feat: Add data quality verification & cleanup scripts
## Data Quality & Cleanup (Priorities 1-6) Added comprehensive data quality verification and cleanup system: **Scripts créés**: - verify_data_quality.py: Analyse qualité complète œuvre par œuvre - clean_duplicate_documents.py: Nettoyage doublons Documents - populate_work_collection.py/clean.py: Peuplement Work collection - fix_chunks_count.py: Correction chunksCount incohérents - manage_orphan_chunks.py: Gestion chunks orphelins (3 options) - clean_orphan_works.py: Suppression Works sans chunks - add_missing_work.py: Création Work manquant - generate_schema_stats.py: Génération stats auto - migrate_add_work_collection.py: Migration sûre Work collection **Documentation**: - WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes) - WEAVIATE_SCHEMA.md: Référence schéma rapide - NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session - ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale - rapport_qualite_donnees.txt: Output brut vérification **Résultats nettoyage**: - Documents: 16 → 9 (7 doublons supprimés) - Works: 0 → 9 (peuplé + nettoyé) - Chunks: 5,404 → 5,230 (174 orphelins supprimés) - chunksCount: Corrigés (231 → 5,230 déclaré = réel) - Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres **Modifications code**: - schema.py: Ajout Work collection avec vectorisation - utils/weaviate_ingest.py: Support Work ingestion - utils/word_pipeline.py: Désactivation concepts (problème .lower()) - utils/word_toc_extractor.py: Métadonnées Word correctes - .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -41,6 +41,15 @@ Vectorization Strategy:
|
||||
- Metadata fields use skip_vectorization=True for filtering only
|
||||
- Work and Document collections have no vectorizer (metadata only)
|
||||
|
||||
Vector Index Configuration (2026-01):
|
||||
- **Dynamic Index**: Automatically switches from flat to HNSW based on collection size
|
||||
- Chunk: Switches at 50,000 vectors
|
||||
- Summary: Switches at 10,000 vectors
|
||||
- **Rotational Quantization (RQ)**: Reduces memory footprint by ~75%
|
||||
- Minimal accuracy loss (<1%)
|
||||
- Essential for scaling to 100k+ chunks
|
||||
- **Distance Metric**: Cosine similarity (matches BGE-M3 training)
|
||||
|
||||
Migration Note (2024-12):
|
||||
Migrated from MiniLM-L6 (384-dim) to BAAI/bge-m3 (1024-dim) for:
|
||||
- 2.7x richer semantic representation
|
||||
@@ -226,6 +235,11 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
|
||||
Note:
|
||||
Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
|
||||
Other fields have skip_vectorization=True for filtering only.
|
||||
|
||||
Vector Index Configuration:
|
||||
- Dynamic index: starts with flat, switches to HNSW at 50k vectors
|
||||
- Rotational Quantization (RQ): reduces memory by ~75% with minimal accuracy loss
|
||||
- Optimized for scaling from small (1k) to large (1M+) collections
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Chunk",
|
||||
@@ -233,6 +247,21 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
|
||||
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
|
||||
vectorize_collection_name=False,
|
||||
),
|
||||
# Dynamic index with RQ for optimal memory/performance trade-off
|
||||
vector_index_config=wvc.Configure.VectorIndex.dynamic(
|
||||
threshold=50000, # Switch to HNSW at 50k chunks
|
||||
hnsw=wvc.Reconfigure.VectorIndex.hnsw(
|
||||
quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
|
||||
enabled=True,
|
||||
# RQ provides ~75% memory reduction with <1% accuracy loss
|
||||
# Perfect for scaling philosophical text collections
|
||||
),
|
||||
distance_metric=wvc.VectorDistances.COSINE, # BGE-M3 uses cosine similarity
|
||||
),
|
||||
flat=wvc.Reconfigure.VectorIndex.flat(
|
||||
distance_metric=wvc.VectorDistances.COSINE,
|
||||
),
|
||||
),
|
||||
properties=[
|
||||
# Main content (vectorized)
|
||||
wvc.Property(
|
||||
@@ -319,6 +348,11 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
|
||||
|
||||
Note:
|
||||
Uses text2vec-transformers for vectorizing summary text.
|
||||
|
||||
Vector Index Configuration:
|
||||
- Dynamic index: starts with flat, switches to HNSW at 10k vectors
|
||||
- Rotational Quantization (RQ): reduces memory by ~75%
|
||||
- Lower threshold than Chunk (summaries are fewer and shorter)
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Summary",
|
||||
@@ -326,6 +360,20 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
|
||||
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
|
||||
vectorize_collection_name=False,
|
||||
),
|
||||
# Dynamic index with RQ (lower threshold for summaries)
|
||||
vector_index_config=wvc.Configure.VectorIndex.dynamic(
|
||||
threshold=10000, # Switch to HNSW at 10k summaries (fewer than chunks)
|
||||
hnsw=wvc.Reconfigure.VectorIndex.hnsw(
|
||||
quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
|
||||
enabled=True,
|
||||
# RQ optimal for summaries (shorter, more uniform text)
|
||||
),
|
||||
distance_metric=wvc.VectorDistances.COSINE,
|
||||
),
|
||||
flat=wvc.Reconfigure.VectorIndex.flat(
|
||||
distance_metric=wvc.VectorDistances.COSINE,
|
||||
),
|
||||
),
|
||||
properties=[
|
||||
wvc.Property(
|
||||
name="sectionPath",
|
||||
@@ -496,6 +544,10 @@ def print_summary() -> None:
|
||||
print(" - Document: NONE")
|
||||
print(" - Chunk: text2vec (text + keywords)")
|
||||
print(" - Summary: text2vec (text)")
|
||||
print("\n✓ Index Vectoriel (Optimisation 2026):")
|
||||
print(" - Chunk: Dynamic (flat → HNSW @ 50k) + RQ (~75% moins de RAM)")
|
||||
print(" - Summary: Dynamic (flat → HNSW @ 10k) + RQ")
|
||||
print(" - Distance: Cosine (compatible BGE-M3)")
|
||||
print("=" * 80)
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user