feat: Add data quality verification & cleanup scripts

## Data Quality & Cleanup (Priorities 1-6)

Added comprehensive data quality verification and cleanup system:

**Scripts créés**:
- verify_data_quality.py: Analyse qualité complète œuvre par œuvre
- clean_duplicate_documents.py: Nettoyage doublons Documents
- populate_work_collection.py/clean.py: Peuplement Work collection
- fix_chunks_count.py: Correction chunksCount incohérents
- manage_orphan_chunks.py: Gestion chunks orphelins (3 options)
- clean_orphan_works.py: Suppression Works sans chunks
- add_missing_work.py: Création Work manquant
- generate_schema_stats.py: Génération stats auto
- migrate_add_work_collection.py: Migration sûre Work collection

**Documentation**:
- WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes)
- WEAVIATE_SCHEMA.md: Référence schéma rapide
- NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session
- ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale
- rapport_qualite_donnees.txt: Output brut vérification

**Résultats nettoyage**:
- Documents: 16 → 9 (7 doublons supprimés)
- Works: 0 → 9 (peuplé + nettoyé)
- Chunks: 5,404 → 5,230 (174 orphelins supprimés)
- chunksCount: Corrigés (231 → 5,230 déclaré = réel)
- Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres

**Modifications code**:
- schema.py: Ajout Work collection avec vectorisation
- utils/weaviate_ingest.py: Support Work ingestion
- utils/word_pipeline.py: Désactivation concepts (problème .lower())
- utils/word_toc_extractor.py: Métadonnées Word correctes
- .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-01 11:57:26 +01:00
parent 845ffb4b06
commit 04ee3f9e39
26 changed files with 6945 additions and 16 deletions

View File

@@ -41,6 +41,15 @@ Vectorization Strategy:
- Metadata fields use skip_vectorization=True for filtering only
- Work and Document collections have no vectorizer (metadata only)
Vector Index Configuration (2026-01):
- **Dynamic Index**: Automatically switches from flat to HNSW based on collection size
- Chunk: Switches at 50,000 vectors
- Summary: Switches at 10,000 vectors
- **Rotational Quantization (RQ)**: Reduces memory footprint by ~75%
- Minimal accuracy loss (<1%)
- Essential for scaling to 100k+ chunks
- **Distance Metric**: Cosine similarity (matches BGE-M3 training)
Migration Note (2024-12):
Migrated from MiniLM-L6 (384-dim) to BAAI/bge-m3 (1024-dim) for:
- 2.7x richer semantic representation
@@ -226,6 +235,11 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
Note:
Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
Other fields have skip_vectorization=True for filtering only.
Vector Index Configuration:
- Dynamic index: starts with flat, switches to HNSW at 50k vectors
- Rotational Quantization (RQ): reduces memory by ~75% with minimal accuracy loss
- Optimized for scaling from small (1k) to large (1M+) collections
"""
client.collections.create(
name="Chunk",
@@ -233,6 +247,21 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
vectorize_collection_name=False,
),
# Dynamic index with RQ for optimal memory/performance trade-off
vector_index_config=wvc.Configure.VectorIndex.dynamic(
threshold=50000, # Switch to HNSW at 50k chunks
hnsw=wvc.Reconfigure.VectorIndex.hnsw(
quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
enabled=True,
# RQ provides ~75% memory reduction with <1% accuracy loss
# Perfect for scaling philosophical text collections
),
distance_metric=wvc.VectorDistances.COSINE, # BGE-M3 uses cosine similarity
),
flat=wvc.Reconfigure.VectorIndex.flat(
distance_metric=wvc.VectorDistances.COSINE,
),
),
properties=[
# Main content (vectorized)
wvc.Property(
@@ -319,6 +348,11 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
Note:
Uses text2vec-transformers for vectorizing summary text.
Vector Index Configuration:
- Dynamic index: starts with flat, switches to HNSW at 10k vectors
- Rotational Quantization (RQ): reduces memory by ~75%
- Lower threshold than Chunk (summaries are fewer and shorter)
"""
client.collections.create(
name="Summary",
@@ -326,6 +360,20 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
vectorize_collection_name=False,
),
# Dynamic index with RQ (lower threshold for summaries)
vector_index_config=wvc.Configure.VectorIndex.dynamic(
threshold=10000, # Switch to HNSW at 10k summaries (fewer than chunks)
hnsw=wvc.Reconfigure.VectorIndex.hnsw(
quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
enabled=True,
# RQ optimal for summaries (shorter, more uniform text)
),
distance_metric=wvc.VectorDistances.COSINE,
),
flat=wvc.Reconfigure.VectorIndex.flat(
distance_metric=wvc.VectorDistances.COSINE,
),
),
properties=[
wvc.Property(
name="sectionPath",
@@ -496,6 +544,10 @@ def print_summary() -> None:
print(" - Document: NONE")
print(" - Chunk: text2vec (text + keywords)")
print(" - Summary: text2vec (text)")
print("\n✓ Index Vectoriel (Optimisation 2026):")
print(" - Chunk: Dynamic (flat → HNSW @ 50k) + RQ (~75% moins de RAM)")
print(" - Summary: Dynamic (flat → HNSW @ 10k) + RQ")
print(" - Distance: Cosine (compatible BGE-M3)")
print("=" * 80)