feat: Add data quality verification & cleanup scripts

## Data Quality & Cleanup (Priorities 1-6) Added comprehensive data quality verification and cleanup system: **Scripts créés**: - verify_data_quality.py: Analyse qualité complète œuvre par œuvre - clean_duplicate_documents.py: Nettoyage doublons Documents - populate_work_collection.py/clean.py: Peuplement Work collection - fix_chunks_count.py: Correction chunksCount incohérents - manage_orphan_chunks.py: Gestion chunks orphelins (3 options) - clean_orphan_works.py: Suppression Works sans chunks - add_missing_work.py: Création Work manquant - generate_schema_stats.py: Génération stats auto - migrate_add_work_collection.py: Migration sûre Work collection **Documentation**: - WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes) - WEAVIATE_SCHEMA.md: Référence schéma rapide - NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session - ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale - rapport_qualite_donnees.txt: Output brut vérification **Résultats nettoyage**: - Documents: 16 → 9 (7 doublons supprimés) - Works: 0 → 9 (peuplé + nettoyé) - Chunks: 5,404 → 5,230 (174 orphelins supprimés) - chunksCount: Corrigés (231 → 5,230 déclaré = réel) - Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres **Modifications code**: - schema.py: Ajout Work collection avec vectorisation - utils/weaviate_ingest.py: Support Work ingestion - utils/word_pipeline.py: Désactivation concepts (problème .lower()) - utils/word_toc_extractor.py: Métadonnées Word correctes - .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-01 11:57:26 +01:00
parent 845ffb4b06
commit 04ee3f9e39
26 changed files with 6945 additions and 16 deletions
--- a/generations/library_rag/schema.py
+++ b/generations/library_rag/schema.py
@@ -41,6 +41,15 @@ Vectorization Strategy:
    - Metadata fields use skip_vectorization=True for filtering only
    - Work and Document collections have no vectorizer (metadata only)

+Vector Index Configuration (2026-01):
+    - **Dynamic Index**: Automatically switches from flat to HNSW based on collection size
+        - Chunk: Switches at 50,000 vectors
+        - Summary: Switches at 10,000 vectors
+    - **Rotational Quantization (RQ)**: Reduces memory footprint by ~75%
+        - Minimal accuracy loss (<1%)
+        - Essential for scaling to 100k+ chunks
+    - **Distance Metric**: Cosine similarity (matches BGE-M3 training)
+
 Migration Note (2024-12):
    Migrated from MiniLM-L6 (384-dim) to BAAI/bge-m3 (1024-dim) for:
    - 2.7x richer semantic representation
@@ -226,6 +235,11 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
    Note:
        Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
        Other fields have skip_vectorization=True for filtering only.
+
+        Vector Index Configuration:
+            - Dynamic index: starts with flat, switches to HNSW at 50k vectors
+            - Rotational Quantization (RQ): reduces memory by ~75% with minimal accuracy loss
+            - Optimized for scaling from small (1k) to large (1M+) collections
    """
    client.collections.create(
        name="Chunk",
@@ -233,6 +247,21 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
            vectorize_collection_name=False,
        ),
+        # Dynamic index with RQ for optimal memory/performance trade-off
+        vector_index_config=wvc.Configure.VectorIndex.dynamic(
+            threshold=50000,  # Switch to HNSW at 50k chunks
+            hnsw=wvc.Reconfigure.VectorIndex.hnsw(
+                quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
+                    enabled=True,
+                    # RQ provides ~75% memory reduction with <1% accuracy loss
+                    # Perfect for scaling philosophical text collections
+                ),
+                distance_metric=wvc.VectorDistances.COSINE,  # BGE-M3 uses cosine similarity
+            ),
+            flat=wvc.Reconfigure.VectorIndex.flat(
+                distance_metric=wvc.VectorDistances.COSINE,
+            ),
+        ),
        properties=[
            # Main content (vectorized)
            wvc.Property(
@@ -319,6 +348,11 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:

    Note:
        Uses text2vec-transformers for vectorizing summary text.
+
+        Vector Index Configuration:
+            - Dynamic index: starts with flat, switches to HNSW at 10k vectors
+            - Rotational Quantization (RQ): reduces memory by ~75%
+            - Lower threshold than Chunk (summaries are fewer and shorter)
    """
    client.collections.create(
        name="Summary",
@@ -326,6 +360,20 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
            vectorize_collection_name=False,
        ),
+        # Dynamic index with RQ (lower threshold for summaries)
+        vector_index_config=wvc.Configure.VectorIndex.dynamic(
+            threshold=10000,  # Switch to HNSW at 10k summaries (fewer than chunks)
+            hnsw=wvc.Reconfigure.VectorIndex.hnsw(
+                quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
+                    enabled=True,
+                    # RQ optimal for summaries (shorter, more uniform text)
+                ),
+                distance_metric=wvc.VectorDistances.COSINE,
+            ),
+            flat=wvc.Reconfigure.VectorIndex.flat(
+                distance_metric=wvc.VectorDistances.COSINE,
+            ),
+        ),
        properties=[
            wvc.Property(
                name="sectionPath",
@@ -496,6 +544,10 @@ def print_summary() -> None:
    print("  - Document: NONE")
    print("  - Chunk:   text2vec (text + keywords)")
    print("  - Summary: text2vec (text)")
+    print("\n✓ Index Vectoriel (Optimisation 2026):")
+    print("  - Chunk:   Dynamic (flat → HNSW @ 50k) + RQ (~75% moins de RAM)")
+    print("  - Summary: Dynamic (flat → HNSW @ 10k) + RQ")
+    print("  - Distance: Cosine (compatible BGE-M3)")
    print("=" * 80)