feat: Add vectorized summary field and migration tools

- Add 'summary' field to Chunk collection (vectorized with text2vec) - Migrate from Dynamic index to HNSW + RQ for both Chunk and Summary - Add LLM summarizer module (utils/llm_summarizer.py) - Add migration scripts (migrate_add_summary.py, restore_*.py) - Add summary generation utilities and progress tracking - Add testing and cleaning tools (outils_test_and_cleaning/) - Add comprehensive documentation (ANALYSE_*.md, guides) - Remove obsolete files (linear_config.py, old test files) - Update .gitignore to exclude backups and temp files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 22:56:03 +01:00
parent feb215dae0
commit 636ad6206c
40 changed files with 11937 additions and 712 deletions
--- a/generations/library_rag/schema.py
+++ b/generations/library_rag/schema.py
@@ -26,7 +26,7 @@ Collections:

    **Chunk** (vectorized with text2vec-transformers):
        Text fragments optimized for semantic search (200-800 chars).
-        Vectorized fields: text, keywords.
+        Vectorized fields: text, summary, keywords.
        Non-vectorized fields: sectionPath, chapterTitle, unitType, orderIndex.
        Includes nested Document and Work references.

@@ -36,15 +36,13 @@ Collections:
        Includes nested Document reference.

 Vectorization Strategy:
-    - Only Chunk.text, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
+    - Only Chunk.text, Chunk.summary, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
    - Uses text2vec-transformers (BAAI/bge-m3 with 1024-dim via Docker)
    - Metadata fields use skip_vectorization=True for filtering only
    - Work and Document collections have no vectorizer (metadata only)

 Vector Index Configuration (2026-01):
-    - **Dynamic Index**: Automatically switches from flat to HNSW based on collection size
-        - Chunk: Switches at 50,000 vectors
-        - Summary: Switches at 10,000 vectors
+    - **HNSW Index**: Hierarchical Navigable Small World for efficient search
    - **Rotational Quantization (RQ)**: Reduces memory footprint by ~75%
        - Minimal accuracy loss (<1%)
        - Essential for scaling to 100k+ chunks
@@ -233,13 +231,13 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
        client: Connected Weaviate client.

    Note:
-        Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
+        Uses text2vec-transformers for vectorizing 'text', 'summary', and 'keywords' fields.
        Other fields have skip_vectorization=True for filtering only.

        Vector Index Configuration:
-            - Dynamic index: starts with flat, switches to HNSW at 50k vectors
+            - HNSW index for efficient similarity search
            - Rotational Quantization (RQ): reduces memory by ~75% with minimal accuracy loss
-            - Optimized for scaling from small (1k) to large (1M+) collections
+            - Optimized for scaling to large (100k+) collections
    """
    client.collections.create(
        name="Chunk",
@@ -247,20 +245,12 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
            vectorize_collection_name=False,
        ),
-        # Dynamic index with RQ for optimal memory/performance trade-off
-        vector_index_config=wvc.Configure.VectorIndex.dynamic(
-            threshold=50000,  # Switch to HNSW at 50k chunks
-            hnsw=wvc.Reconfigure.VectorIndex.hnsw(
-                quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
-                    enabled=True,
-                    # RQ provides ~75% memory reduction with <1% accuracy loss
-                    # Perfect for scaling philosophical text collections
-                ),
-                distance_metric=wvc.VectorDistances.COSINE,  # BGE-M3 uses cosine similarity
-            ),
-            flat=wvc.Reconfigure.VectorIndex.flat(
-                distance_metric=wvc.VectorDistances.COSINE,
-            ),
+        # HNSW index with RQ for optimal memory/performance trade-off
+        vector_index_config=wvc.Configure.VectorIndex.hnsw(
+            distance_metric=wvc.VectorDistances.COSINE,  # BGE-M3 uses cosine similarity
+            quantizer=wvc.Configure.VectorIndex.Quantizer.rq(),
+            # RQ provides ~75% memory reduction with <1% accuracy loss
+            # Perfect for scaling philosophical text collections
        ),
        properties=[
            # Main content (vectorized)
@@ -269,6 +259,11 @@ def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
                description="The text content to be vectorized (200-800 chars optimal).",
                data_type=wvc.DataType.TEXT,
            ),
+            wvc.Property(
+                name="summary",
+                description="LLM-generated summary of this chunk (100-200 words, VECTORIZED).",
+                data_type=wvc.DataType.TEXT,
+            ),
            # Hierarchical context (not vectorized, for filtering)
            wvc.Property(
                name="sectionPath",
@@ -350,9 +345,9 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
        Uses text2vec-transformers for vectorizing summary text.

        Vector Index Configuration:
-            - Dynamic index: starts with flat, switches to HNSW at 10k vectors
+            - HNSW index for efficient similarity search
            - Rotational Quantization (RQ): reduces memory by ~75%
-            - Lower threshold than Chunk (summaries are fewer and shorter)
+            - Optimized for summaries (shorter, more uniform text)
    """
    client.collections.create(
        name="Summary",
@@ -360,19 +355,11 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
            vectorize_collection_name=False,
        ),
-        # Dynamic index with RQ (lower threshold for summaries)
-        vector_index_config=wvc.Configure.VectorIndex.dynamic(
-            threshold=10000,  # Switch to HNSW at 10k summaries (fewer than chunks)
-            hnsw=wvc.Reconfigure.VectorIndex.hnsw(
-                quantizer=wvc.Configure.VectorIndex.Quantizer.rq(
-                    enabled=True,
-                    # RQ optimal for summaries (shorter, more uniform text)
-                ),
-                distance_metric=wvc.VectorDistances.COSINE,
-            ),
-            flat=wvc.Reconfigure.VectorIndex.flat(
-                distance_metric=wvc.VectorDistances.COSINE,
-            ),
+        # HNSW index with RQ for optimal memory/performance trade-off
+        vector_index_config=wvc.Configure.VectorIndex.hnsw(
+            distance_metric=wvc.VectorDistances.COSINE,
+            quantizer=wvc.Configure.VectorIndex.Quantizer.rq(),
+            # RQ optimal for summaries (shorter, more uniform text)
        ),
        properties=[
            wvc.Property(
@@ -537,16 +524,16 @@ def print_summary() -> None:
    print("\n✓ Architecture:")
    print("  - Work: Source unique pour author/title")
    print("  - Document: Métadonnées d'édition avec référence vers Work")
-    print("  - Chunk: Fragments vectorisés (text + keywords)")
-    print("  - Summary: Résumés de chapitres vectorisés (text)")
+    print("  - Chunk: Fragments vectorisés (text + summary + keywords)")
+    print("  - Summary: Résumés de chapitres vectorisés (text + concepts)")
    print("\n✓ Vectorisation:")
    print("  - Work:    NONE")
    print("  - Document: NONE")
-    print("  - Chunk:   text2vec (text + keywords)")
-    print("  - Summary: text2vec (text)")
+    print("  - Chunk:   text2vec (text + summary + keywords)")
+    print("  - Summary: text2vec (text + concepts)")
    print("\n✓ Index Vectoriel (Optimisation 2026):")
-    print("  - Chunk:   Dynamic (flat → HNSW @ 50k) + RQ (~75% moins de RAM)")
-    print("  - Summary: Dynamic (flat → HNSW @ 10k) + RQ")
+    print("  - Chunk:   HNSW + RQ (~75% moins de RAM)")
+    print("  - Summary: HNSW + RQ")
    print("  - Distance: Cosine (compatible BGE-M3)")
    print("=" * 80)