feat: Remove Document collection from schema

BREAKING CHANGE: Document collection removed from Weaviate schema Architecture simplification: - Removed Document collection (unused by Flask app) - All metadata now in Work collection or file-based (chunks.json) - Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2) Schema changes (schema.py): - Removed create_document_collection() function - Updated verify_schema() to expect 3 collections - Updated display_schema() and print_summary() - Updated documentation to reflect Chunk_v2/Summary_v2 Ingestion changes (weaviate_ingest.py): - Removed ingest_document_metadata() function - Removed ingest_document_collection parameter - Updated IngestResult to use work_uuid instead of document_uuid - Removed Document deletion from delete_document_chunks() - Updated DeleteResult TypedDict Type changes (types.py): - WeaviateIngestResult: document_uuid → work_uuid Documentation updates (.claude/CLAUDE.md): - Updated schema diagram (4 → 3 collections) - Removed Document references - Updated to reflect manual GPU vectorization Database changes: - Deleted Document collection (13 objects) - Deleted Chunk collection (0 objects, old schema) Benefits: - Simpler architecture (3 collections vs 4) - No redundant data storage - All metadata available via Work or file-based storage - Reduced Weaviate memory footprint Migration: - See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis - See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-09 14:13:51 +01:00
parent 625c52a925
commit 53f6a92365
8 changed files with 698 additions and 238 deletions
--- a/generations/library_rag/schema.py
+++ b/generations/library_rag/schema.py
@@ -5,13 +5,12 @@ Library RAG application. It provides functions to create, verify, and display
 the schema configuration for indexing and searching philosophical texts.

 Schema Architecture:
-    The schema follows a normalized design with denormalized nested objects
-    for efficient querying. The hierarchy is::
+    The schema follows a denormalized design with nested objects for efficient
+    querying. The hierarchy is::

        Work (metadata only)
-          └── Document (edition/translation instance)
-                ├── Chunk (vectorized text fragments)
-                └── Summary (vectorized chapter summaries)
+          ├── Chunk_v2 (vectorized text fragments)
+          └── Summary_v2 (vectorized chapter summaries)

 Collections:
    **Work** (no vectorization):
@@ -19,27 +18,24 @@ Collections:
        Stores canonical metadata: title, author, year, language, genre.
        Not vectorized - used only for metadata and relationships.

-    **Document** (no vectorization):
-        Represents a specific edition or translation of a Work.
-        Contains: sourceId, edition, language, pages, TOC, hierarchy.
+    **Chunk_v2** (manual GPU vectorization):
+        Text fragments optimized for semantic search (200-800 chars).
+        Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
+        Vectorized fields: text, keywords.
+        Non-vectorized fields: workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex.
        Includes nested Work reference for denormalized access.

-    **Chunk** (vectorized with text2vec-transformers):
-        Text fragments optimized for semantic search (200-800 chars).
-        Vectorized fields: text, summary, keywords.
-        Non-vectorized fields: sectionPath, chapterTitle, unitType, orderIndex.
-        Includes nested Document and Work references.
-
-    **Summary** (vectorized with text2vec-transformers):
+    **Summary_v2** (manual GPU vectorization):
        LLM-generated chapter/section summaries for high-level search.
+        Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
        Vectorized fields: text, concepts.
-        Includes nested Document reference.
+        Includes nested Work reference for denormalized access.

 Vectorization Strategy:
-    - Only Chunk.text, Chunk.summary, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
-    - Uses text2vec-transformers (BAAI/bge-m3 with 1024-dim via Docker)
+    - Only Chunk_v2.text, Chunk_v2.keywords, Summary_v2.text, and Summary_v2.concepts are vectorized
+    - Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
    - Metadata fields use skip_vectorization=True for filtering only
-    - Work and Document collections have no vectorizer (metadata only)
+    - Work collection has no vectorizer (metadata only)

 Vector Index Configuration (2026-01):
    - **HNSW Index**: Hierarchical Navigable Small World for efficient search
@@ -58,12 +54,10 @@ Migration Note (2024-12):
 Nested Objects:
    Instead of using Weaviate cross-references, we use nested objects for
    denormalized data access. This allows single-query retrieval of chunk
-    data with its Work/Document metadata without joins::
+    data with its Work metadata without joins::

-        Chunk.work = {title, author}
-        Chunk.document = {sourceId, edition}
-        Document.work = {title, author}
-        Summary.document = {sourceId}
+        Chunk_v2.work = {title, author}
+        Summary_v2.work = {title, author}

 Usage:
    From command line::
@@ -156,74 +150,6 @@ def create_work_collection(client: weaviate.WeaviateClient) -> None:
    )


-def create_document_collection(client: weaviate.WeaviateClient) -> None:
-    """Create the Document collection for edition/translation instances.
-
-    Args:
-        client: Connected Weaviate client.
-
-    Note:
-        Contains nested Work reference for denormalized access.
-    """
-    client.collections.create(
-        name="Document",
-        description="A specific edition or translation of a work (PDF, ebook, etc.).",
-        vectorizer_config=wvc.Configure.Vectorizer.none(),
-        properties=[
-            wvc.Property(
-                name="sourceId",
-                description="Unique identifier for this document (filename without extension).",
-                data_type=wvc.DataType.TEXT,
-            ),
-            wvc.Property(
-                name="edition",
-                description="Edition or translator (e.g., 'trad. Cousin', 'Loeb Classical Library').",
-                data_type=wvc.DataType.TEXT,
-            ),
-            wvc.Property(
-                name="language",
-                description="Language of this edition (e.g., 'fr', 'en').",
-                data_type=wvc.DataType.TEXT,
-            ),
-            wvc.Property(
-                name="pages",
-                description="Number of pages in the PDF/document.",
-                data_type=wvc.DataType.INT,
-            ),
-            wvc.Property(
-                name="chunksCount",
-                description="Total number of chunks extracted from this document.",
-                data_type=wvc.DataType.INT,
-            ),
-            wvc.Property(
-                name="toc",
-                description="Table of contents as JSON string [{title, level, page}, ...].",
-                data_type=wvc.DataType.TEXT,
-            ),
-            wvc.Property(
-                name="hierarchy",
-                description="Full hierarchical structure as JSON string.",
-                data_type=wvc.DataType.TEXT,
-            ),
-            wvc.Property(
-                name="createdAt",
-                description="Timestamp when this document was ingested.",
-                data_type=wvc.DataType.DATE,
-            ),
-            # Nested Work reference
-            wvc.Property(
-                name="work",
-                description="Reference to the Work this document is an instance of.",
-                data_type=wvc.DataType.OBJECT,
-                nested_properties=[
-                    wvc.Property(name="title", data_type=wvc.DataType.TEXT),
-                    wvc.Property(name="author", data_type=wvc.DataType.TEXT),
-                ],
-            ),
-        ],
-    )
-
-
 def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
    """Create the Chunk collection for vectorized text fragments.

@@ -410,7 +336,7 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
 def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True) -> None:
    """Create the complete Weaviate schema for Library RAG.

-    Creates all four collections: Work, Document, Chunk, Summary.
+    Creates all three collections: Work, Chunk, Summary.

    Args:
        client: Connected Weaviate client.
@@ -429,16 +355,13 @@ def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True)
    print("      → Work (métadonnées œuvre)...")
    create_work_collection(client)

-    print("      → Document (métadonnées édition)...")
-    create_document_collection(client)
-
    print("      → Chunk (fragments vectorisés)...")
    create_chunk_collection(client)

    print("      → Summary (résumés de chapitres)...")
    create_summary_collection(client)

-    print("      ✓ 4 collections créées")
+    print("      ✓ 3 collections créées")


 def verify_schema(client: weaviate.WeaviateClient) -> bool:
@@ -453,7 +376,7 @@ def verify_schema(client: weaviate.WeaviateClient) -> bool:
    print("\n[3/4] Vérification des collections...")
    collections = client.collections.list_all()

-    expected: Set[str] = {"Work", "Document", "Chunk", "Summary"}
+    expected: Set[str] = {"Work", "Chunk", "Summary"}
    actual: Set[str] = set(collections.keys())

    if expected == actual:
@@ -480,7 +403,7 @@ def display_schema(client: weaviate.WeaviateClient) -> None:

    collections = client.collections.list_all()

-    for name in ["Work", "Document", "Chunk", "Summary"]:
+    for name in ["Work", "Chunk", "Summary"]:
        if name not in collections:
            continue

@@ -523,14 +446,12 @@ def print_summary() -> None:
    print("=" * 80)
    print("\n✓ Architecture:")
    print("  - Work: Source unique pour author/title")
-    print("  - Document: Métadonnées d'édition avec référence vers Work")
-    print("  - Chunk: Fragments vectorisés (text + summary + keywords)")
+    print("  - Chunk: Fragments vectorisés (text + keywords)")
    print("  - Summary: Résumés de chapitres vectorisés (text + concepts)")
    print("\n✓ Vectorisation:")
    print("  - Work:    NONE")
-    print("  - Document: NONE")
-    print("  - Chunk:   text2vec (text + summary + keywords)")
-    print("  - Summary: text2vec (text + concepts)")
+    print("  - Chunk:   GPU embedder (BAAI/bge-m3, 1024-dim)")
+    print("  - Summary: GPU embedder (BAAI/bge-m3, 1024-dim)")
    print("\n✓ Index Vectoriel (Optimisation 2026):")
    print("  - Chunk:   HNSW + RQ (~75% moins de RAM)")
    print("  - Summary: HNSW + RQ")