feat: Remove Document collection from schema

BREAKING CHANGE: Document collection removed from Weaviate schema Architecture simplification: - Removed Document collection (unused by Flask app) - All metadata now in Work collection or file-based (chunks.json) - Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2) Schema changes (schema.py): - Removed create_document_collection() function - Updated verify_schema() to expect 3 collections - Updated display_schema() and print_summary() - Updated documentation to reflect Chunk_v2/Summary_v2 Ingestion changes (weaviate_ingest.py): - Removed ingest_document_metadata() function - Removed ingest_document_collection parameter - Updated IngestResult to use work_uuid instead of document_uuid - Removed Document deletion from delete_document_chunks() - Updated DeleteResult TypedDict Type changes (types.py): - WeaviateIngestResult: document_uuid → work_uuid Documentation updates (.claude/CLAUDE.md): - Updated schema diagram (4 → 3 collections) - Removed Document references - Updated to reflect manual GPU vectorization Database changes: - Deleted Document collection (13 objects) - Deleted Chunk collection (0 objects, old schema) Benefits: - Simpler architecture (3 collections vs 4) - No redundant data storage - All metadata available via Work or file-based storage - Reduced Weaviate memory footprint Migration: - See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis - See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-09 14:13:51 +01:00
parent 625c52a925
commit 53f6a92365
8 changed files with 698 additions and 238 deletions
--- a/generations/library_rag/.claude/CLAUDE.md
+++ b/generations/library_rag/.claude/CLAUDE.md
@@ -138,34 +138,30 @@ The core of the application is `utils/pdf_pipeline.py`, which orchestrates a 10-
 - `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC)
 - `ingest_to_weaviate=True` - Insert chunks into Weaviate

-### Weaviate Schema (4 Collections)
+### Weaviate Schema (3 Collections)

-Defined in `schema.py`, the database uses a normalized design with denormalized nested objects:
+Defined in `schema.py`, the database uses a denormalized design with nested objects:

 ```
 Work (no vectorizer)
  title, author, year, language, genre
-
-  ├─► Document (no vectorizer)
-  │     sourceId, edition, pages, toc, hierarchy
+  │
+  ├─► Chunk_v2 (manual GPU vectorization) ⭐ PRIMARY
+  │     text (VECTORIZED)
+  │     keywords (VECTORIZED)
+  │     workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex
  │     work: {title, author} (nested)
  │
-  │   ├─► Chunk (text2vec-transformers) ⭐ PRIMARY
-  │   │     text (VECTORIZED)
-  │   │     keywords (VECTORIZED)
-  │   │     sectionPath, chapterTitle, unitType, orderIndex
-  │   │     work: {title, author} (nested)
-  │   │     document: {sourceId, edition} (nested)
-  │   │
-  │   └─► Summary (text2vec-transformers)
-  │         text (VECTORIZED)
-  │         concepts (VECTORIZED)
-  │         sectionPath, title, level, chunksCount
-  │         document: {sourceId} (nested)
+  └─► Summary_v2 (manual GPU vectorization)
+        text (VECTORIZED)
+        concepts (VECTORIZED)
+        sectionPath, title, level, chunksCount
+        work: {title, author} (nested)
 ```

 **Vectorization Strategy:**
- Only `Chunk.text`, `Chunk.keywords`, `Summary.text`, `Summary.concepts` are vectorized
+- Only `Chunk_v2.text`, `Chunk_v2.keywords`, `Summary_v2.text`, `Summary_v2.concepts` are vectorized
+- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
 - Metadata fields use `skip_vectorization=True` for filtering performance
 - Nested objects avoid joins for efficient single-query retrieval
 - BAAI/bge-m3 model: 1024 dimensions, 8192 token context