feat: Remove Document collection from schema
BREAKING CHANGE: Document collection removed from Weaviate schema Architecture simplification: - Removed Document collection (unused by Flask app) - All metadata now in Work collection or file-based (chunks.json) - Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2) Schema changes (schema.py): - Removed create_document_collection() function - Updated verify_schema() to expect 3 collections - Updated display_schema() and print_summary() - Updated documentation to reflect Chunk_v2/Summary_v2 Ingestion changes (weaviate_ingest.py): - Removed ingest_document_metadata() function - Removed ingest_document_collection parameter - Updated IngestResult to use work_uuid instead of document_uuid - Removed Document deletion from delete_document_chunks() - Updated DeleteResult TypedDict Type changes (types.py): - WeaviateIngestResult: document_uuid → work_uuid Documentation updates (.claude/CLAUDE.md): - Updated schema diagram (4 → 3 collections) - Removed Document references - Updated to reflect manual GPU vectorization Database changes: - Deleted Document collection (13 objects) - Deleted Chunk collection (0 objects, old schema) Benefits: - Simpler architecture (3 collections vs 4) - No redundant data storage - All metadata available via Work or file-based storage - Reduced Weaviate memory footprint Migration: - See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis - See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -138,34 +138,30 @@ The core of the application is `utils/pdf_pipeline.py`, which orchestrates a 10-
|
||||
- `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC)
|
||||
- `ingest_to_weaviate=True` - Insert chunks into Weaviate
|
||||
|
||||
### Weaviate Schema (4 Collections)
|
||||
### Weaviate Schema (3 Collections)
|
||||
|
||||
Defined in `schema.py`, the database uses a normalized design with denormalized nested objects:
|
||||
Defined in `schema.py`, the database uses a denormalized design with nested objects:
|
||||
|
||||
```
|
||||
Work (no vectorizer)
|
||||
title, author, year, language, genre
|
||||
|
||||
├─► Document (no vectorizer)
|
||||
│ sourceId, edition, pages, toc, hierarchy
|
||||
│
|
||||
├─► Chunk_v2 (manual GPU vectorization) ⭐ PRIMARY
|
||||
│ text (VECTORIZED)
|
||||
│ keywords (VECTORIZED)
|
||||
│ workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex
|
||||
│ work: {title, author} (nested)
|
||||
│
|
||||
│ ├─► Chunk (text2vec-transformers) ⭐ PRIMARY
|
||||
│ │ text (VECTORIZED)
|
||||
│ │ keywords (VECTORIZED)
|
||||
│ │ sectionPath, chapterTitle, unitType, orderIndex
|
||||
│ │ work: {title, author} (nested)
|
||||
│ │ document: {sourceId, edition} (nested)
|
||||
│ │
|
||||
│ └─► Summary (text2vec-transformers)
|
||||
│ text (VECTORIZED)
|
||||
│ concepts (VECTORIZED)
|
||||
│ sectionPath, title, level, chunksCount
|
||||
│ document: {sourceId} (nested)
|
||||
└─► Summary_v2 (manual GPU vectorization)
|
||||
text (VECTORIZED)
|
||||
concepts (VECTORIZED)
|
||||
sectionPath, title, level, chunksCount
|
||||
work: {title, author} (nested)
|
||||
```
|
||||
|
||||
**Vectorization Strategy:**
|
||||
- Only `Chunk.text`, `Chunk.keywords`, `Summary.text`, `Summary.concepts` are vectorized
|
||||
- Only `Chunk_v2.text`, `Chunk_v2.keywords`, `Summary_v2.text`, `Summary_v2.concepts` are vectorized
|
||||
- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
|
||||
- Metadata fields use `skip_vectorization=True` for filtering performance
|
||||
- Nested objects avoid joins for efficient single-query retrieval
|
||||
- BAAI/bge-m3 model: 1024 dimensions, 8192 token context
|
||||
|
||||
Reference in New Issue
Block a user