feat: Remove Document collection from schema

BREAKING CHANGE: Document collection removed from Weaviate schema

Architecture simplification:
- Removed Document collection (unused by Flask app)
- All metadata now in Work collection or file-based (chunks.json)
- Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2)

Schema changes (schema.py):
- Removed create_document_collection() function
- Updated verify_schema() to expect 3 collections
- Updated display_schema() and print_summary()
- Updated documentation to reflect Chunk_v2/Summary_v2

Ingestion changes (weaviate_ingest.py):
- Removed ingest_document_metadata() function
- Removed ingest_document_collection parameter
- Updated IngestResult to use work_uuid instead of document_uuid
- Removed Document deletion from delete_document_chunks()
- Updated DeleteResult TypedDict

Type changes (types.py):
- WeaviateIngestResult: document_uuid → work_uuid

Documentation updates (.claude/CLAUDE.md):
- Updated schema diagram (4 → 3 collections)
- Removed Document references
- Updated to reflect manual GPU vectorization

Database changes:
- Deleted Document collection (13 objects)
- Deleted Chunk collection (0 objects, old schema)

Benefits:
- Simpler architecture (3 collections vs 4)
- No redundant data storage
- All metadata available via Work or file-based storage
- Reduced Weaviate memory footprint

Migration:
- See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis
- See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-09 14:13:51 +01:00
parent 625c52a925
commit 53f6a92365
8 changed files with 698 additions and 238 deletions

View File

@@ -138,34 +138,30 @@ The core of the application is `utils/pdf_pipeline.py`, which orchestrates a 10-
- `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC)
- `ingest_to_weaviate=True` - Insert chunks into Weaviate
### Weaviate Schema (4 Collections)
### Weaviate Schema (3 Collections)
Defined in `schema.py`, the database uses a normalized design with denormalized nested objects:
Defined in `schema.py`, the database uses a denormalized design with nested objects:
```
Work (no vectorizer)
title, author, year, language, genre
├─► Document (no vectorizer)
sourceId, edition, pages, toc, hierarchy
├─► Chunk_v2 (manual GPU vectorization) ⭐ PRIMARY
text (VECTORIZED)
│ keywords (VECTORIZED)
│ workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex
│ work: {title, author} (nested)
│ ├─► Chunk (text2vec-transformers) ⭐ PRIMARY
│ │ text (VECTORIZED)
│ │ keywords (VECTORIZED)
│ │ sectionPath, chapterTitle, unitType, orderIndex
│ │ work: {title, author} (nested)
│ │ document: {sourceId, edition} (nested)
│ │
│ └─► Summary (text2vec-transformers)
│ text (VECTORIZED)
│ concepts (VECTORIZED)
│ sectionPath, title, level, chunksCount
│ document: {sourceId} (nested)
└─► Summary_v2 (manual GPU vectorization)
text (VECTORIZED)
concepts (VECTORIZED)
sectionPath, title, level, chunksCount
work: {title, author} (nested)
```
**Vectorization Strategy:**
- Only `Chunk.text`, `Chunk.keywords`, `Summary.text`, `Summary.concepts` are vectorized
- Only `Chunk_v2.text`, `Chunk_v2.keywords`, `Summary_v2.text`, `Summary_v2.concepts` are vectorized
- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
- Metadata fields use `skip_vectorization=True` for filtering performance
- Nested objects avoid joins for efficient single-query retrieval
- BAAI/bge-m3 model: 1024 dimensions, 8192 token context