refactor: Rename Chunk_v2/Summary_v2 collections to Chunk/Summary
- Add migrate_rename_collections.py script for data migration - Update flask_app.py to use new collection names - Update weaviate_ingest.py to use new collection names - Update schema.py documentation - Update README.md and ANALYSE_MCP_TOOLS.md Migration completed: 5372 chunks + 114 summaries preserved with vectors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -9,8 +9,8 @@ Schema Architecture:
|
||||
querying. The hierarchy is::
|
||||
|
||||
Work (metadata only)
|
||||
├── Chunk_v2 (vectorized text fragments)
|
||||
└── Summary_v2 (vectorized chapter summaries)
|
||||
├── Chunk (vectorized text fragments)
|
||||
└── Summary (vectorized chapter summaries)
|
||||
|
||||
Collections:
|
||||
**Work** (no vectorization):
|
||||
@@ -18,21 +18,21 @@ Collections:
|
||||
Stores canonical metadata: title, author, year, language, genre.
|
||||
Not vectorized - used only for metadata and relationships.
|
||||
|
||||
**Chunk_v2** (manual GPU vectorization):
|
||||
**Chunk** (manual GPU vectorization):
|
||||
Text fragments optimized for semantic search (200-800 chars).
|
||||
Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
|
||||
Vectorized fields: text, keywords.
|
||||
Non-vectorized fields: workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex.
|
||||
Includes nested Work reference for denormalized access.
|
||||
|
||||
**Summary_v2** (manual GPU vectorization):
|
||||
**Summary** (manual GPU vectorization):
|
||||
LLM-generated chapter/section summaries for high-level search.
|
||||
Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
|
||||
Vectorized fields: text, concepts.
|
||||
Includes nested Work reference for denormalized access.
|
||||
|
||||
Vectorization Strategy:
|
||||
- Only Chunk_v2.text, Chunk_v2.keywords, Summary_v2.text, and Summary_v2.concepts are vectorized
|
||||
- Only Chunk.text, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
|
||||
- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
|
||||
- Metadata fields use skip_vectorization=True for filtering only
|
||||
- Work collection has no vectorizer (metadata only)
|
||||
@@ -56,8 +56,8 @@ Nested Objects:
|
||||
denormalized data access. This allows single-query retrieval of chunk
|
||||
data with its Work metadata without joins::
|
||||
|
||||
Chunk_v2.work = {title, author}
|
||||
Summary_v2.work = {title, author}
|
||||
Chunk.work = {title, author}
|
||||
Summary.work = {title, author}
|
||||
|
||||
Usage:
|
||||
From command line::
|
||||
|
||||
Reference in New Issue
Block a user