refactor: Rename Chunk_v2/Summary_v2 collections to Chunk/Summary

- Add migrate_rename_collections.py script for data migration
- Update flask_app.py to use new collection names
- Update weaviate_ingest.py to use new collection names
- Update schema.py documentation
- Update README.md and ANALYSE_MCP_TOOLS.md

Migration completed: 5372 chunks + 114 summaries preserved with vectors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-14 23:59:03 +01:00
parent 5a732e885f
commit 1bf570e201
6 changed files with 383 additions and 46 deletions

View File

@@ -9,8 +9,8 @@ Schema Architecture:
querying. The hierarchy is::
Work (metadata only)
├── Chunk_v2 (vectorized text fragments)
└── Summary_v2 (vectorized chapter summaries)
├── Chunk (vectorized text fragments)
└── Summary (vectorized chapter summaries)
Collections:
**Work** (no vectorization):
@@ -18,21 +18,21 @@ Collections:
Stores canonical metadata: title, author, year, language, genre.
Not vectorized - used only for metadata and relationships.
**Chunk_v2** (manual GPU vectorization):
**Chunk** (manual GPU vectorization):
Text fragments optimized for semantic search (200-800 chars).
Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
Vectorized fields: text, keywords.
Non-vectorized fields: workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex.
Includes nested Work reference for denormalized access.
**Summary_v2** (manual GPU vectorization):
**Summary** (manual GPU vectorization):
LLM-generated chapter/section summaries for high-level search.
Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
Vectorized fields: text, concepts.
Includes nested Work reference for denormalized access.
Vectorization Strategy:
- Only Chunk_v2.text, Chunk_v2.keywords, Summary_v2.text, and Summary_v2.concepts are vectorized
- Only Chunk.text, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
- Metadata fields use skip_vectorization=True for filtering only
- Work collection has no vectorizer (metadata only)
@@ -56,8 +56,8 @@ Nested Objects:
denormalized data access. This allows single-query retrieval of chunk
data with its Work metadata without joins::
Chunk_v2.work = {title, author}
Summary_v2.work = {title, author}
Chunk.work = {title, author}
Summary.work = {title, author}
Usage:
From command line::