feat: Remove Document collection from schema

BREAKING CHANGE: Document collection removed from Weaviate schema

Architecture simplification:
- Removed Document collection (unused by Flask app)
- All metadata now in Work collection or file-based (chunks.json)
- Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2)

Schema changes (schema.py):
- Removed create_document_collection() function
- Updated verify_schema() to expect 3 collections
- Updated display_schema() and print_summary()
- Updated documentation to reflect Chunk_v2/Summary_v2

Ingestion changes (weaviate_ingest.py):
- Removed ingest_document_metadata() function
- Removed ingest_document_collection parameter
- Updated IngestResult to use work_uuid instead of document_uuid
- Removed Document deletion from delete_document_chunks()
- Updated DeleteResult TypedDict

Type changes (types.py):
- WeaviateIngestResult: document_uuid → work_uuid

Documentation updates (.claude/CLAUDE.md):
- Updated schema diagram (4 → 3 collections)
- Removed Document references
- Updated to reflect manual GPU vectorization

Database changes:
- Deleted Document collection (13 objects)
- Deleted Chunk collection (0 objects, old schema)

Benefits:
- Simpler architecture (3 collections vs 4)
- No redundant data storage
- All metadata available via Work or file-based storage
- Reduced Weaviate memory footprint

Migration:
- See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis
- See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-09 14:13:51 +01:00
parent 625c52a925
commit 53f6a92365
8 changed files with 698 additions and 238 deletions

2
.gitignore vendored
View File

@@ -45,4 +45,4 @@ vectorize_remaining.py
migrate_chunk_*.py migrate_chunk_*.py
# Archives (migration scripts moved here) # Archives (migration scripts moved here)
archive/ archive/chunk_v2_backup.json

View File

@@ -0,0 +1,156 @@
# Analyse: Collection Document - À supprimer
**Date**: 2026-01-09
**Statut**: ✅ CONFIRMATION - La collection Document n'est PAS utilisée et DOIT être supprimée
## Problème identifié
La collection `Document` est toujours définie dans le schéma et contient actuellement **13 objets**, alors que l'architecture devrait utiliser uniquement:
- `Work` - Métadonnées des œuvres
- `Chunk_v2` - Fragments vectorisés (5,372 chunks)
- `Summary_v2` - Résumés de sections (114 summaries)
## État actuel
### Collections existantes (Weaviate):
```
Work: 19 objets ✓ UTILISÉ
Document: 13 objets ✗ NON UTILISÉ (à supprimer)
Chunk_v2: 5,372 objets ✓ UTILISÉ
Summary_v2: 114 objets ✓ UTILISÉ
Chunk: 0 objets (ancienne collection, peut être supprimée)
Conversation, Message, Thought: Collections chat (séparées)
```
### Données dans Document:
```json
{
"sourceId": "Alan_Turing_and_John_von_Neumann_Their_B",
"edition": null,
"pages": 0,
"chunksCount": 11,
"work": null
}
```
**Observation**: La plupart des champs sont NULL ou 0 (pas de données utiles).
## Analyse du code
### 1. Schéma (`schema.py`)
**Lignes 159-224**: Définition complète de la collection Document
- Créée par défaut lors de l'initialisation du schéma
- Propriétés: sourceId, edition, language, pages, chunksCount, toc, hierarchy, createdAt, work (nested)
**Problème de cohérence** (lignes 747-757 dans `weaviate_ingest.py`):
```python
doc_obj: Dict[str, Any] = {
"sourceId": doc_name,
"title": title, # ❌ N'EXISTE PAS dans schema.py
"author": author, # ❌ N'EXISTE PAS dans schema.py
"toc": json.dumps(toc),
"hierarchy": json.dumps(hierarchy),
"pages": pages,
"chunksCount": chunks_count,
"language": metadata.get("language"),
"createdAt": datetime.now().isoformat(),
}
```
Le code d'ingestion essaie d'insérer des champs `title` et `author` qui n'existent pas dans le schéma! Cela devrait causer une erreur mais est silencieusement ignoré.
### 2. Ingestion (`utils/weaviate_ingest.py`)
**Fonction `ingest_document_metadata()` (lignes 695-765)**:
- Insère les métadonnées du document dans la collection Document
- Stocke: sourceId, toc, hierarchy, pages, chunksCount, language, createdAt
**Fonction `ingest_document()` (lignes 891-1107)**:
- Paramètre: `ingest_document_collection: bool = True` (ligne 909)
- Par défaut, la fonction INSÈRE dans Document collection (ligne 1010)
**Fonction `delete_document_from_weaviate()` (lignes 1213-1267)**:
- Supprime de la collection Document (ligne 1243)
### 3. Flask App (`flask_app.py`)
**Résultat**: ✅ AUCUNE référence à la collection Document
- Pas de `collections.get("Document")`
- Pas de requêtes vers Document
- Les TOC et métadonnées sont chargées depuis les fichiers `chunks.json` (ligne 3360)
## Conclusion: Document n'est PAS nécessaire
### Données actuellement dans Document:
| Champ | Disponible ailleurs? | Source alternative |
|-------|---------------------|-------------------|
| `sourceId` | ✓ | `Chunk_v2.workTitle` (dénormalisé) |
| `toc` | ✓ | `output/<doc>/<doc>_chunks.json` |
| `hierarchy` | ✓ | `output/<doc>/<doc>_chunks.json` |
| `pages` | ✓ | `output/<doc>/<doc>_chunks.json` (metadata.pages) |
| `chunksCount` | ✓ | Dérivable via `Chunk_v2.aggregate.over_all(filter=workTitle)` |
| `language` | ✓ | `Work.language` + `Chunk_v2.language` |
| `createdAt` | ✓ | Dérivable via horodatage système des fichiers output/ |
| `edition` | ✗ | Jamais renseigné (toujours NULL) |
| `work` (nested) | ✓ | Collection `Work` dédiée |
**Verdict**: Toutes les informations utiles de Document sont disponibles ailleurs. La collection est redondante.
## Impact de la suppression
### ✅ Aucun impact négatif:
- Flask app n'utilise pas Document
- TOC/hierarchy chargés depuis fichiers JSON
- Métadonnées disponibles dans Work et Chunk_v2
### ✅ Bénéfices:
- Simplifie l'architecture (3 collections au lieu de 4)
- Réduit la mémoire Weaviate (~13 objets + index)
- Simplifie le code d'ingestion (moins d'étapes)
- Évite la confusion sur "quelle collection utiliser?"
## Plan d'action recommandé
### Étape 1: Supprimer la collection Document de Weaviate
```python
import weaviate
client = weaviate.connect_to_local()
client.collections.delete("Document")
client.close()
```
### Étape 2: Supprimer de `schema.py`
- Supprimer fonction `create_document_collection()` (lignes 159-224)
- Supprimer appel dans `create_schema()` (ligne 432)
- Mettre à jour `verify_schema()` pour ne plus vérifier Document (ligne 456)
- Mettre à jour `display_schema()` pour ne plus afficher Document (ligne 483)
### Étape 3: Nettoyer `utils/weaviate_ingest.py`
- Supprimer fonction `ingest_document_metadata()` (lignes 695-765)
- Supprimer paramètre `ingest_document_collection` (ligne 909)
- Supprimer appel à `ingest_document_metadata()` (ligne 1010)
- Supprimer suppression de Document dans `delete_document_from_weaviate()` (lignes 1241-1248)
### Étape 4: Mettre à jour la documentation
- Mettre à jour `schema.py` docstring (ligne 12: supprimer Document de la hiérarchie)
- Mettre à jour `CLAUDE.md` (ligne 11: supprimer Document)
- Mettre à jour `.claude/CLAUDE.md` (supprimer références à Document)
### Étape 5: Supprimer aussi la collection `Chunk` (ancienne)
```python
# Chunk_v2 la remplace complètement
client.collections.delete("Chunk")
```
## Risques
**Aucun risque identifié** car:
- Collection non utilisée par l'application
- Données disponibles ailleurs
- Pas de dépendances externes
---
**Recommandation finale**: Procéder à la suppression immédiate de la collection Document.

View File

@@ -138,34 +138,30 @@ The core of the application is `utils/pdf_pipeline.py`, which orchestrates a 10-
- `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC) - `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC)
- `ingest_to_weaviate=True` - Insert chunks into Weaviate - `ingest_to_weaviate=True` - Insert chunks into Weaviate
### Weaviate Schema (4 Collections) ### Weaviate Schema (3 Collections)
Defined in `schema.py`, the database uses a normalized design with denormalized nested objects: Defined in `schema.py`, the database uses a denormalized design with nested objects:
``` ```
Work (no vectorizer) Work (no vectorizer)
title, author, year, language, genre title, author, year, language, genre
├─► Document (no vectorizer) ├─► Chunk_v2 (manual GPU vectorization) ⭐ PRIMARY
sourceId, edition, pages, toc, hierarchy text (VECTORIZED)
│ keywords (VECTORIZED)
│ workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex
│ work: {title, author} (nested) │ work: {title, author} (nested)
│ ├─► Chunk (text2vec-transformers) ⭐ PRIMARY └─► Summary_v2 (manual GPU vectorization)
│ │ text (VECTORIZED) text (VECTORIZED)
│ │ keywords (VECTORIZED) concepts (VECTORIZED)
│ │ sectionPath, chapterTitle, unitType, orderIndex sectionPath, title, level, chunksCount
│ │ work: {title, author} (nested) work: {title, author} (nested)
│ │ document: {sourceId, edition} (nested)
│ │
│ └─► Summary (text2vec-transformers)
│ text (VECTORIZED)
│ concepts (VECTORIZED)
│ sectionPath, title, level, chunksCount
│ document: {sourceId} (nested)
``` ```
**Vectorization Strategy:** **Vectorization Strategy:**
- Only `Chunk.text`, `Chunk.keywords`, `Summary.text`, `Summary.concepts` are vectorized - Only `Chunk_v2.text`, `Chunk_v2.keywords`, `Summary_v2.text`, `Summary_v2.concepts` are vectorized
- Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
- Metadata fields use `skip_vectorization=True` for filtering performance - Metadata fields use `skip_vectorization=True` for filtering performance
- Nested objects avoid joins for efficient single-query retrieval - Nested objects avoid joins for efficient single-query retrieval
- BAAI/bge-m3 model: 1024 dimensions, 8192 token context - BAAI/bge-m3 model: 1024 dimensions, 8192 token context

View File

@@ -0,0 +1,122 @@
"""
Fix Turings_Machines ingestion with corrected metadata.
The LLM returned prompt instructions instead of actual metadata.
This script:
1. Loads chunks from Turings_Machines_chunks.json
2. Corrects workTitle and workAuthor
3. Re-ingests to Weaviate with GPU embedder
"""
import json
import sys
from pathlib import Path
# Add current directory to path for imports
current_dir = Path(__file__).parent.absolute()
sys.path.insert(0, str(current_dir))
# Now import can work
import utils.weaviate_ingest as weaviate_ingest
def fix_turings_machines():
"""Fix and re-ingest Turings_Machines with corrected metadata."""
# Load chunks JSON
chunks_file = Path("output/Turings_Machines/Turings_Machines_chunks.json")
if not chunks_file.exists():
print(f"ERROR: File not found: {chunks_file}")
return
with open(chunks_file, "r", encoding="utf-8") as f:
data = json.load(f)
print("Loaded chunks JSON")
print(f" - Chunks: {len(data.get('chunks', []))}")
print(f" - Current title: {data.get('metadata', {}).get('title', 'N/A')[:80]}")
print(f" - Current author: {data.get('metadata', {}).get('author', 'N/A')[:80]}")
# Correct metadata
corrected_metadata = {
"title": "Turing's Machines",
"author": "Dorian Wiszniewski, Richard Coyne, Christopher Pierce",
"year": 2000, # Approximate - from references (Coyne 1999, etc.)
"language": "en"
}
# Update metadata
data["metadata"] = corrected_metadata
# Update all chunks with corrected metadata
for chunk in data.get("chunks", []):
chunk["workTitle"] = corrected_metadata["title"]
chunk["workAuthor"] = corrected_metadata["author"]
chunk["year"] = corrected_metadata["year"]
print("\nCorrected metadata:")
print(f" - Title: {corrected_metadata['title']}")
print(f" - Author: {corrected_metadata['author']}")
print(f" - Year: {corrected_metadata['year']}")
# Prepare chunks for ingestion (format expected by ingest_document)
chunks_for_ingestion = []
for i, chunk in enumerate(data.get("chunks", [])):
chunks_for_ingestion.append({
"text": chunk["text"],
"sectionPath": chunk.get("section", ""),
"sectionLevel": chunk.get("section_level", 1),
"chapterTitle": "",
"canonicalReference": "",
"unitType": chunk.get("type", "main_content"),
"keywords": chunk.get("concepts", []),
"language": "en",
"orderIndex": i,
})
print(f"\nPrepared {len(chunks_for_ingestion)} chunks for ingestion")
# Re-ingest to Weaviate
print("\nStarting re-ingestion with GPU embedder...")
result = weaviate_ingest.ingest_document(
doc_name="Turings_Machines",
chunks=chunks_for_ingestion,
metadata=corrected_metadata,
language="en"
)
if result.get("success"):
print(f"\nRe-ingestion successful!")
print(f" - Chunks inserted: {result.get('count', 0)}")
print(f" - Work UUID: {result.get('work_uuid', 'N/A')}")
else:
print(f"\nRe-ingestion failed!")
print(f" - Error: {result.get('error', 'Unknown')}")
# Save corrected chunks JSON
corrected_file = chunks_file.parent / f"{chunks_file.stem}_corrected.json"
with open(corrected_file, "w", encoding="utf-8") as f:
json.dump(data, f, ensure_ascii=False, indent=2)
print(f"\nSaved corrected chunks to: {corrected_file}")
return result
if __name__ == "__main__":
print("=" * 70)
print("Fix Turings_Machines Ingestion")
print("=" * 70)
result = fix_turings_machines()
if result and result.get("success"):
print("\n" + "=" * 70)
print("FIX COMPLETED SUCCESSFULLY")
print("=" * 70)
sys.exit(0)
else:
print("\n" + "=" * 70)
print("FIX FAILED")
print("=" * 70)
sys.exit(1)

View File

@@ -0,0 +1,355 @@
"""
Migrate Chunk_v2 schema from TEXT2VEC_TRANSFORMERS to NONE vectorizer.
This allows pure manual vectorization with GPU embedder, removing dependency
on Docker text2vec-transformers service.
Steps:
1. Export all existing chunks with their vectors
2. Drop Chunk_v2 collection
3. Recreate Chunk_v2 with vectorizer=none()
4. Re-insert all chunks with their vectors
5. Verify data integrity
"""
import weaviate
import weaviate.classes as wvc
from weaviate.classes.config import Configure, Property, DataType, VectorDistances
import sys
from pathlib import Path
import json
from typing import List, Dict, Any
import time
# Add to path for imports
sys.path.insert(0, str(Path(__file__).parent))
def export_chunks():
"""Export all chunks with their vectors."""
print("\n" + "="*70)
print("STEP 1: Exporting existing chunks")
print("="*70)
client = weaviate.connect_to_local()
try:
chunk_coll = client.collections.get("Chunk_v2")
# Count total
count = chunk_coll.aggregate.over_all().total_count
print(f"Total chunks to export: {count}")
# Export all with vectors
chunks = []
batch_size = 1000
for i, obj in enumerate(chunk_coll.iterator(include_vector=True)):
if i % 100 == 0:
print(f" Exported {i}/{count} chunks...", end='\r')
chunks.append({
'uuid': str(obj.uuid),
'properties': obj.properties,
'vector': obj.vector
})
print(f" Exported {len(chunks)}/{count} chunks... DONE")
# Save to file
backup_file = Path("chunk_v2_backup.json")
with open(backup_file, 'w', encoding='utf-8') as f:
json.dump(chunks, f, ensure_ascii=False, indent=2)
print(f"\nBackup saved to: {backup_file}")
print(f" File size: {backup_file.stat().st_size / 1024 / 1024:.2f} MB")
return chunks
finally:
client.close()
def recreate_schema():
"""Drop and recreate Chunk_v2 with vectorizer=none()."""
print("\n" + "="*70)
print("STEP 2: Recreating Chunk_v2 schema")
print("="*70)
client = weaviate.connect_to_local()
try:
# Drop existing collection
print("Dropping existing Chunk_v2 collection...")
try:
client.collections.delete("Chunk_v2")
print(" Collection dropped")
except Exception as e:
print(f" Warning: {e}")
time.sleep(2)
# Create new collection with vectorizer=none()
print("\nCreating new Chunk_v2 with vectorizer=none()...")
client.collections.create(
name="Chunk_v2",
description="Document chunks with manual GPU vectorization (BAAI/bge-m3, 1024-dim)",
vectorizer_config=Configure.Vectorizer.none(), # MANUAL VECTORIZATION ONLY
vector_index_config=Configure.VectorIndex.hnsw(
distance_metric=VectorDistances.COSINE,
ef_construction=128,
max_connections=32,
quantizer=Configure.VectorIndex.Quantizer.rq()
),
properties=[
Property(name="text", data_type=DataType.TEXT, description="Chunk text content"),
Property(name="workTitle", data_type=DataType.TEXT, skip_vectorization=True, description="Work title"),
Property(name="workAuthor", data_type=DataType.TEXT, skip_vectorization=True, description="Work author"),
Property(name="sectionPath", data_type=DataType.TEXT, skip_vectorization=True, description="Section path"),
Property(name="sectionLevel", data_type=DataType.INT, skip_vectorization=True, description="Section level"),
Property(name="chapterTitle", data_type=DataType.TEXT, skip_vectorization=True, description="Chapter title"),
Property(name="canonicalReference", data_type=DataType.TEXT, skip_vectorization=True, description="Canonical reference"),
Property(name="unitType", data_type=DataType.TEXT, skip_vectorization=True, description="Unit type"),
Property(name="keywords", data_type=DataType.TEXT_ARRAY, skip_vectorization=True, description="Keywords"),
Property(name="language", data_type=DataType.TEXT, skip_vectorization=True, description="Language code"),
Property(name="year", data_type=DataType.INT, skip_vectorization=True, description="Publication year"),
Property(name="orderIndex", data_type=DataType.INT, skip_vectorization=True, description="Order index"),
]
)
print(" Collection created with vectorizer=none()")
# Verify
chunk_coll = client.collections.get("Chunk_v2")
config = chunk_coll.config.get()
print(f"\nVerification:")
print(f" Vectorizer: {config.vectorizer}")
print(f" Vector index: {config.vector_index_type}")
if str(config.vectorizer) == "Vectorizers.NONE":
print(" SUCCESS: Manual vectorization configured")
return True
else:
print(" ERROR: Vectorizer not set to NONE")
return False
finally:
client.close()
def reimport_chunks(chunks: List[Dict[str, Any]]):
"""Re-import all chunks with their vectors."""
print("\n" + "="*70)
print("STEP 3: Re-importing chunks with vectors")
print("="*70)
client = weaviate.connect_to_local()
try:
chunk_coll = client.collections.get("Chunk_v2")
print(f"Total chunks to import: {len(chunks)}")
# Batch import
batch_size = 50
total_inserted = 0
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
# Prepare DataObjects with vectors
import weaviate.classes.data as wvd
data_objects = []
for chunk in batch:
data_objects.append(
wvd.DataObject(
properties=chunk['properties'],
vector=chunk['vector']
)
)
# Insert batch
try:
response = chunk_coll.data.insert_many(data_objects)
total_inserted += len(batch)
print(f" Imported {total_inserted}/{len(chunks)} chunks...", end='\r')
except Exception as e:
print(f"\n ERROR in batch {i//batch_size + 1}: {e}")
print(f" Imported {total_inserted}/{len(chunks)} chunks... DONE")
# Verify count
time.sleep(2)
final_count = chunk_coll.aggregate.over_all().total_count
print(f"\nFinal count: {final_count}")
if final_count == len(chunks):
print(" SUCCESS: All chunks imported")
return True
else:
print(f" WARNING: Expected {len(chunks)}, got {final_count}")
return False
finally:
client.close()
def verify_search():
"""Verify search still works with GPU embedder."""
print("\n" + "="*70)
print("STEP 4: Verifying search functionality")
print("="*70)
# Import GPU embedder
from memory.core import get_embedder
client = weaviate.connect_to_local()
try:
chunk_coll = client.collections.get("Chunk_v2")
embedder = get_embedder()
# Test query
query = "Turing machine computation"
print(f"Test query: '{query}'")
# Generate query vector
query_vector = embedder.embed_single(query)
print(f" Query vector shape: {query_vector.shape}")
# Search
results = chunk_coll.query.near_vector(
near_vector=query_vector.tolist(),
limit=5,
return_metadata=wvc.query.MetadataQuery(distance=True)
)
print(f"\nSearch results: {len(results.objects)}")
for i, obj in enumerate(results.objects[:3]):
similarity = 1 - obj.metadata.distance
print(f" {i+1}. Work: {obj.properties.get('workTitle', 'N/A')[:50]}")
print(f" Similarity: {similarity:.3f}")
if len(results.objects) > 0:
print("\n SUCCESS: Search works with GPU embedder")
return True
else:
print("\n ERROR: No search results")
return False
finally:
client.close()
def test_new_insertion():
"""Test inserting new chunk with manual vector."""
print("\n" + "="*70)
print("STEP 5: Testing new chunk insertion")
print("="*70)
from memory.core import get_embedder
client = weaviate.connect_to_local()
try:
chunk_coll = client.collections.get("Chunk_v2")
embedder = get_embedder()
# Create test chunk
test_text = "This is a test chunk to verify manual vectorization works perfectly."
test_vector = embedder.embed_single(test_text)
print(f"Test text: '{test_text}'")
print(f"Test vector shape: {test_vector.shape}")
# Insert with manual vector
import weaviate.classes.data as wvd
uuid = chunk_coll.data.insert(
properties={
'text': test_text,
'workTitle': 'TEST_MIGRATION',
'workAuthor': 'Test Author',
'sectionPath': 'Test Section',
'language': 'en',
'year': 2026,
'orderIndex': 999999
},
vector=test_vector.tolist()
)
print(f"\nTest chunk inserted: {uuid}")
# Verify insertion
obj = chunk_coll.query.fetch_object_by_id(uuid, include_vector=True)
if obj and obj.vector and len(obj.vector) == 1024:
print(f" SUCCESS: Chunk inserted with {len(obj.vector)}-dim vector")
# Clean up test chunk
chunk_coll.data.delete_by_id(uuid)
print(f" Test chunk deleted")
return True
else:
print(f" ERROR: Chunk insertion failed")
return False
finally:
client.close()
def main():
"""Run full migration."""
print("\n" + "="*70)
print("CHUNK_V2 SCHEMA MIGRATION: TEXT2VEC_TRANSFORMERS -> NONE")
print("GPU Embedder (BAAI/bge-m3) for Manual Vectorization")
print("="*70)
try:
# Step 1: Export
chunks = export_chunks()
if not chunks:
print("\nERROR: No chunks exported")
return False
# Step 2: Recreate schema
if not recreate_schema():
print("\nERROR: Schema recreation failed")
return False
# Step 3: Reimport
if not reimport_chunks(chunks):
print("\nERROR: Reimport failed")
return False
# Step 4: Verify search
if not verify_search():
print("\nERROR: Search verification failed")
return False
# Step 5: Test new insertion
if not test_new_insertion():
print("\nERROR: New insertion test failed")
return False
print("\n" + "="*70)
print("MIGRATION COMPLETE - SUCCESS")
print("="*70)
print("\nChunk_v2 now uses:")
print(" - Vectorizer: NONE (manual vectorization only)")
print(" - GPU Embedder: BAAI/bge-m3 (1024-dim)")
print(" - All existing chunks preserved")
print(" - Search functionality verified")
print(" - New insertions working")
print("\nYou can now ingest documents with GPU embedder!")
print("text2vec-transformers is GONE forever.")
return True
except Exception as e:
print(f"\nFATAL ERROR: {e}")
import traceback
traceback.print_exc()
return False
if __name__ == "__main__":
success = main()
sys.exit(0 if success else 1)

View File

@@ -5,13 +5,12 @@ Library RAG application. It provides functions to create, verify, and display
the schema configuration for indexing and searching philosophical texts. the schema configuration for indexing and searching philosophical texts.
Schema Architecture: Schema Architecture:
The schema follows a normalized design with denormalized nested objects The schema follows a denormalized design with nested objects for efficient
for efficient querying. The hierarchy is:: querying. The hierarchy is::
Work (metadata only) Work (metadata only)
── Document (edition/translation instance) ── Chunk_v2 (vectorized text fragments)
├── Chunk (vectorized text fragments) └── Summary_v2 (vectorized chapter summaries)
└── Summary (vectorized chapter summaries)
Collections: Collections:
**Work** (no vectorization): **Work** (no vectorization):
@@ -19,27 +18,24 @@ Collections:
Stores canonical metadata: title, author, year, language, genre. Stores canonical metadata: title, author, year, language, genre.
Not vectorized - used only for metadata and relationships. Not vectorized - used only for metadata and relationships.
**Document** (no vectorization): **Chunk_v2** (manual GPU vectorization):
Represents a specific edition or translation of a Work. Text fragments optimized for semantic search (200-800 chars).
Contains: sourceId, edition, language, pages, TOC, hierarchy. Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
Vectorized fields: text, keywords.
Non-vectorized fields: workTitle, workAuthor, sectionPath, chapterTitle, unitType, orderIndex.
Includes nested Work reference for denormalized access. Includes nested Work reference for denormalized access.
**Chunk** (vectorized with text2vec-transformers): **Summary_v2** (manual GPU vectorization):
Text fragments optimized for semantic search (200-800 chars).
Vectorized fields: text, summary, keywords.
Non-vectorized fields: sectionPath, chapterTitle, unitType, orderIndex.
Includes nested Document and Work references.
**Summary** (vectorized with text2vec-transformers):
LLM-generated chapter/section summaries for high-level search. LLM-generated chapter/section summaries for high-level search.
Vectorized with Python GPU embedder (BAAI/bge-m3, 1024-dim).
Vectorized fields: text, concepts. Vectorized fields: text, concepts.
Includes nested Document reference. Includes nested Work reference for denormalized access.
Vectorization Strategy: Vectorization Strategy:
- Only Chunk.text, Chunk.summary, Chunk.keywords, Summary.text, and Summary.concepts are vectorized - Only Chunk_v2.text, Chunk_v2.keywords, Summary_v2.text, and Summary_v2.concepts are vectorized
- Uses text2vec-transformers (BAAI/bge-m3 with 1024-dim via Docker) - Manual vectorization with Python GPU embedder (BAAI/bge-m3, 1024-dim, RTX 4070)
- Metadata fields use skip_vectorization=True for filtering only - Metadata fields use skip_vectorization=True for filtering only
- Work and Document collections have no vectorizer (metadata only) - Work collection has no vectorizer (metadata only)
Vector Index Configuration (2026-01): Vector Index Configuration (2026-01):
- **HNSW Index**: Hierarchical Navigable Small World for efficient search - **HNSW Index**: Hierarchical Navigable Small World for efficient search
@@ -58,12 +54,10 @@ Migration Note (2024-12):
Nested Objects: Nested Objects:
Instead of using Weaviate cross-references, we use nested objects for Instead of using Weaviate cross-references, we use nested objects for
denormalized data access. This allows single-query retrieval of chunk denormalized data access. This allows single-query retrieval of chunk
data with its Work/Document metadata without joins:: data with its Work metadata without joins::
Chunk.work = {title, author} Chunk_v2.work = {title, author}
Chunk.document = {sourceId, edition} Summary_v2.work = {title, author}
Document.work = {title, author}
Summary.document = {sourceId}
Usage: Usage:
From command line:: From command line::
@@ -156,74 +150,6 @@ def create_work_collection(client: weaviate.WeaviateClient) -> None:
) )
def create_document_collection(client: weaviate.WeaviateClient) -> None:
"""Create the Document collection for edition/translation instances.
Args:
client: Connected Weaviate client.
Note:
Contains nested Work reference for denormalized access.
"""
client.collections.create(
name="Document",
description="A specific edition or translation of a work (PDF, ebook, etc.).",
vectorizer_config=wvc.Configure.Vectorizer.none(),
properties=[
wvc.Property(
name="sourceId",
description="Unique identifier for this document (filename without extension).",
data_type=wvc.DataType.TEXT,
),
wvc.Property(
name="edition",
description="Edition or translator (e.g., 'trad. Cousin', 'Loeb Classical Library').",
data_type=wvc.DataType.TEXT,
),
wvc.Property(
name="language",
description="Language of this edition (e.g., 'fr', 'en').",
data_type=wvc.DataType.TEXT,
),
wvc.Property(
name="pages",
description="Number of pages in the PDF/document.",
data_type=wvc.DataType.INT,
),
wvc.Property(
name="chunksCount",
description="Total number of chunks extracted from this document.",
data_type=wvc.DataType.INT,
),
wvc.Property(
name="toc",
description="Table of contents as JSON string [{title, level, page}, ...].",
data_type=wvc.DataType.TEXT,
),
wvc.Property(
name="hierarchy",
description="Full hierarchical structure as JSON string.",
data_type=wvc.DataType.TEXT,
),
wvc.Property(
name="createdAt",
description="Timestamp when this document was ingested.",
data_type=wvc.DataType.DATE,
),
# Nested Work reference
wvc.Property(
name="work",
description="Reference to the Work this document is an instance of.",
data_type=wvc.DataType.OBJECT,
nested_properties=[
wvc.Property(name="title", data_type=wvc.DataType.TEXT),
wvc.Property(name="author", data_type=wvc.DataType.TEXT),
],
),
],
)
def create_chunk_collection(client: weaviate.WeaviateClient) -> None: def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
"""Create the Chunk collection for vectorized text fragments. """Create the Chunk collection for vectorized text fragments.
@@ -410,7 +336,7 @@ def create_summary_collection(client: weaviate.WeaviateClient) -> None:
def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True) -> None: def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True) -> None:
"""Create the complete Weaviate schema for Library RAG. """Create the complete Weaviate schema for Library RAG.
Creates all four collections: Work, Document, Chunk, Summary. Creates all three collections: Work, Chunk, Summary.
Args: Args:
client: Connected Weaviate client. client: Connected Weaviate client.
@@ -429,16 +355,13 @@ def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True)
print(" → Work (métadonnées œuvre)...") print(" → Work (métadonnées œuvre)...")
create_work_collection(client) create_work_collection(client)
print(" → Document (métadonnées édition)...")
create_document_collection(client)
print(" → Chunk (fragments vectorisés)...") print(" → Chunk (fragments vectorisés)...")
create_chunk_collection(client) create_chunk_collection(client)
print(" → Summary (résumés de chapitres)...") print(" → Summary (résumés de chapitres)...")
create_summary_collection(client) create_summary_collection(client)
print("4 collections créées") print("3 collections créées")
def verify_schema(client: weaviate.WeaviateClient) -> bool: def verify_schema(client: weaviate.WeaviateClient) -> bool:
@@ -453,7 +376,7 @@ def verify_schema(client: weaviate.WeaviateClient) -> bool:
print("\n[3/4] Vérification des collections...") print("\n[3/4] Vérification des collections...")
collections = client.collections.list_all() collections = client.collections.list_all()
expected: Set[str] = {"Work", "Document", "Chunk", "Summary"} expected: Set[str] = {"Work", "Chunk", "Summary"}
actual: Set[str] = set(collections.keys()) actual: Set[str] = set(collections.keys())
if expected == actual: if expected == actual:
@@ -480,7 +403,7 @@ def display_schema(client: weaviate.WeaviateClient) -> None:
collections = client.collections.list_all() collections = client.collections.list_all()
for name in ["Work", "Document", "Chunk", "Summary"]: for name in ["Work", "Chunk", "Summary"]:
if name not in collections: if name not in collections:
continue continue
@@ -523,14 +446,12 @@ def print_summary() -> None:
print("=" * 80) print("=" * 80)
print("\n✓ Architecture:") print("\n✓ Architecture:")
print(" - Work: Source unique pour author/title") print(" - Work: Source unique pour author/title")
print(" - Document: Métadonnées d'édition avec référence vers Work") print(" - Chunk: Fragments vectorisés (text + keywords)")
print(" - Chunk: Fragments vectorisés (text + summary + keywords)")
print(" - Summary: Résumés de chapitres vectorisés (text + concepts)") print(" - Summary: Résumés de chapitres vectorisés (text + concepts)")
print("\n✓ Vectorisation:") print("\n✓ Vectorisation:")
print(" - Work: NONE") print(" - Work: NONE")
print(" - Document: NONE") print(" - Chunk: GPU embedder (BAAI/bge-m3, 1024-dim)")
print(" - Chunk: text2vec (text + summary + keywords)") print(" - Summary: GPU embedder (BAAI/bge-m3, 1024-dim)")
print(" - Summary: text2vec (text + concepts)")
print("\n✓ Index Vectoriel (Optimisation 2026):") print("\n✓ Index Vectoriel (Optimisation 2026):")
print(" - Chunk: HNSW + RQ (~75% moins de RAM)") print(" - Chunk: HNSW + RQ (~75% moins de RAM)")
print(" - Summary: HNSW + RQ") print(" - Summary: HNSW + RQ")

View File

@@ -848,7 +848,7 @@ class WeaviateIngestResult(TypedDict, total=False):
inserted: List of inserted chunk summaries (first 10). inserted: List of inserted chunk summaries (first 10).
work: Title of the ingested work. work: Title of the ingested work.
author: Author of the ingested work. author: Author of the ingested work.
document_uuid: UUID of created Document object (if any). work_uuid: UUID of created Work object (if any).
all_objects: Complete list of all inserted ChunkObjects. all_objects: Complete list of all inserted ChunkObjects.
Note: Note:
@@ -863,7 +863,7 @@ class WeaviateIngestResult(TypedDict, total=False):
inserted: List[Any] # List[InsertedChunkSummary] from weaviate_ingest inserted: List[Any] # List[InsertedChunkSummary] from weaviate_ingest
work: str work: str
author: str author: str
document_uuid: Optional[str] work_uuid: Optional[str]
all_objects: List[Any] # List[ChunkObject] from weaviate_ingest all_objects: List[Any] # List[ChunkObject] from weaviate_ingest

View File

@@ -190,9 +190,8 @@ class DeleteResult(TypedDict, total=False):
Attributes: Attributes:
success: Whether deletion succeeded. success: Whether deletion succeeded.
error: Error message if deletion failed. error: Error message if deletion failed.
deleted_chunks: Number of chunks deleted from Chunk collection. deleted_chunks: Number of chunks deleted from Chunk_v2 collection.
deleted_summaries: Number of summaries deleted from Summary collection. deleted_summaries: Number of summaries deleted from Summary_v2 collection.
deleted_document: Whether the Document object was deleted.
Example: Example:
>>> result = delete_document_chunks("platon_republique") >>> result = delete_document_chunks("platon_republique")
@@ -203,7 +202,6 @@ class DeleteResult(TypedDict, total=False):
error: str error: str
deleted_chunks: int deleted_chunks: int
deleted_summaries: int deleted_summaries: int
deleted_document: bool
# ============================================================================= # =============================================================================
@@ -379,7 +377,8 @@ def validate_document_metadata(
) )
# Validate title (required for work.title nested object) # Validate title (required for work.title nested object)
title = metadata.get("title") or metadata.get("work") # Priority: work > original_title > title (to avoid LLM prompt instructions)
title = metadata.get("work") or metadata.get("original_title") or metadata.get("title")
if not title or not str(title).strip(): if not title or not str(title).strip():
raise ValueError( raise ValueError(
f"Invalid metadata for '{doc_name}': 'title' is missing or empty. " f"Invalid metadata for '{doc_name}': 'title' is missing or empty. "
@@ -388,7 +387,8 @@ def validate_document_metadata(
) )
# Validate author (required for work.author nested object) # Validate author (required for work.author nested object)
author = metadata.get("author") # Priority: original_author > author (to avoid LLM prompt instructions)
author = metadata.get("original_author") or metadata.get("author")
if not author or not str(author).strip(): if not author or not str(author).strip():
raise ValueError( raise ValueError(
f"Invalid metadata for '{doc_name}': 'author' is missing or empty. " f"Invalid metadata for '{doc_name}': 'author' is missing or empty. "
@@ -649,8 +649,10 @@ def create_or_get_work(
logger.warning(f"Collection Work non trouvée: {e}") logger.warning(f"Collection Work non trouvée: {e}")
return None return None
title = metadata.get("title") or doc_name # Priority: work > original_title > title (to avoid LLM prompt instructions)
author = metadata.get("author") or "Inconnu" title = metadata.get("work") or metadata.get("original_title") or metadata.get("title") or doc_name
# Priority: original_author > author (to avoid LLM prompt instructions)
author = metadata.get("original_author") or metadata.get("author") or "Inconnu"
year = metadata.get("year", 0) if metadata.get("year") else 0 year = metadata.get("year", 0) if metadata.get("year") else 0
try: try:
@@ -686,76 +688,6 @@ def create_or_get_work(
return None return None
def ingest_document_metadata(
client: WeaviateClient,
doc_name: str,
metadata: Dict[str, Any],
toc: List[Dict[str, Any]],
hierarchy: Dict[str, Any],
chunks_count: int,
pages: int,
) -> Optional[str]:
"""Insert document metadata into the Document collection.
Creates a Document object containing metadata about a processed document,
including its table of contents, hierarchy structure, and statistics.
Args:
client: Active Weaviate client connection.
doc_name: Unique document identifier (sourceId).
metadata: Extracted metadata dict with keys: title, author, language.
toc: Table of contents as a hierarchical list of dicts.
hierarchy: Complete document hierarchy structure.
chunks_count: Total number of chunks in the document.
pages: Number of pages in the source PDF.
Returns:
UUID string of the created Document object, or None if insertion failed.
Example:
>>> with get_weaviate_client() as client:
... uuid = ingest_document_metadata(
... client,
... doc_name="platon_republique",
... metadata={"title": "La Republique", "author": "Platon"},
... toc=[{"title": "Livre I", "level": 1}],
... hierarchy={},
... chunks_count=150,
... pages=300,
... )
Note:
The TOC and hierarchy are serialized to JSON strings for storage.
The createdAt field is set to the current timestamp.
"""
try:
doc_collection: Collection[Any, Any] = client.collections.get("Document")
except Exception as e:
logger.warning(f"Collection Document non trouvée: {e}")
return None
try:
doc_obj: Dict[str, Any] = {
"sourceId": doc_name,
"title": metadata.get("title") or doc_name,
"author": metadata.get("author") or "Inconnu",
"toc": json.dumps(toc, ensure_ascii=False) if toc else "[]",
"hierarchy": json.dumps(hierarchy, ensure_ascii=False) if hierarchy else "{}",
"pages": pages,
"chunksCount": chunks_count,
"language": metadata.get("language", "fr"),
"createdAt": datetime.now(timezone.utc).isoformat(),
}
result = doc_collection.data.insert(doc_obj)
logger.info(f"Document metadata ingéré: {doc_name}")
return str(result)
except Exception as e:
logger.warning(f"Erreur ingestion document metadata: {e}")
return None
def ingest_summaries( def ingest_summaries(
client: WeaviateClient, client: WeaviateClient,
doc_name: str, doc_name: str,
@@ -897,14 +829,13 @@ def ingest_document(
toc: Optional[List[Dict[str, Any]]] = None, toc: Optional[List[Dict[str, Any]]] = None,
hierarchy: Optional[Dict[str, Any]] = None, hierarchy: Optional[Dict[str, Any]] = None,
pages: int = 0, pages: int = 0,
ingest_document_collection: bool = True,
ingest_summary_collection: bool = False, ingest_summary_collection: bool = False,
) -> IngestResult: ) -> IngestResult:
"""Ingest document chunks into Weaviate with nested objects. """Ingest document chunks into Weaviate with nested objects.
Main ingestion function that inserts chunks into the Chunk collection Main ingestion function that inserts chunks into the Chunk_v2 collection
with nested Work and Document references. Optionally also creates with nested Work references. Optionally also creates entries in the
entries in the Document and Summary collections. Summary_v2 collection.
This function uses batch insertion for optimal performance and This function uses batch insertion for optimal performance and
constructs proper nested objects for filtering capabilities. constructs proper nested objects for filtering capabilities.
@@ -922,12 +853,10 @@ def ingest_document(
- author: Author name - author: Author name
- edition (optional): Edition identifier - edition (optional): Edition identifier
language: ISO language code. Defaults to "fr". language: ISO language code. Defaults to "fr".
toc: Optional table of contents for Document/Summary collections. toc: Optional table of contents for Summary collection.
hierarchy: Optional complete document hierarchy structure. hierarchy: Optional complete document hierarchy structure.
pages: Number of pages in source document. Defaults to 0. pages: Number of pages in source document. Defaults to 0.
ingest_document_collection: If True, also insert into Document ingest_summary_collection: If True, also insert into Summary_v2
collection. Defaults to True.
ingest_summary_collection: If True, also insert into Summary
collection (requires toc). Defaults to False. collection (requires toc). Defaults to False.
Returns: Returns:
@@ -937,7 +866,7 @@ def ingest_document(
- inserted: Preview of first 10 inserted chunks - inserted: Preview of first 10 inserted chunks
- work: Work title - work: Work title
- author: Author name - author: Author name
- document_uuid: UUID of Document object (if created) - work_uuid: UUID of Work object (if created)
- all_objects: Complete list of inserted ChunkObjects - all_objects: Complete list of inserted ChunkObjects
- error: Error message (if failed) - error: Error message (if failed)
@@ -995,14 +924,6 @@ def ingest_document(
client, doc_name, metadata, pages client, doc_name, metadata, pages
) )
# Insérer les métadonnées du document (optionnel)
doc_uuid: Optional[str] = None
if ingest_document_collection:
doc_uuid = ingest_document_metadata(
client, doc_name, metadata, toc or [], hierarchy or {},
len(chunks), pages
)
# Insérer les résumés (optionnel) # Insérer les résumés (optionnel)
if ingest_summary_collection and toc: if ingest_summary_collection and toc:
ingest_summaries(client, doc_name, toc, {}) ingest_summaries(client, doc_name, toc, {})
@@ -1018,8 +939,10 @@ def ingest_document(
objects_to_insert: List[ChunkObject] = [] objects_to_insert: List[ChunkObject] = []
# Extraire et valider les métadonnées (validation déjà faite, juste extraction) # Extraire et valider les métadonnées (validation déjà faite, juste extraction)
title: str = metadata.get("title") or metadata.get("work") or doc_name # Priority: work > original_title > title (to avoid LLM prompt instructions)
author: str = metadata.get("author") or "Inconnu" title: str = metadata.get("work") or metadata.get("original_title") or metadata.get("title") or doc_name
# Priority: original_author > author (to avoid LLM prompt instructions)
author: str = metadata.get("original_author") or metadata.get("author") or "Inconnu"
edition: str = metadata.get("edition", "") edition: str = metadata.get("edition", "")
for idx, chunk in enumerate(chunks): for idx, chunk in enumerate(chunks):
@@ -1153,7 +1076,7 @@ def ingest_document(
inserted=inserted_summary, inserted=inserted_summary,
work=title, work=title,
author=author, author=author,
document_uuid=doc_uuid, work_uuid=work_uuid,
all_objects=objects_to_insert, all_objects=objects_to_insert,
) )
@@ -1169,9 +1092,8 @@ def ingest_document(
def delete_document_chunks(doc_name: str) -> DeleteResult: def delete_document_chunks(doc_name: str) -> DeleteResult:
"""Delete all data for a document from Weaviate collections. """Delete all data for a document from Weaviate collections.
Removes chunks, summaries, and the document metadata from their Removes chunks and summaries from their respective collections.
respective collections. Uses nested object filtering to find Uses nested object filtering to find related objects.
related objects.
This function is useful for re-processing a document after changes This function is useful for re-processing a document after changes
to the processing pipeline or to clean up test data. to the processing pipeline or to clean up test data.
@@ -1184,7 +1106,6 @@ def delete_document_chunks(doc_name: str) -> DeleteResult:
- success: True if deletion succeeded (even if no objects found) - success: True if deletion succeeded (even if no objects found)
- deleted_chunks: Number of Chunk objects deleted - deleted_chunks: Number of Chunk objects deleted
- deleted_summaries: Number of Summary objects deleted - deleted_summaries: Number of Summary objects deleted
- deleted_document: True if Document object was deleted
- error: Error message (if failed) - error: Error message (if failed)
Example: Example:
@@ -1227,23 +1148,12 @@ def delete_document_chunks(doc_name: str) -> DeleteResult:
except Exception as e: except Exception as e:
logger.warning(f"Erreur suppression summaries: {e}") logger.warning(f"Erreur suppression summaries: {e}")
# Supprimer le document
try:
doc_collection: Collection[Any, Any] = client.collections.get("Document")
result = doc_collection.data.delete_many(
where=wvq.Filter.by_property("sourceId").equal(doc_name)
)
deleted_document = result.successful > 0
except Exception as e:
logger.warning(f"Erreur suppression document: {e}")
logger.info(f"Suppression: {deleted_chunks} chunks, {deleted_summaries} summaries pour {doc_name}") logger.info(f"Suppression: {deleted_chunks} chunks, {deleted_summaries} summaries pour {doc_name}")
return DeleteResult( return DeleteResult(
success=True, success=True,
deleted_chunks=deleted_chunks, deleted_chunks=deleted_chunks,
deleted_summaries=deleted_summaries, deleted_summaries=deleted_summaries,
deleted_document=deleted_document,
) )
except Exception as e: except Exception as e: