53f6a92365
feat: Remove Document collection from schema
...
BREAKING CHANGE: Document collection removed from Weaviate schema
Architecture simplification:
- Removed Document collection (unused by Flask app)
- All metadata now in Work collection or file-based (chunks.json)
- Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2)
Schema changes (schema.py):
- Removed create_document_collection() function
- Updated verify_schema() to expect 3 collections
- Updated display_schema() and print_summary()
- Updated documentation to reflect Chunk_v2/Summary_v2
Ingestion changes (weaviate_ingest.py):
- Removed ingest_document_metadata() function
- Removed ingest_document_collection parameter
- Updated IngestResult to use work_uuid instead of document_uuid
- Removed Document deletion from delete_document_chunks()
- Updated DeleteResult TypedDict
Type changes (types.py):
- WeaviateIngestResult: document_uuid → work_uuid
Documentation updates (.claude/CLAUDE.md):
- Updated schema diagram (4 → 3 collections)
- Removed Document references
- Updated to reflect manual GPU vectorization
Database changes:
- Deleted Document collection (13 objects)
- Deleted Chunk collection (0 objects, old schema)
Benefits:
- Simpler architecture (3 collections vs 4)
- No redundant data storage
- All metadata available via Work or file-based storage
- Reduced Weaviate memory footprint
Migration:
- See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis
- See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-09 14:13:51 +01:00
17dfe213ed
feat: Migrate Weaviate ingestion to Python GPU embedder (30-70x faster)
...
BREAKING: No breaking changes - zero data loss migration
Core Changes:
- Added manual GPU vectorization in weaviate_ingest.py (~100 lines)
- New vectorize_chunks_batch() function using BAAI/bge-m3 on RTX 4070
- Modified ingest_document() and ingest_summaries() for GPU vectors
- Updated docker-compose.yml with healthchecks
Performance:
- Ingestion: 500-1000ms/chunk → 15ms/chunk (30-70x faster)
- VRAM usage: 2.6 GB peak (well under 8 GB available)
- No degradation on search/chat (already using GPU embedder)
Data Safety:
- All 5355 existing chunks preserved (100% compatible vectors)
- Same model (BAAI/bge-m3), same dimensions (1024)
- Docker text2vec-transformers optional (can be removed later)
Tests (All Passed):
✅ Ingestion: 9 chunks in 1.2s
✅ Search: 16 results, GPU embedder confirmed
✅ Chat: 11 chunks across 5 sections, hierarchical search OK
Architecture:
Before: Hybrid (Docker CPU for ingestion, Python GPU for queries)
After: Unified (Python GPU for everything)
Files Modified:
- generations/library_rag/utils/weaviate_ingest.py (GPU vectorization)
- generations/library_rag/.claude/CLAUDE.md (documentation)
- generations/library_rag/docker-compose.yml (healthchecks)
Documentation:
- MIGRATION_GPU_EMBEDDER_SUCCESS.md (detailed report)
- TEST_FINAL_GPU_EMBEDDER.md (ingestion + search tests)
- TEST_CHAT_GPU_EMBEDDER.md (chat test)
- TESTS_COMPLETS_GPU_EMBEDDER.md (complete summary)
- BUG_REPORT_WEAVIATE_CONNECTION.md (initial bug analysis)
- DIAGNOSTIC_ARCHITECTURE_EMBEDDINGS.md (technical analysis)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-09 11:44:10 +01:00
0c3b6c5fea
feat: Auto-create Work entries during document ingestion
...
Adds automatic Work object creation to ensure all uploaded documents
appear on the /documents page. Previously, chunks were ingested but
Work entries were missing, causing documents to be invisible in the UI.
Changes:
- Add create_or_get_work() function to weaviate_ingest.py
- Checks for existing Work by sourceId (prevents duplicates)
- Creates new Work with metadata (title, author, year, pages)
- Returns UUID for potential future reference
- Integrate Work creation into ingest_document() flow
- Add helper scripts for retroactive fixes and verification:
- create_missing_works.py: Create Works for already-ingested documents
- reingest_batch_documents.py: Re-ingest documents after bug fixes
- check_batch_results.py: Verify batch upload results in Weaviate
This completes the batch upload feature - documents now properly appear
on /documents page immediately after ingestion.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-08 23:34:06 +01:00
b8d94576de
fix: Correct Weaviate ingestion for Chunk_v2 schema compatibility
...
Fixes batch upload ingestion that was failing silently due to schema mismatches:
Schema Fixes:
- Update collection names from "Chunk" to "Chunk_v2"
- Update collection names from "Summary" to "Summary_v2"
Object Structure Fixes:
- Replace nested objects (work: {title, author}) with flat fields
- Use workTitle and workAuthor instead of nested work object
- Add year field to chunks
- Remove document nested object (not used in current schema)
- Disable nested objects validation for flat schema
Impact:
- Batch upload now successfully ingests chunks to Weaviate
- Single-file upload also benefits from fixes
- All new documents will be properly indexed and searchable
Testing:
- Verified with 2-file batch upload (7 + 11 chunks = 18 total)
- Total chunks increased from 5,304 to 5,322
- All chunks properly searchable with workTitle/workAuthor filters
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-08 23:25:36 +01:00
04ee3f9e39
feat: Add data quality verification & cleanup scripts
...
## Data Quality & Cleanup (Priorities 1-6)
Added comprehensive data quality verification and cleanup system:
**Scripts créés**:
- verify_data_quality.py: Analyse qualité complète œuvre par œuvre
- clean_duplicate_documents.py: Nettoyage doublons Documents
- populate_work_collection.py/clean.py: Peuplement Work collection
- fix_chunks_count.py: Correction chunksCount incohérents
- manage_orphan_chunks.py: Gestion chunks orphelins (3 options)
- clean_orphan_works.py: Suppression Works sans chunks
- add_missing_work.py: Création Work manquant
- generate_schema_stats.py: Génération stats auto
- migrate_add_work_collection.py: Migration sûre Work collection
**Documentation**:
- WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes)
- WEAVIATE_SCHEMA.md: Référence schéma rapide
- NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session
- ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale
- rapport_qualite_donnees.txt: Output brut vérification
**Résultats nettoyage**:
- Documents: 16 → 9 (7 doublons supprimés)
- Works: 0 → 9 (peuplé + nettoyé)
- Chunks: 5,404 → 5,230 (174 orphelins supprimés)
- chunksCount: Corrigés (231 → 5,230 déclaré = réel)
- Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres
**Modifications code**:
- schema.py: Ajout Work collection avec vectorisation
- utils/weaviate_ingest.py: Support Work ingestion
- utils/word_pipeline.py: Désactivation concepts (problème .lower())
- utils/word_toc_extractor.py: Métadonnées Word correctes
- .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-01 11:57:26 +01:00
d2f7165120
Add Library RAG project and cleanup root directory
...
- Add complete Library RAG application (Flask + MCP server)
- PDF processing pipeline with OCR and LLM extraction
- Weaviate vector database integration (BGE-M3 embeddings)
- Flask web interface with search and document management
- MCP server for Claude Desktop integration
- Comprehensive test suite (134 tests)
- Clean up root directory
- Remove obsolete documentation files
- Remove backup and temporary files
- Update autonomous agent configuration
- Update prompts
- Enhance initializer bis prompt with better instructions
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-30 11:57:12 +01:00