Commit Graph

3 Commits

Author SHA1 Message Date
53f6a92365 feat: Remove Document collection from schema
BREAKING CHANGE: Document collection removed from Weaviate schema

Architecture simplification:
- Removed Document collection (unused by Flask app)
- All metadata now in Work collection or file-based (chunks.json)
- Simplified from 4 collections to 3 (Work, Chunk_v2, Summary_v2)

Schema changes (schema.py):
- Removed create_document_collection() function
- Updated verify_schema() to expect 3 collections
- Updated display_schema() and print_summary()
- Updated documentation to reflect Chunk_v2/Summary_v2

Ingestion changes (weaviate_ingest.py):
- Removed ingest_document_metadata() function
- Removed ingest_document_collection parameter
- Updated IngestResult to use work_uuid instead of document_uuid
- Removed Document deletion from delete_document_chunks()
- Updated DeleteResult TypedDict

Type changes (types.py):
- WeaviateIngestResult: document_uuid → work_uuid

Documentation updates (.claude/CLAUDE.md):
- Updated schema diagram (4 → 3 collections)
- Removed Document references
- Updated to reflect manual GPU vectorization

Database changes:
- Deleted Document collection (13 objects)
- Deleted Chunk collection (0 objects, old schema)

Benefits:
- Simpler architecture (3 collections vs 4)
- No redundant data storage
- All metadata available via Work or file-based storage
- Reduced Weaviate memory footprint

Migration:
- See DOCUMENT_COLLECTION_ANALYSIS.md for detailed analysis
- See migrate_chunk_v2_to_none_vectorizer.py for vectorizer migration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-09 14:13:51 +01:00
17dfe213ed feat: Migrate Weaviate ingestion to Python GPU embedder (30-70x faster)
BREAKING: No breaking changes - zero data loss migration

Core Changes:
- Added manual GPU vectorization in weaviate_ingest.py (~100 lines)
- New vectorize_chunks_batch() function using BAAI/bge-m3 on RTX 4070
- Modified ingest_document() and ingest_summaries() for GPU vectors
- Updated docker-compose.yml with healthchecks

Performance:
- Ingestion: 500-1000ms/chunk → 15ms/chunk (30-70x faster)
- VRAM usage: 2.6 GB peak (well under 8 GB available)
- No degradation on search/chat (already using GPU embedder)

Data Safety:
- All 5355 existing chunks preserved (100% compatible vectors)
- Same model (BAAI/bge-m3), same dimensions (1024)
- Docker text2vec-transformers optional (can be removed later)

Tests (All Passed):
 Ingestion: 9 chunks in 1.2s
 Search: 16 results, GPU embedder confirmed
 Chat: 11 chunks across 5 sections, hierarchical search OK

Architecture:
Before: Hybrid (Docker CPU for ingestion, Python GPU for queries)
After:  Unified (Python GPU for everything)

Files Modified:
- generations/library_rag/utils/weaviate_ingest.py (GPU vectorization)
- generations/library_rag/.claude/CLAUDE.md (documentation)
- generations/library_rag/docker-compose.yml (healthchecks)

Documentation:
- MIGRATION_GPU_EMBEDDER_SUCCESS.md (detailed report)
- TEST_FINAL_GPU_EMBEDDER.md (ingestion + search tests)
- TEST_CHAT_GPU_EMBEDDER.md (chat test)
- TESTS_COMPLETS_GPU_EMBEDDER.md (complete summary)
- BUG_REPORT_WEAVIATE_CONNECTION.md (initial bug analysis)
- DIAGNOSTIC_ARCHITECTURE_EMBEDDINGS.md (technical analysis)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-09 11:44:10 +01:00
d2f7165120 Add Library RAG project and cleanup root directory
- Add complete Library RAG application (Flask + MCP server)
  - PDF processing pipeline with OCR and LLM extraction
  - Weaviate vector database integration (BGE-M3 embeddings)
  - Flask web interface with search and document management
  - MCP server for Claude Desktop integration
  - Comprehensive test suite (134 tests)

- Clean up root directory
  - Remove obsolete documentation files
  - Remove backup and temporary files
  - Update autonomous agent configuration

- Update prompts
  - Enhance initializer bis prompt with better instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-30 11:57:12 +01:00