Files
linear-coding-agent/CHANGELOG.md
David Blanc Brioir 187ba4854e chore: Major cleanup - archive migration scripts and remove temp files
CLEANUP ACTIONS:
- Archived 11 migration/optimization scripts to archive/migration_scripts/
- Archived 11 phase documentation files to archive/documentation/
- Moved backups/, docs/, scripts/ to archive/
- Deleted 30+ temporary debug/test/fix scripts
- Cleaned Python cache (__pycache__/, *.pyc)
- Cleaned log files (*.log)

NEW FILES:
- CHANGELOG.md: Consolidated project history and migration documentation
- Updated .gitignore: Added *.log, *.pyc, archive/ exclusions

FINAL ROOT STRUCTURE (19 items):
- Core framework: agent.py, autonomous_agent_demo.py, client.py, security.py, progress.py, prompts.py
- Config: requirements.txt, package.json, .gitignore
- Docs: README.md, CHANGELOG.md, project_progress.md
- Directories: archive/, generations/, memory/, prompts/, utils/

ARCHIVED SCRIPTS (in archive/migration_scripts/):
01-11: Migration & optimization scripts (migrate, schema, rechunk, vectorize, etc.)

ARCHIVED DOCS (in archive/documentation/):
PHASE_0-8: Detailed phase summaries
MIGRATION_README.md, PLAN_MIGRATION_WEAVIATE_GPU.md

Repository is now clean and production-ready with all important files preserved in archive/.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-08 18:05:43 +01:00

5.3 KiB

Changelog - Library RAG Project

2026-01-08 - Chunking Optimization & Vectorization

Chunking Improvements

  • Strict chunk size limits: Max 1000 words (down from 1500-2000)
  • Overlap implementation: 100-word overlap between consecutive chunks
  • Triple fallback system: Ensures robust chunking even on LLM failures
  • New module: llm_chunker_improved.py with overlap functionality

Re-chunking Results

  • Identified 31 oversized chunks (>2000 tokens, max 7,158)
  • Split into 92 optimally-sized chunks
  • Result: 0 chunks > 2000 tokens (100% within BGE-M3 limits)
  • Preserved all metadata during split (workTitle, workAuthor, sectionPath, orderIndex)

Vectorization

  • Created manual vectorization system for Chunk_v2 (no vectorizer configured)
  • Successfully vectorized 92 new chunks via text2vec-transformers API
  • Result: 5,304/5,304 chunks with vectors (100% coverage)

Docker Configuration

  • Exposed text2vec-transformers port (8090:8080) for external vectorization
  • Added cluster configuration to fix "No private IP address found" error
  • Increased WORKER_TIMEOUT to 600s for very large chunks

Search Quality

  • Created comprehensive test suite (10_test_search_quality.py)
  • Tests: distribution, overlap detection, semantic search (4 queries)
  • Search now uses near_vector() with manual query vectorization
  • Issue identified: Collected papers dominates results (95.8% of chunks)

Database Stats (Post-Optimization)

  • Total chunks: 5,304
  • Average size: 289 tokens (optimal for BGE-M3)
  • Distribution: 84.6% < 500 tokens, 11.5% 500-1000, 3.0% 1000-1500
  • Works: 8 (Collected papers: 5,080 chunks, Mind Design III: 61, Platon Ménon: 56, etc.)

2025-01 - Weaviate v2 Migration & GPU Integration

Phase 1-3: Schema Migration (Complete)

  • Migrated from Chunk/Summary/Document to Chunk_v2/Summary_v2/Work
  • Removed nested document object, added direct properties (workTitle, workAuthor, year, language)
  • Work collection with sourceId for documents
  • Fixed 114 summaries missing properties
  • Deleted vL-jepa chunks (17), fixed null workTitles

Phase 4: Memory System (Complete)

  • Added Thought/Message/Conversation collections to Weaviate
  • 9 MCP tools for memory management (add_thought, search_thoughts, etc.)
  • GPU embeddings integration (BAAI/bge-m3, RTX 4070)
  • Data: 102 Thoughts, 377 Messages, 12 Conversations

Phase 5: Backend Integration (Complete)

  • Integrated GPU embedder into Flask app (singleton pattern)
  • All search routes now use manual vectorization with near_vector()
  • Updated all routes: simple_search, hierarchical_search, summary_only_search, rag_search
  • Fixed Work → Chunk/Summary property mapping (v2 schema)

Phase 6-7: Testing & Optimization

  • Comprehensive testing of search routes
  • MCP tools validation
  • Performance optimization with GPU embeddings
  • Documentation updates (README.md, CLAUDE.md)

Phase 8: Documentation Cleanup

  • Consolidated all phase documentation
  • Updated README with Memory MCP tools section
  • Cleaned up temporary files and scripts

Archive Structure

archive/
├── migration_scripts/        # Migration & optimization scripts (01-11)
│   ├── 01_migrate_document_to_work.py
│   ├── 02_create_schema_v2.py
│   ├── 03_migrate_chunks_v2.py
│   ├── 04_migrate_summaries_v2.py
│   ├── 05_validate_migration.py
│   ├── 07_cleanup.py
│   ├── 08_fix_summaries_properties.py
│   ├── 09_rechunk_oversized.py
│   ├── 10_test_search_quality.py
│   ├── 11_vectorize_missing_chunks.py
│   └── old_scripts/          # ChromaDB migration scripts
├── migration_docs/           # Detailed migration documentation
│   ├── PLAN_MIGRATION_V2_SANS_DOCUMENT.md
│   ├── PHASE5_BACKEND_INTEGRATION.md
│   └── WEAVIATE_RETRIEVAL_ARCHITECTURE.md
├── documentation/            # Phase summaries
│   ├── PHASE_0_PYTORCH_CUDA.md
│   ├── PHASE_2_MIGRATION_SUMMARY.md
│   ├── PHASE_3_CONVERSATIONS_SUMMARY.md
│   ├── PHASE_4_MIGRATION_CHROMADB.md
│   ├── PHASE_5_MCP_TOOLS.md
│   ├── PHASE_6_TESTS_OPTIMISATION.md
│   ├── PHASE_7_INTEGRATION_BACKEND.md
│   ├── PHASE_8_DOCUMENTATION_CLEANUP.md
│   └── MIGRATION_README.md
└── backups/                  # Pre-migration data backups
    └── pre_migration_20260108_152033/

Technology Stack

Vector Database: Weaviate 1.34.4 with BAAI/bge-m3 embeddings (1024-dim) Embedder: PyTorch 2.6.0+cu124, GPU RTX 4070 Backend: Flask 3.0 with Server-Sent Events MCP Integration: 9 memory tools + 6 RAG tools for Claude Desktop OCR: Mistral OCR API LLM: Ollama (local) or Mistral API


Known Issues

  1. Chunk_v2 has no vectorizer: All new chunks require manual vectorization via 11_vectorize_missing_chunks.py
  2. Data imbalance: Collected papers represents 95.8% of chunks, dominating search results
  3. Mind Design III underrepresented: Only 61 chunks (1.2%) vs 5,080 for Collected papers

Recommendations

  1. Add more diverse works to balance corpus
  2. Consider re-ranking with per-work boosting for diversity
  3. Recreate Chunk_v2 with text2vec-transformers vectorizer for auto-vectorization (requires full data reload)

For detailed implementation notes, see .claude/CLAUDE.md and archive/ directories.