- Add complete Library RAG application (Flask + MCP server) - PDF processing pipeline with OCR and LLM extraction - Weaviate vector database integration (BGE-M3 embeddings) - Flask web interface with search and document management - MCP server for Claude Desktop integration - Comprehensive test suite (134 tests) - Clean up root directory - Remove obsolete documentation files - Remove backup and temporary files - Update autonomous agent configuration - Update prompts - Enhance initializer bis prompt with better instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
15 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
Library RAG is a production-grade RAG system specialized in indexing and semantic search of philosophical and academic texts. It provides a complete pipeline from PDF upload through OCR, intelligent LLM-based extraction, to vectorized search in Weaviate.
Core Architecture:
- Vector Database: Weaviate 1.34.4 with text2vec-transformers (BAAI/bge-m3, 1024-dim)
- OCR: Mistral OCR API (~0.003€/page)
- LLM: Ollama (local, free) or Mistral API (fast, paid)
- Web Interface: Flask 3.0 with Server-Sent Events for real-time progress
- Infrastructure: Docker Compose (Weaviate + transformers with GPU support)
Migration Note (Dec 2024): Migrated from MiniLM-L6 (384-dim) to BGE-M3 (1024-dim) for superior multilingual support (Greek, Latin, French, English) and 8192 token context window.
Common Commands
Development Setup
# Windows
init.bat
# Linux/macOS
./init.sh
# Manual setup
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
docker compose up -d
python schema.py
Running the Application
# Start Weaviate (must be running first)
docker compose up -d
# Create schema (first time only)
python schema.py
# Start Flask web interface
python flask_app.py
# Access at http://localhost:5000
# Process a PDF programmatically
python -c "from utils.pdf_pipeline import process_pdf; from pathlib import Path; process_pdf(Path('input/document.pdf'))"
Type Checking and Testing
# Run mypy strict type checking
mypy .
mypy utils/pdf_pipeline.py # Check specific module
# Run all tests
pytest
# Run specific test file
pytest tests/utils/test_ocr_schemas.py -v
# Run with coverage
pytest --cov=utils --cov-report=html
Docker Operations
# Start services
docker compose up -d
# Check Weaviate status
curl http://localhost:8080/v1/.well-known/ready
# View logs
docker compose logs weaviate
docker compose logs text2vec-transformers
# Stop services
docker compose down
# Remove volumes (WARNING: deletes all data)
docker compose down -v
MCP Server (Claude Desktop Integration)
# Run MCP server
python mcp_server.py
# Test MCP tools
python -c "from mcp_tools.parse_pdf import parse_pdf_tool; parse_pdf_tool({'pdf_path': 'input/test.pdf'})"
High-Level Architecture
PDF Processing Pipeline (10 Steps)
The core of the application is utils/pdf_pipeline.py, which orchestrates a 10-step intelligent pipeline:
[1] OCR (ocr_processor.py)
↓ Extract text + images via Mistral OCR (~0.003€/page)
[2] Markdown (markdown_builder.py)
↓ Build structured markdown from OCR
[3] Images (image_extractor.py)
↓ Save images to output/images/
[4] Metadata (llm_metadata.py)
↓ LLM extracts title, author, year, language
[5] TOC (llm_toc.py)
↓ LLM extracts hierarchical table of contents
[6] Classify (llm_classifier.py)
↓ Classify sections (main_content, preface, bibliography, etc.)
[7] Chunking (llm_chunker.py)
↓ LLM semantic chunking into argumentative units
[8] Cleaning (llm_cleaner.py)
↓ Remove OCR artifacts, validate chunk length
[9] Validation (llm_validator.py)
↓ LLM validation + concept/keyword extraction
[10] Ingestion (weaviate_ingest.py)
↓ Batch insert + auto-vectorization in Weaviate
Key Parameters:
skip_ocr=True- Reuse existing markdown (avoid OCR cost)use_llm=True- Enable LLM processing stepsllm_provider="ollama"|"mistral"- Choose LLM provideruse_semantic_chunking=True- LLM-based chunking (slower, higher quality)use_ocr_annotations=True- OCR with annotations (3x cost, better TOC)ingest_to_weaviate=True- Insert chunks into Weaviate
Weaviate Schema (4 Collections)
Defined in schema.py, the database uses a normalized design with denormalized nested objects:
Work (no vectorizer)
title, author, year, language, genre
├─► Document (no vectorizer)
│ sourceId, edition, pages, toc, hierarchy
│ work: {title, author} (nested)
│
│ ├─► Chunk (text2vec-transformers) ⭐ PRIMARY
│ │ text (VECTORIZED)
│ │ keywords (VECTORIZED)
│ │ sectionPath, chapterTitle, unitType, orderIndex
│ │ work: {title, author} (nested)
│ │ document: {sourceId, edition} (nested)
│ │
│ └─► Summary (text2vec-transformers)
│ text (VECTORIZED)
│ concepts (VECTORIZED)
│ sectionPath, title, level, chunksCount
│ document: {sourceId} (nested)
Vectorization Strategy:
- Only
Chunk.text,Chunk.keywords,Summary.text,Summary.conceptsare vectorized - Metadata fields use
skip_vectorization=Truefor filtering performance - Nested objects avoid joins for efficient single-query retrieval
- BAAI/bge-m3 model: 1024 dimensions, 8192 token context
Module Organization
library_rag/
├── flask_app.py # Flask web app (38 KB) - routes, SSE, job queue
├── schema.py # Weaviate schema definition + management
├── docker-compose.yml # Weaviate + text2vec-transformers config
│
├── utils/ # Pipeline modules (all strictly typed)
│ ├── types.py # Central TypedDict definitions (31 KB)
│ ├── pdf_pipeline.py # Main orchestration (64 KB)
│ │
│ ├── mistral_client.py # Mistral OCR API client
│ ├── ocr_processor.py # OCR processing logic
│ ├── ocr_schemas.py # OCR response types
│ │
│ ├── llm_structurer.py # LLM infrastructure (Ollama/Mistral)
│ ├── llm_metadata.py # Step 4: Metadata extraction
│ ├── llm_toc.py # Step 5: TOC extraction
│ ├── llm_classifier.py # Step 6: Section classification
│ ├── llm_chunker.py # Step 7: Semantic chunking
│ ├── llm_cleaner.py # Step 8: Chunk cleaning
│ ├── llm_validator.py # Step 9: Validation + concepts
│ │
│ ├── markdown_builder.py # Step 2: Markdown construction
│ ├── image_extractor.py # Step 3: Image extraction
│ ├── hierarchy_parser.py # Hierarchical TOC parsing
│ ├── weaviate_ingest.py # Step 10: Database ingestion
│ │
│ └── toc_extractor*.py # Alternative TOC strategies
│
├── mcp_server.py # MCP server for Claude Desktop
├── mcp_tools/ # MCP tool implementations
│ ├── parse_pdf.py
│ └── search.py
│
├── templates/ # Jinja2 templates
│ ├── upload.html # PDF upload form
│ ├── upload_progress.html # SSE progress display
│ ├── search.html # Semantic search interface
│ └── ...
│
└── tests/ # Unit tests
└── utils/
Type Safety System
Critical Rule: All code MUST pass mypy --strict. This is non-negotiable.
Type Definitions (utils/types.py):
Metadata- Document metadata (title, author, year, language)TOCEntry- Hierarchical table of contents entriesChunkData- Processed chunk with metadataPipelineResult- Complete pipeline result dictLLMProvider- Literal["ollama", "mistral"]SectionType- 12 section classification typesUnitType- 10 chunk unit types (argument, definition, etc.)ProgressCallback- Protocol for progress reporting
Configuration (mypy.ini):
- Strict mode enabled globally
- Google-style docstrings required
- Per-module overrides for gradual migration
- Third-party libraries (weaviate, mistralai) have ignore_missing_imports
When adding new code:
- Define types in
utils/types.pyfirst - Add type annotations to all functions/methods
- Run
mypy .before committing - Write Google-style docstrings
Flask Application Routes
| Route | Method | Purpose |
|---|---|---|
/ |
GET | Homepage with collection statistics |
/passages |
GET | Browse all chunks (paginated) |
/search |
GET | Semantic search interface |
/upload |
GET | PDF upload form |
/upload |
POST | Start PDF processing job |
/upload/progress/<job_id> |
GET | SSE stream for job progress |
/upload/status/<job_id> |
GET | JSON job status |
/upload/result/<job_id> |
GET | Processing results page |
/documents |
GET | List all processed documents |
/documents/<doc>/view |
GET | Document details view |
/documents/delete/<doc> |
POST | Delete document + chunks |
/output/<filepath> |
GET | Download processed files |
Server-Sent Events (SSE): The upload progress route streams real-time updates for each pipeline step using Flask's stream_with_context.
Cost Management
OCR Costs (Mistral API):
- Standard: ~0.001-0.003€/page
- With annotations: ~0.009€/page (3x, better TOC)
LLM Costs:
- Ollama (local): FREE (slower, requires GPU/powerful CPU)
- Mistral API: Variable (fast, production-ready)
Best Practices:
- Use
skip_ocr=Truewhen re-processing existing documents - Use Ollama for development/testing
- Use Mistral API for production
- Check
<doc>_chunks.jsonfor cost tracking:cost_ocr,cost_llm,cost_total
Development Workflow
Adding a New Pipeline Step
- Create module in
utils/(e.g.,llm_summarizer.py) - Define types in
utils/types.py - Add step to
pdf_pipeline.pyorchestration - Update progress callback to report step
- Add tests in
tests/utils/test_<module>.py - Run
mypy .to verify types - Update this CLAUDE.md if user-facing
Modifying Weaviate Schema
- IMPORTANT: Schema changes require data migration
- Edit
schema.pycollection definitions - Test on empty Weaviate instance first
- Create migration script if data exists
- Update
utils/weaviate_ingest.pyingestion logic - Update
utils/types.pyif object shapes change - Document migration in README.md
Adding New Flask Routes
- Add route handler in
flask_app.py - Create Jinja2 template in
templates/if needed - Update navigation in
templates/base.html - Add route to table in this CLAUDE.md
- Consider adding to MCP server if useful for Claude Desktop
Common Debugging Scenarios
"Weaviate connection failed"
docker compose ps # Check if running
docker compose up -d # Start if needed
docker compose logs weaviate # Check errors
curl http://localhost:8080/v1/.well-known/ready # Test readiness
"OCR cost too high"
# Reuse existing markdown
result = process_pdf(
Path("input/document.pdf"),
skip_ocr=True, # Avoids OCR cost
use_llm=True,
)
"LLM timeout (Ollama)"
# Use lighter model
export STRUCTURE_LLM_MODEL=qwen2.5:7b # Instead of deepseek-r1:14b
# Or switch to Mistral API
result = process_pdf(..., llm_provider="mistral")
"Empty chunks after cleaning"
- Check
output/<doc>/<doc>_chunks.json - Look at
classified_sections- may have classified main content as "ignore" - Adjust classification prompts in
llm_classifier.py - Lower
min_charsthreshold inllm_cleaner.pyif needed
"TOC extraction failed"
# Use OCR annotations (more reliable but 3x cost)
result = process_pdf(
Path("input/document.pdf"),
use_ocr_annotations=True,
)
Type errors from mypy
# Check specific module
mypy utils/pdf_pipeline.py
# Ignore specific errors (last resort)
# Add to mypy.ini:
# [mypy-module_name]
# ignore_errors = True
Output Files Structure
For each processed document <doc_name>:
output/<doc_name>/
├── <doc_name>.md # Structured markdown with hierarchy
├── <doc_name>_ocr.json # Raw OCR response from Mistral
├── <doc_name>_chunks.json # Processed chunks + metadata + costs
├── <doc_name>_weaviate.json # Ingestion results (UUIDs, counts)
└── images/ # Extracted images
├── page_001_image_0.png
└── ...
chunks.json structure:
{
"metadata": {...},
"classified_sections": [...],
"chunks": [...],
"cost_ocr": 0.12,
"cost_llm": 0.03,
"total_cost": 0.15,
"pages": 40,
"chunks_count": 127
}
Important Implementation Notes
LLM Provider Configuration
Set in .env file:
# Ollama (local)
STRUCTURE_LLM_MODEL=qwen2.5:7b
OLLAMA_BASE_URL=http://localhost:11434
# Mistral API
MISTRAL_API_KEY=your_key_here
STRUCTURE_LLM_MODEL=mistral-large-latest
# OCR (required)
MISTRAL_API_KEY=your_key_here
Weaviate Connection
Default: localhost:8080 (HTTP), localhost:50051 (gRPC)
Python client v4 uses connect_to_local():
import weaviate
client = weaviate.connect_to_local()
Nested Objects vs Cross-References
We use nested objects instead of Weaviate cross-references because:
- Single-query retrieval (no joins needed)
- Denormalized for read performance
- Simplified query logic
- Trade-off: Small data duplication acceptable
Google-Style Docstrings Required
def process_pdf(
pdf_path: Path,
*,
use_llm: bool = True,
llm_provider: LLMProvider = "ollama",
) -> PipelineResult:
"""Process a PDF through the complete RAG pipeline.
Args:
pdf_path: Path to the PDF file to process.
use_llm: Enable LLM processing steps (metadata, TOC, chunking).
llm_provider: LLM provider ("ollama" for local, "mistral" for API).
Returns:
Dictionary containing processing results with keys:
- success: Whether processing succeeded
- document_name: Name of the processed document
- chunks_count: Number of chunks created
- cost_ocr: OCR cost in euros
- cost_llm: LLM cost in euros
- cost_total: Total cost in euros
Raises:
FileNotFoundError: If PDF file does not exist.
OCRError: If OCR processing fails.
LLMStructureError: If LLM processing fails.
"""
Testing Strategy
Current test coverage: Partial (OCR schemas, TOC extraction)
Priority areas needing tests:
- End-to-end pipeline tests with mock OCR/LLM
- Weaviate ingestion tests with test collections
- Flask route tests with test client
- LLM module tests with fixed prompts
When writing tests:
# tests/utils/test_my_module.py
import pytest
from unittest.mock import Mock, patch
from utils.my_module import my_function
def test_my_function_success():
"""Test my_function with valid input."""
result = my_function("valid_input")
assert result["success"] is True
@patch("utils.my_module.expensive_api_call")
def test_my_function_with_mock(mock_api):
"""Test my_function with mocked API."""
mock_api.return_value = {"data": "test"}
result = my_function("input")
mock_api.assert_called_once()
Run tests: pytest tests/ -v