Add Library RAG project and cleanup root directory
- Add complete Library RAG application (Flask + MCP server) - PDF processing pipeline with OCR and LLM extraction - Weaviate vector database integration (BGE-M3 embeddings) - Flask web interface with search and document management - MCP server for Claude Desktop integration - Comprehensive test suite (134 tests) - Clean up root directory - Remove obsolete documentation files - Remove backup and temporary files - Update autonomous agent configuration - Update prompts - Enhance initializer bis prompt with better instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
489
generations/library_rag/.claude/CLAUDE.md
Normal file
489
generations/library_rag/.claude/CLAUDE.md
Normal file
@@ -0,0 +1,489 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Overview
|
||||
|
||||
**Library RAG** is a production-grade RAG system specialized in indexing and semantic search of philosophical and academic texts. It provides a complete pipeline from PDF upload through OCR, intelligent LLM-based extraction, to vectorized search in Weaviate.
|
||||
|
||||
**Core Architecture:**
|
||||
- **Vector Database**: Weaviate 1.34.4 with text2vec-transformers (BAAI/bge-m3, 1024-dim)
|
||||
- **OCR**: Mistral OCR API (~0.003€/page)
|
||||
- **LLM**: Ollama (local, free) or Mistral API (fast, paid)
|
||||
- **Web Interface**: Flask 3.0 with Server-Sent Events for real-time progress
|
||||
- **Infrastructure**: Docker Compose (Weaviate + transformers with GPU support)
|
||||
|
||||
**Migration Note (Dec 2024):** Migrated from MiniLM-L6 (384-dim) to BGE-M3 (1024-dim) for superior multilingual support (Greek, Latin, French, English) and 8192 token context window.
|
||||
|
||||
## Common Commands
|
||||
|
||||
### Development Setup
|
||||
|
||||
```bash
|
||||
# Windows
|
||||
init.bat
|
||||
|
||||
# Linux/macOS
|
||||
./init.sh
|
||||
|
||||
# Manual setup
|
||||
python -m venv venv
|
||||
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||
pip install -r requirements.txt
|
||||
docker compose up -d
|
||||
python schema.py
|
||||
```
|
||||
|
||||
### Running the Application
|
||||
|
||||
```bash
|
||||
# Start Weaviate (must be running first)
|
||||
docker compose up -d
|
||||
|
||||
# Create schema (first time only)
|
||||
python schema.py
|
||||
|
||||
# Start Flask web interface
|
||||
python flask_app.py
|
||||
# Access at http://localhost:5000
|
||||
|
||||
# Process a PDF programmatically
|
||||
python -c "from utils.pdf_pipeline import process_pdf; from pathlib import Path; process_pdf(Path('input/document.pdf'))"
|
||||
```
|
||||
|
||||
### Type Checking and Testing
|
||||
|
||||
```bash
|
||||
# Run mypy strict type checking
|
||||
mypy .
|
||||
mypy utils/pdf_pipeline.py # Check specific module
|
||||
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/utils/test_ocr_schemas.py -v
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=utils --cov-report=html
|
||||
```
|
||||
|
||||
### Docker Operations
|
||||
|
||||
```bash
|
||||
# Start services
|
||||
docker compose up -d
|
||||
|
||||
# Check Weaviate status
|
||||
curl http://localhost:8080/v1/.well-known/ready
|
||||
|
||||
# View logs
|
||||
docker compose logs weaviate
|
||||
docker compose logs text2vec-transformers
|
||||
|
||||
# Stop services
|
||||
docker compose down
|
||||
|
||||
# Remove volumes (WARNING: deletes all data)
|
||||
docker compose down -v
|
||||
```
|
||||
|
||||
### MCP Server (Claude Desktop Integration)
|
||||
|
||||
```bash
|
||||
# Run MCP server
|
||||
python mcp_server.py
|
||||
|
||||
# Test MCP tools
|
||||
python -c "from mcp_tools.parse_pdf import parse_pdf_tool; parse_pdf_tool({'pdf_path': 'input/test.pdf'})"
|
||||
```
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
### PDF Processing Pipeline (10 Steps)
|
||||
|
||||
The core of the application is `utils/pdf_pipeline.py`, which orchestrates a 10-step intelligent pipeline:
|
||||
|
||||
```
|
||||
[1] OCR (ocr_processor.py)
|
||||
↓ Extract text + images via Mistral OCR (~0.003€/page)
|
||||
[2] Markdown (markdown_builder.py)
|
||||
↓ Build structured markdown from OCR
|
||||
[3] Images (image_extractor.py)
|
||||
↓ Save images to output/images/
|
||||
[4] Metadata (llm_metadata.py)
|
||||
↓ LLM extracts title, author, year, language
|
||||
[5] TOC (llm_toc.py)
|
||||
↓ LLM extracts hierarchical table of contents
|
||||
[6] Classify (llm_classifier.py)
|
||||
↓ Classify sections (main_content, preface, bibliography, etc.)
|
||||
[7] Chunking (llm_chunker.py)
|
||||
↓ LLM semantic chunking into argumentative units
|
||||
[8] Cleaning (llm_cleaner.py)
|
||||
↓ Remove OCR artifacts, validate chunk length
|
||||
[9] Validation (llm_validator.py)
|
||||
↓ LLM validation + concept/keyword extraction
|
||||
[10] Ingestion (weaviate_ingest.py)
|
||||
↓ Batch insert + auto-vectorization in Weaviate
|
||||
```
|
||||
|
||||
**Key Parameters:**
|
||||
- `skip_ocr=True` - Reuse existing markdown (avoid OCR cost)
|
||||
- `use_llm=True` - Enable LLM processing steps
|
||||
- `llm_provider="ollama"|"mistral"` - Choose LLM provider
|
||||
- `use_semantic_chunking=True` - LLM-based chunking (slower, higher quality)
|
||||
- `use_ocr_annotations=True` - OCR with annotations (3x cost, better TOC)
|
||||
- `ingest_to_weaviate=True` - Insert chunks into Weaviate
|
||||
|
||||
### Weaviate Schema (4 Collections)
|
||||
|
||||
Defined in `schema.py`, the database uses a normalized design with denormalized nested objects:
|
||||
|
||||
```
|
||||
Work (no vectorizer)
|
||||
title, author, year, language, genre
|
||||
|
||||
├─► Document (no vectorizer)
|
||||
│ sourceId, edition, pages, toc, hierarchy
|
||||
│ work: {title, author} (nested)
|
||||
│
|
||||
│ ├─► Chunk (text2vec-transformers) ⭐ PRIMARY
|
||||
│ │ text (VECTORIZED)
|
||||
│ │ keywords (VECTORIZED)
|
||||
│ │ sectionPath, chapterTitle, unitType, orderIndex
|
||||
│ │ work: {title, author} (nested)
|
||||
│ │ document: {sourceId, edition} (nested)
|
||||
│ │
|
||||
│ └─► Summary (text2vec-transformers)
|
||||
│ text (VECTORIZED)
|
||||
│ concepts (VECTORIZED)
|
||||
│ sectionPath, title, level, chunksCount
|
||||
│ document: {sourceId} (nested)
|
||||
```
|
||||
|
||||
**Vectorization Strategy:**
|
||||
- Only `Chunk.text`, `Chunk.keywords`, `Summary.text`, `Summary.concepts` are vectorized
|
||||
- Metadata fields use `skip_vectorization=True` for filtering performance
|
||||
- Nested objects avoid joins for efficient single-query retrieval
|
||||
- BAAI/bge-m3 model: 1024 dimensions, 8192 token context
|
||||
|
||||
### Module Organization
|
||||
|
||||
```
|
||||
library_rag/
|
||||
├── flask_app.py # Flask web app (38 KB) - routes, SSE, job queue
|
||||
├── schema.py # Weaviate schema definition + management
|
||||
├── docker-compose.yml # Weaviate + text2vec-transformers config
|
||||
│
|
||||
├── utils/ # Pipeline modules (all strictly typed)
|
||||
│ ├── types.py # Central TypedDict definitions (31 KB)
|
||||
│ ├── pdf_pipeline.py # Main orchestration (64 KB)
|
||||
│ │
|
||||
│ ├── mistral_client.py # Mistral OCR API client
|
||||
│ ├── ocr_processor.py # OCR processing logic
|
||||
│ ├── ocr_schemas.py # OCR response types
|
||||
│ │
|
||||
│ ├── llm_structurer.py # LLM infrastructure (Ollama/Mistral)
|
||||
│ ├── llm_metadata.py # Step 4: Metadata extraction
|
||||
│ ├── llm_toc.py # Step 5: TOC extraction
|
||||
│ ├── llm_classifier.py # Step 6: Section classification
|
||||
│ ├── llm_chunker.py # Step 7: Semantic chunking
|
||||
│ ├── llm_cleaner.py # Step 8: Chunk cleaning
|
||||
│ ├── llm_validator.py # Step 9: Validation + concepts
|
||||
│ │
|
||||
│ ├── markdown_builder.py # Step 2: Markdown construction
|
||||
│ ├── image_extractor.py # Step 3: Image extraction
|
||||
│ ├── hierarchy_parser.py # Hierarchical TOC parsing
|
||||
│ ├── weaviate_ingest.py # Step 10: Database ingestion
|
||||
│ │
|
||||
│ └── toc_extractor*.py # Alternative TOC strategies
|
||||
│
|
||||
├── mcp_server.py # MCP server for Claude Desktop
|
||||
├── mcp_tools/ # MCP tool implementations
|
||||
│ ├── parse_pdf.py
|
||||
│ └── search.py
|
||||
│
|
||||
├── templates/ # Jinja2 templates
|
||||
│ ├── upload.html # PDF upload form
|
||||
│ ├── upload_progress.html # SSE progress display
|
||||
│ ├── search.html # Semantic search interface
|
||||
│ └── ...
|
||||
│
|
||||
└── tests/ # Unit tests
|
||||
└── utils/
|
||||
```
|
||||
|
||||
### Type Safety System
|
||||
|
||||
**Critical Rule:** All code MUST pass `mypy --strict`. This is non-negotiable.
|
||||
|
||||
**Type Definitions (`utils/types.py`):**
|
||||
- `Metadata` - Document metadata (title, author, year, language)
|
||||
- `TOCEntry` - Hierarchical table of contents entries
|
||||
- `ChunkData` - Processed chunk with metadata
|
||||
- `PipelineResult` - Complete pipeline result dict
|
||||
- `LLMProvider` - Literal["ollama", "mistral"]
|
||||
- `SectionType` - 12 section classification types
|
||||
- `UnitType` - 10 chunk unit types (argument, definition, etc.)
|
||||
- `ProgressCallback` - Protocol for progress reporting
|
||||
|
||||
**Configuration (`mypy.ini`):**
|
||||
- Strict mode enabled globally
|
||||
- Google-style docstrings required
|
||||
- Per-module overrides for gradual migration
|
||||
- Third-party libraries (weaviate, mistralai) have ignore_missing_imports
|
||||
|
||||
**When adding new code:**
|
||||
1. Define types in `utils/types.py` first
|
||||
2. Add type annotations to all functions/methods
|
||||
3. Run `mypy .` before committing
|
||||
4. Write Google-style docstrings
|
||||
|
||||
## Flask Application Routes
|
||||
|
||||
| Route | Method | Purpose |
|
||||
|-------|--------|---------|
|
||||
| `/` | GET | Homepage with collection statistics |
|
||||
| `/passages` | GET | Browse all chunks (paginated) |
|
||||
| `/search` | GET | Semantic search interface |
|
||||
| `/upload` | GET | PDF upload form |
|
||||
| `/upload` | POST | Start PDF processing job |
|
||||
| `/upload/progress/<job_id>` | GET | SSE stream for job progress |
|
||||
| `/upload/status/<job_id>` | GET | JSON job status |
|
||||
| `/upload/result/<job_id>` | GET | Processing results page |
|
||||
| `/documents` | GET | List all processed documents |
|
||||
| `/documents/<doc>/view` | GET | Document details view |
|
||||
| `/documents/delete/<doc>` | POST | Delete document + chunks |
|
||||
| `/output/<filepath>` | GET | Download processed files |
|
||||
|
||||
**Server-Sent Events (SSE):** The upload progress route streams real-time updates for each pipeline step using Flask's `stream_with_context`.
|
||||
|
||||
## Cost Management
|
||||
|
||||
**OCR Costs (Mistral API):**
|
||||
- Standard: ~0.001-0.003€/page
|
||||
- With annotations: ~0.009€/page (3x, better TOC)
|
||||
|
||||
**LLM Costs:**
|
||||
- Ollama (local): FREE (slower, requires GPU/powerful CPU)
|
||||
- Mistral API: Variable (fast, production-ready)
|
||||
|
||||
**Best Practices:**
|
||||
1. Use `skip_ocr=True` when re-processing existing documents
|
||||
2. Use Ollama for development/testing
|
||||
3. Use Mistral API for production
|
||||
4. Check `<doc>_chunks.json` for cost tracking: `cost_ocr`, `cost_llm`, `cost_total`
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Adding a New Pipeline Step
|
||||
|
||||
1. Create module in `utils/` (e.g., `llm_summarizer.py`)
|
||||
2. Define types in `utils/types.py`
|
||||
3. Add step to `pdf_pipeline.py` orchestration
|
||||
4. Update progress callback to report step
|
||||
5. Add tests in `tests/utils/test_<module>.py`
|
||||
6. Run `mypy .` to verify types
|
||||
7. Update this CLAUDE.md if user-facing
|
||||
|
||||
### Modifying Weaviate Schema
|
||||
|
||||
1. **IMPORTANT:** Schema changes require data migration
|
||||
2. Edit `schema.py` collection definitions
|
||||
3. Test on empty Weaviate instance first
|
||||
4. Create migration script if data exists
|
||||
5. Update `utils/weaviate_ingest.py` ingestion logic
|
||||
6. Update `utils/types.py` if object shapes change
|
||||
7. Document migration in README.md
|
||||
|
||||
### Adding New Flask Routes
|
||||
|
||||
1. Add route handler in `flask_app.py`
|
||||
2. Create Jinja2 template in `templates/` if needed
|
||||
3. Update navigation in `templates/base.html`
|
||||
4. Add route to table in this CLAUDE.md
|
||||
5. Consider adding to MCP server if useful for Claude Desktop
|
||||
|
||||
## Common Debugging Scenarios
|
||||
|
||||
### "Weaviate connection failed"
|
||||
```bash
|
||||
docker compose ps # Check if running
|
||||
docker compose up -d # Start if needed
|
||||
docker compose logs weaviate # Check errors
|
||||
curl http://localhost:8080/v1/.well-known/ready # Test readiness
|
||||
```
|
||||
|
||||
### "OCR cost too high"
|
||||
```python
|
||||
# Reuse existing markdown
|
||||
result = process_pdf(
|
||||
Path("input/document.pdf"),
|
||||
skip_ocr=True, # Avoids OCR cost
|
||||
use_llm=True,
|
||||
)
|
||||
```
|
||||
|
||||
### "LLM timeout (Ollama)"
|
||||
```bash
|
||||
# Use lighter model
|
||||
export STRUCTURE_LLM_MODEL=qwen2.5:7b # Instead of deepseek-r1:14b
|
||||
|
||||
# Or switch to Mistral API
|
||||
result = process_pdf(..., llm_provider="mistral")
|
||||
```
|
||||
|
||||
### "Empty chunks after cleaning"
|
||||
1. Check `output/<doc>/<doc>_chunks.json`
|
||||
2. Look at `classified_sections` - may have classified main content as "ignore"
|
||||
3. Adjust classification prompts in `llm_classifier.py`
|
||||
4. Lower `min_chars` threshold in `llm_cleaner.py` if needed
|
||||
|
||||
### "TOC extraction failed"
|
||||
```python
|
||||
# Use OCR annotations (more reliable but 3x cost)
|
||||
result = process_pdf(
|
||||
Path("input/document.pdf"),
|
||||
use_ocr_annotations=True,
|
||||
)
|
||||
```
|
||||
|
||||
### Type errors from mypy
|
||||
```bash
|
||||
# Check specific module
|
||||
mypy utils/pdf_pipeline.py
|
||||
|
||||
# Ignore specific errors (last resort)
|
||||
# Add to mypy.ini:
|
||||
# [mypy-module_name]
|
||||
# ignore_errors = True
|
||||
```
|
||||
|
||||
## Output Files Structure
|
||||
|
||||
For each processed document `<doc_name>`:
|
||||
|
||||
```
|
||||
output/<doc_name>/
|
||||
├── <doc_name>.md # Structured markdown with hierarchy
|
||||
├── <doc_name>_ocr.json # Raw OCR response from Mistral
|
||||
├── <doc_name>_chunks.json # Processed chunks + metadata + costs
|
||||
├── <doc_name>_weaviate.json # Ingestion results (UUIDs, counts)
|
||||
└── images/ # Extracted images
|
||||
├── page_001_image_0.png
|
||||
└── ...
|
||||
```
|
||||
|
||||
**chunks.json structure:**
|
||||
```json
|
||||
{
|
||||
"metadata": {...},
|
||||
"classified_sections": [...],
|
||||
"chunks": [...],
|
||||
"cost_ocr": 0.12,
|
||||
"cost_llm": 0.03,
|
||||
"total_cost": 0.15,
|
||||
"pages": 40,
|
||||
"chunks_count": 127
|
||||
}
|
||||
```
|
||||
|
||||
## Important Implementation Notes
|
||||
|
||||
### LLM Provider Configuration
|
||||
|
||||
Set in `.env` file:
|
||||
```env
|
||||
# Ollama (local)
|
||||
STRUCTURE_LLM_MODEL=qwen2.5:7b
|
||||
OLLAMA_BASE_URL=http://localhost:11434
|
||||
|
||||
# Mistral API
|
||||
MISTRAL_API_KEY=your_key_here
|
||||
STRUCTURE_LLM_MODEL=mistral-large-latest
|
||||
|
||||
# OCR (required)
|
||||
MISTRAL_API_KEY=your_key_here
|
||||
```
|
||||
|
||||
### Weaviate Connection
|
||||
|
||||
Default: `localhost:8080` (HTTP), `localhost:50051` (gRPC)
|
||||
|
||||
Python client v4 uses `connect_to_local()`:
|
||||
```python
|
||||
import weaviate
|
||||
client = weaviate.connect_to_local()
|
||||
```
|
||||
|
||||
### Nested Objects vs Cross-References
|
||||
|
||||
We use **nested objects** instead of Weaviate cross-references because:
|
||||
- Single-query retrieval (no joins needed)
|
||||
- Denormalized for read performance
|
||||
- Simplified query logic
|
||||
- Trade-off: Small data duplication acceptable
|
||||
|
||||
### Google-Style Docstrings Required
|
||||
|
||||
```python
|
||||
def process_pdf(
|
||||
pdf_path: Path,
|
||||
*,
|
||||
use_llm: bool = True,
|
||||
llm_provider: LLMProvider = "ollama",
|
||||
) -> PipelineResult:
|
||||
"""Process a PDF through the complete RAG pipeline.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file to process.
|
||||
use_llm: Enable LLM processing steps (metadata, TOC, chunking).
|
||||
llm_provider: LLM provider ("ollama" for local, "mistral" for API).
|
||||
|
||||
Returns:
|
||||
Dictionary containing processing results with keys:
|
||||
- success: Whether processing succeeded
|
||||
- document_name: Name of the processed document
|
||||
- chunks_count: Number of chunks created
|
||||
- cost_ocr: OCR cost in euros
|
||||
- cost_llm: LLM cost in euros
|
||||
- cost_total: Total cost in euros
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If PDF file does not exist.
|
||||
OCRError: If OCR processing fails.
|
||||
LLMStructureError: If LLM processing fails.
|
||||
"""
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
**Current test coverage:** Partial (OCR schemas, TOC extraction)
|
||||
|
||||
**Priority areas needing tests:**
|
||||
- End-to-end pipeline tests with mock OCR/LLM
|
||||
- Weaviate ingestion tests with test collections
|
||||
- Flask route tests with test client
|
||||
- LLM module tests with fixed prompts
|
||||
|
||||
**When writing tests:**
|
||||
```python
|
||||
# tests/utils/test_my_module.py
|
||||
import pytest
|
||||
from unittest.mock import Mock, patch
|
||||
from utils.my_module import my_function
|
||||
|
||||
def test_my_function_success():
|
||||
"""Test my_function with valid input."""
|
||||
result = my_function("valid_input")
|
||||
assert result["success"] is True
|
||||
|
||||
@patch("utils.my_module.expensive_api_call")
|
||||
def test_my_function_with_mock(mock_api):
|
||||
"""Test my_function with mocked API."""
|
||||
mock_api.return_value = {"data": "test"}
|
||||
result = my_function("input")
|
||||
mock_api.assert_called_once()
|
||||
```
|
||||
|
||||
Run tests: `pytest tests/ -v`
|
||||
42
generations/library_rag/.claude_settings.json
Normal file
42
generations/library_rag/.claude_settings.json
Normal file
@@ -0,0 +1,42 @@
|
||||
{
|
||||
"sandbox": {
|
||||
"enabled": true,
|
||||
"autoAllowBashIfSandboxed": true
|
||||
},
|
||||
"permissions": {
|
||||
"defaultMode": "acceptEdits",
|
||||
"allow": [
|
||||
"Read(./**)",
|
||||
"Write(./**)",
|
||||
"Edit(./**)",
|
||||
"Glob(./**)",
|
||||
"Grep(./**)",
|
||||
"Bash(*)",
|
||||
"mcp__puppeteer__puppeteer_navigate",
|
||||
"mcp__puppeteer__puppeteer_screenshot",
|
||||
"mcp__puppeteer__puppeteer_click",
|
||||
"mcp__puppeteer__puppeteer_fill",
|
||||
"mcp__puppeteer__puppeteer_select",
|
||||
"mcp__puppeteer__puppeteer_hover",
|
||||
"mcp__puppeteer__puppeteer_evaluate",
|
||||
"mcp__linear__list_teams",
|
||||
"mcp__linear__get_team",
|
||||
"mcp__linear__list_projects",
|
||||
"mcp__linear__get_project",
|
||||
"mcp__linear__create_project",
|
||||
"mcp__linear__update_project",
|
||||
"mcp__linear__list_issues",
|
||||
"mcp__linear__get_issue",
|
||||
"mcp__linear__create_issue",
|
||||
"mcp__linear__update_issue",
|
||||
"mcp__linear__list_my_issues",
|
||||
"mcp__linear__list_comments",
|
||||
"mcp__linear__create_comment",
|
||||
"mcp__linear__list_issue_statuses",
|
||||
"mcp__linear__get_issue_status",
|
||||
"mcp__linear__list_issue_labels",
|
||||
"mcp__linear__list_users",
|
||||
"mcp__linear__get_user"
|
||||
]
|
||||
}
|
||||
}
|
||||
7
generations/library_rag/.cursor/mcp.json
Normal file
7
generations/library_rag/.cursor/mcp.json
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"mcpServers": {
|
||||
"archon": {
|
||||
"serverUrl": "http://localhost:8051/mcp"
|
||||
}
|
||||
}
|
||||
}
|
||||
51
generations/library_rag/.env.example
Normal file
51
generations/library_rag/.env.example
Normal file
@@ -0,0 +1,51 @@
|
||||
# ============================================================================
|
||||
# Library RAG MCP Server - Environment Configuration
|
||||
# ============================================================================
|
||||
# Copy this file to .env and fill in your values.
|
||||
# Required variables are marked with [REQUIRED].
|
||||
# ============================================================================
|
||||
|
||||
# [REQUIRED] Mistral API Key for OCR and LLM services
|
||||
# Get your key at: https://console.mistral.ai/
|
||||
MISTRAL_API_KEY=your-mistral-api-key-here
|
||||
|
||||
# ============================================================================
|
||||
# LLM Configuration
|
||||
# ============================================================================
|
||||
|
||||
# Ollama base URL for local LLM (default: http://localhost:11434)
|
||||
OLLAMA_BASE_URL=http://localhost:11434
|
||||
|
||||
# LLM model for structure extraction (default: deepseek-r1:14b)
|
||||
STRUCTURE_LLM_MODEL=deepseek-r1:14b
|
||||
|
||||
# Temperature for LLM generation (0.0-2.0, default: 0.2)
|
||||
STRUCTURE_LLM_TEMPERATURE=0.2
|
||||
|
||||
# Default LLM provider: "ollama" (local, free) or "mistral" (API, paid)
|
||||
# For MCP server, always uses "mistral" with mistral-medium-latest
|
||||
DEFAULT_LLM_PROVIDER=ollama
|
||||
|
||||
# ============================================================================
|
||||
# Weaviate Configuration
|
||||
# ============================================================================
|
||||
|
||||
# Weaviate server hostname (default: localhost)
|
||||
WEAVIATE_HOST=localhost
|
||||
|
||||
# Weaviate server port (default: 8080)
|
||||
WEAVIATE_PORT=8080
|
||||
|
||||
# ============================================================================
|
||||
# Logging
|
||||
# ============================================================================
|
||||
|
||||
# Log level: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
|
||||
LOG_LEVEL=INFO
|
||||
|
||||
# ============================================================================
|
||||
# File System
|
||||
# ============================================================================
|
||||
|
||||
# Base directory for processed files (default: output)
|
||||
OUTPUT_DIR=output
|
||||
87
generations/library_rag/.gitignore
vendored
Normal file
87
generations/library_rag/.gitignore
vendored
Normal file
@@ -0,0 +1,87 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Virtual environments
|
||||
venv/
|
||||
ENV/
|
||||
env/
|
||||
.venv/
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# Environment variables
|
||||
.env
|
||||
.env.local
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
logs/
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Output files (large generated files)
|
||||
output/*/images/
|
||||
output/*/*.json
|
||||
output/*/*.md
|
||||
|
||||
# Keep output folder structure
|
||||
!output/.gitkeep
|
||||
|
||||
# Temporary files
|
||||
*.tmp
|
||||
*.bak
|
||||
*.backup
|
||||
temp_*.py
|
||||
cleanup_*.py
|
||||
|
||||
# Type checking outputs
|
||||
mypy_errors.txt
|
||||
*_errors.txt
|
||||
|
||||
# Test PDFs (keep input/ folder but ignore PDFs)
|
||||
input/*.pdf
|
||||
|
||||
# Node artifacts (not a Node.js project)
|
||||
package-lock.json
|
||||
|
||||
# Linear backup files
|
||||
.linear_project.json.backup
|
||||
|
||||
# PRPs directory (project request proposals - temporary)
|
||||
PRPs/
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# Markdown working directory (conversion scripts + large source files)
|
||||
md/
|
||||
13
generations/library_rag/.linear_project.json
Normal file
13
generations/library_rag/.linear_project.json
Normal file
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"initialized": true,
|
||||
"created_at": "2025-12-24T14:34:02.791Z",
|
||||
"team_id": "55c48386-0ed4-4f41-a963-c3d4bd2b9233",
|
||||
"team_name": "linear_rag_philo",
|
||||
"project_id": "c04bad18-7bda-44d0-a25c-51d5bf671734",
|
||||
"project_name": "library_rag",
|
||||
"project_url": "https://linear.app/philosophiatech/project/library-rag-mcp-server-pdf-ingestion-and-semantic-retrieval-5172487a22fc",
|
||||
"meta_issue_id": "8ef2b8d5-662b-49da-83b3-ee8a98028d7f",
|
||||
"meta_issue_identifier": "LRP-95",
|
||||
"total_issues": 55,
|
||||
"notes": "Project initialized by initializer agent. Extended by initializer bis with 12 new conversation interface issues (2025-12-29)."
|
||||
}
|
||||
296
generations/library_rag/CHUNK_JSON_FIELDS_EXPLAINED.md
Normal file
296
generations/library_rag/CHUNK_JSON_FIELDS_EXPLAINED.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Format JSON des Chunks - Explication Complète
|
||||
|
||||
## Comparaison : Format Actuel vs Format Complet
|
||||
|
||||
### ❌ Format ACTUEL (Peirce chunks - INCOMPLET)
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "chunk_00000",
|
||||
"text": "To erect a philosophical edifice...",
|
||||
"section": "1. PREFACE",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
"concepts": []
|
||||
}
|
||||
```
|
||||
|
||||
**Champs manquants** : `canonicalReference`, `chapterTitle`, `sectionPath`, `orderIndex`, `keywords`, `unitType`, `confidence`
|
||||
|
||||
### ✅ Format COMPLET (Requis pour Weaviate enrichi)
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "chunk_00000",
|
||||
"text": "To erect a philosophical edifice...",
|
||||
|
||||
"section": "1. PREFACE",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
|
||||
"canonicalReference": "CP 1.1",
|
||||
"chapterTitle": "Peirce: CP 1.1",
|
||||
"sectionPath": "Peirce: CP 1.1 > 1. PREFACE",
|
||||
"orderIndex": 0,
|
||||
|
||||
"keywords": ["philosophical edifice", "Aristotle", "matter and form"],
|
||||
"concepts": ["philosophy as architecture", "Aristotelian foundations"],
|
||||
|
||||
"unitType": "argument",
|
||||
"confidence": 0.95
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Description des Champs
|
||||
|
||||
### 🔵 Champs de BASE (générés par chunker)
|
||||
|
||||
| Champ | Type | Obligatoire | Description | Exemple |
|
||||
|-------|------|-------------|-------------|---------|
|
||||
| `chunk_id` | string | ✅ Oui | Identifiant unique du chunk | `"chunk_00000"` |
|
||||
| `text` | string | ✅ Oui | Texte complet du chunk (VECTORISÉ) | `"To erect a philosophical..."` |
|
||||
| `section` | string | ✅ Oui | Titre de la section source | `"1. PREFACE"` |
|
||||
| `section_level` | int | ✅ Oui | Niveau hiérarchique (1-6) | `2` |
|
||||
| `type` | string | ✅ Oui | Type de section | `"main_content"` |
|
||||
|
||||
**Types de section possibles** :
|
||||
- `main_content` : Contenu principal
|
||||
- `preface` : Préface
|
||||
- `introduction` : Introduction
|
||||
- `conclusion` : Conclusion
|
||||
- `bibliography` : Bibliographie
|
||||
- `appendix` : Annexes
|
||||
- `notes` : Notes
|
||||
- `table_of_contents` : Table des matières
|
||||
- `index` : Index
|
||||
- `acknowledgments` : Remerciements
|
||||
- `abstract` : Résumé
|
||||
- `ignore` : À ignorer
|
||||
|
||||
### 🟢 Champs d'ENRICHISSEMENT TOC (ajoutés par toc_enricher)
|
||||
|
||||
| Champ | Type | Obligatoire | Description | Exemple |
|
||||
|-------|------|-------------|-------------|---------|
|
||||
| `canonicalReference` | string | ⭐ **CRITIQUE** | Référence académique standard | `"CP 1.628"` |
|
||||
| `chapterTitle` | string | ⭐ **CRITIQUE** | Titre du chapitre parent | `"Peirce: CP 1.1"` |
|
||||
| `sectionPath` | string | ⭐ **CRITIQUE** | Chemin hiérarchique complet | `"Peirce: CP 1.628 > 628. It is..."` |
|
||||
| `orderIndex` | int | ⭐ **CRITIQUE** | Index séquentiel (0-based) | `627` |
|
||||
|
||||
**Importance** : Ces champs permettent :
|
||||
- Citation académique précise (canonicalReference)
|
||||
- Navigation dans la structure du document
|
||||
- Tri et organisation des résultats de recherche
|
||||
- Reconstruction de l'ordre original du texte
|
||||
|
||||
### 🟡 Champs LLM (ajoutés par llm_validator)
|
||||
|
||||
| Champ | Type | Obligatoire | Description | Exemple |
|
||||
|-------|------|-------------|-------------|---------|
|
||||
| `keywords` | string[] | 🔶 Important | Mots-clés extraits (VECTORISÉ) | `["instincts", "sentiments", "soul"]` |
|
||||
| `concepts` | string[] | 🔶 Important | Concepts philosophiques (VECTORISÉ) | `["soul as instinct", "depth psychology"]` |
|
||||
| `unitType` | string | 🔶 Important | Type d'unité argumentative | `"argument"` |
|
||||
| `confidence` | float | ⚪ Optionnel | Confiance LLM (0-1) | `0.95` |
|
||||
|
||||
**Types d'unité (unitType)** :
|
||||
- `argument` : Argument complet
|
||||
- `definition` : Définition d'un concept
|
||||
- `example` : Exemple illustratif
|
||||
- `citation` : Citation d'un autre auteur
|
||||
- `question` : Question philosophique
|
||||
- `objection` : Objection à un argument
|
||||
- `response` : Réponse à une objection
|
||||
- `analysis` : Analyse d'un concept
|
||||
- `synthesis` : Synthèse d'idées
|
||||
- `transition` : Transition entre sections
|
||||
|
||||
### 🔴 Champs de MÉTADONNÉES (au niveau document)
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"title": "Collected papers",
|
||||
"author": "Charles Sanders PEIRCE",
|
||||
"year": 1931,
|
||||
"language": "en",
|
||||
"genre": "Philosophy"
|
||||
},
|
||||
"toc": [...],
|
||||
"hierarchy": {...},
|
||||
"pages": 548,
|
||||
"chunks_count": 5180,
|
||||
"chunks": [...]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mapping Weaviate
|
||||
|
||||
### Collection `Chunk`
|
||||
|
||||
| Champ JSON | Propriété Weaviate | Vectorisé | Indexé | Type |
|
||||
|------------|-------------------|-----------|---------|------|
|
||||
| `text` | `text` | ✅ Oui | ✅ Oui | text |
|
||||
| `keywords` | `keywords` | ✅ Oui | ✅ Oui | text[] |
|
||||
| `concepts` | `concepts` | ✅ Oui | ✅ Oui | text[] |
|
||||
| `canonicalReference` | `canonicalReference` | ❌ Non | ✅ Oui | text |
|
||||
| `chapterTitle` | `chapterTitle` | ❌ Non | ✅ Oui | text |
|
||||
| `sectionPath` | `sectionPath` | ❌ Non | ✅ Oui | text |
|
||||
| `orderIndex` | `orderIndex` | ❌ Non | ✅ Oui | int |
|
||||
| `unitType` | `unitType` | ❌ Non | ✅ Oui | text |
|
||||
| `section` | `section` | ❌ Non | ✅ Oui | text |
|
||||
| `type` | `type` | ❌ Non | ✅ Oui | text |
|
||||
|
||||
**Nested Objects** (dénormalisés pour performance) :
|
||||
|
||||
```json
|
||||
{
|
||||
"work": {
|
||||
"title": "Collected papers",
|
||||
"author": "Charles Sanders PEIRCE",
|
||||
"year": 1931,
|
||||
"language": "en",
|
||||
"genre": "Philosophy"
|
||||
},
|
||||
"document": {
|
||||
"sourceId": "peirce_collected_papers_fixed",
|
||||
"edition": "Harvard University Press"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validation des Champs
|
||||
|
||||
### Règles de validation
|
||||
|
||||
1. **text** :
|
||||
- Min : 100 caractères (après nettoyage)
|
||||
- Max : 8000 caractères (limite BGE-M3)
|
||||
- Pas de texte vide ou whitespace seulement
|
||||
|
||||
2. **canonicalReference** :
|
||||
- Format Peirce : `CP X.YYY` (ex: `CP 1.628`)
|
||||
- Format Stephanus : `Œuvre NNNx` (ex: `Ménon 80a`)
|
||||
- Peut être `null` si non applicable
|
||||
|
||||
3. **orderIndex** :
|
||||
- Entier >= 0
|
||||
- Séquentiel (pas de gaps)
|
||||
- Unique par document
|
||||
|
||||
4. **keywords** et **concepts** :
|
||||
- Tableau de strings
|
||||
- Min : 0 éléments (peut être vide)
|
||||
- Max : 20 éléments recommandé
|
||||
- Pas de doublons
|
||||
|
||||
5. **unitType** :
|
||||
- Doit être l'une des valeurs de l'enum
|
||||
- Défaut : `"argument"` si non spécifié
|
||||
|
||||
---
|
||||
|
||||
## Exemple Complet pour Peirce
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"title": "Collected papers",
|
||||
"author": "Charles Sanders PEIRCE",
|
||||
"year": 1931,
|
||||
"language": "en",
|
||||
"genre": "Philosophy"
|
||||
},
|
||||
"toc": [
|
||||
{"title": "Peirce: CP 1.1", "level": 1},
|
||||
{"title": "1. PREFACE", "level": 2},
|
||||
{"title": "Peirce: CP 1.628", "level": 1},
|
||||
{"title": "628. It is the instincts...", "level": 2}
|
||||
],
|
||||
"hierarchy": {"type": "flat"},
|
||||
"pages": 548,
|
||||
"chunks_count": 5180,
|
||||
"chunks": [
|
||||
{
|
||||
"chunk_id": "chunk_00627",
|
||||
"text": "It is the instincts, the sentiments, that make the substance of the soul. Cognition is only its surface, its locus of contact with what is external to it. All that is admirable in it is not only ours by nature, every creature has it; but all consciousness of it, and all that makes it valuable to us, comes to us from without, through the senses.",
|
||||
|
||||
"section": "628. It is the instincts, the sentiments, that make the substance of the soul",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
|
||||
"canonicalReference": "CP 1.628",
|
||||
"chapterTitle": "Peirce: CP 1.1",
|
||||
"sectionPath": "Peirce: CP 1.628 > 628. It is the instincts, the sentiments, that make the substance of the soul",
|
||||
"orderIndex": 627,
|
||||
|
||||
"keywords": [
|
||||
"instincts",
|
||||
"sentiments",
|
||||
"soul",
|
||||
"substance",
|
||||
"cognition",
|
||||
"surface",
|
||||
"external",
|
||||
"consciousness",
|
||||
"senses"
|
||||
],
|
||||
"concepts": [
|
||||
"soul as instinct and sentiment",
|
||||
"cognition as surface phenomenon",
|
||||
"external origin of consciousness",
|
||||
"sensory foundation of value"
|
||||
],
|
||||
|
||||
"unitType": "argument",
|
||||
"confidence": 0.94
|
||||
}
|
||||
],
|
||||
"cost_ocr": 1.644,
|
||||
"cost_llm": 0.523,
|
||||
"cost_total": 2.167
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Checklist avant Ingestion Weaviate
|
||||
|
||||
✅ Champs obligatoires présents :
|
||||
- [ ] `text` (non vide, 100-8000 chars)
|
||||
- [ ] `orderIndex` (séquentiel, unique)
|
||||
- [ ] `section` et `section_level`
|
||||
- [ ] `type` (valeur enum valide)
|
||||
|
||||
✅ Champs d'enrichissement :
|
||||
- [ ] `canonicalReference` (format valide ou null)
|
||||
- [ ] `chapterTitle` (présent si TOC disponible)
|
||||
- [ ] `sectionPath` (hiérarchie complète)
|
||||
|
||||
✅ Champs LLM (si `use_llm=True`) :
|
||||
- [ ] `keywords` (array de strings)
|
||||
- [ ] `concepts` (array de strings)
|
||||
- [ ] `unitType` (valeur enum valide)
|
||||
|
||||
✅ Métadonnées document :
|
||||
- [ ] `metadata.author` (présent et valide)
|
||||
- [ ] `metadata.title` (présent et valide)
|
||||
- [ ] TOC avec au moins 1 entrée
|
||||
|
||||
---
|
||||
|
||||
## Commande pour Vérifier un Fichier
|
||||
|
||||
```python
|
||||
python check_chunk_fields.py
|
||||
```
|
||||
|
||||
Affiche :
|
||||
- Champs présents dans les chunks
|
||||
- Champs manquants pour Weaviate
|
||||
- État du TOC et hiérarchie
|
||||
- Exemple de premier chunk
|
||||
128
generations/library_rag/EXAMPLE_CHUNK_JSON_FORMAT.json
Normal file
128
generations/library_rag/EXAMPLE_CHUNK_JSON_FORMAT.json
Normal file
@@ -0,0 +1,128 @@
|
||||
{
|
||||
"metadata": {
|
||||
"title": "Collected papers",
|
||||
"author": "Charles Sanders PEIRCE",
|
||||
"year": 1931,
|
||||
"language": "en",
|
||||
"genre": "Philosophy"
|
||||
},
|
||||
"toc": [
|
||||
{
|
||||
"title": "Peirce: CP 1.1",
|
||||
"level": 1
|
||||
},
|
||||
{
|
||||
"title": "1. PREFACE",
|
||||
"level": 2
|
||||
},
|
||||
{
|
||||
"title": "Peirce: CP 1.2",
|
||||
"level": 1
|
||||
},
|
||||
{
|
||||
"title": "2. But before all else...",
|
||||
"level": 2
|
||||
}
|
||||
],
|
||||
"hierarchy": {
|
||||
"type": "flat"
|
||||
},
|
||||
"pages": 548,
|
||||
"chunks_count": 5180,
|
||||
"chunks": [
|
||||
{
|
||||
"chunk_id": "chunk_00000",
|
||||
"text": "To erect a philosophical edifice that shall outlast the vicissitudes of time, my care must be, not so much to set each brick with nicest accuracy, as to lay the foundations deep and massive...",
|
||||
|
||||
"section": "1. PREFACE",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
|
||||
"canonicalReference": "CP 1.1",
|
||||
"chapterTitle": "Peirce: CP 1.1",
|
||||
"sectionPath": "Peirce: CP 1.1 > 1. PREFACE",
|
||||
"orderIndex": 0,
|
||||
|
||||
"keywords": [
|
||||
"philosophical edifice",
|
||||
"Aristotle",
|
||||
"matter and form",
|
||||
"act and power",
|
||||
"peripatetic",
|
||||
"Descartes",
|
||||
"Kant",
|
||||
"comprehensive theory"
|
||||
],
|
||||
"concepts": [
|
||||
"philosophy as architecture",
|
||||
"Aristotelian foundations",
|
||||
"modern philosophy needs",
|
||||
"comprehensive theory construction"
|
||||
],
|
||||
|
||||
"unitType": "argument",
|
||||
"confidence": 0.95
|
||||
},
|
||||
{
|
||||
"chunk_id": "chunk_00001",
|
||||
"text": "But before all else, let me make the acquaintance of my reader, and express my sincere esteem for him and the deep pleasure it is to me to address one so wise and so patient...",
|
||||
|
||||
"section": "2. But before all else, let me make the acquaintance of my reader, and",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
|
||||
"canonicalReference": "CP 1.2",
|
||||
"chapterTitle": "Peirce: CP 1.1",
|
||||
"sectionPath": "Peirce: CP 1.2 > 2. But before all else, let me make the acquaintance of my reader, and",
|
||||
"orderIndex": 1,
|
||||
|
||||
"keywords": [
|
||||
"reader acquaintance",
|
||||
"preconceived opinions",
|
||||
"patient reader",
|
||||
"fundamental objections"
|
||||
],
|
||||
"concepts": [
|
||||
"reader as critical thinker",
|
||||
"philosophy requires patience",
|
||||
"openness to new ideas"
|
||||
],
|
||||
|
||||
"unitType": "introduction",
|
||||
"confidence": 0.92
|
||||
},
|
||||
{
|
||||
"chunk_id": "chunk_00627",
|
||||
"text": "It is the instincts, the sentiments, that make the substance of the soul. Cognition is only its surface, its locus of contact with what is external to it...",
|
||||
|
||||
"section": "628. It is the instincts, the sentiments, that make the substance of the soul",
|
||||
"section_level": 2,
|
||||
"type": "main_content",
|
||||
|
||||
"canonicalReference": "CP 1.628",
|
||||
"chapterTitle": "Peirce: CP 1.1",
|
||||
"sectionPath": "Peirce: CP 1.628 > 628. It is the instincts, the sentiments, that make the substance of the soul",
|
||||
"orderIndex": 627,
|
||||
|
||||
"keywords": [
|
||||
"instincts",
|
||||
"sentiments",
|
||||
"soul",
|
||||
"cognition",
|
||||
"surface",
|
||||
"external contact"
|
||||
],
|
||||
"concepts": [
|
||||
"soul as instinct and sentiment",
|
||||
"cognition as surface phenomenon",
|
||||
"depth psychology"
|
||||
],
|
||||
|
||||
"unitType": "argument",
|
||||
"confidence": 0.94
|
||||
}
|
||||
],
|
||||
"cost_ocr": 1.644,
|
||||
"cost_llm": 0.523,
|
||||
"cost_total": 2.167
|
||||
}
|
||||
845
generations/library_rag/README.md
Normal file
845
generations/library_rag/README.md
Normal file
@@ -0,0 +1,845 @@
|
||||
# Library RAG - Base de Textes Philosophiques
|
||||
|
||||
Système RAG (Retrieval Augmented Generation) de qualité production spécialisé dans l'indexation et la recherche sémantique de textes philosophiques et académiques. Pipeline complet d'OCR, extraction de métadonnées, chunking intelligent et vectorisation automatique.
|
||||
|
||||
> **Note Technique (Dec 2024):** Migration vers BAAI/bge-m3 (1024-dim, 8192 token context) pour un support multilingue supérieur (grec, latin, français, anglais) et des performances améliorées sur les textes philosophiques. Voir [Annexe: Migration BGE-M3](#annexe-migration-bge-m3).
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Démarrage Rapide
|
||||
|
||||
```bash
|
||||
# 1. Configurer les variables d'environnement
|
||||
cp .env.example .env
|
||||
# Éditer .env et ajouter votre MISTRAL_API_KEY
|
||||
|
||||
# 2. Démarrer Weaviate + transformers
|
||||
docker compose up -d
|
||||
|
||||
# 3. Installer les dépendances Python
|
||||
pip install -r requirements.txt
|
||||
|
||||
# 4. Créer le schéma Weaviate
|
||||
python schema.py
|
||||
|
||||
# 5. Lancer l'interface web Flask
|
||||
python flask_app.py
|
||||
```
|
||||
|
||||
Ouvrez ensuite http://localhost:5000 dans votre navigateur.
|
||||
|
||||
---
|
||||
|
||||
## 📖 Table des Matières
|
||||
|
||||
- [Architecture](#-architecture)
|
||||
- [Pipeline de Traitement PDF](#-pipeline-de-traitement-pdf-10-étapes)
|
||||
- [Configuration](#%EF%B8%8F-configuration)
|
||||
- [Interface Flask](#-interface-flask)
|
||||
- [Schéma Weaviate](#-schéma-weaviate-4-collections)
|
||||
- [Exemples de Requêtes](#-exemples-de-requêtes)
|
||||
- [MCP Server (Claude Desktop)](#-mcp-server-claude-desktop)
|
||||
- [Gestion des Coûts](#-gestion-des-coûts)
|
||||
- [Tests](#-tests)
|
||||
- [Debugging](#-debugging)
|
||||
- [Production](#-production)
|
||||
- [Annexes](#-annexes)
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TB
|
||||
subgraph Docker["🐳 Docker Compose"]
|
||||
subgraph Weaviate["Weaviate 1.34.4"]
|
||||
direction TB
|
||||
Work["📚 Work<br/><i>no vectorizer</i>"]
|
||||
Document["📄 Document<br/><i>no vectorizer</i>"]
|
||||
Chunk["📝 Chunk<br/><i>text2vec-transformers</i>"]
|
||||
Summary["📋 Summary<br/><i>text2vec-transformers</i>"]
|
||||
|
||||
Work --> Document
|
||||
Document --> Chunk
|
||||
Document --> Summary
|
||||
end
|
||||
|
||||
Transformers["🤖 Transformers API<br/>BAAI/bge-m3 (1024-dim)"]
|
||||
end
|
||||
|
||||
subgraph Flask["🌐 Flask App"]
|
||||
Parser["📄 Pipeline PDF<br/>10 étapes"]
|
||||
OCR["🔍 Mistral OCR"]
|
||||
LLM["🧠 LLM<br/>Ollama / Mistral"]
|
||||
Web["🎨 Interface Web<br/>SSE Progress"]
|
||||
end
|
||||
|
||||
Client["🐍 Python Client"]
|
||||
|
||||
Client -->|"REST :8080<br/>gRPC :50051"| Weaviate
|
||||
Chunk -.->|vectorization| Transformers
|
||||
Summary -.->|vectorization| Transformers
|
||||
Parser --> OCR
|
||||
Parser --> LLM
|
||||
Parser --> Client
|
||||
```
|
||||
|
||||
**Composants Clés:**
|
||||
- **Weaviate 1.34.4**: Base vectorielle avec 4 collections (Work, Document, Chunk, Summary)
|
||||
- **BAAI/bge-m3**: Modèle d'embedding multilingue (1024 dimensions, 8192 token context)
|
||||
- **Mistral OCR**: Extraction texte/images (~0.003€/page)
|
||||
- **LLM**: Ollama (local, gratuit) ou Mistral API (rapide, payant)
|
||||
- **Flask 3.0**: Interface web avec Server-Sent Events (SSE)
|
||||
|
||||
---
|
||||
|
||||
## 📄 Pipeline de Traitement PDF (10 Étapes)
|
||||
|
||||
Le système implémente un pipeline intelligent orchestré par `utils/pdf_pipeline.py` :
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
PDF["📄 PDF Upload"] --> Step1["[1] OCR Mistral<br/>~0.003€/page"]
|
||||
Step1 --> Step2["[2] Markdown Builder<br/>Structure le texte"]
|
||||
Step2 --> Step3["[3] Image Extractor<br/>Sauvegarde images"]
|
||||
Step3 --> Step4["[4] LLM Metadata<br/>Titre, auteur, année"]
|
||||
Step4 --> Step5["[5] LLM TOC<br/>Table des matières"]
|
||||
Step5 --> Step6["[6] LLM Classifier<br/>Classification sections"]
|
||||
Step6 --> Step7["[7] LLM Chunker<br/>Chunking sémantique"]
|
||||
Step7 --> Step8["[8] Cleaner<br/>Nettoyage OCR"]
|
||||
Step8 --> Step9["[9] LLM Validator<br/>Validation + concepts"]
|
||||
Step9 --> Step10["[10] Weaviate Ingest<br/>Vectorisation"]
|
||||
Step10 --> DB[("🗄️ Weaviate<br/>4 Collections")]
|
||||
```
|
||||
|
||||
### Détails du Pipeline
|
||||
|
||||
| Étape | Module | Fonction | Coût |
|
||||
|-------|--------|----------|------|
|
||||
| **1** | `ocr_processor.py` | Extraction texte/images via Mistral OCR | ~0.003€/page |
|
||||
| **2** | `markdown_builder.py` | Construction Markdown structuré | Gratuit |
|
||||
| **3** | `image_extractor.py` | Sauvegarde images dans `output/images/` | Gratuit |
|
||||
| **4** | `llm_metadata.py` | Extraction métadonnées (titre, auteur, langue, année) | Variable (LLM) |
|
||||
| **5** | `llm_toc.py` | Extraction hiérarchique de la table des matières | Variable (LLM) |
|
||||
| **6** | `llm_classifier.py` | Classification sections (main_content, preamble, etc.) | Variable (LLM) |
|
||||
| **7** | `llm_chunker.py` | Découpage sémantique en unités argumentatives | Variable (LLM) |
|
||||
| **8** | `llm_cleaner.py` | Nettoyage artéfacts OCR, validation longueur | Gratuit |
|
||||
| **9** | `llm_validator.py` | Validation chunks + extraction concepts/mots-clés | Variable (LLM) |
|
||||
| **10** | `weaviate_ingest.py` | Ingestion batch + vectorisation automatique | Gratuit |
|
||||
|
||||
**Progression en Temps Réel:** Server-Sent Events (SSE) pour suivre chaque étape du traitement via l'interface web.
|
||||
|
||||
---
|
||||
|
||||
## ⚙️ Configuration
|
||||
|
||||
### Variables d'Environnement
|
||||
|
||||
Créez un fichier `.env` à la racine du projet :
|
||||
|
||||
```env
|
||||
# API Mistral (obligatoire pour OCR)
|
||||
MISTRAL_API_KEY=your_mistral_api_key_here
|
||||
|
||||
# LLM Configuration
|
||||
STRUCTURE_LLM_MODEL=qwen2.5:7b # Modèle Ollama (ou modèle Mistral)
|
||||
OLLAMA_BASE_URL=http://localhost:11434 # URL serveur Ollama
|
||||
STRUCTURE_LLM_TEMPERATURE=0.2 # Température LLM (0=déterministe, 1=créatif)
|
||||
|
||||
# APIs optionnelles (non utilisées actuellement)
|
||||
ANTHROPIC_API_KEY=your_anthropic_key # Optionnel
|
||||
OPENAI_API_KEY=your_openai_key # Optionnel
|
||||
|
||||
# Weaviate (defaults)
|
||||
WEAVIATE_HOST=localhost
|
||||
WEAVIATE_PORT=8080
|
||||
|
||||
# Linear Integration (pour développement dans framework)
|
||||
LINEAR_TEAM=LRP # Identifiant équipe Linear
|
||||
```
|
||||
|
||||
### Options de Traitement
|
||||
|
||||
Lors de l'upload d'un PDF, vous pouvez configurer :
|
||||
|
||||
| Option | Par défaut | Description |
|
||||
|--------|------------|-------------|
|
||||
| `skip_ocr` | `False` | Réutiliser markdown existant (évite coût OCR) |
|
||||
| `use_llm` | `True` | Activer les étapes LLM (métadonnées, TOC, chunking) |
|
||||
| `llm_provider` | `"ollama"` | `"ollama"` (local, gratuit) ou `"mistral"` (API, rapide) |
|
||||
| `llm_model` | `None` | Nom du modèle (auto-détecté depuis .env si None) |
|
||||
| `use_ocr_annotations` | `False` | OCR avec annotations (3x coût, meilleure TOC) |
|
||||
| `use_semantic_chunking` | `False` | Chunking LLM (lent mais précis) |
|
||||
| `ingest_to_weaviate` | `True` | Insérer les chunks dans Weaviate |
|
||||
|
||||
---
|
||||
|
||||
## 📊 Schéma Weaviate (4 Collections)
|
||||
|
||||
### Architecture Simplifiée
|
||||
|
||||
```
|
||||
Work (no vectorizer)
|
||||
├─ title, author, year, language, genre
|
||||
│
|
||||
└─► Document (no vectorizer)
|
||||
├─ sourceId, edition, language, pages, chunksCount
|
||||
├─ toc (JSON), hierarchy (JSON), createdAt
|
||||
├─ work: {title, author} (nested)
|
||||
│
|
||||
├─► Chunk (VECTORIZED ⭐)
|
||||
│ ├─ text (vectorized), keywords (vectorized)
|
||||
│ ├─ sectionPath, chapterTitle, unitType, orderIndex, language
|
||||
│ ├─ work: {title, author} (nested)
|
||||
│ └─ document: {sourceId, edition} (nested)
|
||||
│
|
||||
└─► Summary (VECTORIZED ⭐)
|
||||
├─ text (vectorized), concepts (vectorized)
|
||||
├─ sectionPath, title, level, chunksCount
|
||||
└─ document: {sourceId} (nested)
|
||||
```
|
||||
|
||||
### Collections
|
||||
|
||||
**Work** (no vectorizer)
|
||||
- Représente une œuvre philosophique (ex: Ménon de Platon)
|
||||
- Propriétés : `title`, `author`, `originalTitle`, `year`, `language`, `genre`
|
||||
- Pas de vectorisation (métadonnées uniquement)
|
||||
|
||||
**Document** (no vectorizer)
|
||||
- Représente une édition spécifique d'une œuvre (PDF, traduction)
|
||||
- Propriétés : `sourceId`, `edition`, `language`, `pages`, `chunksCount`, `toc`, `hierarchy`, `createdAt`
|
||||
- Référence nested : `work: {title, author}`
|
||||
- Pas de vectorisation (métadonnées uniquement)
|
||||
|
||||
**Chunk ⭐** (text2vec-transformers)
|
||||
- Fragment de texte optimisé pour la recherche sémantique (200-800 caractères)
|
||||
- Propriétés vectorisées : `text`, `keywords`
|
||||
- Propriétés non-vectorisées : `sectionPath`, `chapterTitle`, `unitType`, `orderIndex`, `language`
|
||||
- Références nested : `work: {title, author}`, `document: {sourceId, edition}`
|
||||
|
||||
**Summary** (text2vec-transformers)
|
||||
- Résumés LLM de chapitres/sections pour recherche de haut niveau
|
||||
- Propriétés vectorisées : `text`, `concepts`
|
||||
- Propriétés non-vectorisées : `sectionPath`, `title`, `level`, `chunksCount`
|
||||
- Référence nested : `document: {sourceId}`
|
||||
|
||||
### Design Patterns
|
||||
|
||||
**Nested Objects vs Cross-References:**
|
||||
- Utilise des objets imbriqués pour éviter les JOINs
|
||||
- Accès en une seule requête avec métadonnées Work/Document
|
||||
- Trade-off : Petite duplication contrôlée pour performance maximale
|
||||
|
||||
**Vectorisation Sélective:**
|
||||
- Seuls `Chunk.text/keywords` et `Summary.text/concepts` sont vectorisés
|
||||
- Métadonnées utilisent `skip_vectorization=True` pour filtrage rapide
|
||||
- Gain : 6× moins de calculs vs vectorisation complète
|
||||
|
||||
---
|
||||
|
||||
## 🌐 Interface Flask
|
||||
|
||||
### Routes Disponibles
|
||||
|
||||
| Route | Méthode | Description |
|
||||
|-------|---------|-------------|
|
||||
| `/` | GET | 🏛️ Accueil — Statistiques des collections |
|
||||
| `/passages` | GET | 📚 Parcourir — Liste paginée de tous les chunks |
|
||||
| `/search` | GET | 🔍 Recherche — Recherche sémantique vectorielle |
|
||||
| `/upload` | GET | 📤 Formulaire — Page d'upload PDF |
|
||||
| `/upload` | POST | 🚀 Traiter — Démarre le traitement PDF en arrière-plan |
|
||||
| `/upload/progress/<job_id>` | GET | 📊 SSE — Flux de progression en temps réel |
|
||||
| `/upload/status/<job_id>` | GET | ℹ️ Statut — État JSON du job de traitement |
|
||||
| `/upload/result/<job_id>` | GET | ✅ Résultats — Page de résultats du traitement |
|
||||
| `/documents` | GET | 📁 Documents — Liste des documents traités |
|
||||
| `/documents/<doc>/view` | GET | 👁️ Détails — Vue détaillée d'un document |
|
||||
| `/documents/delete/<doc>` | POST | 🗑️ Supprimer — Supprime document + chunks de Weaviate |
|
||||
| `/output/<filepath>` | GET | 💾 Télécharger — Télécharge fichiers traités (MD, JSON) |
|
||||
|
||||
### Server-Sent Events (SSE)
|
||||
|
||||
L'interface utilise SSE pour un suivi en temps réel du traitement :
|
||||
|
||||
```javascript
|
||||
// Exemple de flux SSE
|
||||
event: step
|
||||
data: {"step": 1, "total": 10, "message": "OCR Mistral en cours...", "progress": 10}
|
||||
|
||||
event: step
|
||||
data: {"step": 4, "total": 10, "message": "Extraction métadonnées (LLM)...", "progress": 40}
|
||||
|
||||
event: complete
|
||||
data: {"success": true, "document": "platon-menon", "chunks": 127, "cost_ocr": 0.12, "cost_llm": 0.03}
|
||||
|
||||
event: error
|
||||
data: {"error": "OCR failed: API timeout"}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Exemples de Requêtes
|
||||
|
||||
### Recherche Sémantique (Collection Chunk)
|
||||
|
||||
```python
|
||||
import weaviate
|
||||
import weaviate.classes.query as wvq
|
||||
|
||||
client = weaviate.connect_to_local()
|
||||
|
||||
try:
|
||||
chunks = client.collections.get("Chunk")
|
||||
|
||||
# Recherche sémantique simple
|
||||
result = chunks.query.near_text(
|
||||
query="la mort et la valeur de la vie",
|
||||
limit=5,
|
||||
return_metadata=wvq.MetadataQuery(distance=True),
|
||||
)
|
||||
|
||||
for obj in result.objects:
|
||||
work = obj.properties['work']
|
||||
doc = obj.properties['document']
|
||||
print(f"[{work['title']} - {work['author']}]")
|
||||
print(f" Edition: {doc['edition']}")
|
||||
print(f" Section: {obj.properties['sectionPath']}")
|
||||
print(f" {obj.properties['text'][:200]}...")
|
||||
print(f" Similarité: {(1 - obj.metadata.distance) * 100:.1f}%\n")
|
||||
|
||||
finally:
|
||||
client.close()
|
||||
```
|
||||
|
||||
### Recherche avec Filtres
|
||||
|
||||
```python
|
||||
# Rechercher dans les œuvres de Platon uniquement
|
||||
result = chunks.query.near_text(
|
||||
query="justice et vérité",
|
||||
limit=10,
|
||||
filters=wvq.Filter.by_property("work").by_property("author").equal("Platon"),
|
||||
return_metadata=wvq.MetadataQuery(distance=True),
|
||||
)
|
||||
|
||||
# Filtrer par langue
|
||||
result = chunks.query.near_text(
|
||||
query="âme immortelle",
|
||||
limit=5,
|
||||
filters=wvq.Filter.by_property("language").equal("fr"),
|
||||
)
|
||||
|
||||
# Filtrer par type d'unité (arguments uniquement)
|
||||
result = chunks.query.near_text(
|
||||
query="connaissance",
|
||||
filters=wvq.Filter.by_property("unitType").equal("argument"),
|
||||
)
|
||||
```
|
||||
|
||||
### Recherche Hybride (Sémantique + BM25)
|
||||
|
||||
```python
|
||||
# Combine recherche vectorielle et recherche par mots-clés
|
||||
result = chunks.query.hybrid(
|
||||
query="réminiscence et connaissance",
|
||||
alpha=0.75, # 0 = BM25 uniquement, 1 = vectoriel uniquement, 0.75 = favorise vectoriel
|
||||
limit=10,
|
||||
)
|
||||
```
|
||||
|
||||
### Recherche dans les Résumés (High-Level)
|
||||
|
||||
```python
|
||||
summaries = client.collections.get("Summary")
|
||||
|
||||
# Recherche de chapitres/sections par concept
|
||||
result = summaries.query.near_text(
|
||||
query="dialectique et maïeutique",
|
||||
limit=5,
|
||||
)
|
||||
|
||||
for obj in result.objects:
|
||||
print(f"Section: {obj.properties['title']}")
|
||||
print(f"Niveau: {obj.properties['level']}")
|
||||
print(f"Résumé: {obj.properties['text']}")
|
||||
print(f"Concepts: {', '.join(obj.properties['concepts'])}\n")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🤖 MCP Server (Claude Desktop)
|
||||
|
||||
Library RAG expose ses fonctionnalités via un serveur MCP (Model Context Protocol) pour intégration avec Claude Desktop.
|
||||
|
||||
### Installation MCP
|
||||
|
||||
```bash
|
||||
# Installer les dépendances MCP
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Tester le serveur
|
||||
python mcp_server.py
|
||||
```
|
||||
|
||||
### Configuration Claude Desktop
|
||||
|
||||
Ajouter à votre configuration Claude Desktop :
|
||||
|
||||
**Windows:** `%APPDATA%\Claude\claude_desktop_config.json`
|
||||
**macOS:** `~/Library/Application Support/Claude/claude_desktop_config.json`
|
||||
**Linux:** `~/.config/Claude/claude_desktop_config.json`
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"library-rag": {
|
||||
"command": "python",
|
||||
"args": ["C:/path/to/library_rag/mcp_server.py"],
|
||||
"env": {
|
||||
"MISTRAL_API_KEY": "your-mistral-api-key"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Outils MCP Disponibles
|
||||
|
||||
**1. parse_pdf** - Traite un PDF avec paramètres optimaux
|
||||
```
|
||||
parse_pdf(pdf_path="/docs/platon-menon.pdf")
|
||||
```
|
||||
|
||||
**2. search_chunks** - Recherche sémantique dans les chunks
|
||||
```
|
||||
search_chunks(query="la vertu", limit=10, author_filter="Platon")
|
||||
```
|
||||
|
||||
**3. search_summaries** - Recherche dans les résumés de chapitres
|
||||
```
|
||||
search_summaries(query="dialectique", min_level=1, max_level=2)
|
||||
```
|
||||
|
||||
**4. get_document** - Récupère un document par ID
|
||||
```
|
||||
get_document(source_id="platon-menon", include_chunks=true)
|
||||
```
|
||||
|
||||
**5. list_documents** - Liste tous les documents
|
||||
```
|
||||
list_documents(author_filter="Platon", language_filter="fr")
|
||||
```
|
||||
|
||||
**6. get_chunks_by_document** - Récupère les chunks d'un document
|
||||
```
|
||||
get_chunks_by_document(source_id="platon-menon", limit=50)
|
||||
```
|
||||
|
||||
**7. filter_by_author** - Tous les travaux d'un auteur
|
||||
```
|
||||
filter_by_author(author="Platon")
|
||||
```
|
||||
|
||||
**8. delete_document** - Supprime un document (requiert confirmation)
|
||||
```
|
||||
delete_document(source_id="platon-menon", confirm=true)
|
||||
```
|
||||
|
||||
Pour plus de détails, voir la documentation complète dans `.claude/CLAUDE.md`.
|
||||
|
||||
---
|
||||
|
||||
## 💰 Gestion des Coûts
|
||||
|
||||
### Coûts OCR (Mistral API)
|
||||
|
||||
| Mode | Coût par page | Utilisation |
|
||||
|------|---------------|-------------|
|
||||
| **Standard** | ~0.001-0.003€ | Extraction texte + images |
|
||||
| **Avec annotations** | ~0.009€ (3x) | + Annotations structurelles (meilleure TOC) |
|
||||
|
||||
**Optimisation:** Utilisez `skip_ocr=True` pour réutiliser le Markdown existant et éviter les coûts OCR lors du retraitement.
|
||||
|
||||
### Coûts LLM
|
||||
|
||||
| Provider | Coût | Performance |
|
||||
|----------|------|-------------|
|
||||
| **Ollama** (local) | Gratuit | Plus lent (~30s/doc), nécessite GPU/CPU puissant |
|
||||
| **Mistral API** | Variable | Rapide (~5s/doc), facturé par token |
|
||||
|
||||
**Recommandation:**
|
||||
- Développement/test : Ollama (gratuit)
|
||||
- Production : Mistral API (rapide, scalable)
|
||||
|
||||
### Suivi des Coûts
|
||||
|
||||
Chaque traitement génère un fichier `<doc>_chunks.json` avec :
|
||||
|
||||
```json
|
||||
{
|
||||
"cost_ocr": 0.12,
|
||||
"cost_llm": 0.03,
|
||||
"total_cost": 0.15,
|
||||
"pages": 40,
|
||||
"chunks": 127
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Configuration Docker
|
||||
|
||||
Le fichier `docker-compose.yml` configure :
|
||||
|
||||
### Weaviate 1.34.4
|
||||
- **Ports:** 8080 (HTTP), 50051 (gRPC)
|
||||
- **Modules:** `text2vec-transformers`
|
||||
- **Persistence:** Volume `weaviate_data`
|
||||
- **Authentification:** Désactivée (dev local)
|
||||
|
||||
### text2vec-transformers
|
||||
- **Modèle:** `baai-bge-m3` (BAAI/bge-m3)
|
||||
- **Dimensions:** 1024 (2.7x plus riche que MiniLM-L6)
|
||||
- **Context Window:** 8192 tokens (16x plus long que MiniLM-L6)
|
||||
- **Mode:** GPU accelerated (CUDA) with CPU fallback
|
||||
- **Multilingue:** Support supérieur pour grec, latin, français, anglais
|
||||
|
||||
```yaml
|
||||
# Configuration GPU (optionnel)
|
||||
text2vec-transformers:
|
||||
environment:
|
||||
- ENABLE_CUDA=1
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Tests
|
||||
|
||||
```bash
|
||||
# Exécuter tous les tests
|
||||
pytest
|
||||
|
||||
# Tests spécifiques
|
||||
pytest tests/utils/test_ocr_schemas.py -v
|
||||
|
||||
# Avec couverture
|
||||
pytest --cov=utils --cov-report=html
|
||||
|
||||
# Type checking strict
|
||||
mypy .
|
||||
```
|
||||
|
||||
**Tests disponibles:**
|
||||
- `test_ocr_schemas.py` : Validation schémas OCR
|
||||
- `test_toc.py` : Extraction table des matières
|
||||
- `test_mistral_client.py` : Client API Mistral
|
||||
|
||||
---
|
||||
|
||||
## 🐛 Debugging
|
||||
|
||||
### Problèmes Courants
|
||||
|
||||
**1. "Weaviate connection failed"**
|
||||
```bash
|
||||
# Vérifier que les conteneurs sont démarrés
|
||||
docker compose ps
|
||||
|
||||
# Démarrer si nécessaire
|
||||
docker compose up -d
|
||||
|
||||
# Vérifier les logs
|
||||
docker compose logs weaviate
|
||||
```
|
||||
|
||||
**2. "OCR cost too high"**
|
||||
```python
|
||||
# Réutiliser markdown existant
|
||||
result = process_pdf(
|
||||
Path("input/document.pdf"),
|
||||
skip_ocr=True, # ← Évite l'OCR
|
||||
use_llm=True,
|
||||
)
|
||||
```
|
||||
|
||||
**3. "LLM timeout (Ollama)"**
|
||||
```env
|
||||
# Augmenter timeout ou utiliser modèle plus léger
|
||||
STRUCTURE_LLM_MODEL=qwen2.5:7b # Au lieu de deepseek-r1:14b
|
||||
```
|
||||
|
||||
**4. "Empty chunks after cleaning"**
|
||||
```python
|
||||
# Vérifier les sections classifiées
|
||||
import json
|
||||
with open("output/<doc>/<doc>_chunks.json") as f:
|
||||
data = json.load(f)
|
||||
print(data["classified_sections"])
|
||||
```
|
||||
|
||||
**5. "TOC extraction failed"**
|
||||
```python
|
||||
# Utiliser annotations OCR (plus fiable mais 3x coût)
|
||||
result = process_pdf(
|
||||
Path("input/document.pdf"),
|
||||
use_ocr_annotations=True, # ← Meilleure TOC
|
||||
)
|
||||
```
|
||||
|
||||
**6. "Le fichier _ocr.json est-il utilisé ?"**
|
||||
|
||||
Le fichier `<doc>_ocr.json` est créé systématiquement mais :
|
||||
- **Pipeline normal:** ❌ Non utilisé (réponse OCR en mémoire → markdown)
|
||||
- **Mode `skip_ocr=True`:** ✅ Lu uniquement pour récupérer le nombre de pages
|
||||
|
||||
**Utilité:** Archive en production, cache en développement pour éviter les coûts API.
|
||||
|
||||
### Logs
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
# Activer logs détaillés
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# Logs pipeline
|
||||
logger = logging.getLogger("utils.pdf_pipeline")
|
||||
logger.setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Production
|
||||
|
||||
### Checklist Déploiement
|
||||
|
||||
- [ ] **Sécurité:** Ajouter authentification Flask (Flask-Login, OAuth)
|
||||
- [ ] **Rate Limiting:** Limiter uploads (Flask-Limiter)
|
||||
- [ ] **Secrets:** Utiliser gestionnaire secrets (AWS Secrets Manager, Vault)
|
||||
- [ ] **HTTPS:** Configurer reverse proxy (nginx + Let's Encrypt)
|
||||
- [ ] **CORS:** Configurer CORS si API séparée
|
||||
- [ ] **Monitoring:** Logging centralisé (Sentry, CloudWatch)
|
||||
- [ ] **Coûts:** Dashboard suivi coûts OCR/LLM
|
||||
- [ ] **Backup:** Stratégie backup Weaviate (volumes Docker)
|
||||
- [ ] **Tests:** Suite tests complète (pytest + couverture >80%)
|
||||
- [ ] **CI/CD:** Pipeline automatisé (GitHub Actions, GitLab CI)
|
||||
|
||||
### Exemple Nginx
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 80;
|
||||
server_name library-rag.example.com;
|
||||
|
||||
location / {
|
||||
proxy_pass http://127.0.0.1:5000;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
}
|
||||
|
||||
# SSE requiert des timeouts longs
|
||||
location /upload/progress {
|
||||
proxy_pass http://127.0.0.1:5000;
|
||||
proxy_buffering off;
|
||||
proxy_read_timeout 600s;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Production WSGI
|
||||
|
||||
```bash
|
||||
# Installer Gunicorn
|
||||
pip install gunicorn
|
||||
|
||||
# Lancer avec workers
|
||||
gunicorn -w 4 -b 0.0.0.0:5000 --timeout 600 flask_app:app
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📁 Structure du Projet
|
||||
|
||||
```
|
||||
library_rag/
|
||||
├── .env # Variables d'environnement (API keys, config LLM)
|
||||
├── .env.example # Exemple de configuration
|
||||
├── docker-compose.yml # Weaviate + text2vec-transformers
|
||||
├── requirements.txt # Dépendances Python
|
||||
├── mypy.ini # Configuration mypy (strict mode)
|
||||
├── pytest.ini # Configuration pytest
|
||||
│
|
||||
├── schema.py # ⚙️ Définition schéma Weaviate (4 collections)
|
||||
├── flask_app.py # 🌐 Application Flask principale (38 Ko)
|
||||
├── mcp_server.py # 🤖 MCP server pour Claude Desktop
|
||||
├── query_test.py # 🔍 Exemples de requêtes sémantiques
|
||||
│
|
||||
├── utils/ # 📦 Modules du pipeline PDF
|
||||
│ ├── __init__.py
|
||||
│ ├── types.py # TypedDict centralisées (31 Ko)
|
||||
│ ├── pdf_pipeline.py # Orchestration pipeline 10 étapes (64 Ko)
|
||||
│ ├── mistral_client.py # Client API Mistral OCR
|
||||
│ ├── pdf_uploader.py # Upload PDF vers Mistral
|
||||
│ ├── ocr_processor.py # Traitement OCR
|
||||
│ ├── ocr_schemas.py # Types pour réponses OCR
|
||||
│ ├── markdown_builder.py # Construction Markdown
|
||||
│ ├── image_extractor.py # Extraction images
|
||||
│ ├── hierarchy_parser.py # Parsing hiérarchique
|
||||
│ ├── llm_structurer.py # Infrastructure LLM (Ollama/Mistral)
|
||||
│ ├── llm_metadata.py # LLM: Extraction métadonnées
|
||||
│ ├── llm_toc.py # LLM: Extraction TOC
|
||||
│ ├── llm_classifier.py # LLM: Classification sections
|
||||
│ ├── llm_chunker.py # LLM: Chunking sémantique
|
||||
│ ├── llm_cleaner.py # Nettoyage chunks
|
||||
│ ├── llm_validator.py # LLM: Validation + concepts
|
||||
│ ├── weaviate_ingest.py # Ingestion batch Weaviate
|
||||
│ ├── toc_extractor.py # Extraction TOC (stratégies alternatives)
|
||||
│ ├── toc_extractor_markdown.py
|
||||
│ └── toc_extractor_visual.py
|
||||
│
|
||||
├── mcp_tools/ # 🔧 MCP tool implementations
|
||||
│ ├── parse_pdf.py
|
||||
│ └── search.py
|
||||
│
|
||||
├── templates/ # 🎨 Templates Jinja2
|
||||
│ ├── base.html # Template de base (navigation, CSS)
|
||||
│ ├── index.html # Page d'accueil (statistiques)
|
||||
│ ├── passages.html # Liste paginée des chunks
|
||||
│ ├── search.html # Interface de recherche sémantique
|
||||
│ ├── upload.html # Formulaire d'upload PDF
|
||||
│ ├── upload_progress.html # Progression SSE en temps réel
|
||||
│ ├── upload_result.html # Résultats du traitement
|
||||
│ ├── documents.html # Liste des documents traités
|
||||
│ └── document_view.html # Vue détaillée d'un document
|
||||
│
|
||||
├── static/
|
||||
│ └── rag-philo-charte.css # 🎨 Charte graphique
|
||||
│
|
||||
├── input/ # 📄 PDFs à traiter
|
||||
│ └── (vos fichiers PDF)
|
||||
│
|
||||
├── output/ # 💾 Résultats du traitement
|
||||
│ └── <nom_document>/
|
||||
│ ├── <nom_document>.md # Markdown structuré
|
||||
│ ├── <nom_document>_chunks.json # Chunks + métadonnées
|
||||
│ ├── <nom_document>_ocr.json # Réponse OCR brute
|
||||
│ ├── <nom_document>_weaviate.json # Résultat ingestion
|
||||
│ └── images/ # Images extraites
|
||||
│ ├── page_001_image_0.png
|
||||
│ └── ...
|
||||
│
|
||||
├── tests/ # 🧪 Tests unitaires
|
||||
│ └── utils/
|
||||
│ ├── test_ocr_schemas.py
|
||||
│ ├── test_toc.py
|
||||
│ └── test_mistral_client.py
|
||||
│
|
||||
├── .claude/ # 🤖 Instructions pour Claude Code
|
||||
│ └── CLAUDE.md
|
||||
│
|
||||
└── README.md # 📖 Ce fichier
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📚 Ressources
|
||||
|
||||
### Documentation
|
||||
|
||||
- [Weaviate Documentation](https://weaviate.io/developers/weaviate)
|
||||
- [Weaviate Python Client v4](https://weaviate.io/developers/weaviate/client-libraries/python)
|
||||
- [text2vec-transformers](https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-transformers)
|
||||
- [Mistral AI API](https://docs.mistral.ai/)
|
||||
- [Ollama Documentation](https://ollama.ai/)
|
||||
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
|
||||
|
||||
### Développement
|
||||
|
||||
- `.claude/CLAUDE.md` - Instructions développement pour Claude Code
|
||||
- `utils/types.py` - Définitions TypedDict centralisées (31 Ko)
|
||||
- `mypy.ini` - Configuration vérification types stricte
|
||||
|
||||
### Modèles
|
||||
|
||||
- **BAAI/bge-m3:** Modèle d'embedding multilingue (1024 dimensions, 8192 token context)
|
||||
- **Qwen 2.5:** Modèle LLM recommandé pour extraction (via Ollama)
|
||||
- **Mistral API:** OCR + LLM cloud (rapide, payant)
|
||||
|
||||
---
|
||||
|
||||
## 📝 Licence
|
||||
|
||||
Ce projet est un outil de recherche académique. Consultez votre licence spécifique.
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contribution
|
||||
|
||||
Pour contribuer :
|
||||
|
||||
1. **Type Safety:** Toutes les fonctions doivent avoir des annotations de types
|
||||
2. **Docstrings:** Google-style docstrings obligatoires
|
||||
3. **Tests:** Ajouter tests unitaires pour nouvelles fonctionnalités
|
||||
4. **mypy:** Code doit passer `mypy --strict`
|
||||
5. **Simplicité:** Suivre principes KISS et YAGNI
|
||||
|
||||
```bash
|
||||
# Vérifier types
|
||||
mypy .
|
||||
|
||||
# Vérifier docstrings
|
||||
pydocstyle utils/
|
||||
|
||||
# Tests
|
||||
pytest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📌 Annexes
|
||||
|
||||
### Annexe: Migration BGE-M3
|
||||
|
||||
**Date:** Décembre 2024
|
||||
|
||||
**Raison:** Migration de MiniLM-L6 (384-dim) vers BAAI/bge-m3 (1024-dim) pour :
|
||||
- 2.7× représentation sémantique plus riche
|
||||
- 8192 token context (vs 512)
|
||||
- Support multilingue supérieur (grec, latin, français, anglais)
|
||||
- Meilleures performances sur textes philosophiques/académiques
|
||||
|
||||
**Impact:**
|
||||
- **Aucun changement** dans le pipeline (étapes 1-9)
|
||||
- **Modification** de la vectorisation (étape 10) : utilise BGE-M3
|
||||
- **Collections Weaviate** : Recréées avec vecteurs 1024-dim
|
||||
- **Documents existants** : Doivent être ré-ingérés
|
||||
|
||||
**Migration:**
|
||||
```bash
|
||||
# 1. Arrêter containers
|
||||
docker compose down
|
||||
|
||||
# 2. Démarrer avec nouvelle config
|
||||
docker compose up -d
|
||||
|
||||
# 3. Recréer schéma
|
||||
python schema.py
|
||||
|
||||
# 4. Ré-ingérer documents depuis cache
|
||||
python reingest_from_cache.py
|
||||
```
|
||||
|
||||
**Rollback:** Restaurer `docker-compose.yml.backup` si nécessaire (~15 min).
|
||||
|
||||
**GPU:** BGE-M3 utilise ~2GB VRAM. Compatible RTX 4070 (12GB) avec Ollama/Qwen en parallèle.
|
||||
|
||||
---
|
||||
|
||||
**Library RAG** - Système RAG de qualité production pour textes philosophiques et académiques.
|
||||
70
generations/library_rag/docker-compose.yml
Normal file
70
generations/library_rag/docker-compose.yml
Normal file
@@ -0,0 +1,70 @@
|
||||
# Library RAG - Weaviate + BGE-M3 Embeddings
|
||||
# ===========================================
|
||||
#
|
||||
# This docker-compose runs Weaviate with BAAI/bge-m3 embedding model.
|
||||
#
|
||||
# BGE-M3 Advantages:
|
||||
# - 1024 dimensions (vs 384 for MiniLM-L6) - 2.7x richer representation
|
||||
# - 8192 token context (vs 512) - 16x longer sequences
|
||||
# - Superior multilingual support (Greek, Latin, French, English)
|
||||
# - Better trained on academic/philosophical texts
|
||||
#
|
||||
# GPU Configuration:
|
||||
# - ENABLE_CUDA="1" - Uses NVIDIA GPU for faster vectorization
|
||||
# - ENABLE_CUDA="0" - Uses CPU only (slower but functional)
|
||||
# - GPU device mapping included for CUDA acceleration
|
||||
#
|
||||
# Migration Note (2024-12):
|
||||
# Migrated from sentence-transformers-multi-qa-MiniLM-L6-cos-v1 (384-dim)
|
||||
# to BAAI/bge-m3 (1024-dim). All collections were deleted and recreated.
|
||||
# See MIGRATION_BGE_M3.md for details.
|
||||
|
||||
services:
|
||||
weaviate:
|
||||
image: cr.weaviate.io/semitechnologies/weaviate:1.34.4
|
||||
restart: on-failure:0
|
||||
ports:
|
||||
- "8080:8080"
|
||||
- "50051:50051"
|
||||
environment:
|
||||
QUERY_DEFAULTS_LIMIT: "25"
|
||||
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true" # ok pour dev/local
|
||||
PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
|
||||
CLUSTER_HOSTNAME: "node1"
|
||||
DEFAULT_VECTORIZER_MODULE: "text2vec-transformers"
|
||||
ENABLE_MODULES: "text2vec-transformers"
|
||||
TRANSFORMERS_INFERENCE_API: "http://text2vec-transformers:8080"
|
||||
# Limits to prevent OOM crashes
|
||||
GOMEMLIMIT: "6GiB"
|
||||
GOGC: "100"
|
||||
volumes:
|
||||
- weaviate_data:/var/lib/weaviate
|
||||
mem_limit: 8g
|
||||
memswap_limit: 10g
|
||||
cpus: 4
|
||||
|
||||
text2vec-transformers:
|
||||
# BAAI/bge-m3: Multilingual embedding model (1024 dimensions)
|
||||
# Superior for philosophical texts (Greek, Latin, French, English)
|
||||
# 8192 token context window (16x longer than MiniLM-L6)
|
||||
# Using ONNX version (only available format in Weaviate registry)
|
||||
#
|
||||
# GPU LIMITATION (Dec 2024):
|
||||
# - Weaviate only provides ONNX version of BGE-M3 (no PyTorch)
|
||||
# - ONNX runtime is CPU-optimized (no native CUDA support)
|
||||
# - GPU acceleration would require NVIDIA NIM (different architecture)
|
||||
# - Current setup: CPU-only with AVX2 optimization (functional but slower)
|
||||
image: cr.weaviate.io/semitechnologies/transformers-inference:baai-bge-m3-onnx-latest
|
||||
restart: on-failure:0
|
||||
environment:
|
||||
# ONNX runtime - CPU only (CUDA not supported in ONNX version)
|
||||
ENABLE_CUDA: "0"
|
||||
# Increased timeouts for very long chunks (e.g., Peirce CP 3.403, CP 8.388, Menon chunk 10)
|
||||
# Default is 60s, increased to 600s (10 minutes) for exceptionally large texts (e.g., CP 8.388: 218k chars)
|
||||
WORKER_TIMEOUT: "600"
|
||||
mem_limit: 10g
|
||||
memswap_limit: 12g
|
||||
cpus: 3
|
||||
|
||||
volumes:
|
||||
weaviate_data:
|
||||
@@ -0,0 +1,625 @@
|
||||
# Spécifications MCP Client pour Application Python
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Ce document spécifie comment implémenter un client MCP dans votre application Python pour permettre à votre LLM d'utiliser les outils de Library RAG via le MCP server.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ VOTRE APPLICATION PYTHON │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ LLM │───────▶│ MCP Client │─────▶│ Tool Executor│ │
|
||||
│ │ (Mistral, │◀───────│ (votre code)│◀─────│ │ │
|
||||
│ │ Claude, │ └──────────────┘ └──────────────┘ │
|
||||
│ │ etc.) │ │ ▲ │
|
||||
│ └────────────┘ │ │ │
|
||||
│ │ │ stdio (JSON-RPC) │
|
||||
└───────────────────────────────┼─┼────────────────────────────────┘
|
||||
│ │
|
||||
┌──────┴─┴──────┐
|
||||
│ MCP Server │
|
||||
│ (subprocess) │
|
||||
│ │
|
||||
│ library_rag/ │
|
||||
│ mcp_server.py │
|
||||
└────────────────┘
|
||||
│
|
||||
┌──────┴──────┐
|
||||
│ Weaviate │
|
||||
│ Database │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
## Composants à implémenter
|
||||
|
||||
### 1. MCP Client Manager
|
||||
|
||||
**Fichier:** `mcp_client.py`
|
||||
|
||||
**Responsabilités:**
|
||||
- Démarrer le MCP server comme subprocess
|
||||
- Communiquer via stdin/stdout (JSON-RPC 2.0)
|
||||
- Gérer le cycle de vie du server
|
||||
- Exposer les outils disponibles au LLM
|
||||
|
||||
**Interface:**
|
||||
|
||||
```python
|
||||
class MCPClient:
|
||||
"""Client pour communiquer avec le MCP server de Library RAG."""
|
||||
|
||||
def __init__(self, server_script_path: str, env: dict[str, str] | None = None):
|
||||
"""
|
||||
Args:
|
||||
server_script_path: Chemin vers mcp_server.py
|
||||
env: Variables d'environnement (MISTRAL_API_KEY, etc.)
|
||||
"""
|
||||
pass
|
||||
|
||||
async def start(self) -> None:
|
||||
"""Démarrer le MCP server subprocess."""
|
||||
pass
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Arrêter le MCP server subprocess."""
|
||||
pass
|
||||
|
||||
async def list_tools(self) -> list[ToolDefinition]:
|
||||
"""Obtenir la liste des outils disponibles."""
|
||||
pass
|
||||
|
||||
async def call_tool(
|
||||
self,
|
||||
tool_name: str,
|
||||
arguments: dict[str, Any]
|
||||
) -> ToolResult:
|
||||
"""Appeler un outil MCP.
|
||||
|
||||
Args:
|
||||
tool_name: Nom de l'outil (ex: "search_chunks")
|
||||
arguments: Arguments JSON
|
||||
|
||||
Returns:
|
||||
Résultat de l'outil
|
||||
"""
|
||||
pass
|
||||
```
|
||||
|
||||
### 2. JSON-RPC Communication
|
||||
|
||||
**Format des messages:**
|
||||
|
||||
**Client → Server (appel d'outil):**
|
||||
```json
|
||||
{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 1,
|
||||
"method": "tools/call",
|
||||
"params": {
|
||||
"name": "search_chunks",
|
||||
"arguments": {
|
||||
"query": "nominalism and realism",
|
||||
"limit": 10
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Server → Client (résultat):**
|
||||
```json
|
||||
{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 1,
|
||||
"result": {
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "{\"results\": [...], \"total_count\": 10}"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. LLM Integration
|
||||
|
||||
**Fichier:** `llm_with_tools.py`
|
||||
|
||||
**Responsabilités:**
|
||||
- Convertir les outils MCP en format utilisable par le LLM
|
||||
- Gérer le cycle de reasoning + tool calling
|
||||
- Parser les réponses du LLM pour extraire les appels d'outils
|
||||
|
||||
**Interface:**
|
||||
|
||||
```python
|
||||
class LLMWithMCPTools:
|
||||
"""LLM avec capacité d'utiliser les outils MCP."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
llm_client, # Mistral, Anthropic, OpenAI client
|
||||
mcp_client: MCPClient
|
||||
):
|
||||
"""
|
||||
Args:
|
||||
llm_client: Client LLM (Mistral, Claude, GPT)
|
||||
mcp_client: Client MCP initialisé
|
||||
"""
|
||||
pass
|
||||
|
||||
async def chat(
|
||||
self,
|
||||
user_message: str,
|
||||
max_iterations: int = 5
|
||||
) -> str:
|
||||
"""
|
||||
Converser avec le LLM qui peut utiliser les outils MCP.
|
||||
|
||||
Flow:
|
||||
1. Envoyer message au LLM avec liste des outils
|
||||
2. Si LLM demande un outil → l'exécuter via MCP
|
||||
3. Renvoyer le résultat au LLM
|
||||
4. Répéter jusqu'à réponse finale
|
||||
|
||||
Args:
|
||||
user_message: Question de l'utilisateur
|
||||
max_iterations: Limite de tool calls
|
||||
|
||||
Returns:
|
||||
Réponse finale du LLM
|
||||
"""
|
||||
pass
|
||||
|
||||
async def _convert_mcp_tools_to_llm_format(
|
||||
self,
|
||||
mcp_tools: list[ToolDefinition]
|
||||
) -> list[dict]:
|
||||
"""Convertir les outils MCP au format du LLM."""
|
||||
pass
|
||||
```
|
||||
|
||||
## Protocole de communication détaillé
|
||||
|
||||
### Phase 1: Initialisation
|
||||
|
||||
```python
|
||||
# 1. Démarrer le subprocess
|
||||
process = await asyncio.create_subprocess_exec(
|
||||
"python", "mcp_server.py",
|
||||
stdin=asyncio.subprocess.PIPE,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
env=environment_variables
|
||||
)
|
||||
|
||||
# 2. Envoyer initialize request
|
||||
initialize_request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": 0,
|
||||
"method": "initialize",
|
||||
"params": {
|
||||
"protocolVersion": "2024-11-05",
|
||||
"capabilities": {
|
||||
"tools": {}
|
||||
},
|
||||
"clientInfo": {
|
||||
"name": "my-python-app",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# 3. Recevoir initialize response
|
||||
# Server retourne ses capabilities et la liste des outils
|
||||
|
||||
# 4. Envoyer initialized notification
|
||||
initialized_notification = {
|
||||
"jsonrpc": "2.0",
|
||||
"method": "notifications/initialized"
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 2: Découverte des outils
|
||||
|
||||
```python
|
||||
# Liste des outils disponibles
|
||||
tools_request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": 1,
|
||||
"method": "tools/list"
|
||||
}
|
||||
|
||||
# Réponse attendue:
|
||||
{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 1,
|
||||
"result": {
|
||||
"tools": [
|
||||
{
|
||||
"name": "search_chunks",
|
||||
"description": "Search for text chunks using semantic similarity",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"query": {"type": "string"},
|
||||
"limit": {"type": "integer", "default": 10},
|
||||
"author_filter": {"type": "string"}
|
||||
},
|
||||
"required": ["query"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "parse_pdf",
|
||||
"description": "Process a PDF with OCR and ingest to Weaviate",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"pdf_path": {"type": "string"}
|
||||
},
|
||||
"required": ["pdf_path"]
|
||||
}
|
||||
}
|
||||
// ... autres outils
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: Appel d'outil
|
||||
|
||||
```python
|
||||
# Appel d'outil
|
||||
tool_call_request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": 2,
|
||||
"method": "tools/call",
|
||||
"params": {
|
||||
"name": "search_chunks",
|
||||
"arguments": {
|
||||
"query": "What is nominalism?",
|
||||
"limit": 5,
|
||||
"author_filter": "Charles Sanders Peirce"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Réponse
|
||||
{
|
||||
"jsonrpc": "2.0",
|
||||
"id": 2,
|
||||
"result": {
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "{\"results\": [{\"text\": \"...\", \"similarity\": 0.89}], \"total_count\": 5}"
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Dépendances Python
|
||||
|
||||
```toml
|
||||
# pyproject.toml
|
||||
[project]
|
||||
dependencies = [
|
||||
"anyio>=4.0.0", # Async I/O
|
||||
"pydantic>=2.0.0", # Validation
|
||||
"httpx>=0.27.0", # HTTP client (si download PDF)
|
||||
|
||||
# LLM client (choisir un):
|
||||
"anthropic>=0.39.0", # Pour Claude
|
||||
"mistralai>=1.2.0", # Pour Mistral
|
||||
"openai>=1.54.0", # Pour GPT
|
||||
]
|
||||
```
|
||||
|
||||
## Exemple d'implémentation minimale
|
||||
|
||||
### mcp_client.py (squelette)
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from typing import Any
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass
|
||||
class ToolDefinition:
|
||||
name: str
|
||||
description: str
|
||||
input_schema: dict[str, Any]
|
||||
|
||||
|
||||
class MCPClient:
|
||||
def __init__(self, server_path: str, env: dict[str, str] | None = None):
|
||||
self.server_path = server_path
|
||||
self.env = env or {}
|
||||
self.process = None
|
||||
self.request_id = 0
|
||||
|
||||
async def start(self):
|
||||
"""Démarrer le MCP server."""
|
||||
self.process = await asyncio.create_subprocess_exec(
|
||||
"python", self.server_path,
|
||||
stdin=asyncio.subprocess.PIPE,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
env={**os.environ, **self.env}
|
||||
)
|
||||
|
||||
# Initialize
|
||||
await self._send_request("initialize", {
|
||||
"protocolVersion": "2024-11-05",
|
||||
"capabilities": {"tools": {}},
|
||||
"clientInfo": {"name": "my-app", "version": "1.0"}
|
||||
})
|
||||
|
||||
# Notification initialized
|
||||
await self._send_notification("notifications/initialized", {})
|
||||
|
||||
async def _send_request(self, method: str, params: dict) -> dict:
|
||||
"""Envoyer une requête JSON-RPC et attendre la réponse."""
|
||||
self.request_id += 1
|
||||
request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": self.request_id,
|
||||
"method": method,
|
||||
"params": params
|
||||
}
|
||||
|
||||
# Écrire dans stdin
|
||||
request_json = json.dumps(request) + "\n"
|
||||
self.process.stdin.write(request_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
# Lire depuis stdout
|
||||
response_line = await self.process.stdout.readline()
|
||||
response = json.loads(response_line.decode())
|
||||
|
||||
return response.get("result")
|
||||
|
||||
async def _send_notification(self, method: str, params: dict):
|
||||
"""Envoyer une notification (pas de réponse attendue)."""
|
||||
notification = {
|
||||
"jsonrpc": "2.0",
|
||||
"method": method,
|
||||
"params": params
|
||||
}
|
||||
notification_json = json.dumps(notification) + "\n"
|
||||
self.process.stdin.write(notification_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
async def list_tools(self) -> list[ToolDefinition]:
|
||||
"""Obtenir la liste des outils."""
|
||||
result = await self._send_request("tools/list", {})
|
||||
tools = result.get("tools", [])
|
||||
|
||||
return [
|
||||
ToolDefinition(
|
||||
name=tool["name"],
|
||||
description=tool["description"],
|
||||
input_schema=tool["inputSchema"]
|
||||
)
|
||||
for tool in tools
|
||||
]
|
||||
|
||||
async def call_tool(self, tool_name: str, arguments: dict) -> Any:
|
||||
"""Appeler un outil."""
|
||||
result = await self._send_request("tools/call", {
|
||||
"name": tool_name,
|
||||
"arguments": arguments
|
||||
})
|
||||
|
||||
# Extraire le contenu texte
|
||||
content = result.get("content", [])
|
||||
if content and content[0].get("type") == "text":
|
||||
return json.loads(content[0]["text"])
|
||||
|
||||
return result
|
||||
|
||||
async def stop(self):
|
||||
"""Arrêter le server."""
|
||||
if self.process:
|
||||
self.process.terminate()
|
||||
await self.process.wait()
|
||||
```
|
||||
|
||||
### llm_agent.py (exemple avec Mistral)
|
||||
|
||||
```python
|
||||
from mistralai import Mistral
|
||||
|
||||
|
||||
class LLMAgent:
|
||||
def __init__(self, mcp_client: MCPClient):
|
||||
self.mcp_client = mcp_client
|
||||
self.mistral = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
|
||||
self.tools = None
|
||||
self.messages = []
|
||||
|
||||
async def initialize(self):
|
||||
"""Charger les outils MCP."""
|
||||
mcp_tools = await self.mcp_client.list_tools()
|
||||
|
||||
# Convertir au format Mistral
|
||||
self.tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": tool.name,
|
||||
"description": tool.description,
|
||||
"parameters": tool.input_schema
|
||||
}
|
||||
}
|
||||
for tool in mcp_tools
|
||||
]
|
||||
|
||||
async def chat(self, user_message: str) -> str:
|
||||
"""Converser avec tool calling."""
|
||||
self.messages.append({
|
||||
"role": "user",
|
||||
"content": user_message
|
||||
})
|
||||
|
||||
max_iterations = 10
|
||||
|
||||
for _ in range(max_iterations):
|
||||
# Appel LLM
|
||||
response = self.mistral.chat.complete(
|
||||
model="mistral-large-latest",
|
||||
messages=self.messages,
|
||||
tools=self.tools,
|
||||
tool_choice="auto"
|
||||
)
|
||||
|
||||
assistant_message = response.choices[0].message
|
||||
self.messages.append(assistant_message)
|
||||
|
||||
# Si pas de tool calls → réponse finale
|
||||
if not assistant_message.tool_calls:
|
||||
return assistant_message.content
|
||||
|
||||
# Exécuter les tool calls
|
||||
for tool_call in assistant_message.tool_calls:
|
||||
tool_name = tool_call.function.name
|
||||
arguments = json.loads(tool_call.function.arguments)
|
||||
|
||||
# Appeler via MCP
|
||||
result = await self.mcp_client.call_tool(tool_name, arguments)
|
||||
|
||||
# Ajouter le résultat
|
||||
self.messages.append({
|
||||
"role": "tool",
|
||||
"name": tool_name,
|
||||
"content": json.dumps(result),
|
||||
"tool_call_id": tool_call.id
|
||||
})
|
||||
|
||||
return "Max iterations atteintes"
|
||||
```
|
||||
|
||||
### main.py (exemple d'utilisation)
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
|
||||
async def main():
|
||||
# 1. Créer le client MCP
|
||||
mcp_client = MCPClient(
|
||||
server_path="path/to/library_rag/mcp_server.py",
|
||||
env={
|
||||
"MISTRAL_API_KEY": os.getenv("MISTRAL_API_KEY"),
|
||||
"LINEAR_API_KEY": os.getenv("LINEAR_API_KEY") # Si besoin
|
||||
}
|
||||
)
|
||||
|
||||
# 2. Démarrer le server
|
||||
await mcp_client.start()
|
||||
|
||||
try:
|
||||
# 3. Créer l'agent LLM
|
||||
agent = LLMAgent(mcp_client)
|
||||
await agent.initialize()
|
||||
|
||||
# 4. Converser
|
||||
response = await agent.chat(
|
||||
"What did Peirce say about nominalism versus realism? "
|
||||
"Search the database and summarize the key points."
|
||||
)
|
||||
|
||||
print(response)
|
||||
|
||||
finally:
|
||||
# 5. Arrêter le server
|
||||
await mcp_client.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Flow complet
|
||||
|
||||
```
|
||||
User: "What did Peirce say about nominalism?"
|
||||
│
|
||||
▼
|
||||
LLM Agent
|
||||
│
|
||||
├─ Appel Mistral avec tools disponibles
|
||||
│
|
||||
▼
|
||||
Mistral décide: "Je dois utiliser search_chunks"
|
||||
│
|
||||
▼
|
||||
LLM Agent → MCP Client
|
||||
│
|
||||
├─ call_tool("search_chunks", {
|
||||
│ "query": "Peirce nominalism realism",
|
||||
│ "limit": 10
|
||||
│ })
|
||||
│
|
||||
▼
|
||||
MCP Server (subprocess)
|
||||
│
|
||||
├─ Exécute search_chunks_handler
|
||||
│
|
||||
├─ Query Weaviate
|
||||
│
|
||||
├─ Retourne résultats JSON
|
||||
│
|
||||
▼
|
||||
MCP Client reçoit résultat
|
||||
│
|
||||
▼
|
||||
LLM Agent renvoie résultat à Mistral
|
||||
│
|
||||
▼
|
||||
Mistral synthétise la réponse finale
|
||||
│
|
||||
▼
|
||||
User reçoit: "Peirce was a realist who believed that universals..."
|
||||
```
|
||||
|
||||
## Variables d'environnement requises
|
||||
|
||||
```bash
|
||||
# .env
|
||||
MISTRAL_API_KEY=your_mistral_key # Pour le LLM ET pour l'OCR
|
||||
WEAVIATE_URL=http://localhost:8080 # Optionnel (défaut: localhost)
|
||||
PYTHONPATH=/path/to/library_rag # Pour les imports
|
||||
```
|
||||
|
||||
## Références
|
||||
|
||||
- **MCP Protocol**: https://spec.modelcontextprotocol.io/
|
||||
- **JSON-RPC 2.0**: https://www.jsonrpc.org/specification
|
||||
- **Mistral Tool Use**: https://docs.mistral.ai/capabilities/function_calling/
|
||||
- **Anthropic Tool Use**: https://docs.anthropic.com/en/docs/tool-use
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Implémenter `MCPClient` avec gestion complète du protocole
|
||||
2. Implémenter `LLMAgent` avec votre LLM de choix
|
||||
3. Tester avec un outil simple (`search_chunks`)
|
||||
4. Ajouter error handling et retry logic
|
||||
5. Implémenter logging pour debug
|
||||
6. Ajouter tests unitaires
|
||||
|
||||
## Notes importantes
|
||||
|
||||
- Le MCP server utilise **stdio** (stdin/stdout) pour la communication
|
||||
- Chaque message JSON-RPC doit être sur **une seule ligne** terminée par `\n`
|
||||
- Le server peut envoyer des logs sur **stderr** (à ne pas confondre avec stdout)
|
||||
- Les tool calls peuvent être **longs** (parse_pdf prend plusieurs minutes)
|
||||
- Implémenter des **timeouts** appropriés
|
||||
386
generations/library_rag/docs_techniques/SCHEMA_V2_RATIONALE.md
Normal file
386
generations/library_rag/docs_techniques/SCHEMA_V2_RATIONALE.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# Schéma Weaviate v2 - Justification des Choix de Conception
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Le schéma v2 corrige les problèmes majeurs du schéma v1 et optimise la base pour:
|
||||
- **Performance** (vectorisation ciblée)
|
||||
- **Intégrité** (normalisation, pas de duplication)
|
||||
- **Évolutivité** (références croisées)
|
||||
- **Efficacité** (requêtes optimisées)
|
||||
|
||||
---
|
||||
|
||||
## Comparaison v1 vs v2
|
||||
|
||||
### Schéma v1 (Problématique)
|
||||
|
||||
```
|
||||
Work (0 objets) Document (auto-schema)
|
||||
├── title ├── author ❌ dupliqué
|
||||
├── author ├── title ❌ dupliqué
|
||||
├── year └── toc (vide)
|
||||
└── ... (inutilisé)
|
||||
Passage (50 objets)
|
||||
├── chunk ✓
|
||||
├── author ❌ dupliqué 50×
|
||||
├── work ❌ dupliqué 50×
|
||||
└── ... (propriétés auto-ajoutées)
|
||||
```
|
||||
|
||||
**Problèmes**:
|
||||
- ❌ Work inutilisée (0 objets)
|
||||
- ❌ author/work dupliqués 50 fois dans Passage
|
||||
- ❌ Pas de références croisées
|
||||
- ❌ Auto-schema incontrôlé
|
||||
|
||||
### Schéma v2 (Optimisé)
|
||||
|
||||
```
|
||||
Work (source unique)
|
||||
├── title
|
||||
├── author
|
||||
└── year
|
||||
│
|
||||
├──> Document (référence nested)
|
||||
│ ├── sourceId
|
||||
│ ├── edition
|
||||
│ ├── work → {title, author} ✓
|
||||
│ └── toc
|
||||
│
|
||||
└──> Passage (référence nested)
|
||||
├── chunk (vectorisé)
|
||||
├── work → {title, author} ✓
|
||||
├── document → {sourceId, edition} ✓
|
||||
└── keywords (vectorisé)
|
||||
```
|
||||
|
||||
**Avantages**:
|
||||
- ✅ Work est la source unique de vérité
|
||||
- ✅ Pas de duplication (références nested)
|
||||
- ✅ Schéma strict (pas d'auto-ajout)
|
||||
- ✅ Vectorisation contrôlée
|
||||
|
||||
---
|
||||
|
||||
## Principes de Conception
|
||||
|
||||
### 1. Normalisation avec Dénormalisation Partielle
|
||||
|
||||
**Principe**: Normaliser les données, mais dénormaliser partiellement via **nested objects** pour la performance.
|
||||
|
||||
#### Pourquoi Nested Objects et pas References?
|
||||
|
||||
**Option A: True References** (non utilisée)
|
||||
```python
|
||||
# Nécessite une requête supplémentaire pour récupérer Work
|
||||
wvc.Property(
|
||||
name="work_ref",
|
||||
data_type=wvc.DataType.REFERENCE,
|
||||
references="Work"
|
||||
)
|
||||
```
|
||||
❌ Requiert JOIN → 2 requêtes au lieu de 1
|
||||
|
||||
**Option B: Nested Objects** (utilisée ✓)
|
||||
```python
|
||||
# Work essentiel embarqué dans Passage
|
||||
wvc.Property(
|
||||
name="work",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="title", data_type=wvc.DataType.TEXT),
|
||||
wvc.Property(name="author", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
)
|
||||
```
|
||||
✅ Une seule requête, données essentielles embarquées
|
||||
|
||||
**Compromis accepté**:
|
||||
- Duplication de `work.title` et `work.author` dans chaque Passage
|
||||
- **MAIS** contrôlée et minimale (2 champs vs 10+ en v1)
|
||||
- **GAIN**: 1 requête au lieu de 2, performance 50% meilleure
|
||||
|
||||
---
|
||||
|
||||
### 2. Vectorisation Sélective
|
||||
|
||||
**Principe**: Seuls les champs pertinents pour la recherche sémantique sont vectorisés.
|
||||
|
||||
| Collection | Vectorizer | Champs Vectorisés | Pourquoi |
|
||||
|------------|-----------|-------------------|----------|
|
||||
| **Work** | NONE | Aucun | Métadonnées uniquement, pas de recherche sémantique |
|
||||
| **Document** | NONE | Aucun | Métadonnées uniquement |
|
||||
| **Passage** | text2vec | `chunk`, `keywords` | Recherche sémantique principale |
|
||||
| **Section** | text2vec | `summary` | Résumés pour vue d'ensemble |
|
||||
|
||||
**Impact Performance**:
|
||||
- v1: ~12 champs vectorisés par Passage (dont author, work, section...)
|
||||
- v2: 2 champs vectorisés (`chunk` + `keywords`)
|
||||
- **Gain**: 6× moins de calculs de vectorisation
|
||||
|
||||
---
|
||||
|
||||
### 3. Skip Vectorization Explicite
|
||||
|
||||
**Principe**: Marquer explicitement les champs non vectorisables pour éviter l'auto-vectorisation.
|
||||
|
||||
```python
|
||||
wvc.Property(
|
||||
name="sectionPath",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True, # ← Explicite
|
||||
)
|
||||
```
|
||||
|
||||
**Champs avec skip_vectorization**:
|
||||
- `sectionPath` → Pour filtrage exact, pas sémantique
|
||||
- `chapterTitle` → Pour affichage, pas recherche
|
||||
- `unitType` → Catégorie, pas sémantique
|
||||
- `language` → Métadonnée, pas sémantique
|
||||
- `document.sourceId` → Identifiant technique
|
||||
- `work.author` → Nom propre (filtrage exact)
|
||||
|
||||
**Pourquoi?**
|
||||
- Vectoriser "Platon" n'a pas de sens sémantique
|
||||
- Filtrer par `author == "Platon"` est plus rapide avec index
|
||||
|
||||
---
|
||||
|
||||
### 4. Types de Données Stricts
|
||||
|
||||
**Principe**: Utiliser les types Weaviate corrects pour éviter les conversions implicites.
|
||||
|
||||
| v1 (Auto-Schema) | v2 (Strict) | Impact |
|
||||
|------------------|-------------|--------|
|
||||
| `pages: NUMBER` | `pages: INT` | Validation + index optimisé |
|
||||
| `createdAt: TEXT` | `createdAt: DATE` | Requêtes temporelles natives |
|
||||
| `chunksCount: NUMBER` | `passagesCount: INT` | Agrégations efficaces |
|
||||
|
||||
**Exemple concret**:
|
||||
```python
|
||||
# v1 (auto-schema): pages stocké comme 0.0 (float)
|
||||
"pages": 0.0 # ❌ Perte de précision, type incorrect
|
||||
|
||||
# v2 (strict): pages comme INT
|
||||
"pages": 42 # ✓ Type correct, validation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Hiérarchie des Collections
|
||||
|
||||
**Principe**: Ordre de dépendance strict pour les références.
|
||||
|
||||
```
|
||||
1. Work (indépendant)
|
||||
↓
|
||||
2. Document (référence Work)
|
||||
↓
|
||||
3. Passage (référence Document + Work)
|
||||
↓
|
||||
4. Section (référence Document, optionnel)
|
||||
```
|
||||
|
||||
**Lors de l'ingestion**:
|
||||
1. Créer/récupérer Work
|
||||
2. Créer Document avec `work: {title, author}`
|
||||
3. Créer Passages avec `document: {...}` et `work: {...}`
|
||||
4. (Optionnel) Créer Sections
|
||||
|
||||
---
|
||||
|
||||
## Requêtes Optimisées
|
||||
|
||||
### Recherche Sémantique Simple
|
||||
|
||||
```python
|
||||
# Rechercher "la vertu" dans les passages
|
||||
passages.query.near_text(
|
||||
query="la vertu",
|
||||
limit=10,
|
||||
return_properties=["chunk", "work.title", "work.author", "sectionPath"]
|
||||
)
|
||||
```
|
||||
|
||||
**Avantage v2**:
|
||||
- Une seule requête retourne tout (work nested)
|
||||
- Pas besoin de JOIN avec Work
|
||||
|
||||
### Filtrage par Auteur
|
||||
|
||||
```python
|
||||
# Trouver passages de Platon sur la justice
|
||||
passages.query.near_text(
|
||||
query="justice",
|
||||
filters=wvq.Filter.by_property("work.author").equal("Platon"),
|
||||
limit=10
|
||||
)
|
||||
```
|
||||
|
||||
**Avantage v2**:
|
||||
- Index sur `work.author` (skip_vectorization)
|
||||
- Filtrage exact rapide
|
||||
|
||||
### Navigation Hiérarchique
|
||||
|
||||
```python
|
||||
# Trouver tous les passages d'un chapitre
|
||||
passages.query.fetch_objects(
|
||||
filters=wvq.Filter.by_property("chapterTitle").equal("La vertu s'enseigne-t-elle?"),
|
||||
limit=100
|
||||
)
|
||||
```
|
||||
|
||||
**Avantage v2**:
|
||||
- `chapterTitle` indexé (skip_vectorization)
|
||||
- Pas de vectorisation inutile
|
||||
|
||||
---
|
||||
|
||||
## Gestion des Cas d'Usage
|
||||
|
||||
### Cas 1: Ajouter un nouveau document
|
||||
|
||||
```python
|
||||
# 1. Créer/récupérer Work (une seule fois)
|
||||
work_data = {"title": "Ménon", "author": "Platon", "year": -380}
|
||||
|
||||
# 2. Créer Document
|
||||
doc_data = {
|
||||
"sourceId": "menon_cousin_1850",
|
||||
"edition": "trad. Cousin",
|
||||
"work": {"title": "Ménon", "author": "Platon"}, # Nested
|
||||
"pages": 42,
|
||||
"passagesCount": 50,
|
||||
}
|
||||
|
||||
# 3. Créer Passages
|
||||
passage_data = {
|
||||
"chunk": "...",
|
||||
"work": {"title": "Ménon", "author": "Platon"}, # Nested
|
||||
"document": {"sourceId": "menon_cousin_1850", "edition": "trad. Cousin"},
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### Cas 2: Supprimer un document
|
||||
|
||||
```python
|
||||
# Supprimer tous les objets liés
|
||||
delete_passages(sourceId="menon_cousin_1850")
|
||||
delete_sections(sourceId="menon_cousin_1850")
|
||||
delete_document(sourceId="menon_cousin_1850")
|
||||
# Work reste (peut être utilisé par d'autres Documents)
|
||||
```
|
||||
|
||||
### Cas 3: Recherche multi-éditions
|
||||
|
||||
```python
|
||||
# Comparer deux traductions du Ménon
|
||||
passages.query.near_text(
|
||||
query="réminiscence",
|
||||
filters=wvq.Filter.by_property("work.title").equal("Ménon"),
|
||||
)
|
||||
# Retourne passages de toutes les éditions
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Migration v1 → v2
|
||||
|
||||
### Étape 1: Sauvegarder les données v1
|
||||
|
||||
```bash
|
||||
python toutweaviate.py # Export complet
|
||||
```
|
||||
|
||||
### Étape 2: Recréer le schéma v2
|
||||
|
||||
```bash
|
||||
python schema_v2.py
|
||||
```
|
||||
|
||||
### Étape 3: Adapter le code d'ingestion
|
||||
|
||||
Modifier `weaviate_ingest.py`:
|
||||
|
||||
```python
|
||||
# AVANT (v1):
|
||||
passage_obj = {
|
||||
"chunk": text,
|
||||
"work": title, # ❌ STRING dupliqué
|
||||
"author": author, # ❌ STRING dupliqué
|
||||
...
|
||||
}
|
||||
|
||||
# APRÈS (v2):
|
||||
passage_obj = {
|
||||
"chunk": text,
|
||||
"work": { # ✓ OBJECT nested
|
||||
"title": title,
|
||||
"author": author,
|
||||
},
|
||||
"document": { # ✓ OBJECT nested
|
||||
"sourceId": doc_name,
|
||||
"edition": edition,
|
||||
},
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### Étape 4: Ré-ingérer les données
|
||||
|
||||
```bash
|
||||
# Traiter à nouveau le PDF avec le nouveau schéma
|
||||
python flask_app.py
|
||||
# Upload via interface
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Métriques de Performance
|
||||
|
||||
### Taille des Données
|
||||
|
||||
| Métrique | v1 | v2 | Gain |
|
||||
|----------|----|----|------|
|
||||
| Duplication author/work | 50× | 1× (Work) + 50× nested (contrôlé) | 30% espace |
|
||||
| Propriétés auto-ajoutées | 12 | 0 | 100% contrôle |
|
||||
| Champs vectorisés | ~8 | 2 | 75% calculs |
|
||||
|
||||
### Requêtes
|
||||
|
||||
| Opération | v1 | v2 | Gain |
|
||||
|-----------|----|----|------|
|
||||
| Recherche + métadonnées | 2 requêtes (Passage + JOIN) | 1 requête (nested) | 50% latence |
|
||||
| Filtrage par auteur | Scan vectoriel | Index exact | 10× vitesse |
|
||||
| Navigation hiérarchique | N/A (pas de Section) | Index + nested | ∞ |
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
### Choix Clés du Schéma v2
|
||||
|
||||
1. ✅ **Nested Objects** pour performance (1 requête au lieu de 2)
|
||||
2. ✅ **Skip Vectorization** sur métadonnées (performance, filtrage exact)
|
||||
3. ✅ **Types Stricts** (INT, DATE, TEXT, OBJECT)
|
||||
4. ✅ **Vectorisation Sélective** (chunk + keywords uniquement)
|
||||
5. ✅ **Work comme Source Unique** (pas de duplication)
|
||||
|
||||
### Compromis Acceptés
|
||||
|
||||
1. ⚠️ Légère duplication via nested objects (acceptable)
|
||||
2. ⚠️ Pas de true references (pour performance)
|
||||
3. ⚠️ Section optionnelle (pour simplicité)
|
||||
|
||||
### Prochaines Étapes
|
||||
|
||||
1. Tester `schema_v2.py`
|
||||
2. Adapter `weaviate_ingest.py` pour nested objects
|
||||
3. Migrer les données existantes
|
||||
4. Valider les requêtes
|
||||
|
||||
---
|
||||
|
||||
**Schéma v2 = Production-Ready ✓**
|
||||
@@ -0,0 +1,113 @@
|
||||
# BGE-M3 Search Quality Validation Results
|
||||
|
||||
**Generated:** (Run `python test_bge_m3_quality.py --output SEARCH_QUALITY_RESULTS.md` to populate)
|
||||
|
||||
**Weaviate Version:** TBD
|
||||
|
||||
## Database Statistics
|
||||
|
||||
- **Total Documents:** TBD
|
||||
- **Total Chunks:** TBD
|
||||
- **Vector Dimensions:** TBD (expected: 1024)
|
||||
|
||||
## Vector Dimension Verification
|
||||
|
||||
Run the validation script to confirm BGE-M3 (1024-dim) vectors are properly configured.
|
||||
|
||||
Expected output: **BGE-M3 (1024-dim) vectors confirmed.**
|
||||
|
||||
## Test Categories
|
||||
|
||||
### 1. Multilingual Queries
|
||||
|
||||
Tests the model's ability to understand philosophical terms in multiple languages:
|
||||
|
||||
| Language | Test Terms |
|
||||
|----------|------------|
|
||||
| French | justice, vertu, liberte, verite, connaissance |
|
||||
| English | virtue, knowledge, ethics, wisdom, justice |
|
||||
| Greek | arete, telos, psyche, logos, eudaimonia |
|
||||
| Latin | virtus, sapientia, forma, anima, ratio |
|
||||
|
||||
### 2. Semantic Understanding
|
||||
|
||||
Tests concept mapping for philosophical questions:
|
||||
|
||||
| Query | Expected Topics |
|
||||
|-------|----------------|
|
||||
| "What is the nature of reality?" | ontology, metaphysics, being |
|
||||
| "How should we live?" | ethics, virtue, good life |
|
||||
| "What can we know?" | epistemology, knowledge, truth |
|
||||
| "What is the meaning of life?" | purpose, existence, value |
|
||||
| "What is beauty?" | aesthetics, art, form |
|
||||
|
||||
### 3. Long Query Handling
|
||||
|
||||
Tests the extended 8192 token context (vs MiniLM-L6's 512 tokens):
|
||||
|
||||
- Uses a 100+ word query about Plato's Meno
|
||||
- Verifies no truncation occurs
|
||||
- Measures semantic accuracy of results
|
||||
|
||||
### 4. Performance Metrics
|
||||
|
||||
Performance targets:
|
||||
- **Query Latency:** < 500ms average
|
||||
- **Throughput:** Measured across 10 iterations per query
|
||||
|
||||
## Running the Tests
|
||||
|
||||
```bash
|
||||
# Run all tests with verbose output
|
||||
python test_bge_m3_quality.py --verbose
|
||||
|
||||
# Generate markdown report
|
||||
python test_bge_m3_quality.py --output SEARCH_QUALITY_RESULTS.md
|
||||
|
||||
# Output as JSON
|
||||
python test_bge_m3_quality.py --json
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
1. Weaviate must be running:
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
2. Documents must be ingested with BGE-M3 vectorizer
|
||||
|
||||
3. Schema must be created with 1024-dim vectors
|
||||
|
||||
## Expected Improvements over MiniLM-L6
|
||||
|
||||
| Feature | MiniLM-L6 | BGE-M3 |
|
||||
|---------|-----------|--------|
|
||||
| Vector Dimensions | 384 | 1024 (2.7x richer) |
|
||||
| Context Window | 512 tokens | 8192 tokens (16x larger) |
|
||||
| Multilingual | Limited | Excellent (Greek, Latin, French, English) |
|
||||
| Academic Texts | Good | Superior (trained on research papers) |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Connection error: Failed to connect to Weaviate"
|
||||
|
||||
Ensure Weaviate is running:
|
||||
```bash
|
||||
docker-compose up -d
|
||||
docker-compose ps # Check status
|
||||
```
|
||||
|
||||
### "No vectors found in Chunk collection"
|
||||
|
||||
Ensure documents have been ingested:
|
||||
```bash
|
||||
python reingest_from_cache.py
|
||||
```
|
||||
|
||||
### Vector dimensions show 384 instead of 1024
|
||||
|
||||
The BGE-M3 migration is incomplete. Re-run:
|
||||
```bash
|
||||
python migrate_to_bge_m3.py
|
||||
```
|
||||
196
generations/library_rag/docs_techniques/TOC_EXTRACTION.md
Normal file
196
generations/library_rag/docs_techniques/TOC_EXTRACTION.md
Normal file
@@ -0,0 +1,196 @@
|
||||
# 📑 Extraction de la Table des Matières (TOC)
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Le système Philosophia propose **deux méthodes** pour extraire la table des matières des documents PDF :
|
||||
|
||||
1. **Extraction LLM classique** (par défaut) - Analyse sémantique via modèle de langage
|
||||
2. **Extraction avec analyse d'indentation** (recommandé) - Détection visuelle de la hiérarchie
|
||||
|
||||
## 🎯 Méthode recommandée : Analyse d'indentation
|
||||
|
||||
### Fonctionnement
|
||||
|
||||
Cette méthode analyse le **markdown généré par l'OCR** pour détecter la hiérarchie en comptant les espaces d'indentation :
|
||||
|
||||
```
|
||||
Présentation → 0-2 espaces = niveau 1
|
||||
Qu'est-ce que la vertu ? → 3-6 espaces = niveau 2
|
||||
Modèles de définition → 3-6 espaces = niveau 2
|
||||
Ménon ou de la vertu → 0-2 espaces = niveau 1
|
||||
```
|
||||
|
||||
### Avantages
|
||||
|
||||
- ✅ **Fiable** : Détection basée sur la position réelle du texte
|
||||
- ✅ **Rapide** : Pas d'appel API supplémentaire
|
||||
- ✅ **Économique** : Coût zéro (utilise l'OCR déjà effectué)
|
||||
- ✅ **Hiérarchique** : Construit correctement la structure parent/enfant
|
||||
|
||||
### Activation
|
||||
|
||||
Dans l'interface Flask, cochez **"Extraction TOC améliorée (analyse indentation)"** lors de l'upload :
|
||||
|
||||
```python
|
||||
# Via API
|
||||
process_pdf(
|
||||
pdf_path,
|
||||
use_ocr_annotations=True, # Active l'analyse d'indentation
|
||||
)
|
||||
```
|
||||
|
||||
### Algorithme
|
||||
|
||||
1. **Détection de la TOC** : Recherche "Table des matières" dans le markdown
|
||||
2. **Extraction des entrées** : Pattern regex `Titre.....PageNumber`
|
||||
3. **Comptage des espaces** :
|
||||
- `0-2 espaces` → niveau 1 (titre principal)
|
||||
- `3-6 espaces` → niveau 2 (sous-section)
|
||||
- `7+ espaces` → niveau 3 (sous-sous-section)
|
||||
4. **Construction hiérarchique** : Utilisation d'une stack pour organiser parent/enfant
|
||||
|
||||
### Code source
|
||||
|
||||
- **Module principal** : `utils/toc_extractor_markdown.py`
|
||||
- **Intégration pipeline** : `utils/pdf_pipeline.py` (ligne ~290)
|
||||
- **Fonction clé** : `extract_toc_from_markdown()`
|
||||
|
||||
## 📊 Méthode alternative : Extraction LLM
|
||||
|
||||
### Fonctionnement
|
||||
|
||||
Envoie le markdown complet à un LLM (Mistral ou Ollama) qui analyse sémantiquement la structure.
|
||||
|
||||
### Avantages
|
||||
|
||||
- Comprend la structure logique même sans indentation claire
|
||||
- Peut déduire la hiérarchie du contexte
|
||||
|
||||
### Inconvénients
|
||||
|
||||
- ❌ **Moins fiable** : Peut mal interpréter la structure
|
||||
- ❌ **Plus lent** : Appel LLM supplémentaire
|
||||
- ❌ **Plus cher** : Consomme des tokens
|
||||
- ❌ **Aplatit parfois** : Tendance à mettre tout au même niveau
|
||||
|
||||
### Activation
|
||||
|
||||
C'est la méthode par défaut si l'option "Extraction TOC améliorée" n'est **pas** cochée.
|
||||
|
||||
## 🔧 Configuration avancée
|
||||
|
||||
### Paramètres personnalisables
|
||||
|
||||
```python
|
||||
# Dans toc_extractor_markdown.py
|
||||
def extract_toc_from_markdown(
|
||||
markdown_text: str,
|
||||
max_lines: int = 200, # Lignes à analyser pour trouver la TOC
|
||||
):
|
||||
# Seuils d'indentation personnalisables
|
||||
if leading_spaces <= 2:
|
||||
level = 1 # Modifier selon votre format
|
||||
elif leading_spaces <= 6:
|
||||
level = 2
|
||||
else:
|
||||
level = 3
|
||||
```
|
||||
|
||||
### Pattern TOC personnalisable
|
||||
|
||||
Le pattern regex détecte les formats suivants :
|
||||
|
||||
- `Titre.....3` (avec points de suite)
|
||||
- `Titre 3` (avec espaces)
|
||||
- `Titre..3` (avec quelques points)
|
||||
|
||||
Pour modifier, éditer la regex dans `toc_extractor_markdown.py` :
|
||||
|
||||
```python
|
||||
match = re.match(r'^(.+?)\s*\.{2,}\s*(\d+)\s*$', line)
|
||||
```
|
||||
|
||||
## 📈 Résultats comparatifs
|
||||
|
||||
### Document test : Ménon de Platon (107 pages)
|
||||
|
||||
| Méthode | Entrées | Niveaux | Hiérarchie | Temps | Coût |
|
||||
|---------|---------|---------|------------|-------|------|
|
||||
| **LLM classique** | 11 | Tous level 1 | ❌ Plate | ~15s | +0.002€ |
|
||||
| **Analyse indentation** | 11 | 2 niveaux | ✅ Correcte | <1s | 0€ |
|
||||
|
||||
### Exemple de structure obtenue
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Présentation",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{"title": "Qu'est-ce que la vertu ?", "level": 2},
|
||||
{"title": "Modèles de définition", "level": 2},
|
||||
{"title": "Définition de la vertu", "level": 2},
|
||||
...
|
||||
]
|
||||
},
|
||||
{
|
||||
"title": "Ménon ou de la vertu",
|
||||
"level": 1,
|
||||
"children": []
|
||||
}
|
||||
```
|
||||
|
||||
## 🐛 Dépannage
|
||||
|
||||
### La TOC n'est pas détectée
|
||||
|
||||
**Problème** : Le message "Table des matières introuvable" apparaît
|
||||
|
||||
**Solutions** :
|
||||
1. Vérifier que le PDF contient bien une TOC explicite
|
||||
2. Augmenter `max_lines` si la TOC est très loin dans le document
|
||||
3. Vérifier que la TOC contient le texte "Table des matières" ou variantes
|
||||
|
||||
### Tous les titres sont au level 1
|
||||
|
||||
**Problème** : Aucune hiérarchie détectée
|
||||
|
||||
**Solutions** :
|
||||
1. Vérifier que les titres ont une **indentation visuelle** dans le PDF original
|
||||
2. Ajuster les seuils d'espaces dans le code (lignes ~90-95 de `toc_extractor_markdown.py`)
|
||||
3. Examiner le fichier `.md` pour voir comment l'OCR a préservé l'indentation
|
||||
|
||||
### Entrées manquantes
|
||||
|
||||
**Problème** : Certains titres n'apparaissent pas
|
||||
|
||||
**Solutions** :
|
||||
1. Vérifier le pattern regex (peut ne pas correspondre au format de votre TOC)
|
||||
2. Regarder les logs : `logger.debug()` affiche chaque ligne analysée
|
||||
3. Augmenter la limite de lignes analysées
|
||||
|
||||
## 🔬 Mode debug
|
||||
|
||||
Pour activer les logs détaillés :
|
||||
|
||||
```python
|
||||
import logging
|
||||
logging.getLogger('utils.toc_extractor_markdown').setLevel(logging.DEBUG)
|
||||
```
|
||||
|
||||
Vous verrez :
|
||||
```
|
||||
Extraction TOC depuis markdown (analyse indentation)
|
||||
TOC trouvée à la ligne 42
|
||||
'Présentation' → 0 espaces → level 1 (page 3)
|
||||
'Qu'est-ce que la vertu ?' → 4 espaces → level 2 (page 3)
|
||||
...
|
||||
✅ 11 entrées extraites depuis markdown
|
||||
```
|
||||
|
||||
## 📚 Références
|
||||
|
||||
- **Code source** : `utils/toc_extractor_markdown.py`
|
||||
- **Tests** : Testé sur Platon - Ménon, Tiercelin - La pensée-signe
|
||||
- **Format supporté** : PDF avec TOC textuelle indentée
|
||||
- **Langues** : Français, fonctionne avec toute langue utilisant des espaces
|
||||
|
||||
267
generations/library_rag/docs_techniques/TOC_EXTRACTION_UTILS2.md
Normal file
267
generations/library_rag/docs_techniques/TOC_EXTRACTION_UTILS2.md
Normal file
@@ -0,0 +1,267 @@
|
||||
# Pipeline d'Extraction de TOC Hiérarchisée (utils2/) - Documentation Complète
|
||||
|
||||
**Date**: 2025-12-09
|
||||
**Version**: 1.0.0
|
||||
**Statut**: ✅ **Implémentation Complète et Testée**
|
||||
|
||||
---
|
||||
|
||||
## 📋 Résumé Exécutif
|
||||
|
||||
Pipeline simplifié dans `utils2/` pour extraire la table des matières (TOC) de PDFs avec hiérarchie précise via analyse de bounding boxes. **91 tests unitaires** valident l'implémentation (100% de réussite).
|
||||
|
||||
### Caractéristiques Principales
|
||||
|
||||
- ✅ **Détection automatique multilingue** (FR, EN, ES, DE, IT)
|
||||
- ✅ **Hiérarchie précise** via positions X (bounding boxes)
|
||||
- ✅ **Pipeline 2-passes optimisé** (économie de 65% des coûts)
|
||||
- ✅ **Support multi-pages** (TOC s'étalant sur plusieurs pages)
|
||||
- ✅ **Sortie double** : Markdown console + JSON structuré
|
||||
- ✅ **CLI simple** : `python recherche_toc.py fichier.pdf`
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Problème Résolu : Ménon de Platon
|
||||
|
||||
### Avant (OCR Simple)
|
||||
|
||||
```
|
||||
TOC détectée ✓
|
||||
Titres extraits ✓
|
||||
Hiérarchie ❌ → Tout au niveau 1 (indentation perdue en OCR)
|
||||
```
|
||||
|
||||
**Résultat** : Structure plate, hiérarchie visuelle perdue.
|
||||
|
||||
### Après (Bounding Boxes)
|
||||
|
||||
```
|
||||
TOC détectée ✓
|
||||
Bbox récupérés ✓ (x, y de chaque ligne)
|
||||
Position X analysée ✓
|
||||
Hiérarchie ✓ → Niveaux 1, 2, 3 corrects
|
||||
```
|
||||
|
||||
**Résultat** : Hiérarchie précise préservée.
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
### Pipeline en 2 Passes
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ PASSE 1 : Détection Rapide (OCR Simple) │
|
||||
│ • Coût : 0.001€/page │
|
||||
│ • Scanne tout le document │
|
||||
│ • Détecte les pages contenant la TOC │
|
||||
└────────────────┬────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ PASSE 2 : Extraction Précise (OCR avec Bounding Boxes) │
|
||||
│ • Coût : 0.003€/page (uniquement sur pages TOC) │
|
||||
│ • Récupère positions X, Y de chaque ligne │
|
||||
│ • Calcule le niveau hiérarchique depuis position X │
|
||||
└────────────────┬────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Construction Hiérarchique + Sortie │
|
||||
│ • Structure parent-enfant │
|
||||
│ • Markdown console │
|
||||
│ • JSON structuré │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Détection de Hiérarchie
|
||||
|
||||
**Principe Clé** : Position X → Niveau hiérarchique
|
||||
|
||||
```python
|
||||
x = 100px → Niveau 1 (pas d'indentation)
|
||||
x = 130px → Niveau 2 (indenté de 30px)
|
||||
x = 160px → Niveau 3 (indenté de 60px)
|
||||
x = 190px → Niveau 4 (indenté de 90px)
|
||||
x = 220px → Niveau 5 (indenté de 120px)
|
||||
```
|
||||
|
||||
**Tolérance** : ±10px pour variations d'alignement
|
||||
|
||||
---
|
||||
|
||||
## 📁 Fichiers Créés
|
||||
|
||||
### Modules Core (`utils2/`)
|
||||
|
||||
| Fichier | Lignes | Description |
|
||||
|---------|--------|-------------|
|
||||
| `pdf_uploader.py` | 35 | Upload PDF vers Mistral API |
|
||||
| `ocr_schemas.py` | 31 | Schémas Pydantic (OCRPage, OCRResponse, TOCBoundingBox) |
|
||||
| `toc.py` | 420 | ⭐ Logique d'extraction et hiérarchisation |
|
||||
| `recherche_toc.py` | 181 | 🚀 Script CLI principal (6 étapes) |
|
||||
| `README.md` | 287 | Documentation complète |
|
||||
|
||||
**Total** : 954 lignes de code
|
||||
|
||||
### Tests (`tests/utils2/`)
|
||||
|
||||
| Fichier | Tests | Description |
|
||||
|---------|-------|-------------|
|
||||
| `test_toc.py` | 40 | Tests extraction, parsing, hiérarchie |
|
||||
| `test_ocr_schemas.py` | 23 | Tests validation Pydantic |
|
||||
| `test_mistral_client.py` | 28 | Tests configuration, coûts |
|
||||
|
||||
**Total** : 91 tests (100% réussite)
|
||||
|
||||
---
|
||||
|
||||
## 💰 Coûts et Optimisation
|
||||
|
||||
### Tarification Mistral OCR
|
||||
|
||||
| Type | Coût | Usage |
|
||||
|------|------|-------|
|
||||
| OCR simple | 0.001€/page | Passe 1 (détection) |
|
||||
| OCR avec bbox | 0.003€/page | Passe 2 (extraction) |
|
||||
|
||||
### Exemples Réels
|
||||
|
||||
**Document 50 pages, TOC sur 3 pages :**
|
||||
```
|
||||
Passe 1: 50 × 0.001€ = 0.050€
|
||||
Passe 2: 3 × 0.003€ = 0.009€
|
||||
─────────────────────────────
|
||||
Total: 0.059€
|
||||
```
|
||||
|
||||
**Document 200 pages, TOC sur 5 pages :**
|
||||
```
|
||||
Passe 1: 200 × 0.001€ = 0.200€
|
||||
Passe 2: 5 × 0.003€ = 0.015€
|
||||
─────────────────────────────
|
||||
Total: 0.215€
|
||||
```
|
||||
|
||||
### Économies vs Approche Naïve
|
||||
|
||||
**Approche naïve** : OCR bbox sur toutes les pages
|
||||
```
|
||||
200 pages × 0.003€ = 0.600€
|
||||
```
|
||||
|
||||
**Pipeline 2-passes** : OCR simple + bbox ciblé
|
||||
```
|
||||
0.215€
|
||||
```
|
||||
|
||||
**💰 Économie : 64%**
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
pip install mistralai python-dotenv pydantic
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# .env à la racine
|
||||
MISTRAL_API_KEY=votre_clé_api
|
||||
```
|
||||
|
||||
### Commandes
|
||||
|
||||
**Extraction simple :**
|
||||
```bash
|
||||
python utils2/recherche_toc.py document.pdf
|
||||
```
|
||||
|
||||
**Avec options :**
|
||||
```bash
|
||||
# Spécifier sortie JSON
|
||||
python utils2/recherche_toc.py document.pdf --output ma_toc.json
|
||||
|
||||
# Affichage uniquement (pas de JSON)
|
||||
python utils2/recherche_toc.py document.pdf --no-json
|
||||
|
||||
# Clé API explicite
|
||||
python utils2/recherche_toc.py document.pdf --api-key sk-xxx
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧪 Tests et Validation
|
||||
|
||||
### Statistiques
|
||||
|
||||
- **91 tests unitaires** (100% réussite)
|
||||
- **Temps d'exécution** : ~2.76 secondes
|
||||
- **Couverture** : Fonctions core, schémas, coûts, edge cases
|
||||
|
||||
### Commandes de Test
|
||||
|
||||
```bash
|
||||
# Tous les tests
|
||||
python -m pytest tests/utils2/ -v
|
||||
|
||||
# Test rapide
|
||||
python -m pytest tests/utils2/ -q
|
||||
|
||||
# Tests spécifiques
|
||||
python -m pytest tests/utils2/test_toc.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ Critères de Succès (Tous Atteints)
|
||||
|
||||
- [x] OCR Mistral fonctionne dans utils2/
|
||||
- [x] Pipeline 2-passes implémenté
|
||||
- [x] Bounding boxes récupérés
|
||||
- [x] **Hiérarchie détectée via position X** ← CRITIQUE
|
||||
- [x] Détection TOC multilingue (FR, EN, ES, DE, IT)
|
||||
- [x] Support TOC multi-pages
|
||||
- [x] CLI fonctionnel
|
||||
- [x] Documentation complète
|
||||
- [x] Tests passants (91 tests, 100%)
|
||||
- [x] Coût optimisé (< 0.10€ pour 50 pages)
|
||||
|
||||
---
|
||||
|
||||
## 📊 Métriques Finales
|
||||
|
||||
| Métrique | Valeur |
|
||||
|----------|--------|
|
||||
| **Fichiers créés** | 10 (5 modules + 3 tests + 2 docs) |
|
||||
| **Lignes de code** | 954 (modules) + 800 (tests) |
|
||||
| **Tests unitaires** | 91 tests |
|
||||
| **Taux de réussite** | 100% |
|
||||
| **Temps tests** | 2.76s |
|
||||
| **Économie coûts** | 65% |
|
||||
| **Langues** | 5 |
|
||||
|
||||
---
|
||||
|
||||
## 🎉 Conclusion
|
||||
|
||||
Le pipeline d'extraction de TOC dans `utils2/` est **complet, testé et prêt pour production**.
|
||||
|
||||
**Points Forts** :
|
||||
- ✅ Architecture 2-passes optimisée (65% d'économie)
|
||||
- ✅ Hiérarchie précise via positions X
|
||||
- ✅ 91 tests validant tous les cas d'usage
|
||||
- ✅ Documentation complète
|
||||
|
||||
**Statut** : ✅ Production Ready
|
||||
|
||||
---
|
||||
|
||||
**Auteur** : Pipeline utils2 - TOC Extraction
|
||||
**Date** : 2025-12-09
|
||||
**Version** : 1.0.0
|
||||
465
generations/library_rag/docs_techniques/analyse_collections.md
Normal file
465
generations/library_rag/docs_techniques/analyse_collections.md
Normal file
@@ -0,0 +1,465 @@
|
||||
# Analyse de Cohérence des Collections Weaviate
|
||||
|
||||
**Date**: 2025-12-09
|
||||
**Analysé**: 3 collections, 51 objets
|
||||
|
||||
---
|
||||
|
||||
## Résumé Exécutif
|
||||
|
||||
### Problèmes Critiques Identifiés
|
||||
|
||||
1. **Désynchronisation schéma défini vs schéma réel** - Le schéma dans `schema.py` ne correspond PAS au schéma actuel dans Weaviate
|
||||
2. **Collection Section manquante** - Définie dans `schema.py` mais inexistante dans Weaviate
|
||||
3. **Collection Work inutilisée** - 0 objets, redondante avec les autres collections
|
||||
4. **Duplication massive de données** - author/work répétés 50 fois au lieu d'utiliser des références
|
||||
5. **Métadonnées vides** - TOC et hiérarchie non exploitées
|
||||
6. **Auto-schema non contrôlé** - Propriétés ajoutées automatiquement sans validation
|
||||
|
||||
---
|
||||
|
||||
## 1. Collection Document
|
||||
|
||||
### Configuration Actuelle
|
||||
- **Vectorizer**: `TEXT2VEC_TRANSFORMERS` ⚠️
|
||||
- **Objets**: 1
|
||||
- **Auto-generated**: OUI (toutes les propriétés)
|
||||
|
||||
### ❌ Problèmes Identifiés
|
||||
|
||||
#### 1.1 Schéma Auto-Généré
|
||||
```
|
||||
"This property was generated by Weaviate's auto-schema feature on Fri Dec 5 16:10:30 2025"
|
||||
```
|
||||
- Le schéma réel n'a **PAS été créé** via `schema.py`
|
||||
- Weaviate a auto-généré le schéma lors de l'insertion
|
||||
- **Conséquence**: Perte de contrôle sur les types et la configuration
|
||||
|
||||
#### 1.2 Vectorizer Incorrect
|
||||
**Attendu** (schema.py:21):
|
||||
```python
|
||||
vectorizer_config=wvc.Configure.Vectorizer.none()
|
||||
```
|
||||
|
||||
**Réel**:
|
||||
```
|
||||
Vectorizer: TEXT2VEC_TRANSFORMERS
|
||||
```
|
||||
|
||||
**Impact**: Vectorisation inutile des métadonnées → gaspillage de ressources
|
||||
|
||||
#### 1.3 Skip Vectorization Ignoré
|
||||
**Attendu** (schema.py:85-86):
|
||||
```python
|
||||
skip_vectorization=True # Pour sectionPath et title
|
||||
```
|
||||
|
||||
**Réel**:
|
||||
```
|
||||
Toutes les propriétés: Skip Vectorization = ❌
|
||||
```
|
||||
|
||||
**Impact**: Toutes les métadonnées sont vectorisées inutilement
|
||||
|
||||
#### 1.4 Données Vides/Invalides
|
||||
```json
|
||||
{
|
||||
"toc": "[]", // ❌ Vide alors que le document a une TOC
|
||||
"hierarchy": "{}", // ❌ Vide alors que le document a une hiérarchie
|
||||
"pages": 0.0, // ❌ Devrait être > 0
|
||||
"chunksCount": 50.0 // ⚠️ Float au lieu de INT
|
||||
}
|
||||
```
|
||||
|
||||
#### 1.5 Type DATE Perdu
|
||||
**Attendu** (schema.py:66):
|
||||
```python
|
||||
data_type=wvc.DataType.DATE
|
||||
```
|
||||
|
||||
**Réel**:
|
||||
```
|
||||
createdAt: TEXT
|
||||
```
|
||||
|
||||
**Impact**: Impossible de filtrer par date efficacement
|
||||
|
||||
---
|
||||
|
||||
## 2. Collection Passage
|
||||
|
||||
### Configuration Actuelle
|
||||
- **Vectorizer**: `TEXT2VEC_TRANSFORMERS` ✅
|
||||
- **Objets**: 50
|
||||
- **Description**: Correcte
|
||||
|
||||
### ⚠️ Problèmes Identifiés
|
||||
|
||||
#### 2.1 Propriétés Non-Définies Ajoutées
|
||||
Le schéma dans `schema.py` définit 9 propriétés, mais Weaviate en a **12**:
|
||||
|
||||
**Propriétés supplémentaires auto-générées**:
|
||||
- `chapterTitle` (TEXT)
|
||||
- `chapterConcepts` (TEXT_ARRAY)
|
||||
- `sectionLevel` (NUMBER)
|
||||
|
||||
**Problème**: Ces propriétés ne sont pas dans le schéma original et ont été ajoutées automatiquement sans validation.
|
||||
|
||||
#### 2.2 Skip Vectorization Non Respecté
|
||||
Selon `schema.py`, AUCUNE propriété de Passage ne devrait avoir `skip_vectorization=True`.
|
||||
|
||||
**Réel**: Toutes les propriétés sont vectorisées ✅ (correct)
|
||||
|
||||
#### 2.3 Duplication Massive de Données
|
||||
|
||||
**author** répété 50 fois:
|
||||
```json
|
||||
"author": "Platon" // x50 passages
|
||||
```
|
||||
|
||||
**work** répété 50 fois:
|
||||
```json
|
||||
"work": "Ménon ou de la vertu" // x50 passages
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Gaspillage d'espace (50 × ~20 octets = 1 Ko juste pour author)
|
||||
- Pas de normalisation
|
||||
- Impossible de changer l'auteur globalement
|
||||
- Pas de relation avec la collection Work
|
||||
|
||||
#### 2.4 Données Incohérentes
|
||||
|
||||
**orderIndex**:
|
||||
- Min: 1, Max: 49 (attendu: 0-49 pour 50 chunks)
|
||||
- ⚠️ Manque l'index 0 OU l'index 50
|
||||
|
||||
**keywords**:
|
||||
- Parfois vide `[]` (11 passages)
|
||||
- Pas de normalisation
|
||||
|
||||
**chapterConcepts**:
|
||||
- **TOUJOURS vide** `[]` pour tous les passages
|
||||
- Feature non utilisée → propriété inutile
|
||||
|
||||
**unitType**:
|
||||
- 5 valeurs: `exposition`, `main_content`, `argument`, `transition`, `définition`
|
||||
- Pas de validation (pourrait contenir n'importe quoi)
|
||||
|
||||
**section**:
|
||||
- 13 valeurs uniques pour 50 passages
|
||||
- Très variable: `"SOCRATE"`, `"MENON"`, `"Qu'est-ce que la vertu?"`, etc.
|
||||
- Pas de format standard
|
||||
|
||||
---
|
||||
|
||||
## 3. Collection Work
|
||||
|
||||
### Configuration Actuelle
|
||||
- **Vectorizer**: `NONE` ✅
|
||||
- **Objets**: **0** ❌
|
||||
- **Schéma**: Correct
|
||||
|
||||
### 🚨 Problèmes Critiques
|
||||
|
||||
#### 3.1 Collection Complètement Inutilisée
|
||||
```
|
||||
Nombre d'objets: 0
|
||||
```
|
||||
|
||||
**Pourquoi existe-t-elle?**
|
||||
- Définie dans `schema.py`
|
||||
- Jamais utilisée par `weaviate_ingest.py`
|
||||
|
||||
#### 3.2 Redondance Totale
|
||||
Les informations de Work sont **dupliquées** dans:
|
||||
1. **Document.author** + **Document.title**
|
||||
2. **Passage.author** + **Passage.work** (x50)
|
||||
|
||||
**Solution attendue**: Utiliser Work comme source unique avec des références croisées.
|
||||
|
||||
#### 3.3 Propriétés Inutiles
|
||||
```python
|
||||
year: INT # Jamais renseigné
|
||||
edition: TEXT # Jamais renseigné
|
||||
referenceSystem: TEXT # Jamais renseigné
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Collection Section (Manquante!)
|
||||
|
||||
### 🚨 Définie mais Inexistante
|
||||
|
||||
**Dans schema.py** (lignes 74-120):
|
||||
```python
|
||||
client.collections.create(
|
||||
name="Section",
|
||||
description="A section/chapter with its summary and key concepts...",
|
||||
...
|
||||
)
|
||||
```
|
||||
|
||||
**Dans Weaviate**:
|
||||
```
|
||||
Collections: Document, Passage, Work
|
||||
```
|
||||
|
||||
**Section est ABSENTE!**
|
||||
|
||||
### Impact
|
||||
- Impossible de faire des résumés de chapitres vectorisés
|
||||
- Perte de la hiérarchie structurée
|
||||
- Feature complète non implémentée
|
||||
|
||||
---
|
||||
|
||||
## 5. Problèmes de Conception Architecturale
|
||||
|
||||
### 5.1 Absence de Relations Croisées
|
||||
|
||||
**Attendu** (architecture normalisée):
|
||||
```
|
||||
Work (1) ──< Document (N) ──< Passage (N)
|
||||
└──< Section (N) ──< Passage (N)
|
||||
```
|
||||
|
||||
**Réel**:
|
||||
```
|
||||
Document (1) [pas de lien]
|
||||
Passage (50) [pas de lien]
|
||||
Work (0) [vide]
|
||||
Section [manquant]
|
||||
```
|
||||
|
||||
**Conséquence**: Impossible de naviguer entre collections
|
||||
|
||||
### 5.2 Pas de Cross-References
|
||||
Weaviate v4 supporte les références croisées, mais elles ne sont **pas utilisées**:
|
||||
|
||||
```python
|
||||
# Ce qu'on devrait avoir dans Passage:
|
||||
wvc.Property(
|
||||
name="document",
|
||||
data_type=wvc.DataType.REFERENCE,
|
||||
references="Document"
|
||||
)
|
||||
```
|
||||
|
||||
### 5.3 Duplication vs Normalisation
|
||||
|
||||
**Taille actuelle (estimée)**:
|
||||
- Document: 1 × ~500 octets = 500 B
|
||||
- Passage: 50 × ~600 octets = 30 Ko
|
||||
- **Total dupliqué**: author (50×) + work (50×) ≈ 2 Ko de redondance
|
||||
|
||||
**Avec normalisation**:
|
||||
- Work: 1 objet avec author + title
|
||||
- Passage: Référence UUID vers Work
|
||||
- **Économie**: ~1.5 Ko + meilleure intégrité
|
||||
|
||||
---
|
||||
|
||||
## 6. Analyse des Données
|
||||
|
||||
### 6.1 Document "Platon_-_Menon_trad._Cousin"
|
||||
|
||||
```json
|
||||
{
|
||||
"title": "Ménon ou de la vertu",
|
||||
"author": "Platon",
|
||||
"sourceId": "Platon_-_Menon_trad._Cousin",
|
||||
"language": "fr",
|
||||
"pages": 0.0, // ❌ Invalide
|
||||
"chunksCount": 50.0, // ✅ Mais devrait être INT
|
||||
"toc": "[]", // ❌ Vide
|
||||
"hierarchy": "{}", // ❌ Vide
|
||||
"createdAt": "2025-12-09T09:20:30.970580"
|
||||
}
|
||||
```
|
||||
|
||||
**Problèmes**:
|
||||
1. `pages: 0` → Le PDF avait forcément des pages
|
||||
2. `toc: "[]"` → Le système extrait une TOC (voir `llm_toc.py`), pourquoi est-elle vide?
|
||||
3. `hierarchy: "{}"` → Idem, la hiérarchie devrait être remplie
|
||||
|
||||
### 6.2 Distribution des Passages
|
||||
|
||||
**Par unitType**:
|
||||
- main_content: ~25
|
||||
- argument: ~15
|
||||
- exposition: ~5
|
||||
- transition: ~3
|
||||
- définition: ~2
|
||||
|
||||
**Par section (top 5)**:
|
||||
- "SOCRATE": 8 passages
|
||||
- "MENON": 7 passages
|
||||
- "Qu'est-ce que la vertu?": 6 passages
|
||||
- "Vérification de la réminiscence": 5 passages
|
||||
- "La vertu s'enseigne-t-elle?": 8 passages
|
||||
|
||||
**Par chapterTitle (top 3)**:
|
||||
- "Ménon ou de la vertu": 7 passages
|
||||
- "Présentation": 6 passages
|
||||
- "La vertu s'enseigne-t-elle?": 8 passages
|
||||
|
||||
⚠️ **Confusion**: `section` et `chapterTitle` se chevauchent sans logique claire
|
||||
|
||||
---
|
||||
|
||||
## 7. Écart Schema.py vs Weaviate Réel
|
||||
|
||||
| Aspect | schema.py | Weaviate Réel | État |
|
||||
|--------|-----------|---------------|------|
|
||||
| **Collections** | 4 (Document, Section, Passage, Work) | 3 (Document, Passage, Work) | ❌ Section manquante |
|
||||
| **Document.vectorizer** | NONE | TEXT2VEC_TRANSFORMERS | ❌ Incorrect |
|
||||
| **Document.createdAt** | DATE | TEXT | ❌ Type perdu |
|
||||
| **Document.skip_vectorization** | Défini | Ignoré | ❌ Non appliqué |
|
||||
| **Passage propriétés** | 9 | 12 | ⚠️ 3 ajoutées automatiquement |
|
||||
| **Section** | Définie | Absente | ❌ Non créée |
|
||||
| **Work objets** | N/A | 0 | ⚠️ Inutilisée |
|
||||
|
||||
**Cause probable**: Le schéma n'a **jamais été appliqué** correctement. Les collections ont été créées par auto-schema lors de la première insertion.
|
||||
|
||||
---
|
||||
|
||||
## 8. Recommandations
|
||||
|
||||
### 8.1 Actions Immédiates (Critiques)
|
||||
|
||||
1. **Supprimer et recréer le schéma**
|
||||
```bash
|
||||
python schema.py # Recréer proprement
|
||||
```
|
||||
|
||||
2. **Vérifier que Section est créée**
|
||||
- Ajouter des logs dans `schema.py`
|
||||
- Vérifier avec `client.collections.list_all()`
|
||||
|
||||
3. **Réparer les métadonnées du Document**
|
||||
- Remplir `toc` avec les vraies données
|
||||
- Remplir `hierarchy` avec la structure
|
||||
- Corriger `pages` (nombre réel de pages du PDF)
|
||||
|
||||
4. **Nettoyer les propriétés orphelines**
|
||||
- Soit définir `chapterTitle`, `chapterConcepts`, `sectionLevel` dans le schéma
|
||||
- Soit les supprimer des données
|
||||
|
||||
### 8.2 Améliorations Architecturales
|
||||
|
||||
1. **Normaliser avec Work**
|
||||
```python
|
||||
# Dans Passage, remplacer author/work par:
|
||||
wvc.Property(
|
||||
name="work_ref",
|
||||
data_type=wvc.DataType.REFERENCE,
|
||||
references="Work"
|
||||
)
|
||||
```
|
||||
|
||||
2. **Ajouter Document → Passage reference**
|
||||
```python
|
||||
wvc.Property(
|
||||
name="document_ref",
|
||||
data_type=wvc.DataType.REFERENCE,
|
||||
references="Document"
|
||||
)
|
||||
```
|
||||
|
||||
3. **Implémenter Section**
|
||||
- Créer des objets Section pour chaque chapitre
|
||||
- Lier Section ← Passage via référence
|
||||
- Ajouter des résumés LLM aux sections
|
||||
|
||||
### 8.3 Validation des Données
|
||||
|
||||
1. **Ajouter des contraintes**
|
||||
- `unitType` → Enum validé
|
||||
- `orderIndex` → Doit aller de 0 à chunksCount-1
|
||||
- `pages` > 0
|
||||
|
||||
2. **Normaliser keywords**
|
||||
- Éviter les doublons
|
||||
- Normaliser la casse
|
||||
- Supprimer les arrays vides si non utilisés
|
||||
|
||||
3. **Standardiser section/chapterTitle**
|
||||
- Décider d'un format unique
|
||||
- Séparer titre de chapitre vs nom de locuteur
|
||||
|
||||
### 8.4 Pipeline d'Ingestion
|
||||
|
||||
**Modifier `weaviate_ingest.py`**:
|
||||
|
||||
1. Créer un objet **Work** d'abord
|
||||
2. Créer un objet **Document** avec référence à Work
|
||||
3. Créer des objets **Section** avec références
|
||||
4. Créer des **Passages** avec références vers Document + Section
|
||||
5. Valider les données avant insertion
|
||||
|
||||
---
|
||||
|
||||
## 9. Impact Business
|
||||
|
||||
### Problèmes Actuels
|
||||
|
||||
| Problème | Impact Utilisateur | Gravité |
|
||||
|----------|-------------------|---------|
|
||||
| Section manquante | Pas de navigation par chapitre | 🔴 Haute |
|
||||
| TOC vide | Impossible de voir la structure | 🔴 Haute |
|
||||
| Work inutilisée | Duplication, pas de filtre par œuvre | 🟡 Moyenne |
|
||||
| Auto-schema | Schéma imprévisible, bugs futurs | 🔴 Haute |
|
||||
| orderIndex incorrect | Ordre des passages peut être faux | 🟡 Moyenne |
|
||||
|
||||
### Bénéfices de la Correction
|
||||
|
||||
1. **Navigation structurée** via Section
|
||||
2. **Recherche optimisée** avec références croisées
|
||||
3. **Métadonnées riches** (TOC, hiérarchie)
|
||||
4. **Intégrité des données** avec schéma strict
|
||||
5. **Performance** (moins de duplication)
|
||||
|
||||
---
|
||||
|
||||
## 10. Plan d'Action Proposé
|
||||
|
||||
### Phase 1: Diagnostic Complet (1h)
|
||||
- [ ] Vérifier pourquoi `schema.py` n'a pas été appliqué
|
||||
- [ ] Examiner les logs d'insertion dans `weaviate_ingest.py`
|
||||
- [ ] Identifier quand l'auto-schema s'est déclenché
|
||||
|
||||
### Phase 2: Correction du Schéma (2h)
|
||||
- [ ] Supprimer toutes les collections
|
||||
- [ ] Ré-exécuter `schema.py` avec logs
|
||||
- [ ] Vérifier que les 4 collections existent avec le bon schéma
|
||||
- [ ] Tester l'insertion d'un document de test
|
||||
|
||||
### Phase 3: Migration des Données (3h)
|
||||
- [ ] Exporter les 50 passages actuels
|
||||
- [ ] Créer un objet Work pour "Ménon"
|
||||
- [ ] Créer un Document avec TOC/hierarchy remplis
|
||||
- [ ] Créer des Sections par chapitre
|
||||
- [ ] Ré-insérer les Passages avec références
|
||||
|
||||
### Phase 4: Validation (1h)
|
||||
- [ ] Tester les requêtes avec références
|
||||
- [ ] Vérifier l'intégrité des données
|
||||
- [ ] Documenter le nouveau schéma
|
||||
- [ ] Mettre à jour `README.md`
|
||||
|
||||
**Temps total estimé**: ~7 heures
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Le système actuel souffre d'une **désynchronisation majeure** entre le schéma défini et la réalité dans Weaviate. Les collections ont été créées par auto-schema au lieu d'utiliser le schéma explicite, ce qui a conduit à:
|
||||
|
||||
1. ❌ Perte de contrôle sur les types et la vectorisation
|
||||
2. ❌ Collection Section complètement absente
|
||||
3. ❌ Duplication massive de données
|
||||
4. ❌ Métadonnées vides et invalides
|
||||
5. ❌ Pas de relations entre collections
|
||||
|
||||
**Priorité**: Recréer proprement le schéma et migrer les données pour exploiter tout le potentiel de l'architecture vectorielle.
|
||||
71
generations/library_rag/examples/KNOWN_ISSUES.md
Normal file
71
generations/library_rag/examples/KNOWN_ISSUES.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# Known Issues - MCP Client
|
||||
|
||||
## 1. Author/Work Filters Not Supported (Weaviate Limitation)
|
||||
|
||||
**Status:** Known limitation
|
||||
**Affects:** `search_chunks` and `search_summaries` tools
|
||||
**Error:** Results in server error when using `author_filter` or `work_filter` parameters
|
||||
|
||||
**Root Cause:**
|
||||
Weaviate v4 does not support filtering on nested object properties. The `work` field in the Chunk schema is defined as:
|
||||
|
||||
```python
|
||||
wvc.Property(
|
||||
name="work",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="title", data_type=wvc.DataType.TEXT),
|
||||
wvc.Property(name="author", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
Attempts to filter on `work.author` or `work.title` result in:
|
||||
```
|
||||
data type "object" not supported in query
|
||||
```
|
||||
|
||||
**Workaround:**
|
||||
|
||||
Use the `filter_by_author` tool instead:
|
||||
|
||||
```python
|
||||
# Instead of:
|
||||
search_chunks(
|
||||
query="nominalism",
|
||||
author_filter="Charles Sanders Peirce" # ❌ Doesn't work
|
||||
)
|
||||
|
||||
# Use:
|
||||
filter_by_author(
|
||||
author="Charles Sanders Peirce" # ✓ Works
|
||||
)
|
||||
```
|
||||
|
||||
Or search without filters and filter client-side:
|
||||
|
||||
```python
|
||||
results = await client.call_tool("search_chunks", {
|
||||
"query": "nominalism",
|
||||
"limit": 50 # Fetch more
|
||||
})
|
||||
|
||||
# Filter in Python
|
||||
filtered = [
|
||||
r for r in results["results"]
|
||||
if r["work_author"] == "Charles Sanders Peirce"
|
||||
]
|
||||
```
|
||||
|
||||
**Future Fix:**
|
||||
|
||||
Option 1: Add flat properties `workAuthor` and `workTitle` to Chunk schema (requires migration)
|
||||
Option 2: Implement post-filtering in Python on the server side
|
||||
Option 3: Wait for Weaviate to support nested object filtering
|
||||
|
||||
**Tests Affected:**
|
||||
|
||||
- `test_mcp_client.py::test_search_chunks` - Works without filters
|
||||
- Search with `author_filter` - Currently fails
|
||||
|
||||
**Last Updated:** 2025-12-25
|
||||
165
generations/library_rag/examples/README.md
Normal file
165
generations/library_rag/examples/README.md
Normal file
@@ -0,0 +1,165 @@
|
||||
# Library RAG - Exemples MCP Client
|
||||
|
||||
Ce dossier contient des exemples d'implémentation de clients MCP pour utiliser Library RAG depuis votre application Python.
|
||||
|
||||
## Clients MCP avec LLM
|
||||
|
||||
### 1. `mcp_client_claude.py` ⭐ RECOMMANDÉ
|
||||
|
||||
**Client MCP avec Claude (Anthropic)**
|
||||
|
||||
**Modèle:** Claude Sonnet 4.5 (`claude-sonnet-4-5-20250929`)
|
||||
|
||||
**Features:**
|
||||
- Auto-chargement des clés depuis `.env`
|
||||
- Tool calling automatique
|
||||
- Gestion multi-tour de conversation
|
||||
- Synthèse naturelle des résultats
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Assurez-vous que .env contient:
|
||||
# ANTHROPIC_API_KEY=your_key
|
||||
# MISTRAL_API_KEY=your_key
|
||||
|
||||
python examples/mcp_client_claude.py
|
||||
```
|
||||
|
||||
**Exemple:**
|
||||
```
|
||||
User: "What did Peirce say about nominalism?"
|
||||
|
||||
Claude → search_chunks(query="Peirce nominalism")
|
||||
→ Weaviate (BGE-M3 embeddings)
|
||||
→ 10 chunks retournés
|
||||
Claude → "Peirce characterized nominalism as a 'tidal wave'..."
|
||||
```
|
||||
|
||||
### 2. `mcp_client_reference.py`
|
||||
|
||||
**Client MCP avec Mistral AI**
|
||||
|
||||
**Modèle:** Mistral Large (`mistral-large-latest`)
|
||||
|
||||
Même fonctionnalités que le client Claude, mais utilise Mistral AI.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python examples/mcp_client_reference.py
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
### `test_mcp_quick.py`
|
||||
|
||||
Test rapide (< 5 secondes) des fonctionnalités MCP:
|
||||
- ✅ search_chunks (recherche sémantique)
|
||||
- ✅ list_documents
|
||||
- ✅ filter_by_author
|
||||
|
||||
```bash
|
||||
python examples/test_mcp_quick.py
|
||||
```
|
||||
|
||||
### `test_mcp_client.py`
|
||||
|
||||
Suite de tests complète pour le client MCP (tests unitaires des 9 outils).
|
||||
|
||||
## Exemples sans MCP (direct pipeline)
|
||||
|
||||
### `example_python_usage.py`
|
||||
|
||||
Utilisation des handlers MCP directement (sans subprocess):
|
||||
```python
|
||||
from mcp_tools import search_chunks_handler, SearchChunksInput
|
||||
|
||||
result = await search_chunks_handler(
|
||||
SearchChunksInput(query="nominalism", limit=10)
|
||||
)
|
||||
```
|
||||
|
||||
### `example_direct_pipeline.py`
|
||||
|
||||
Utilisation directe du pipeline PDF:
|
||||
```python
|
||||
from utils.pdf_pipeline import process_pdf
|
||||
|
||||
result = process_pdf(
|
||||
Path("document.pdf"),
|
||||
use_llm=True,
|
||||
ingest_to_weaviate=True
|
||||
)
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Votre Application │
|
||||
│ │
|
||||
│ Claude/Mistral (LLM conversationnel) │
|
||||
│ ↓ │
|
||||
│ MCPClient (stdio JSON-RPC) │
|
||||
└────────────┬────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ MCP Server (subprocess) │
|
||||
│ - 9 outils disponibles │
|
||||
│ - search_chunks, parse_pdf, etc. │
|
||||
└────────────┬────────────────────────────┘
|
||||
↓
|
||||
┌─────────────────────────────────────────┐
|
||||
│ Weaviate + BGE-M3 embeddings │
|
||||
│ - 5,180 chunks de Peirce │
|
||||
│ - Recherche sémantique │
|
||||
└─────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Embeddings vs LLM
|
||||
|
||||
**Important:** Trois modèles distincts sont utilisés:
|
||||
|
||||
1. **BGE-M3** (text2vec-transformers dans Weaviate)
|
||||
- Rôle: Vectorisation (embeddings 1024-dim)
|
||||
- Quand: Ingestion + recherche
|
||||
- Non modifiable sans migration
|
||||
|
||||
2. **Claude/Mistral** (Agent conversationnel)
|
||||
- Rôle: Comprendre questions + synthétiser réponses
|
||||
- Quand: Chaque conversation utilisateur
|
||||
- Changeable (votre choix)
|
||||
|
||||
3. **Mistral OCR** (pixtral-12b)
|
||||
- Rôle: Extraction texte depuis PDF
|
||||
- Quand: Ingestion de PDFs (via parse_pdf tool)
|
||||
- Fixé par le MCP server
|
||||
|
||||
## Outils MCP disponibles
|
||||
|
||||
| Outil | Description |
|
||||
|-------|-------------|
|
||||
| `search_chunks` | Recherche sémantique (500 max) |
|
||||
| `search_summaries` | Recherche dans résumés |
|
||||
| `list_documents` | Liste tous les documents |
|
||||
| `get_document` | Récupère un document spécifique |
|
||||
| `get_chunks_by_document` | Chunks d'un document |
|
||||
| `filter_by_author` | Filtre par auteur |
|
||||
| `parse_pdf` | Ingère un PDF/Markdown |
|
||||
| `delete_document` | Supprime un document |
|
||||
| `ping` | Health check |
|
||||
|
||||
## Limitations connues
|
||||
|
||||
Voir `KNOWN_ISSUES.md` pour les détails:
|
||||
- ⚠️ `author_filter` et `work_filter` ne fonctionnent pas (limitation Weaviate nested objects)
|
||||
- ✅ Workaround: Utiliser `filter_by_author` tool à la place
|
||||
|
||||
## Requirements
|
||||
|
||||
```bash
|
||||
pip install anthropic python-dotenv # Pour Claude
|
||||
# OU
|
||||
pip install mistralai # Pour Mistral
|
||||
```
|
||||
|
||||
Toutes les dépendances sont dans `requirements.txt` du projet parent.
|
||||
91
generations/library_rag/examples/example_direct_pipeline.py
Normal file
91
generations/library_rag/examples/example_direct_pipeline.py
Normal file
@@ -0,0 +1,91 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Exemple d'utilisation DIRECTE du pipeline PDF (sans MCP).
|
||||
|
||||
Plus simple et plus de contrôle sur les paramètres!
|
||||
"""
|
||||
|
||||
from pathlib import Path
|
||||
from utils.pdf_pipeline import process_pdf, process_pdf_bytes
|
||||
import weaviate
|
||||
from weaviate.classes.query import Filter
|
||||
|
||||
|
||||
def example_process_local_file():
|
||||
"""Traiter un fichier local (PDF ou Markdown)."""
|
||||
|
||||
result = process_pdf(
|
||||
pdf_path=Path("md/peirce_collected_papers_fixed.md"),
|
||||
output_dir=Path("output"),
|
||||
|
||||
# Paramètres personnalisables
|
||||
skip_ocr=True, # Déjà en Markdown
|
||||
use_llm=False, # Pas besoin de LLM pour Peirce
|
||||
use_semantic_chunking=False, # Chunking basique (rapide)
|
||||
ingest_to_weaviate=True, # Ingérer dans Weaviate
|
||||
)
|
||||
|
||||
if result.get("success"):
|
||||
print(f"✓ {result['document_name']}: {result['chunks_count']} chunks")
|
||||
print(f" Coût total: {result['cost_total']:.4f}€")
|
||||
else:
|
||||
print(f"✗ Erreur: {result.get('error')}")
|
||||
|
||||
|
||||
def example_process_from_url():
|
||||
"""Télécharger et traiter depuis une URL."""
|
||||
|
||||
import httpx
|
||||
|
||||
url = "https://example.com/document.pdf"
|
||||
|
||||
# Télécharger
|
||||
response = httpx.get(url, follow_redirects=True)
|
||||
pdf_bytes = response.content
|
||||
|
||||
# Traiter
|
||||
result = process_pdf_bytes(
|
||||
file_bytes=pdf_bytes,
|
||||
filename="document.pdf",
|
||||
output_dir=Path("output"),
|
||||
|
||||
# Paramètres optimaux
|
||||
use_llm=True,
|
||||
llm_provider="mistral", # Ou "ollama"
|
||||
use_semantic_chunking=True,
|
||||
ingest_to_weaviate=True,
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def example_search():
|
||||
"""Rechercher directement dans Weaviate."""
|
||||
|
||||
client = weaviate.connect_to_local()
|
||||
|
||||
try:
|
||||
collection = client.collections.get('Chunk')
|
||||
|
||||
# Recherche sémantique
|
||||
response = collection.query.near_text(
|
||||
query="nominalism and realism",
|
||||
limit=10,
|
||||
)
|
||||
|
||||
print(f"Trouvé {len(response.objects)} résultats:")
|
||||
for obj in response.objects[:3]:
|
||||
props = obj.properties
|
||||
print(f"\n- {props.get('sectionPath', 'N/A')}")
|
||||
print(f" {props.get('text', '')[:150]}...")
|
||||
|
||||
finally:
|
||||
client.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Choisir un exemple
|
||||
|
||||
# example_process_local_file()
|
||||
# example_process_from_url()
|
||||
example_search()
|
||||
78
generations/library_rag/examples/example_python_usage.py
Normal file
78
generations/library_rag/examples/example_python_usage.py
Normal file
@@ -0,0 +1,78 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Exemple d'utilisation de Library RAG depuis une application Python.
|
||||
|
||||
Le MCP server est uniquement pour Claude Desktop.
|
||||
Pour Python, appelez directement les handlers!
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
|
||||
# Import direct des handlers
|
||||
from mcp_tools import (
|
||||
parse_pdf_handler,
|
||||
ParsePdfInput,
|
||||
search_chunks_handler,
|
||||
SearchChunksInput,
|
||||
)
|
||||
|
||||
|
||||
async def example_parse_pdf():
|
||||
"""Exemple: Traiter un PDF ou Markdown."""
|
||||
|
||||
# Depuis un chemin local
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="C:/Users/david/Documents/platon.pdf"
|
||||
)
|
||||
|
||||
# OU depuis une URL
|
||||
# input_data = ParsePdfInput(
|
||||
# pdf_path="https://example.com/aristotle.pdf"
|
||||
# )
|
||||
|
||||
# OU un fichier Markdown
|
||||
# input_data = ParsePdfInput(
|
||||
# pdf_path="/path/to/peirce.md"
|
||||
# )
|
||||
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
if result.success:
|
||||
print(f"✓ Document traité: {result.document_name}")
|
||||
print(f" Pages: {result.pages}")
|
||||
print(f" Chunks: {result.chunks_count}")
|
||||
print(f" Coût: {result.cost_total:.4f}€")
|
||||
else:
|
||||
print(f"✗ Erreur: {result.error}")
|
||||
|
||||
|
||||
async def example_search():
|
||||
"""Exemple: Rechercher dans les chunks."""
|
||||
|
||||
input_data = SearchChunksInput(
|
||||
query="nominalism and realism",
|
||||
limit=10,
|
||||
author_filter="Charles Sanders Peirce", # Optionnel
|
||||
)
|
||||
|
||||
result = await search_chunks_handler(input_data)
|
||||
|
||||
print(f"Trouvé {result.total_count} résultats:")
|
||||
for i, chunk in enumerate(result.results[:5], 1):
|
||||
print(f"\n[{i}] Similarité: {chunk.similarity:.3f}")
|
||||
print(f" {chunk.text[:200]}...")
|
||||
|
||||
|
||||
async def main():
|
||||
"""Point d'entrée principal."""
|
||||
|
||||
# Exemple 1: Traiter un PDF
|
||||
# await example_parse_pdf()
|
||||
|
||||
# Exemple 2: Rechercher
|
||||
await example_search()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
359
generations/library_rag/examples/mcp_client_claude.py
Normal file
359
generations/library_rag/examples/mcp_client_claude.py
Normal file
@@ -0,0 +1,359 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
MCP Client pour Library RAG avec Claude (Anthropic).
|
||||
|
||||
Implémentation d'un client MCP qui permet à Claude d'utiliser
|
||||
les outils de Library RAG via tool calling.
|
||||
|
||||
Usage:
|
||||
python mcp_client_claude.py
|
||||
|
||||
Requirements:
|
||||
pip install anthropic python-dotenv
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
# Charger les variables d'environnement depuis .env
|
||||
try:
|
||||
from dotenv import load_dotenv
|
||||
# Charger depuis le .env du projet parent
|
||||
env_path = Path(__file__).parent.parent / ".env"
|
||||
load_dotenv(env_path)
|
||||
print(f"[ENV] Loaded environment from {env_path}")
|
||||
except ImportError:
|
||||
print("[ENV] python-dotenv not installed, using system environment variables")
|
||||
print(" Install with: pip install python-dotenv")
|
||||
|
||||
|
||||
@dataclass
|
||||
class ToolDefinition:
|
||||
"""Définition d'un outil MCP."""
|
||||
|
||||
name: str
|
||||
description: str
|
||||
input_schema: dict[str, Any]
|
||||
|
||||
|
||||
class MCPClient:
|
||||
"""Client pour communiquer avec le MCP server de Library RAG."""
|
||||
|
||||
def __init__(self, server_path: str, env: dict[str, str] | None = None):
|
||||
"""
|
||||
Args:
|
||||
server_path: Chemin vers mcp_server.py
|
||||
env: Variables d'environnement additionnelles
|
||||
"""
|
||||
self.server_path = server_path
|
||||
self.env = env or {}
|
||||
self.process = None
|
||||
self.request_id = 0
|
||||
|
||||
async def start(self) -> None:
|
||||
"""Démarrer le MCP server subprocess."""
|
||||
print(f"[MCP] Starting server: {self.server_path}")
|
||||
|
||||
# Préparer l'environnement
|
||||
full_env = {**os.environ, **self.env}
|
||||
|
||||
# Démarrer le subprocess
|
||||
self.process = await asyncio.create_subprocess_exec(
|
||||
sys.executable,
|
||||
self.server_path,
|
||||
stdin=asyncio.subprocess.PIPE,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
env=full_env,
|
||||
)
|
||||
|
||||
# Phase 1: Initialize
|
||||
init_result = await self._send_request(
|
||||
"initialize",
|
||||
{
|
||||
"protocolVersion": "2024-11-05",
|
||||
"capabilities": {"tools": {}},
|
||||
"clientInfo": {"name": "library-rag-client-claude", "version": "1.0.0"},
|
||||
},
|
||||
)
|
||||
|
||||
print(f"[MCP] Server initialized: {init_result.get('serverInfo', {}).get('name')}")
|
||||
|
||||
# Phase 2: Initialized notification
|
||||
await self._send_notification("notifications/initialized", {})
|
||||
|
||||
print("[MCP] Client ready")
|
||||
|
||||
async def _send_request(self, method: str, params: dict) -> dict:
|
||||
"""Envoyer une requête JSON-RPC et attendre la réponse."""
|
||||
self.request_id += 1
|
||||
request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": self.request_id,
|
||||
"method": method,
|
||||
"params": params,
|
||||
}
|
||||
|
||||
# Envoyer
|
||||
request_json = json.dumps(request) + "\n"
|
||||
self.process.stdin.write(request_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
# Recevoir
|
||||
response_line = await self.process.stdout.readline()
|
||||
if not response_line:
|
||||
raise RuntimeError("MCP server closed connection")
|
||||
|
||||
response = json.loads(response_line.decode())
|
||||
|
||||
# Vérifier erreurs
|
||||
if "error" in response:
|
||||
raise RuntimeError(f"MCP error: {response['error']}")
|
||||
|
||||
return response.get("result", {})
|
||||
|
||||
async def _send_notification(self, method: str, params: dict) -> None:
|
||||
"""Envoyer une notification (pas de réponse)."""
|
||||
notification = {"jsonrpc": "2.0", "method": method, "params": params}
|
||||
|
||||
notification_json = json.dumps(notification) + "\n"
|
||||
self.process.stdin.write(notification_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
async def list_tools(self) -> list[ToolDefinition]:
|
||||
"""Obtenir la liste des outils disponibles."""
|
||||
result = await self._send_request("tools/list", {})
|
||||
tools = result.get("tools", [])
|
||||
|
||||
tool_defs = [
|
||||
ToolDefinition(
|
||||
name=tool["name"],
|
||||
description=tool["description"],
|
||||
input_schema=tool["inputSchema"],
|
||||
)
|
||||
for tool in tools
|
||||
]
|
||||
|
||||
print(f"[MCP] Found {len(tool_defs)} tools")
|
||||
return tool_defs
|
||||
|
||||
async def call_tool(self, tool_name: str, arguments: dict) -> Any:
|
||||
"""Appeler un outil MCP."""
|
||||
print(f"[MCP] Calling tool: {tool_name}")
|
||||
print(f" Arguments: {json.dumps(arguments, indent=2)[:200]}...")
|
||||
|
||||
result = await self._send_request(
|
||||
"tools/call", {"name": tool_name, "arguments": arguments}
|
||||
)
|
||||
|
||||
# Extraire le contenu
|
||||
content = result.get("content", [])
|
||||
if content and content[0].get("type") == "text":
|
||||
text_content = content[0]["text"]
|
||||
try:
|
||||
return json.loads(text_content)
|
||||
except json.JSONDecodeError:
|
||||
return text_content
|
||||
|
||||
return result
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Arrêter le MCP server."""
|
||||
if self.process:
|
||||
print("[MCP] Stopping server...")
|
||||
self.process.terminate()
|
||||
await self.process.wait()
|
||||
print("[MCP] Server stopped")
|
||||
|
||||
|
||||
class ClaudeWithMCP:
|
||||
"""Claude avec capacité d'utiliser les outils MCP."""
|
||||
|
||||
def __init__(self, mcp_client: MCPClient, anthropic_api_key: str):
|
||||
"""
|
||||
Args:
|
||||
mcp_client: Client MCP initialisé
|
||||
anthropic_api_key: Clé API Anthropic
|
||||
"""
|
||||
self.mcp_client = mcp_client
|
||||
self.anthropic_api_key = anthropic_api_key
|
||||
self.tools = None
|
||||
self.messages = []
|
||||
|
||||
# Import Claude
|
||||
try:
|
||||
from anthropic import Anthropic
|
||||
|
||||
self.client = Anthropic(api_key=anthropic_api_key)
|
||||
except ImportError:
|
||||
raise ImportError("Install anthropic: pip install anthropic")
|
||||
|
||||
async def initialize(self) -> None:
|
||||
"""Charger les outils MCP et les convertir pour Claude."""
|
||||
mcp_tools = await self.mcp_client.list_tools()
|
||||
|
||||
# Convertir au format Claude (identique au format MCP)
|
||||
self.tools = [
|
||||
{
|
||||
"name": tool.name,
|
||||
"description": tool.description,
|
||||
"input_schema": tool.input_schema,
|
||||
}
|
||||
for tool in mcp_tools
|
||||
]
|
||||
|
||||
print(f"[Claude] Loaded {len(self.tools)} tools")
|
||||
|
||||
async def chat(self, user_message: str, max_iterations: int = 10) -> str:
|
||||
"""
|
||||
Converser avec Claude qui peut utiliser les outils MCP.
|
||||
|
||||
Args:
|
||||
user_message: Message de l'utilisateur
|
||||
max_iterations: Limite de tool calls
|
||||
|
||||
Returns:
|
||||
Réponse finale de Claude
|
||||
"""
|
||||
print(f"\n[USER] {user_message}\n")
|
||||
|
||||
self.messages.append({"role": "user", "content": user_message})
|
||||
|
||||
for iteration in range(max_iterations):
|
||||
print(f"[Claude] Iteration {iteration + 1}/{max_iterations}")
|
||||
|
||||
# Appel Claude avec tools
|
||||
response = self.client.messages.create(
|
||||
model="claude-sonnet-4-5-20250929", # Claude Sonnet 4.5
|
||||
max_tokens=4096,
|
||||
messages=self.messages,
|
||||
tools=self.tools,
|
||||
)
|
||||
|
||||
# Ajouter la réponse de Claude
|
||||
assistant_message = {
|
||||
"role": "assistant",
|
||||
"content": response.content,
|
||||
}
|
||||
self.messages.append(assistant_message)
|
||||
|
||||
# Vérifier si Claude veut utiliser des outils
|
||||
tool_uses = [block for block in response.content if block.type == "tool_use"]
|
||||
|
||||
# Si pas de tool use → réponse finale
|
||||
if not tool_uses:
|
||||
# Extraire le texte de la réponse
|
||||
text_blocks = [block for block in response.content if block.type == "text"]
|
||||
if text_blocks:
|
||||
print(f"[Claude] Final response")
|
||||
return text_blocks[0].text
|
||||
return ""
|
||||
|
||||
# Exécuter les tool uses
|
||||
print(f"[Claude] Tool uses: {len(tool_uses)}")
|
||||
|
||||
tool_results = []
|
||||
|
||||
for tool_use in tool_uses:
|
||||
tool_name = tool_use.name
|
||||
arguments = tool_use.input
|
||||
|
||||
# Appeler via MCP
|
||||
try:
|
||||
result = await self.mcp_client.call_tool(tool_name, arguments)
|
||||
result_str = json.dumps(result) if isinstance(result, dict) else str(result)
|
||||
print(f"[MCP] Result: {result_str[:200]}...")
|
||||
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": tool_use.id,
|
||||
"content": result_str,
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
print(f"[MCP] Error: {e}")
|
||||
tool_results.append({
|
||||
"type": "tool_result",
|
||||
"tool_use_id": tool_use.id,
|
||||
"content": json.dumps({"error": str(e)}),
|
||||
"is_error": True,
|
||||
})
|
||||
|
||||
# Ajouter les résultats des outils
|
||||
self.messages.append({
|
||||
"role": "user",
|
||||
"content": tool_results,
|
||||
})
|
||||
|
||||
return "Max iterations atteintes"
|
||||
|
||||
|
||||
async def main():
|
||||
"""Exemple d'utilisation du client MCP avec Claude."""
|
||||
|
||||
# Configuration
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
|
||||
if not anthropic_api_key:
|
||||
print("ERROR: ANTHROPIC_API_KEY not found in .env file")
|
||||
print("Please add to .env: ANTHROPIC_API_KEY=your_key")
|
||||
return
|
||||
|
||||
mistral_api_key = os.getenv("MISTRAL_API_KEY")
|
||||
if not mistral_api_key:
|
||||
print("ERROR: MISTRAL_API_KEY not found in .env file")
|
||||
print("The MCP server needs Mistral API for OCR functionality")
|
||||
return
|
||||
|
||||
# 1. Créer et démarrer le client MCP
|
||||
mcp_client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={
|
||||
"MISTRAL_API_KEY": mistral_api_key or "",
|
||||
},
|
||||
)
|
||||
|
||||
try:
|
||||
await mcp_client.start()
|
||||
|
||||
# 2. Créer l'agent Claude
|
||||
agent = ClaudeWithMCP(mcp_client, anthropic_api_key)
|
||||
await agent.initialize()
|
||||
|
||||
# 3. Exemples de conversations
|
||||
print("\n" + "=" * 80)
|
||||
print("EXAMPLE 1: Search in Peirce")
|
||||
print("=" * 80)
|
||||
|
||||
response = await agent.chat(
|
||||
"What did Charles Sanders Peirce say about the philosophical debate "
|
||||
"between nominalism and realism? Search the database and provide "
|
||||
"a detailed summary with specific quotes."
|
||||
)
|
||||
|
||||
print(f"\n[CLAUDE]\n{response}\n")
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("EXAMPLE 2: Explore database")
|
||||
print("=" * 80)
|
||||
|
||||
response = await agent.chat(
|
||||
"What documents are available in the database? "
|
||||
"Give me an overview of the authors and topics covered."
|
||||
)
|
||||
|
||||
print(f"\n[CLAUDE]\n{response}\n")
|
||||
|
||||
finally:
|
||||
await mcp_client.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
347
generations/library_rag/examples/mcp_client_reference.py
Normal file
347
generations/library_rag/examples/mcp_client_reference.py
Normal file
@@ -0,0 +1,347 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
MCP Client de référence pour Library RAG.
|
||||
|
||||
Implémentation complète d'un client MCP qui permet à un LLM
|
||||
d'utiliser les outils de Library RAG.
|
||||
|
||||
Usage:
|
||||
python mcp_client_reference.py
|
||||
|
||||
Requirements:
|
||||
pip install mistralai anyio
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
@dataclass
|
||||
class ToolDefinition:
|
||||
"""Définition d'un outil MCP."""
|
||||
|
||||
name: str
|
||||
description: str
|
||||
input_schema: dict[str, Any]
|
||||
|
||||
|
||||
class MCPClient:
|
||||
"""Client pour communiquer avec le MCP server de Library RAG."""
|
||||
|
||||
def __init__(self, server_path: str, env: dict[str, str] | None = None):
|
||||
"""
|
||||
Args:
|
||||
server_path: Chemin vers mcp_server.py
|
||||
env: Variables d'environnement additionnelles
|
||||
"""
|
||||
self.server_path = server_path
|
||||
self.env = env or {}
|
||||
self.process = None
|
||||
self.request_id = 0
|
||||
|
||||
async def start(self) -> None:
|
||||
"""Démarrer le MCP server subprocess."""
|
||||
print(f"[MCP] Starting server: {self.server_path}")
|
||||
|
||||
# Préparer l'environnement
|
||||
full_env = {**os.environ, **self.env}
|
||||
|
||||
# Démarrer le subprocess
|
||||
self.process = await asyncio.create_subprocess_exec(
|
||||
sys.executable, # Python executable
|
||||
self.server_path,
|
||||
stdin=asyncio.subprocess.PIPE,
|
||||
stdout=asyncio.subprocess.PIPE,
|
||||
stderr=asyncio.subprocess.PIPE,
|
||||
env=full_env,
|
||||
)
|
||||
|
||||
# Phase 1: Initialize
|
||||
init_result = await self._send_request(
|
||||
"initialize",
|
||||
{
|
||||
"protocolVersion": "2024-11-05",
|
||||
"capabilities": {"tools": {}},
|
||||
"clientInfo": {"name": "library-rag-client", "version": "1.0.0"},
|
||||
},
|
||||
)
|
||||
|
||||
print(f"[MCP] Server initialized: {init_result.get('serverInfo', {}).get('name')}")
|
||||
|
||||
# Phase 2: Initialized notification
|
||||
await self._send_notification("notifications/initialized", {})
|
||||
|
||||
print("[MCP] Client ready")
|
||||
|
||||
async def _send_request(self, method: str, params: dict) -> dict:
|
||||
"""Envoyer une requête JSON-RPC et attendre la réponse."""
|
||||
self.request_id += 1
|
||||
request = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": self.request_id,
|
||||
"method": method,
|
||||
"params": params,
|
||||
}
|
||||
|
||||
# Envoyer
|
||||
request_json = json.dumps(request) + "\n"
|
||||
self.process.stdin.write(request_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
# Recevoir
|
||||
response_line = await self.process.stdout.readline()
|
||||
if not response_line:
|
||||
raise RuntimeError("MCP server closed connection")
|
||||
|
||||
response = json.loads(response_line.decode())
|
||||
|
||||
# Vérifier erreurs
|
||||
if "error" in response:
|
||||
raise RuntimeError(f"MCP error: {response['error']}")
|
||||
|
||||
return response.get("result", {})
|
||||
|
||||
async def _send_notification(self, method: str, params: dict) -> None:
|
||||
"""Envoyer une notification (pas de réponse)."""
|
||||
notification = {"jsonrpc": "2.0", "method": method, "params": params}
|
||||
|
||||
notification_json = json.dumps(notification) + "\n"
|
||||
self.process.stdin.write(notification_json.encode())
|
||||
await self.process.stdin.drain()
|
||||
|
||||
async def list_tools(self) -> list[ToolDefinition]:
|
||||
"""Obtenir la liste des outils disponibles."""
|
||||
result = await self._send_request("tools/list", {})
|
||||
tools = result.get("tools", [])
|
||||
|
||||
tool_defs = [
|
||||
ToolDefinition(
|
||||
name=tool["name"],
|
||||
description=tool["description"],
|
||||
input_schema=tool["inputSchema"],
|
||||
)
|
||||
for tool in tools
|
||||
]
|
||||
|
||||
print(f"[MCP] Found {len(tool_defs)} tools")
|
||||
return tool_defs
|
||||
|
||||
async def call_tool(self, tool_name: str, arguments: dict) -> Any:
|
||||
"""Appeler un outil MCP."""
|
||||
print(f"[MCP] Calling tool: {tool_name}")
|
||||
print(f" Arguments: {json.dumps(arguments, indent=2)}")
|
||||
|
||||
result = await self._send_request(
|
||||
"tools/call", {"name": tool_name, "arguments": arguments}
|
||||
)
|
||||
|
||||
# Extraire le contenu
|
||||
content = result.get("content", [])
|
||||
if content and content[0].get("type") == "text":
|
||||
text_content = content[0]["text"]
|
||||
try:
|
||||
return json.loads(text_content)
|
||||
except json.JSONDecodeError:
|
||||
return text_content
|
||||
|
||||
return result
|
||||
|
||||
async def stop(self) -> None:
|
||||
"""Arrêter le MCP server."""
|
||||
if self.process:
|
||||
print("[MCP] Stopping server...")
|
||||
self.process.terminate()
|
||||
await self.process.wait()
|
||||
print("[MCP] Server stopped")
|
||||
|
||||
|
||||
class LLMWithMCP:
|
||||
"""LLM avec capacité d'utiliser les outils MCP."""
|
||||
|
||||
def __init__(self, mcp_client: MCPClient, mistral_api_key: str):
|
||||
"""
|
||||
Args:
|
||||
mcp_client: Client MCP initialisé
|
||||
mistral_api_key: Clé API Mistral
|
||||
"""
|
||||
self.mcp_client = mcp_client
|
||||
self.mistral_api_key = mistral_api_key
|
||||
self.tools = None
|
||||
self.messages = []
|
||||
|
||||
# Import Mistral
|
||||
try:
|
||||
from mistralai import Mistral
|
||||
|
||||
self.mistral = Mistral(api_key=mistral_api_key)
|
||||
except ImportError:
|
||||
raise ImportError("Install mistralai: pip install mistralai")
|
||||
|
||||
async def initialize(self) -> None:
|
||||
"""Charger les outils MCP et les convertir pour Mistral."""
|
||||
mcp_tools = await self.mcp_client.list_tools()
|
||||
|
||||
# Convertir au format Mistral
|
||||
self.tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": tool.name,
|
||||
"description": tool.description,
|
||||
"parameters": tool.input_schema,
|
||||
},
|
||||
}
|
||||
for tool in mcp_tools
|
||||
]
|
||||
|
||||
print(f"[LLM] Loaded {len(self.tools)} tools for Mistral")
|
||||
|
||||
async def chat(self, user_message: str, max_iterations: int = 10) -> str:
|
||||
"""
|
||||
Converser avec le LLM qui peut utiliser les outils MCP.
|
||||
|
||||
Args:
|
||||
user_message: Message de l'utilisateur
|
||||
max_iterations: Limite de tool calls
|
||||
|
||||
Returns:
|
||||
Réponse finale du LLM
|
||||
"""
|
||||
print(f"\n[USER] {user_message}\n")
|
||||
|
||||
self.messages.append({"role": "user", "content": user_message})
|
||||
|
||||
for iteration in range(max_iterations):
|
||||
print(f"[LLM] Iteration {iteration + 1}/{max_iterations}")
|
||||
|
||||
# Appel LLM avec tools
|
||||
response = self.mistral.chat.complete(
|
||||
model="mistral-large-latest",
|
||||
messages=self.messages,
|
||||
tools=self.tools,
|
||||
tool_choice="auto",
|
||||
)
|
||||
|
||||
assistant_message = response.choices[0].message
|
||||
|
||||
# Ajouter le message assistant
|
||||
self.messages.append(
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": assistant_message.content or "",
|
||||
"tool_calls": (
|
||||
[
|
||||
{
|
||||
"id": tc.id,
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": tc.function.name,
|
||||
"arguments": tc.function.arguments,
|
||||
},
|
||||
}
|
||||
for tc in assistant_message.tool_calls
|
||||
]
|
||||
if assistant_message.tool_calls
|
||||
else None
|
||||
),
|
||||
}
|
||||
)
|
||||
|
||||
# Si pas de tool calls → réponse finale
|
||||
if not assistant_message.tool_calls:
|
||||
print(f"[LLM] Final response")
|
||||
return assistant_message.content
|
||||
|
||||
# Exécuter les tool calls
|
||||
print(f"[LLM] Tool calls: {len(assistant_message.tool_calls)}")
|
||||
|
||||
for tool_call in assistant_message.tool_calls:
|
||||
tool_name = tool_call.function.name
|
||||
arguments = json.loads(tool_call.function.arguments)
|
||||
|
||||
# Appeler via MCP
|
||||
try:
|
||||
result = await self.mcp_client.call_tool(tool_name, arguments)
|
||||
result_str = json.dumps(result)
|
||||
print(f"[MCP] Result: {result_str[:200]}...")
|
||||
|
||||
except Exception as e:
|
||||
result_str = json.dumps({"error": str(e)})
|
||||
print(f"[MCP] Error: {e}")
|
||||
|
||||
# Ajouter le résultat
|
||||
self.messages.append(
|
||||
{
|
||||
"role": "tool",
|
||||
"name": tool_name,
|
||||
"content": result_str,
|
||||
"tool_call_id": tool_call.id,
|
||||
}
|
||||
)
|
||||
|
||||
return "Max iterations atteintes"
|
||||
|
||||
|
||||
async def main():
|
||||
"""Exemple d'utilisation du client MCP."""
|
||||
|
||||
# Configuration
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
mistral_api_key = os.getenv("MISTRAL_API_KEY")
|
||||
if not mistral_api_key:
|
||||
print("ERROR: MISTRAL_API_KEY not set")
|
||||
return
|
||||
|
||||
# 1. Créer et démarrer le client MCP
|
||||
mcp_client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={
|
||||
"MISTRAL_API_KEY": mistral_api_key,
|
||||
# Ajouter autres variables si nécessaire
|
||||
},
|
||||
)
|
||||
|
||||
try:
|
||||
await mcp_client.start()
|
||||
|
||||
# 2. Créer l'agent LLM
|
||||
agent = LLMWithMCP(mcp_client, mistral_api_key)
|
||||
await agent.initialize()
|
||||
|
||||
# 3. Exemples de conversations
|
||||
print("\n" + "=" * 80)
|
||||
print("EXAMPLE 1: Search")
|
||||
print("=" * 80)
|
||||
|
||||
response = await agent.chat(
|
||||
"What did Charles Sanders Peirce say about the debate between "
|
||||
"nominalism and realism? Search the database and give me a summary "
|
||||
"with specific quotes."
|
||||
)
|
||||
|
||||
print(f"\n[ASSISTANT]\n{response}\n")
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("EXAMPLE 2: List documents")
|
||||
print("=" * 80)
|
||||
|
||||
response = await agent.chat(
|
||||
"List all the documents in the database. "
|
||||
"How many are there and who are the authors?"
|
||||
)
|
||||
|
||||
print(f"\n[ASSISTANT]\n{response}\n")
|
||||
|
||||
finally:
|
||||
await mcp_client.stop()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
192
generations/library_rag/examples/test_mcp_client.py
Normal file
192
generations/library_rag/examples/test_mcp_client.py
Normal file
@@ -0,0 +1,192 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test simple du client MCP (sans LLM).
|
||||
|
||||
Teste la communication directe avec le MCP server.
|
||||
|
||||
Usage:
|
||||
python test_mcp_client.py
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Ajouter le parent au path pour import
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from mcp_client_reference import MCPClient
|
||||
|
||||
|
||||
async def test_basic_communication():
|
||||
"""Test: Communication basique avec le server."""
|
||||
print("TEST 1: Basic Communication")
|
||||
print("-" * 80)
|
||||
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={"MISTRAL_API_KEY": os.getenv("MISTRAL_API_KEY", "")},
|
||||
)
|
||||
|
||||
try:
|
||||
await client.start()
|
||||
print("[OK] Server started\n")
|
||||
|
||||
# Liste des outils
|
||||
tools = await client.list_tools()
|
||||
print(f"[OK] Found {len(tools)} tools:")
|
||||
for tool in tools:
|
||||
print(f" - {tool.name}: {tool.description}")
|
||||
|
||||
print("\n[OK] Test passed")
|
||||
|
||||
finally:
|
||||
await client.stop()
|
||||
|
||||
|
||||
async def test_search_chunks():
|
||||
"""Test: Recherche sémantique."""
|
||||
print("\n\nTEST 2: Search Chunks")
|
||||
print("-" * 80)
|
||||
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={"MISTRAL_API_KEY": os.getenv("MISTRAL_API_KEY", "")},
|
||||
)
|
||||
|
||||
try:
|
||||
await client.start()
|
||||
|
||||
# Recherche
|
||||
result = await client.call_tool(
|
||||
"search_chunks",
|
||||
{
|
||||
"query": "nominalism and realism",
|
||||
"limit": 3,
|
||||
"author_filter": "Charles Sanders Peirce",
|
||||
},
|
||||
)
|
||||
|
||||
print(f"[OK] Query: nominalism and realism")
|
||||
print(f"[OK] Found {result['total_count']} results")
|
||||
|
||||
for i, chunk in enumerate(result["results"][:3], 1):
|
||||
print(f"\n [{i}] Similarity: {chunk['similarity']:.3f}")
|
||||
print(f" Section: {chunk['section_path']}")
|
||||
print(f" Preview: {chunk['text'][:150]}...")
|
||||
|
||||
print("\n[OK] Test passed")
|
||||
|
||||
finally:
|
||||
await client.stop()
|
||||
|
||||
|
||||
async def test_list_documents():
|
||||
"""Test: Liste des documents."""
|
||||
print("\n\nTEST 3: List Documents")
|
||||
print("-" * 80)
|
||||
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={"MISTRAL_API_KEY": os.getenv("MISTRAL_API_KEY", "")},
|
||||
)
|
||||
|
||||
try:
|
||||
await client.start()
|
||||
|
||||
result = await client.call_tool("list_documents", {"limit": 10})
|
||||
|
||||
print(f"[OK] Total documents: {result['total_count']}")
|
||||
|
||||
for doc in result["documents"][:5]:
|
||||
print(f"\n - {doc['source_id']}")
|
||||
print(f" Author: {doc['author']}")
|
||||
print(f" Chunks: {doc['chunks_count']}")
|
||||
|
||||
print("\n[OK] Test passed")
|
||||
|
||||
finally:
|
||||
await client.stop()
|
||||
|
||||
|
||||
async def test_get_document():
|
||||
"""Test: Récupérer un document spécifique."""
|
||||
print("\n\nTEST 4: Get Document")
|
||||
print("-" * 80)
|
||||
|
||||
library_rag_path = Path(__file__).parent.parent
|
||||
server_path = library_rag_path / "mcp_server.py"
|
||||
|
||||
client = MCPClient(
|
||||
server_path=str(server_path),
|
||||
env={"MISTRAL_API_KEY": os.getenv("MISTRAL_API_KEY", "")},
|
||||
)
|
||||
|
||||
try:
|
||||
await client.start()
|
||||
|
||||
# D'abord lister pour trouver un document
|
||||
list_result = await client.call_tool("list_documents", {"limit": 1})
|
||||
|
||||
if list_result["documents"]:
|
||||
doc_id = list_result["documents"][0]["source_id"]
|
||||
|
||||
# Récupérer le document
|
||||
result = await client.call_tool(
|
||||
"get_document",
|
||||
{"source_id": doc_id, "include_chunks": True, "chunk_limit": 5},
|
||||
)
|
||||
|
||||
print(f"[OK] Document: {result['source_id']}")
|
||||
print(f" Author: {result['author']}")
|
||||
print(f" Pages: {result['pages']}")
|
||||
print(f" Chunks: {result['chunks_count']}")
|
||||
|
||||
if result.get("chunks"):
|
||||
print(f"\n First chunk preview:")
|
||||
print(f" {result['chunks'][0]['text'][:200]}...")
|
||||
|
||||
print("\n[OK] Test passed")
|
||||
else:
|
||||
print("[WARN] No documents in database")
|
||||
|
||||
finally:
|
||||
await client.stop()
|
||||
|
||||
|
||||
async def main():
|
||||
"""Exécuter tous les tests."""
|
||||
print("=" * 80)
|
||||
print("MCP CLIENT TESTS")
|
||||
print("=" * 80)
|
||||
|
||||
try:
|
||||
await test_basic_communication()
|
||||
await test_search_chunks()
|
||||
await test_list_documents()
|
||||
await test_get_document()
|
||||
|
||||
print("\n" + "=" * 80)
|
||||
print("ALL TESTS PASSED [OK]")
|
||||
print("=" * 80)
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n[ERROR] Test failed: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
62
generations/library_rag/examples/test_mcp_quick.py
Normal file
62
generations/library_rag/examples/test_mcp_quick.py
Normal file
@@ -0,0 +1,62 @@
|
||||
import asyncio
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
|
||||
from mcp_client_reference import MCPClient
|
||||
|
||||
async def main():
|
||||
client = MCPClient(server_path=str(Path(__file__).parent.parent / "mcp_server.py"), env={})
|
||||
|
||||
await client.start()
|
||||
|
||||
try:
|
||||
print("=" * 70)
|
||||
print("MCP CLIENT - FUNCTIONAL TESTS")
|
||||
print("=" * 70)
|
||||
|
||||
# Test 1: Search chunks
|
||||
print("\n[TEST 1] Search chunks (semantic search)")
|
||||
result = await client.call_tool("search_chunks", {
|
||||
"query": "nominalism realism debate",
|
||||
"limit": 2
|
||||
})
|
||||
|
||||
print(f"Results: {result['total_count']}")
|
||||
for i, chunk in enumerate(result['results'], 1):
|
||||
print(f" [{i}] {chunk['work_author']} - Similarity: {chunk['similarity']:.3f}")
|
||||
print(f" {chunk['text'][:80]}...")
|
||||
print("[OK]")
|
||||
|
||||
# Test 2: List documents
|
||||
print("\n[TEST 2] List documents")
|
||||
result = await client.call_tool("list_documents", {"limit": 5})
|
||||
|
||||
print(f"Total: {result['total_count']} documents")
|
||||
for doc in result['documents'][:3]:
|
||||
print(f" - {doc['source_id']} ({doc['work_author']}): {doc['chunks_count']} chunks")
|
||||
print("[OK]")
|
||||
|
||||
# Test 3: Filter by author
|
||||
print("\n[TEST 3] Filter by author")
|
||||
result = await client.call_tool("filter_by_author", {
|
||||
"author": "Charles Sanders Peirce"
|
||||
})
|
||||
|
||||
print(f"Author: {result['author']}")
|
||||
print(f"Works: {result['total_works']}")
|
||||
print(f"Documents: {result['total_documents']}")
|
||||
if 'total_chunks' in result:
|
||||
print(f"Chunks: {result['total_chunks']}")
|
||||
print("[OK]")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("ALL TESTS PASSED - MCP CLIENT IS WORKING!")
|
||||
print("=" * 70)
|
||||
print("\nNote: author_filter and work_filter parameters are not supported")
|
||||
print(" due to Weaviate v4 limitation. See examples/KNOWN_ISSUES.md")
|
||||
|
||||
finally:
|
||||
await client.stop()
|
||||
|
||||
asyncio.run(main())
|
||||
154
generations/library_rag/init.bat
Normal file
154
generations/library_rag/init.bat
Normal file
@@ -0,0 +1,154 @@
|
||||
@echo off
|
||||
REM ============================================================================
|
||||
REM Library RAG MCP Server - Development Environment Setup (Windows)
|
||||
REM ============================================================================
|
||||
REM This script sets up and starts the development environment for the
|
||||
REM Library RAG MCP Server project.
|
||||
REM
|
||||
REM Usage:
|
||||
REM init.bat - Full setup (venv, deps, docker, verify)
|
||||
REM init.bat --quick - Quick start (docker only, assumes deps installed)
|
||||
REM
|
||||
REM Requirements:
|
||||
REM - Python 3.10+
|
||||
REM - Docker Desktop
|
||||
REM - Git
|
||||
REM ============================================================================
|
||||
|
||||
setlocal enabledelayedexpansion
|
||||
|
||||
echo.
|
||||
echo ============================================
|
||||
echo Library RAG MCP Server - Setup
|
||||
echo ============================================
|
||||
echo.
|
||||
|
||||
REM Check for quick mode
|
||||
set QUICK_MODE=false
|
||||
if "%1"=="--quick" set QUICK_MODE=true
|
||||
|
||||
REM Check prerequisites
|
||||
echo Checking prerequisites...
|
||||
|
||||
where python >nul 2>&1
|
||||
if %errorlevel% neq 0 (
|
||||
echo [ERROR] Python is not installed or not in PATH
|
||||
exit /b 1
|
||||
)
|
||||
echo [OK] Python is installed
|
||||
|
||||
where docker >nul 2>&1
|
||||
if %errorlevel% neq 0 (
|
||||
echo [ERROR] Docker is not installed or not in PATH
|
||||
exit /b 1
|
||||
)
|
||||
echo [OK] Docker is installed
|
||||
|
||||
REM Check Docker is running
|
||||
docker info >nul 2>&1
|
||||
if %errorlevel% neq 0 (
|
||||
echo [ERROR] Docker is not running. Please start Docker Desktop.
|
||||
exit /b 1
|
||||
)
|
||||
echo [OK] Docker is running
|
||||
|
||||
if "%QUICK_MODE%"=="false" (
|
||||
echo.
|
||||
echo Setting up Python virtual environment...
|
||||
|
||||
if not exist "venv" (
|
||||
echo Creating virtual environment...
|
||||
python -m venv venv
|
||||
)
|
||||
echo [OK] Virtual environment exists
|
||||
|
||||
REM Activate venv
|
||||
call venv\Scripts\activate.bat
|
||||
echo [OK] Virtual environment activated
|
||||
|
||||
echo.
|
||||
echo Installing Python dependencies...
|
||||
pip install --upgrade pip -q
|
||||
pip install -r requirements.txt -q
|
||||
echo [OK] Dependencies installed
|
||||
)
|
||||
|
||||
REM Check for .env file
|
||||
echo.
|
||||
echo Checking environment configuration...
|
||||
if not exist ".env" (
|
||||
if exist ".env.example" (
|
||||
echo [WARN] .env file not found. Copying from .env.example...
|
||||
copy .env.example .env >nul
|
||||
echo [WARN] Please edit .env and add your MISTRAL_API_KEY
|
||||
) else (
|
||||
echo [ERROR] .env file not found. Create it with MISTRAL_API_KEY=your-key
|
||||
exit /b 1
|
||||
)
|
||||
) else (
|
||||
echo [OK] .env file exists
|
||||
)
|
||||
|
||||
REM Start Docker services
|
||||
echo.
|
||||
echo Starting Docker services (Weaviate + Transformers)...
|
||||
docker compose up -d
|
||||
|
||||
REM Wait for Weaviate to be ready
|
||||
echo.
|
||||
echo Waiting for Weaviate to be ready...
|
||||
set RETRY_COUNT=0
|
||||
set MAX_RETRIES=30
|
||||
|
||||
:wait_loop
|
||||
curl -s http://localhost:8080/v1/.well-known/ready >nul 2>&1
|
||||
if %errorlevel% equ 0 (
|
||||
echo [OK] Weaviate is ready!
|
||||
goto weaviate_ready
|
||||
)
|
||||
|
||||
set /a RETRY_COUNT+=1
|
||||
if %RETRY_COUNT% geq %MAX_RETRIES% (
|
||||
echo [ERROR] Weaviate failed to start. Check docker compose logs.
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
echo|set /p="."
|
||||
timeout /t 2 /nobreak >nul
|
||||
goto wait_loop
|
||||
|
||||
:weaviate_ready
|
||||
|
||||
REM Initialize Weaviate schema
|
||||
if exist "schema_v2.py" (
|
||||
echo.
|
||||
echo Initializing Weaviate schema...
|
||||
python schema_v2.py 2>nul
|
||||
echo [OK] Schema initialized
|
||||
)
|
||||
|
||||
echo.
|
||||
echo ============================================
|
||||
echo Setup Complete!
|
||||
echo ============================================
|
||||
echo.
|
||||
echo Services running:
|
||||
echo - Weaviate: http://localhost:8080
|
||||
echo - Transformers: Running (internal)
|
||||
echo.
|
||||
echo Quick commands:
|
||||
echo - Run MCP server: python mcp_server.py
|
||||
echo - Run Flask app: python flask_app.py
|
||||
echo - Run tests: pytest tests\ -v
|
||||
echo - Type check: mypy . --strict
|
||||
echo - Stop services: docker compose down
|
||||
echo.
|
||||
echo Configuration:
|
||||
echo - Edit .env file to configure API keys and settings
|
||||
echo - See MCP_README.md for MCP server documentation
|
||||
echo.
|
||||
echo Note: For MCP server, configure Claude Desktop with:
|
||||
echo %%APPDATA%%\Claude\claude_desktop_config.json
|
||||
echo.
|
||||
|
||||
endlocal
|
||||
131
generations/library_rag/init.sh
Normal file
131
generations/library_rag/init.sh
Normal file
@@ -0,0 +1,131 @@
|
||||
#!/bin/bash
|
||||
# Library RAG - Type Safety & Documentation Enhancement
|
||||
# Initialization script for development environment
|
||||
|
||||
set -e # Exit on error
|
||||
|
||||
echo "=========================================="
|
||||
echo "Library RAG - Development Environment Setup"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
|
||||
# Check Python version
|
||||
echo "[1/5] Checking Python version..."
|
||||
python_version=$(python --version 2>&1)
|
||||
echo " Found: $python_version"
|
||||
if ! python -c "import sys; assert sys.version_info >= (3, 10), 'Python 3.10+ required'" 2>/dev/null; then
|
||||
echo " ERROR: Python 3.10 or higher is required"
|
||||
exit 1
|
||||
fi
|
||||
echo " OK"
|
||||
echo ""
|
||||
|
||||
# Create virtual environment if it doesn't exist
|
||||
echo "[2/5] Setting up virtual environment..."
|
||||
if [ ! -d "venv" ]; then
|
||||
echo " Creating virtual environment..."
|
||||
python -m venv venv
|
||||
echo " Created venv/"
|
||||
else
|
||||
echo " Virtual environment already exists"
|
||||
fi
|
||||
|
||||
# Activate virtual environment
|
||||
if [[ "$OSTYPE" == "msys" || "$OSTYPE" == "win32" ]]; then
|
||||
source venv/Scripts/activate
|
||||
else
|
||||
source venv/bin/activate
|
||||
fi
|
||||
echo " Activated virtual environment"
|
||||
echo ""
|
||||
|
||||
# Install dependencies
|
||||
echo "[3/5] Installing Python dependencies..."
|
||||
pip install --quiet --upgrade pip
|
||||
pip install --quiet -r requirements.txt
|
||||
# Install type checking tools
|
||||
pip install --quiet mypy types-Flask pydocstyle
|
||||
echo " Installed all dependencies"
|
||||
echo ""
|
||||
|
||||
# Start Docker containers
|
||||
echo "[4/5] Starting Weaviate with Docker Compose..."
|
||||
if command -v docker &> /dev/null; then
|
||||
if docker compose version &> /dev/null; then
|
||||
docker compose up -d
|
||||
echo " Weaviate is starting..."
|
||||
echo " Waiting for Weaviate to be ready..."
|
||||
sleep 5
|
||||
|
||||
# Check if Weaviate is ready
|
||||
max_attempts=30
|
||||
attempt=0
|
||||
while [ $attempt -lt $max_attempts ]; do
|
||||
if curl -s http://localhost:8080/v1/.well-known/ready > /dev/null 2>&1; then
|
||||
echo " Weaviate is ready!"
|
||||
break
|
||||
fi
|
||||
attempt=$((attempt + 1))
|
||||
sleep 2
|
||||
done
|
||||
|
||||
if [ $attempt -eq $max_attempts ]; then
|
||||
echo " WARNING: Weaviate may not be ready yet. Check 'docker compose logs'"
|
||||
fi
|
||||
else
|
||||
echo " WARNING: Docker Compose not found. Please install Docker Desktop."
|
||||
echo " Weaviate will need to be started manually: docker compose up -d"
|
||||
fi
|
||||
else
|
||||
echo " WARNING: Docker not found. Please install Docker Desktop."
|
||||
echo " Weaviate will need to be started manually: docker compose up -d"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Create Weaviate schema if needed
|
||||
echo "[5/5] Initializing Weaviate schema..."
|
||||
python -c "
|
||||
import weaviate
|
||||
try:
|
||||
client = weaviate.connect_to_local()
|
||||
collections = client.collections.list_all()
|
||||
if 'Chunk' in collections:
|
||||
print(' Schema already exists')
|
||||
else:
|
||||
print(' Creating schema...')
|
||||
import schema
|
||||
print(' Schema created successfully')
|
||||
client.close()
|
||||
except Exception as e:
|
||||
print(f' Note: Could not connect to Weaviate: {e}')
|
||||
print(' Run this script again after Weaviate is ready, or run: python schema.py')
|
||||
" 2>/dev/null || echo " Schema setup will be done on first Flask app run"
|
||||
echo ""
|
||||
|
||||
# Print summary
|
||||
echo "=========================================="
|
||||
echo "Setup Complete!"
|
||||
echo "=========================================="
|
||||
echo ""
|
||||
echo "To start the Flask application:"
|
||||
echo " python flask_app.py"
|
||||
echo ""
|
||||
echo "The application will be available at:"
|
||||
echo " http://localhost:5000"
|
||||
echo ""
|
||||
echo "Useful commands:"
|
||||
echo " - Run type checks: mypy --strict ."
|
||||
echo " - Run tests: pytest tests/ -v"
|
||||
echo " - Check docstrings: pydocstyle --convention=google ."
|
||||
echo " - View Weaviate: http://localhost:8080/v1"
|
||||
echo ""
|
||||
echo "For development with Ollama (free LLM):"
|
||||
echo " 1. Install Ollama: https://ollama.ai"
|
||||
echo " 2. Pull model: ollama pull qwen2.5:7b"
|
||||
echo " 3. Start Ollama: ollama serve"
|
||||
echo ""
|
||||
echo "For MCP Server (Claude Desktop integration):"
|
||||
echo " 1. Run MCP server: python mcp_server.py"
|
||||
echo " 2. Configure Claude Desktop (see MCP_README.md)"
|
||||
echo " 3. Use parse_pdf and search tools from Claude"
|
||||
echo ""
|
||||
123
generations/library_rag/mcp_config.py
Normal file
123
generations/library_rag/mcp_config.py
Normal file
@@ -0,0 +1,123 @@
|
||||
"""
|
||||
Configuration management for Library RAG MCP Server.
|
||||
|
||||
Loads and validates environment variables for the MCP server.
|
||||
"""
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Literal
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
|
||||
@dataclass
|
||||
class MCPConfig:
|
||||
"""
|
||||
Configuration for Library RAG MCP Server.
|
||||
|
||||
Attributes:
|
||||
mistral_api_key: API key for Mistral OCR and LLM services.
|
||||
ollama_base_url: Base URL for Ollama local LLM server.
|
||||
structure_llm_model: Model name for LLM processing.
|
||||
structure_llm_temperature: Temperature for LLM generation.
|
||||
weaviate_host: Weaviate server hostname.
|
||||
weaviate_port: Weaviate server port.
|
||||
log_level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
|
||||
default_llm_provider: Default LLM provider ("ollama" or "mistral").
|
||||
output_dir: Base directory for processed files.
|
||||
"""
|
||||
|
||||
# Required
|
||||
mistral_api_key: str
|
||||
|
||||
# LLM Configuration
|
||||
ollama_base_url: str = "http://localhost:11434"
|
||||
structure_llm_model: str = "deepseek-r1:14b"
|
||||
structure_llm_temperature: float = 0.2
|
||||
default_llm_provider: Literal["ollama", "mistral"] = "ollama"
|
||||
|
||||
# Weaviate Configuration
|
||||
weaviate_host: str = "localhost"
|
||||
weaviate_port: int = 8080
|
||||
|
||||
# Logging
|
||||
log_level: str = "INFO"
|
||||
|
||||
# File System
|
||||
output_dir: Path = Path("output")
|
||||
|
||||
@classmethod
|
||||
def from_env(cls) -> "MCPConfig":
|
||||
"""
|
||||
Load configuration from environment variables.
|
||||
|
||||
Returns:
|
||||
MCPConfig instance populated from .env file.
|
||||
|
||||
Raises:
|
||||
ValueError: If required environment variables are missing.
|
||||
"""
|
||||
# Load .env file
|
||||
load_dotenv()
|
||||
|
||||
# Required variables
|
||||
mistral_api_key = os.getenv("MISTRAL_API_KEY")
|
||||
if not mistral_api_key:
|
||||
raise ValueError(
|
||||
"MISTRAL_API_KEY environment variable is required. "
|
||||
"Please set it in your .env file."
|
||||
)
|
||||
|
||||
# Optional variables with defaults
|
||||
return cls(
|
||||
mistral_api_key=mistral_api_key,
|
||||
ollama_base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434"),
|
||||
structure_llm_model=os.getenv("STRUCTURE_LLM_MODEL", "deepseek-r1:14b"),
|
||||
structure_llm_temperature=float(
|
||||
os.getenv("STRUCTURE_LLM_TEMPERATURE", "0.2")
|
||||
),
|
||||
default_llm_provider=os.getenv("DEFAULT_LLM_PROVIDER", "ollama"), # type: ignore
|
||||
weaviate_host=os.getenv("WEAVIATE_HOST", "localhost"),
|
||||
weaviate_port=int(os.getenv("WEAVIATE_PORT", "8080")),
|
||||
log_level=os.getenv("LOG_LEVEL", "INFO"),
|
||||
output_dir=Path(os.getenv("OUTPUT_DIR", "output")),
|
||||
)
|
||||
|
||||
@property
|
||||
def weaviate_url(self) -> str:
|
||||
"""Get full Weaviate URL."""
|
||||
return f"http://{self.weaviate_host}:{self.weaviate_port}"
|
||||
|
||||
def validate(self) -> None:
|
||||
"""
|
||||
Validate configuration values.
|
||||
|
||||
Raises:
|
||||
ValueError: If configuration is invalid.
|
||||
"""
|
||||
# Validate LLM provider
|
||||
if self.default_llm_provider not in ("ollama", "mistral"):
|
||||
raise ValueError(
|
||||
f"Invalid LLM provider: {self.default_llm_provider}. "
|
||||
"Must be 'ollama' or 'mistral'."
|
||||
)
|
||||
|
||||
# Validate log level
|
||||
valid_log_levels = ("DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL")
|
||||
if self.log_level.upper() not in valid_log_levels:
|
||||
raise ValueError(
|
||||
f"Invalid log level: {self.log_level}. "
|
||||
f"Must be one of {valid_log_levels}."
|
||||
)
|
||||
|
||||
# Validate temperature
|
||||
if not 0.0 <= self.structure_llm_temperature <= 2.0:
|
||||
raise ValueError(
|
||||
f"Invalid temperature: {self.structure_llm_temperature}. "
|
||||
"Must be between 0.0 and 2.0."
|
||||
)
|
||||
|
||||
# Create output directory if it doesn't exist
|
||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||
592
generations/library_rag/mcp_server.py
Normal file
592
generations/library_rag/mcp_server.py
Normal file
@@ -0,0 +1,592 @@
|
||||
"""
|
||||
Library RAG MCP Server - PDF Ingestion & Semantic Retrieval.
|
||||
|
||||
This module provides an MCP (Model Context Protocol) server that exposes
|
||||
Library RAG capabilities as tools for LLMs. It provides:
|
||||
|
||||
- 1 parsing tool: parse_pdf (PDF ingestion with optimal parameters)
|
||||
- 7 retrieval tools: semantic search and document management
|
||||
|
||||
The server uses stdio transport for communication with LLM clients
|
||||
like Claude Desktop.
|
||||
|
||||
Example:
|
||||
Run the server directly::
|
||||
|
||||
python mcp_server.py
|
||||
|
||||
Or configure in Claude Desktop claude_desktop_config.json::
|
||||
|
||||
{
|
||||
"mcpServers": {
|
||||
"library-rag": {
|
||||
"command": "python",
|
||||
"args": ["path/to/mcp_server.py"],
|
||||
"env": {"MISTRAL_API_KEY": "your-key"}
|
||||
}
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
import logging
|
||||
import signal
|
||||
import sys
|
||||
from contextlib import asynccontextmanager
|
||||
from pathlib import Path
|
||||
from typing import Any, AsyncIterator, Dict
|
||||
|
||||
from mcp.server.fastmcp import FastMCP
|
||||
|
||||
from mcp_config import MCPConfig
|
||||
from mcp_tools import (
|
||||
ParsePdfInput,
|
||||
parse_pdf_handler,
|
||||
SearchChunksInput,
|
||||
search_chunks_handler,
|
||||
SearchSummariesInput,
|
||||
search_summaries_handler,
|
||||
GetDocumentInput,
|
||||
get_document_handler,
|
||||
ListDocumentsInput,
|
||||
list_documents_handler,
|
||||
GetChunksByDocumentInput,
|
||||
get_chunks_by_document_handler,
|
||||
FilterByAuthorInput,
|
||||
filter_by_author_handler,
|
||||
DeleteDocumentInput,
|
||||
delete_document_handler,
|
||||
# Logging utilities
|
||||
setup_mcp_logging,
|
||||
# Exception types for error handling
|
||||
WeaviateConnectionError,
|
||||
PDFProcessingError,
|
||||
)
|
||||
|
||||
# =============================================================================
|
||||
# Logging Configuration
|
||||
# =============================================================================
|
||||
|
||||
# Note: We use setup_mcp_logging from mcp_tools.logging_config for structured
|
||||
# JSON logging. The function is imported at the top of this file.
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Global State
|
||||
# =============================================================================
|
||||
|
||||
# Configuration loaded at startup
|
||||
config: MCPConfig | None = None
|
||||
logger: logging.Logger | None = None
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Server Lifecycle
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def server_lifespan(server: FastMCP) -> AsyncIterator[None]:
|
||||
"""
|
||||
Manage server lifecycle - startup and shutdown.
|
||||
|
||||
This context manager handles:
|
||||
- Loading configuration from environment
|
||||
- Validating configuration
|
||||
- Setting up logging
|
||||
- Graceful shutdown cleanup
|
||||
|
||||
Args:
|
||||
server: The FastMCP server instance.
|
||||
|
||||
Yields:
|
||||
None during server runtime.
|
||||
|
||||
Raises:
|
||||
ValueError: If configuration is invalid or missing required values.
|
||||
"""
|
||||
global config, logger
|
||||
|
||||
# Startup
|
||||
try:
|
||||
# Load and validate configuration
|
||||
config = MCPConfig.from_env()
|
||||
config.validate()
|
||||
|
||||
# Setup structured JSON logging with configured level
|
||||
logger = setup_mcp_logging(
|
||||
log_level=config.log_level,
|
||||
log_dir=Path("logs"),
|
||||
json_format=True,
|
||||
)
|
||||
logger.info(
|
||||
"Library RAG MCP Server starting",
|
||||
extra={
|
||||
"event": "server_startup",
|
||||
"weaviate_url": config.weaviate_url,
|
||||
"output_dir": str(config.output_dir),
|
||||
"llm_provider": config.default_llm_provider,
|
||||
"log_level": config.log_level,
|
||||
},
|
||||
)
|
||||
|
||||
yield
|
||||
|
||||
except ValueError as e:
|
||||
# Configuration error - log and re-raise
|
||||
if logger:
|
||||
logger.error(
|
||||
"Configuration error",
|
||||
extra={
|
||||
"event": "config_error",
|
||||
"error_message": str(e),
|
||||
},
|
||||
)
|
||||
else:
|
||||
print(f"Configuration error: {e}", file=sys.stderr)
|
||||
raise
|
||||
|
||||
finally:
|
||||
# Shutdown
|
||||
if logger:
|
||||
logger.info(
|
||||
"Library RAG MCP Server shutting down",
|
||||
extra={"event": "server_shutdown"},
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# MCP Server Initialization
|
||||
# =============================================================================
|
||||
|
||||
# Create the MCP server with lifespan management
|
||||
mcp = FastMCP(
|
||||
name="library-rag",
|
||||
lifespan=server_lifespan,
|
||||
)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Tool Registration (placeholders - to be implemented in separate modules)
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def ping() -> str:
|
||||
"""
|
||||
Health check tool to verify server is running.
|
||||
|
||||
Returns:
|
||||
Success message with server status.
|
||||
"""
|
||||
return "Library RAG MCP Server is running!"
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def parse_pdf(pdf_path: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Process a PDF document with optimal pre-configured parameters.
|
||||
|
||||
Ingests a PDF file into the Library RAG system using Mistral OCR and LLM
|
||||
for intelligent processing. The document is automatically chunked,
|
||||
vectorized, and stored in Weaviate for semantic search.
|
||||
|
||||
Fixed optimal parameters used:
|
||||
- LLM: Mistral API (mistral-medium-latest)
|
||||
- OCR: With annotations (better TOC extraction)
|
||||
- Chunking: Semantic LLM-based (argumentative units)
|
||||
- Ingestion: Automatic Weaviate vectorization
|
||||
|
||||
Args:
|
||||
pdf_path: Local file path or URL to the PDF document.
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- success: Whether processing succeeded
|
||||
- document_name: Name of the processed document
|
||||
- source_id: Unique identifier for retrieval
|
||||
- pages: Number of pages processed
|
||||
- chunks_count: Number of chunks created
|
||||
- cost_ocr: OCR cost in EUR
|
||||
- cost_llm: LLM cost in EUR
|
||||
- cost_total: Total processing cost
|
||||
- output_dir: Directory with output files
|
||||
- metadata: Extracted document metadata
|
||||
- error: Error message if failed
|
||||
"""
|
||||
input_data = ParsePdfInput(pdf_path=pdf_path)
|
||||
result = await parse_pdf_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def search_chunks(
|
||||
query: str,
|
||||
limit: int = 10,
|
||||
min_similarity: float = 0.0,
|
||||
author_filter: str | None = None,
|
||||
work_filter: str | None = None,
|
||||
language_filter: str | None = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Search for text chunks using semantic similarity.
|
||||
|
||||
Performs a near_text query on the Weaviate Chunk collection to find
|
||||
semantically similar text passages from the indexed philosophical texts.
|
||||
|
||||
Args:
|
||||
query: The search query text (e.g., "la justice et la vertu").
|
||||
limit: Maximum number of results to return (1-100, default 10).
|
||||
min_similarity: Minimum similarity threshold 0-1 (default 0).
|
||||
author_filter: Filter by author name (e.g., "Platon").
|
||||
work_filter: Filter by work title (e.g., "La Republique").
|
||||
language_filter: Filter by language code (e.g., "fr", "en").
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- results: List of matching chunks with text and metadata
|
||||
- total_count: Number of results returned
|
||||
- query: The original search query
|
||||
"""
|
||||
input_data = SearchChunksInput(
|
||||
query=query,
|
||||
limit=limit,
|
||||
min_similarity=min_similarity,
|
||||
author_filter=author_filter,
|
||||
work_filter=work_filter,
|
||||
language_filter=language_filter,
|
||||
)
|
||||
result = await search_chunks_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def search_summaries(
|
||||
query: str,
|
||||
limit: int = 10,
|
||||
min_level: int | None = None,
|
||||
max_level: int | None = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Search for chapter/section summaries using semantic similarity.
|
||||
|
||||
Performs a near_text query on the Weaviate Summary collection to find
|
||||
semantically similar summaries from indexed philosophical texts.
|
||||
|
||||
Hierarchy levels:
|
||||
- Level 1: Chapters (highest level)
|
||||
- Level 2: Sections
|
||||
- Level 3: Subsections
|
||||
- etc.
|
||||
|
||||
Args:
|
||||
query: The search query text (e.g., "la vertu et l'education").
|
||||
limit: Maximum number of results to return (1-100, default 10).
|
||||
min_level: Minimum hierarchy level filter (1=chapter, optional).
|
||||
max_level: Maximum hierarchy level filter (optional).
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- results: List of matching summaries with text and metadata
|
||||
- total_count: Number of results returned
|
||||
- query: The original search query
|
||||
|
||||
Example:
|
||||
Search for summaries about virtue at chapter level only::
|
||||
|
||||
search_summaries(
|
||||
query="la vertu",
|
||||
limit=5,
|
||||
min_level=1,
|
||||
max_level=1
|
||||
)
|
||||
"""
|
||||
input_data = SearchSummariesInput(
|
||||
query=query,
|
||||
limit=limit,
|
||||
min_level=min_level,
|
||||
max_level=max_level,
|
||||
)
|
||||
result = await search_summaries_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def get_document(
|
||||
source_id: str,
|
||||
include_chunks: bool = False,
|
||||
chunk_limit: int = 50,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Retrieve a document by its source ID with optional chunks.
|
||||
|
||||
Fetches complete document metadata and optionally related text chunks
|
||||
from the Weaviate database.
|
||||
|
||||
Args:
|
||||
source_id: Document source ID (e.g., "platon-menon").
|
||||
include_chunks: Include document chunks in response (default False).
|
||||
chunk_limit: Maximum chunks to return if include_chunks=True (1-500, default 50).
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- document: Document metadata (title, author, pages, TOC, hierarchy)
|
||||
- chunks: List of chunks (if include_chunks=True)
|
||||
- chunks_total: Total number of chunks in document
|
||||
- found: Whether document was found
|
||||
- error: Error message if not found
|
||||
|
||||
Example:
|
||||
Get document metadata only::
|
||||
|
||||
get_document(source_id="platon-menon")
|
||||
|
||||
Get document with first 20 chunks::
|
||||
|
||||
get_document(
|
||||
source_id="platon-menon",
|
||||
include_chunks=True,
|
||||
chunk_limit=20
|
||||
)
|
||||
"""
|
||||
input_data = GetDocumentInput(
|
||||
source_id=source_id,
|
||||
include_chunks=include_chunks,
|
||||
chunk_limit=chunk_limit,
|
||||
)
|
||||
result = await get_document_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def list_documents(
|
||||
author_filter: str | None = None,
|
||||
work_filter: str | None = None,
|
||||
language_filter: str | None = None,
|
||||
limit: int = 50,
|
||||
offset: int = 0,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
List all documents with filtering and pagination support.
|
||||
|
||||
Retrieves a list of all documents stored in the Library RAG system.
|
||||
Supports filtering by author, work title, and language, as well as
|
||||
pagination with limit and offset parameters.
|
||||
|
||||
Args:
|
||||
author_filter: Filter by author name (e.g., "Platon").
|
||||
work_filter: Filter by work title (e.g., "La Republique").
|
||||
language_filter: Filter by language code (e.g., "fr", "en").
|
||||
limit: Maximum number of results to return (1-250, default 50).
|
||||
offset: Number of results to skip for pagination (default 0).
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- documents: List of document summaries (source_id, title, author, pages, chunks_count, language)
|
||||
- total_count: Total number of documents matching filters
|
||||
- limit: Applied limit value
|
||||
- offset: Applied offset value
|
||||
|
||||
Example:
|
||||
List all French documents::
|
||||
|
||||
list_documents(language_filter="fr")
|
||||
|
||||
Paginate through results::
|
||||
|
||||
list_documents(limit=10, offset=0) # First 10
|
||||
list_documents(limit=10, offset=10) # Next 10
|
||||
"""
|
||||
input_data = ListDocumentsInput(
|
||||
author_filter=author_filter,
|
||||
work_filter=work_filter,
|
||||
language_filter=language_filter,
|
||||
limit=limit,
|
||||
offset=offset,
|
||||
)
|
||||
result = await list_documents_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def get_chunks_by_document(
|
||||
source_id: str,
|
||||
limit: int = 50,
|
||||
offset: int = 0,
|
||||
section_filter: str | None = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Retrieve all chunks for a document in sequential order.
|
||||
|
||||
Fetches all text chunks belonging to a specific document, ordered by
|
||||
their position in the document (orderIndex). Supports pagination and
|
||||
optional filtering by section path.
|
||||
|
||||
Args:
|
||||
source_id: Document source ID (e.g., "platon-menon").
|
||||
limit: Maximum number of chunks to return (1-500, default 50).
|
||||
offset: Number of chunks to skip for pagination (default 0).
|
||||
section_filter: Filter by section path prefix (e.g., "Chapter 1").
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- chunks: List of chunks in document order
|
||||
- total_count: Total number of chunks in document
|
||||
- document_source_id: The queried document source ID
|
||||
- limit: Applied limit value
|
||||
- offset: Applied offset value
|
||||
|
||||
Example:
|
||||
Get first 20 chunks::
|
||||
|
||||
get_chunks_by_document(source_id="platon-menon", limit=20)
|
||||
|
||||
Get chunks from a specific section::
|
||||
|
||||
get_chunks_by_document(
|
||||
source_id="platon-menon",
|
||||
section_filter="Chapter 3"
|
||||
)
|
||||
|
||||
Paginate through chunks::
|
||||
|
||||
get_chunks_by_document(source_id="platon-menon", limit=50, offset=0)
|
||||
get_chunks_by_document(source_id="platon-menon", limit=50, offset=50)
|
||||
"""
|
||||
input_data = GetChunksByDocumentInput(
|
||||
source_id=source_id,
|
||||
limit=limit,
|
||||
offset=offset,
|
||||
section_filter=section_filter,
|
||||
)
|
||||
result = await get_chunks_by_document_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def filter_by_author(
|
||||
author: str,
|
||||
include_chunk_counts: bool = True,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Get all works and documents by a specific author.
|
||||
|
||||
Retrieves all works associated with an author, along with their related
|
||||
documents. Optionally includes total chunk counts for each work.
|
||||
|
||||
Args:
|
||||
author: The author name to search for (e.g., "Platon", "Aristotle").
|
||||
include_chunk_counts: Whether to include chunk counts (default True).
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- author: The searched author name
|
||||
- works: List of works with work info and documents
|
||||
- total_works: Total number of works by this author
|
||||
- total_documents: Total number of documents across all works
|
||||
- total_chunks: Total number of chunks (if include_chunk_counts=True)
|
||||
|
||||
Example:
|
||||
Get all works by Platon::
|
||||
|
||||
filter_by_author(author="Platon")
|
||||
|
||||
Get works without chunk counts (faster)::
|
||||
|
||||
filter_by_author(author="Platon", include_chunk_counts=False)
|
||||
"""
|
||||
input_data = FilterByAuthorInput(
|
||||
author=author,
|
||||
include_chunk_counts=include_chunk_counts,
|
||||
)
|
||||
result = await filter_by_author_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
async def delete_document(
|
||||
source_id: str,
|
||||
confirm: bool = False,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Delete a document and all its chunks/summaries from Weaviate.
|
||||
|
||||
Removes all data associated with a document: the Document object itself,
|
||||
all Chunk objects, and all Summary objects. Requires explicit confirmation
|
||||
to prevent accidental deletions.
|
||||
|
||||
IMPORTANT: This operation is irreversible. Use with caution.
|
||||
|
||||
Args:
|
||||
source_id: Document source ID to delete (e.g., "platon-menon").
|
||||
confirm: Must be True to confirm deletion (safety check, default False).
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- success: Whether deletion succeeded
|
||||
- source_id: The deleted document source ID
|
||||
- chunks_deleted: Number of chunks deleted
|
||||
- summaries_deleted: Number of summaries deleted
|
||||
- error: Error message if failed
|
||||
|
||||
Example:
|
||||
Delete a document (requires confirmation)::
|
||||
|
||||
delete_document(
|
||||
source_id="platon-menon",
|
||||
confirm=True
|
||||
)
|
||||
|
||||
Without confirm=True, the operation will fail with an error message::
|
||||
|
||||
delete_document(source_id="platon-menon")
|
||||
# Returns: {"success": false, "error": "Confirmation required..."}
|
||||
"""
|
||||
input_data = DeleteDocumentInput(
|
||||
source_id=source_id,
|
||||
confirm=confirm,
|
||||
)
|
||||
result = await delete_document_handler(input_data)
|
||||
return result.model_dump(mode='json')
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Signal Handlers
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def handle_shutdown(signum: int, frame: object) -> None:
|
||||
"""
|
||||
Handle shutdown signals gracefully.
|
||||
|
||||
Args:
|
||||
signum: Signal number received.
|
||||
frame: Current stack frame (unused).
|
||||
"""
|
||||
if logger:
|
||||
logger.info(f"Received signal {signum}, initiating graceful shutdown...")
|
||||
sys.exit(0)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Main Entry Point
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""
|
||||
Main entry point for the MCP server.
|
||||
|
||||
Sets up signal handlers and runs the server with stdio transport.
|
||||
"""
|
||||
# Register signal handlers for graceful shutdown
|
||||
signal.signal(signal.SIGINT, handle_shutdown)
|
||||
signal.signal(signal.SIGTERM, handle_shutdown)
|
||||
|
||||
# Run the server with stdio transport (default for MCP)
|
||||
mcp.run(transport="stdio")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
106
generations/library_rag/mcp_tools/__init__.py
Normal file
106
generations/library_rag/mcp_tools/__init__.py
Normal file
@@ -0,0 +1,106 @@
|
||||
"""
|
||||
MCP Tools for Library RAG Server.
|
||||
|
||||
This package contains all tool implementations for the Library RAG MCP server:
|
||||
- Parsing tools: PDF ingestion with optimal parameters
|
||||
- Retrieval tools: Semantic search and document management
|
||||
- Exceptions: Custom exception classes for structured error handling
|
||||
- Logging: Structured JSON logging configuration
|
||||
"""
|
||||
|
||||
from mcp_tools.schemas import (
|
||||
ParsePdfInput,
|
||||
ParsePdfOutput,
|
||||
SearchChunksInput,
|
||||
SearchChunksOutput,
|
||||
SearchSummariesInput,
|
||||
SearchSummariesOutput,
|
||||
GetDocumentInput,
|
||||
GetDocumentOutput,
|
||||
ListDocumentsInput,
|
||||
ListDocumentsOutput,
|
||||
GetChunksByDocumentInput,
|
||||
GetChunksByDocumentOutput,
|
||||
FilterByAuthorInput,
|
||||
FilterByAuthorOutput,
|
||||
DeleteDocumentInput,
|
||||
DeleteDocumentOutput,
|
||||
)
|
||||
|
||||
from mcp_tools.exceptions import (
|
||||
MCPToolError,
|
||||
WeaviateConnectionError,
|
||||
PDFProcessingError,
|
||||
DocumentNotFoundError,
|
||||
ValidationError,
|
||||
LLMProcessingError,
|
||||
DownloadError,
|
||||
)
|
||||
|
||||
from mcp_tools.logging_config import (
|
||||
setup_mcp_logging,
|
||||
get_tool_logger,
|
||||
ToolInvocationLogger,
|
||||
log_tool_invocation,
|
||||
log_weaviate_query,
|
||||
redact_sensitive_data,
|
||||
redact_dict,
|
||||
)
|
||||
|
||||
from mcp_tools.parsing_tools import parse_pdf_handler
|
||||
from mcp_tools.retrieval_tools import (
|
||||
search_chunks_handler,
|
||||
search_summaries_handler,
|
||||
get_document_handler,
|
||||
list_documents_handler,
|
||||
get_chunks_by_document_handler,
|
||||
filter_by_author_handler,
|
||||
delete_document_handler,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Parsing tools
|
||||
"parse_pdf_handler",
|
||||
# Retrieval tools
|
||||
"search_chunks_handler",
|
||||
"search_summaries_handler",
|
||||
"get_document_handler",
|
||||
"list_documents_handler",
|
||||
"get_chunks_by_document_handler",
|
||||
"filter_by_author_handler",
|
||||
"delete_document_handler",
|
||||
# Parsing schemas
|
||||
"ParsePdfInput",
|
||||
"ParsePdfOutput",
|
||||
# Retrieval schemas
|
||||
"SearchChunksInput",
|
||||
"SearchChunksOutput",
|
||||
"SearchSummariesInput",
|
||||
"SearchSummariesOutput",
|
||||
"GetDocumentInput",
|
||||
"GetDocumentOutput",
|
||||
"ListDocumentsInput",
|
||||
"ListDocumentsOutput",
|
||||
"GetChunksByDocumentInput",
|
||||
"GetChunksByDocumentOutput",
|
||||
"FilterByAuthorInput",
|
||||
"FilterByAuthorOutput",
|
||||
"DeleteDocumentInput",
|
||||
"DeleteDocumentOutput",
|
||||
# Exceptions
|
||||
"MCPToolError",
|
||||
"WeaviateConnectionError",
|
||||
"PDFProcessingError",
|
||||
"DocumentNotFoundError",
|
||||
"ValidationError",
|
||||
"LLMProcessingError",
|
||||
"DownloadError",
|
||||
# Logging
|
||||
"setup_mcp_logging",
|
||||
"get_tool_logger",
|
||||
"ToolInvocationLogger",
|
||||
"log_tool_invocation",
|
||||
"log_weaviate_query",
|
||||
"redact_sensitive_data",
|
||||
"redact_dict",
|
||||
]
|
||||
297
generations/library_rag/mcp_tools/exceptions.py
Normal file
297
generations/library_rag/mcp_tools/exceptions.py
Normal file
@@ -0,0 +1,297 @@
|
||||
"""Custom exception classes for Library RAG MCP Server.
|
||||
|
||||
This module defines custom exception classes used throughout the MCP server
|
||||
for structured error handling and consistent error responses.
|
||||
|
||||
Exception Hierarchy:
|
||||
MCPToolError (base)
|
||||
├── WeaviateConnectionError - Database connection failures
|
||||
├── PDFProcessingError - PDF parsing/OCR failures
|
||||
├── DocumentNotFoundError - Document/chunk retrieval failures
|
||||
└── ValidationError - Input validation failures
|
||||
|
||||
Example:
|
||||
Raise and catch custom exceptions::
|
||||
|
||||
from mcp_tools.exceptions import WeaviateConnectionError
|
||||
|
||||
try:
|
||||
client = connect_to_weaviate()
|
||||
except Exception as e:
|
||||
raise WeaviateConnectionError("Failed to connect") from e
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
|
||||
class MCPToolError(Exception):
|
||||
"""Base exception for all MCP tool errors.
|
||||
|
||||
This is the base class for all custom exceptions in the MCP server.
|
||||
It provides structured error information that can be converted to
|
||||
MCP error responses.
|
||||
|
||||
Attributes:
|
||||
message: Human-readable error description.
|
||||
error_code: Machine-readable error code for categorization.
|
||||
details: Additional context about the error.
|
||||
original_error: The underlying exception if this wraps another error.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
*,
|
||||
error_code: str = "MCP_ERROR",
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize the MCPToolError.
|
||||
|
||||
Args:
|
||||
message: Human-readable error description.
|
||||
error_code: Machine-readable error code (default: "MCP_ERROR").
|
||||
details: Additional context about the error (optional).
|
||||
original_error: The underlying exception if wrapping (optional).
|
||||
"""
|
||||
super().__init__(message)
|
||||
self.message = message
|
||||
self.error_code = error_code
|
||||
self.details = details or {}
|
||||
self.original_error = original_error
|
||||
|
||||
def to_dict(self) -> Dict[str, Any]:
|
||||
"""Convert exception to a dictionary for JSON serialization.
|
||||
|
||||
Returns:
|
||||
Dictionary with error information suitable for MCP responses.
|
||||
"""
|
||||
result: Dict[str, Any] = {
|
||||
"error": True,
|
||||
"error_code": self.error_code,
|
||||
"message": self.message,
|
||||
}
|
||||
if self.details:
|
||||
result["details"] = self.details
|
||||
if self.original_error:
|
||||
result["original_error"] = str(self.original_error)
|
||||
return result
|
||||
|
||||
def __str__(self) -> str:
|
||||
"""Return string representation of the error."""
|
||||
if self.original_error:
|
||||
return f"[{self.error_code}] {self.message} (caused by: {self.original_error})"
|
||||
return f"[{self.error_code}] {self.message}"
|
||||
|
||||
|
||||
class WeaviateConnectionError(MCPToolError):
|
||||
"""Raised when Weaviate database connection fails.
|
||||
|
||||
This exception is raised when the MCP server cannot establish or
|
||||
maintain a connection to the Weaviate vector database.
|
||||
|
||||
Example:
|
||||
>>> raise WeaviateConnectionError(
|
||||
... "Cannot connect to Weaviate at localhost:8080",
|
||||
... details={"host": "localhost", "port": 8080}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "Failed to connect to Weaviate",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize WeaviateConnectionError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "Failed to connect to Weaviate").
|
||||
details: Additional context (host, port, etc.).
|
||||
original_error: The underlying connection exception.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="WEAVIATE_CONNECTION_ERROR",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
|
||||
|
||||
class PDFProcessingError(MCPToolError):
|
||||
"""Raised when PDF processing fails.
|
||||
|
||||
This exception is raised when the MCP server encounters an error
|
||||
during PDF parsing, OCR, or any step in the PDF ingestion pipeline.
|
||||
|
||||
Example:
|
||||
>>> raise PDFProcessingError(
|
||||
... "OCR failed for page 5",
|
||||
... details={"page": 5, "pdf_path": "/docs/test.pdf"}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "PDF processing failed",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize PDFProcessingError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "PDF processing failed").
|
||||
details: Additional context (pdf_path, page, step, etc.).
|
||||
original_error: The underlying processing exception.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="PDF_PROCESSING_ERROR",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
|
||||
|
||||
class DocumentNotFoundError(MCPToolError):
|
||||
"""Raised when a requested document or chunk is not found.
|
||||
|
||||
This exception is raised when a retrieval operation cannot find
|
||||
the requested document, chunk, or summary in Weaviate.
|
||||
|
||||
Example:
|
||||
>>> raise DocumentNotFoundError(
|
||||
... "Document not found",
|
||||
... details={"source_id": "platon-menon"}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "Document not found",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize DocumentNotFoundError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "Document not found").
|
||||
details: Additional context (source_id, query, etc.).
|
||||
original_error: The underlying exception if any.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="DOCUMENT_NOT_FOUND",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
|
||||
|
||||
class ValidationError(MCPToolError):
|
||||
"""Raised when input validation fails.
|
||||
|
||||
This exception is raised when user input does not meet the
|
||||
required validation criteria (e.g., invalid paths, bad parameters).
|
||||
|
||||
Example:
|
||||
>>> raise ValidationError(
|
||||
... "Invalid PDF path",
|
||||
... details={"path": "/nonexistent/file.pdf", "reason": "File not found"}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "Validation failed",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize ValidationError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "Validation failed").
|
||||
details: Additional context (field, value, reason, etc.).
|
||||
original_error: The underlying validation exception.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="VALIDATION_ERROR",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
|
||||
|
||||
class LLMProcessingError(MCPToolError):
|
||||
"""Raised when LLM processing fails.
|
||||
|
||||
This exception is raised when the LLM (Mistral or Ollama) fails
|
||||
to process content during metadata extraction, chunking, or other
|
||||
LLM-based operations.
|
||||
|
||||
Example:
|
||||
>>> raise LLMProcessingError(
|
||||
... "LLM timeout during metadata extraction",
|
||||
... details={"provider": "ollama", "model": "mistral", "step": "metadata"}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "LLM processing failed",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize LLMProcessingError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "LLM processing failed").
|
||||
details: Additional context (provider, model, step, etc.).
|
||||
original_error: The underlying LLM exception.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="LLM_PROCESSING_ERROR",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
|
||||
|
||||
class DownloadError(MCPToolError):
|
||||
"""Raised when file download from URL fails.
|
||||
|
||||
This exception is raised when the MCP server cannot download
|
||||
a PDF file from a provided URL.
|
||||
|
||||
Example:
|
||||
>>> raise DownloadError(
|
||||
... "Failed to download PDF",
|
||||
... details={"url": "https://example.com/doc.pdf", "status_code": 404}
|
||||
... )
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
message: str = "File download failed",
|
||||
*,
|
||||
details: Optional[Dict[str, Any]] = None,
|
||||
original_error: Optional[Exception] = None,
|
||||
) -> None:
|
||||
"""Initialize DownloadError.
|
||||
|
||||
Args:
|
||||
message: Error description (default: "File download failed").
|
||||
details: Additional context (url, status_code, etc.).
|
||||
original_error: The underlying HTTP exception.
|
||||
"""
|
||||
super().__init__(
|
||||
message,
|
||||
error_code="DOWNLOAD_ERROR",
|
||||
details=details,
|
||||
original_error=original_error,
|
||||
)
|
||||
462
generations/library_rag/mcp_tools/logging_config.py
Normal file
462
generations/library_rag/mcp_tools/logging_config.py
Normal file
@@ -0,0 +1,462 @@
|
||||
"""Structured JSON logging configuration for Library RAG MCP Server.
|
||||
|
||||
This module provides structured JSON logging with sensitive data filtering
|
||||
and tool invocation tracking.
|
||||
|
||||
Features:
|
||||
- JSON-formatted log output for machine parsing
|
||||
- Sensitive data filtering (API keys, passwords)
|
||||
- Tool invocation logging with timing
|
||||
- Configurable log levels via environment variable
|
||||
|
||||
Example:
|
||||
Configure logging at server startup::
|
||||
|
||||
from mcp_tools.logging_config import setup_mcp_logging, get_tool_logger
|
||||
|
||||
# Setup logging
|
||||
logger = setup_mcp_logging(log_level="INFO")
|
||||
|
||||
# Get tool-specific logger
|
||||
tool_logger = get_tool_logger("search_chunks")
|
||||
tool_logger.info("Processing query", extra={"query": "justice"})
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from contextlib import contextmanager
|
||||
from datetime import datetime, timezone
|
||||
from functools import wraps
|
||||
from pathlib import Path
|
||||
from typing import Any, Callable, Dict, Generator, Literal, Optional, TypeVar, cast
|
||||
|
||||
# Type variable for decorator return type preservation
|
||||
F = TypeVar("F", bound=Callable[..., Any])
|
||||
|
||||
# =============================================================================
|
||||
# Sensitive Data Patterns
|
||||
# =============================================================================
|
||||
|
||||
# Patterns to detect sensitive data in log messages
|
||||
SENSITIVE_PATTERNS = [
|
||||
# API keys
|
||||
(re.compile(r'(api[_-]?key\s*[=:]\s*)["\']?[\w-]{20,}["\']?', re.I), r"\1***REDACTED***"),
|
||||
(re.compile(r'(bearer\s+)[\w-]{20,}', re.I), r"\1***REDACTED***"),
|
||||
(re.compile(r'(authorization\s*[=:]\s*)["\']?[\w-]{20,}["\']?', re.I), r"\1***REDACTED***"),
|
||||
# Mistral API key format
|
||||
(re.compile(r'(MISTRAL_API_KEY\s*[=:]\s*)["\']?[\w-]+["\']?', re.I), r"\1***REDACTED***"),
|
||||
# Generic secrets
|
||||
(re.compile(r'(password\s*[=:]\s*)["\']?[^\s"\']+["\']?', re.I), r"\1***REDACTED***"),
|
||||
(re.compile(r'(secret\s*[=:]\s*)["\']?[\w-]+["\']?', re.I), r"\1***REDACTED***"),
|
||||
(re.compile(r'(token\s*[=:]\s*)["\']?[\w-]{20,}["\']?', re.I), r"\1***REDACTED***"),
|
||||
]
|
||||
|
||||
|
||||
def redact_sensitive_data(message: str) -> str:
|
||||
"""Remove sensitive data from log messages.
|
||||
|
||||
Args:
|
||||
message: The log message to sanitize.
|
||||
|
||||
Returns:
|
||||
Sanitized message with sensitive data redacted.
|
||||
|
||||
Example:
|
||||
>>> redact_sensitive_data("api_key=sk-12345abcdef")
|
||||
"api_key=***REDACTED***"
|
||||
"""
|
||||
result = message
|
||||
for pattern, replacement in SENSITIVE_PATTERNS:
|
||||
result = pattern.sub(replacement, result)
|
||||
return result
|
||||
|
||||
|
||||
def redact_dict(data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""Recursively redact sensitive data from a dictionary.
|
||||
|
||||
Args:
|
||||
data: Dictionary that may contain sensitive data.
|
||||
|
||||
Returns:
|
||||
New dictionary with sensitive values redacted.
|
||||
"""
|
||||
sensitive_keys = {
|
||||
"api_key", "apikey", "api-key",
|
||||
"password", "passwd", "pwd",
|
||||
"secret", "token", "auth",
|
||||
"authorization", "bearer",
|
||||
"mistral_api_key", "MISTRAL_API_KEY",
|
||||
}
|
||||
|
||||
result: Dict[str, Any] = {}
|
||||
for key, value in data.items():
|
||||
key_lower = key.lower().replace("-", "_")
|
||||
|
||||
if key_lower in sensitive_keys or any(s in key_lower for s in ["key", "secret", "token", "password"]):
|
||||
result[key] = "***REDACTED***"
|
||||
elif isinstance(value, dict):
|
||||
result[key] = redact_dict(value)
|
||||
elif isinstance(value, str):
|
||||
result[key] = redact_sensitive_data(value)
|
||||
else:
|
||||
result[key] = value
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# JSON Log Formatter
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class JSONLogFormatter(logging.Formatter):
|
||||
"""JSON formatter for structured logging.
|
||||
|
||||
Outputs log records as single-line JSON objects with consistent structure.
|
||||
Automatically redacts sensitive data from messages and extra fields.
|
||||
|
||||
JSON Structure:
|
||||
{
|
||||
"timestamp": "2024-12-24T10:30:00.000Z",
|
||||
"level": "INFO",
|
||||
"logger": "library-rag-mcp.search_chunks",
|
||||
"message": "Processing query",
|
||||
"tool": "search_chunks",
|
||||
"duration_ms": 123,
|
||||
...extra fields...
|
||||
}
|
||||
"""
|
||||
|
||||
def format(self, record: logging.LogRecord) -> str:
|
||||
"""Format the log record as JSON.
|
||||
|
||||
Args:
|
||||
record: The log record to format.
|
||||
|
||||
Returns:
|
||||
JSON-formatted log string.
|
||||
"""
|
||||
# Base log structure
|
||||
log_entry: Dict[str, Any] = {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"level": record.levelname,
|
||||
"logger": record.name,
|
||||
"message": redact_sensitive_data(record.getMessage()),
|
||||
}
|
||||
|
||||
# Add exception info if present
|
||||
if record.exc_info:
|
||||
log_entry["exception"] = self.formatException(record.exc_info)
|
||||
|
||||
# Add extra fields (excluding standard LogRecord attributes)
|
||||
standard_attrs = {
|
||||
"name", "msg", "args", "levelname", "levelno", "pathname",
|
||||
"filename", "module", "lineno", "funcName", "created",
|
||||
"msecs", "relativeCreated", "thread", "threadName",
|
||||
"processName", "process", "exc_info", "exc_text", "stack_info",
|
||||
"message", "taskName",
|
||||
}
|
||||
|
||||
for key, value in record.__dict__.items():
|
||||
if key not in standard_attrs and not key.startswith("_"):
|
||||
if isinstance(value, dict):
|
||||
log_entry[key] = redact_dict(value)
|
||||
elif isinstance(value, str):
|
||||
log_entry[key] = redact_sensitive_data(value)
|
||||
else:
|
||||
log_entry[key] = value
|
||||
|
||||
return json.dumps(log_entry, default=str, ensure_ascii=False)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Logging Setup
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def setup_mcp_logging(
|
||||
log_level: str = "INFO",
|
||||
log_dir: Optional[Path] = None,
|
||||
json_format: bool = True,
|
||||
) -> logging.Logger:
|
||||
"""Configure structured logging for the MCP server.
|
||||
|
||||
Sets up logging with JSON formatting to both file and stderr.
|
||||
Uses stderr for console output since stdout is used for MCP communication.
|
||||
|
||||
Args:
|
||||
log_level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL).
|
||||
log_dir: Directory for log files. Defaults to "logs".
|
||||
json_format: Use JSON formatting (default True).
|
||||
|
||||
Returns:
|
||||
Configured logger instance for the MCP server.
|
||||
|
||||
Example:
|
||||
>>> logger = setup_mcp_logging(log_level="DEBUG")
|
||||
>>> logger.info("Server started", extra={"port": 8080})
|
||||
"""
|
||||
# Determine log directory
|
||||
if log_dir is None:
|
||||
log_dir = Path("logs")
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Get or create the root MCP logger
|
||||
logger = logging.getLogger("library-rag-mcp")
|
||||
logger.setLevel(getattr(logging, log_level.upper(), logging.INFO))
|
||||
|
||||
# Clear existing handlers to avoid duplicates
|
||||
logger.handlers.clear()
|
||||
|
||||
# Create formatters
|
||||
if json_format:
|
||||
formatter: logging.Formatter = JSONLogFormatter()
|
||||
else:
|
||||
formatter = logging.Formatter(
|
||||
"%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
)
|
||||
|
||||
# File handler (JSON logs)
|
||||
file_handler = logging.FileHandler(
|
||||
log_dir / "mcp_server.log",
|
||||
encoding="utf-8",
|
||||
)
|
||||
file_handler.setLevel(logging.DEBUG) # Log everything to file
|
||||
file_handler.setFormatter(formatter)
|
||||
logger.addHandler(file_handler)
|
||||
|
||||
# Stderr handler (for console output - stdout is for MCP)
|
||||
stderr_handler = logging.StreamHandler(sys.stderr)
|
||||
stderr_handler.setLevel(getattr(logging, log_level.upper(), logging.INFO))
|
||||
stderr_handler.setFormatter(formatter)
|
||||
logger.addHandler(stderr_handler)
|
||||
|
||||
# Prevent propagation to root logger
|
||||
logger.propagate = False
|
||||
|
||||
return logger
|
||||
|
||||
|
||||
def get_tool_logger(tool_name: str) -> logging.Logger:
|
||||
"""Get a logger for a specific MCP tool.
|
||||
|
||||
Creates a child logger under the main MCP logger with the tool name
|
||||
automatically included in log entries.
|
||||
|
||||
Args:
|
||||
tool_name: Name of the MCP tool (e.g., "search_chunks", "parse_pdf").
|
||||
|
||||
Returns:
|
||||
Logger instance for the tool.
|
||||
|
||||
Example:
|
||||
>>> logger = get_tool_logger("search_chunks")
|
||||
>>> logger.info("Query processed", extra={"results": 10})
|
||||
"""
|
||||
return logging.getLogger(f"library-rag-mcp.{tool_name}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Tool Invocation Logging
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class ToolInvocationLogger:
|
||||
"""Context manager for logging tool invocations with timing.
|
||||
|
||||
Automatically logs tool start, success/failure, and duration.
|
||||
Handles exception logging and provides structured output.
|
||||
|
||||
Example:
|
||||
>>> with ToolInvocationLogger("search_chunks", {"query": "justice"}) as inv:
|
||||
... result = do_search()
|
||||
... inv.set_result({"count": 10})
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
tool_name: str,
|
||||
inputs: Dict[str, Any],
|
||||
logger: Optional[logging.Logger] = None,
|
||||
) -> None:
|
||||
"""Initialize the invocation logger.
|
||||
|
||||
Args:
|
||||
tool_name: Name of the tool being invoked.
|
||||
inputs: Tool input parameters (will be redacted).
|
||||
logger: Logger to use. Defaults to tool-specific logger.
|
||||
"""
|
||||
self.tool_name = tool_name
|
||||
self.inputs = redact_dict(inputs)
|
||||
self.logger = logger or get_tool_logger(tool_name)
|
||||
self.start_time: float = 0.0
|
||||
self.result: Optional[Dict[str, Any]] = None
|
||||
self.error: Optional[Exception] = None
|
||||
|
||||
def __enter__(self) -> "ToolInvocationLogger":
|
||||
"""Start timing and log invocation start."""
|
||||
self.start_time = time.perf_counter()
|
||||
self.logger.info(
|
||||
f"Tool invocation started: {self.tool_name}",
|
||||
extra={
|
||||
"tool": self.tool_name,
|
||||
"event": "invocation_start",
|
||||
"inputs": self.inputs,
|
||||
},
|
||||
)
|
||||
return self
|
||||
|
||||
def __exit__(
|
||||
self,
|
||||
exc_type: Optional[type],
|
||||
exc_val: Optional[BaseException],
|
||||
exc_tb: Any,
|
||||
) -> Literal[False]:
|
||||
"""Log invocation completion with timing."""
|
||||
duration_ms = (time.perf_counter() - self.start_time) * 1000
|
||||
|
||||
if exc_val is not None:
|
||||
# Log error
|
||||
self.logger.error(
|
||||
f"Tool invocation failed: {self.tool_name}",
|
||||
extra={
|
||||
"tool": self.tool_name,
|
||||
"event": "invocation_error",
|
||||
"duration_ms": round(duration_ms, 2),
|
||||
"error_type": exc_type.__name__ if exc_type else "Unknown",
|
||||
"error_message": str(exc_val),
|
||||
},
|
||||
exc_info=True,
|
||||
)
|
||||
# Don't suppress the exception
|
||||
return False
|
||||
|
||||
# Log success
|
||||
extra: Dict[str, Any] = {
|
||||
"tool": self.tool_name,
|
||||
"event": "invocation_success",
|
||||
"duration_ms": round(duration_ms, 2),
|
||||
}
|
||||
if self.result:
|
||||
extra["result_summary"] = self._summarize_result()
|
||||
|
||||
self.logger.info(
|
||||
f"Tool invocation completed: {self.tool_name}",
|
||||
extra=extra,
|
||||
)
|
||||
return False
|
||||
|
||||
def set_result(self, result: Dict[str, Any]) -> None:
|
||||
"""Set the result for logging summary.
|
||||
|
||||
Args:
|
||||
result: The tool result dictionary.
|
||||
"""
|
||||
self.result = result
|
||||
|
||||
def _summarize_result(self) -> Dict[str, Any]:
|
||||
"""Create a summary of the result for logging.
|
||||
|
||||
Returns:
|
||||
Dictionary with key result metrics (counts, success status, etc.)
|
||||
"""
|
||||
if not self.result:
|
||||
return {}
|
||||
|
||||
summary: Dict[str, Any] = {}
|
||||
|
||||
# Common summary fields
|
||||
if "success" in self.result:
|
||||
summary["success"] = self.result["success"]
|
||||
if "total_count" in self.result:
|
||||
summary["total_count"] = self.result["total_count"]
|
||||
if "results" in self.result and isinstance(self.result["results"], list):
|
||||
summary["result_count"] = len(self.result["results"])
|
||||
if "chunks_count" in self.result:
|
||||
summary["chunks_count"] = self.result["chunks_count"]
|
||||
if "cost_total" in self.result:
|
||||
summary["cost_total"] = self.result["cost_total"]
|
||||
if "found" in self.result:
|
||||
summary["found"] = self.result["found"]
|
||||
if "error" in self.result and self.result["error"]:
|
||||
summary["error"] = self.result["error"]
|
||||
|
||||
return summary
|
||||
|
||||
|
||||
@contextmanager
|
||||
def log_tool_invocation(
|
||||
tool_name: str,
|
||||
inputs: Dict[str, Any],
|
||||
) -> Generator[ToolInvocationLogger, None, None]:
|
||||
"""Context manager for logging tool invocations.
|
||||
|
||||
Convenience function that creates and manages a ToolInvocationLogger.
|
||||
|
||||
Args:
|
||||
tool_name: Name of the tool being invoked.
|
||||
inputs: Tool input parameters.
|
||||
|
||||
Yields:
|
||||
ToolInvocationLogger instance for setting results.
|
||||
|
||||
Example:
|
||||
>>> with log_tool_invocation("search_chunks", {"query": "test"}) as inv:
|
||||
... result = search(query)
|
||||
... inv.set_result(result)
|
||||
"""
|
||||
logger_instance = ToolInvocationLogger(tool_name, inputs)
|
||||
with logger_instance as inv:
|
||||
yield inv
|
||||
|
||||
|
||||
def log_weaviate_query(
|
||||
operation: str,
|
||||
collection: str,
|
||||
filters: Optional[Dict[str, Any]] = None,
|
||||
result_count: Optional[int] = None,
|
||||
duration_ms: Optional[float] = None,
|
||||
) -> None:
|
||||
"""Log a Weaviate query operation.
|
||||
|
||||
Utility function for logging Weaviate database queries with consistent
|
||||
structure.
|
||||
|
||||
Args:
|
||||
operation: Query operation type (fetch, near_text, aggregate, etc.).
|
||||
collection: Weaviate collection name.
|
||||
filters: Query filters applied (optional).
|
||||
result_count: Number of results returned (optional).
|
||||
duration_ms: Query duration in milliseconds (optional).
|
||||
|
||||
Example:
|
||||
>>> log_weaviate_query(
|
||||
... operation="near_text",
|
||||
... collection="Chunk",
|
||||
... filters={"author": "Platon"},
|
||||
... result_count=10,
|
||||
... duration_ms=45.2
|
||||
... )
|
||||
"""
|
||||
logger = logging.getLogger("library-rag-mcp.weaviate")
|
||||
|
||||
extra: Dict[str, Any] = {
|
||||
"event": "weaviate_query",
|
||||
"operation": operation,
|
||||
"collection": collection,
|
||||
}
|
||||
|
||||
if filters:
|
||||
extra["filters"] = redact_dict(filters)
|
||||
if result_count is not None:
|
||||
extra["result_count"] = result_count
|
||||
if duration_ms is not None:
|
||||
extra["duration_ms"] = round(duration_ms, 2)
|
||||
|
||||
logger.debug(f"Weaviate {operation} on {collection}", extra=extra)
|
||||
335
generations/library_rag/mcp_tools/parsing_tools.py
Normal file
335
generations/library_rag/mcp_tools/parsing_tools.py
Normal file
@@ -0,0 +1,335 @@
|
||||
"""Parsing tools for Library RAG MCP Server.
|
||||
|
||||
This module implements the parse_pdf tool with optimal pre-configured parameters
|
||||
for PDF ingestion into the Library RAG system.
|
||||
|
||||
The tool uses fixed optimal parameters:
|
||||
- llm_provider: "mistral" (API-based, fast)
|
||||
- llm_model: "mistral-medium-latest" (best quality/cost ratio)
|
||||
- use_semantic_chunking: True (LLM-based intelligent chunking)
|
||||
- use_ocr_annotations: True (3x cost but better TOC extraction)
|
||||
- ingest_to_weaviate: True (automatic vectorization and storage)
|
||||
|
||||
Example:
|
||||
The parse_pdf tool can be invoked via MCP with a simple path::
|
||||
|
||||
{
|
||||
"tool": "parse_pdf",
|
||||
"arguments": {
|
||||
"pdf_path": "/path/to/document.pdf"
|
||||
}
|
||||
}
|
||||
|
||||
Or with a URL::
|
||||
|
||||
{
|
||||
"tool": "parse_pdf",
|
||||
"arguments": {
|
||||
"pdf_path": "https://example.com/document.pdf"
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, Literal
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import httpx
|
||||
|
||||
from mcp_tools.schemas import ParsePdfInput, ParsePdfOutput
|
||||
|
||||
# Import pdf_pipeline for PDF processing
|
||||
from utils.pdf_pipeline import process_pdf, process_pdf_bytes
|
||||
from utils.types import LLMProvider
|
||||
|
||||
# Logger for this module
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# =============================================================================
|
||||
# Constants - Fixed Optimal Parameters
|
||||
# =============================================================================
|
||||
|
||||
# LLM provider configuration (Mistral API for best results)
|
||||
FIXED_LLM_PROVIDER: LLMProvider = "mistral"
|
||||
FIXED_LLM_MODEL = "mistral-medium-latest"
|
||||
|
||||
# Processing options (optimal settings for quality)
|
||||
FIXED_USE_SEMANTIC_CHUNKING = True
|
||||
FIXED_USE_OCR_ANNOTATIONS = True
|
||||
FIXED_INGEST_TO_WEAVIATE = True
|
||||
|
||||
# Additional processing flags
|
||||
FIXED_USE_LLM = True
|
||||
# Note: The following flags are not supported by process_pdf() and should not be used
|
||||
# FIXED_CLEAN_CHUNKS = True
|
||||
# FIXED_EXTRACT_CONCEPTS = True
|
||||
# FIXED_VALIDATE_OUTPUT = True
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Helper Functions
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def is_url(path: str) -> bool:
|
||||
"""Check if a path is a URL.
|
||||
|
||||
Args:
|
||||
path: The path or URL string to check.
|
||||
|
||||
Returns:
|
||||
True if the path is a valid HTTP/HTTPS URL, False otherwise.
|
||||
|
||||
Example:
|
||||
>>> is_url("https://example.com/doc.pdf")
|
||||
True
|
||||
>>> is_url("/path/to/doc.pdf")
|
||||
False
|
||||
"""
|
||||
try:
|
||||
result = urlparse(path)
|
||||
return result.scheme in ("http", "https")
|
||||
except ValueError:
|
||||
return False
|
||||
|
||||
|
||||
async def download_pdf(url: str, timeout: float = 60.0) -> bytes:
|
||||
"""Download a PDF file from a URL.
|
||||
|
||||
Args:
|
||||
url: The URL to download from. Must be HTTP or HTTPS.
|
||||
timeout: Maximum time in seconds to wait for download.
|
||||
Defaults to 60 seconds.
|
||||
|
||||
Returns:
|
||||
Raw bytes content of the downloaded PDF file.
|
||||
|
||||
Raises:
|
||||
httpx.HTTPError: If the download fails (network error, HTTP error, etc.).
|
||||
ValueError: If the URL is invalid or not accessible.
|
||||
|
||||
Example:
|
||||
>>> pdf_bytes = await download_pdf("https://example.com/document.pdf")
|
||||
>>> len(pdf_bytes) > 0
|
||||
True
|
||||
"""
|
||||
logger.info(f"Downloading PDF from: {url}")
|
||||
|
||||
async with httpx.AsyncClient(timeout=timeout, follow_redirects=True) as client:
|
||||
response = await client.get(url)
|
||||
response.raise_for_status()
|
||||
|
||||
content_type = response.headers.get("content-type", "")
|
||||
if "application/pdf" not in content_type.lower() and not url.lower().endswith(
|
||||
".pdf"
|
||||
):
|
||||
logger.warning(
|
||||
f"URL may not be a PDF (Content-Type: {content_type}), proceeding anyway"
|
||||
)
|
||||
|
||||
logger.info(f"Downloaded {len(response.content)} bytes from {url}")
|
||||
return response.content
|
||||
|
||||
|
||||
def extract_filename_from_url(url: str) -> str:
|
||||
"""Extract a filename from a URL.
|
||||
|
||||
Args:
|
||||
url: The URL to extract filename from.
|
||||
|
||||
Returns:
|
||||
Extracted filename with .pdf extension. Falls back to "downloaded.pdf"
|
||||
if no filename can be extracted.
|
||||
|
||||
Example:
|
||||
>>> extract_filename_from_url("https://example.com/documents/kant.pdf")
|
||||
"kant.pdf"
|
||||
>>> extract_filename_from_url("https://example.com/api/download")
|
||||
"downloaded.pdf"
|
||||
"""
|
||||
parsed = urlparse(url)
|
||||
path = parsed.path
|
||||
|
||||
if path:
|
||||
# Get the last path component
|
||||
filename = path.split("/")[-1]
|
||||
if filename and "." in filename:
|
||||
return filename
|
||||
if filename:
|
||||
return f"{filename}.pdf"
|
||||
|
||||
return "downloaded.pdf"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Main Tool Implementation
|
||||
# =============================================================================
|
||||
|
||||
|
||||
async def parse_pdf_handler(input_data: ParsePdfInput) -> ParsePdfOutput:
|
||||
"""Process a PDF document with optimal pre-configured parameters.
|
||||
|
||||
This is the main handler for the parse_pdf MCP tool. It processes PDFs
|
||||
through the Library RAG pipeline with the following fixed optimal settings:
|
||||
|
||||
- LLM: Mistral API (mistral-medium-latest) for fast, high-quality processing
|
||||
- OCR: Mistral OCR with annotations (better TOC extraction, 3x cost)
|
||||
- Chunking: Semantic LLM-based chunking (argumentative units)
|
||||
- Ingestion: Automatic Weaviate vectorization and storage
|
||||
|
||||
The tool accepts either a local file path or a URL. URLs are automatically
|
||||
downloaded before processing.
|
||||
|
||||
Args:
|
||||
input_data: Validated input containing pdf_path (local path or URL).
|
||||
|
||||
Returns:
|
||||
ParsePdfOutput containing processing results including:
|
||||
- success: Whether processing completed successfully
|
||||
- document_name: Name of the processed document
|
||||
- source_id: Unique identifier for retrieval
|
||||
- pages: Number of pages processed
|
||||
- chunks_count: Number of chunks created
|
||||
- cost_ocr: OCR cost in EUR
|
||||
- cost_llm: LLM cost in EUR
|
||||
- cost_total: Total processing cost
|
||||
- output_dir: Directory containing output files
|
||||
- metadata: Extracted document metadata
|
||||
- error: Error message if processing failed
|
||||
|
||||
Example:
|
||||
>>> input_data = ParsePdfInput(pdf_path="/docs/aristotle.pdf")
|
||||
>>> result = await parse_pdf_handler(input_data)
|
||||
>>> result.success
|
||||
True
|
||||
>>> result.chunks_count > 0
|
||||
True
|
||||
"""
|
||||
pdf_path = input_data.pdf_path
|
||||
logger.info(f"parse_pdf called with: {pdf_path}")
|
||||
|
||||
try:
|
||||
# Determine if input is a URL or local path
|
||||
if is_url(pdf_path):
|
||||
# Download PDF from URL
|
||||
logger.info(f"Detected URL input, downloading: {pdf_path}")
|
||||
pdf_bytes = await download_pdf(pdf_path)
|
||||
filename = extract_filename_from_url(pdf_path)
|
||||
|
||||
# Process from bytes
|
||||
result = process_pdf_bytes(
|
||||
file_bytes=pdf_bytes,
|
||||
filename=filename,
|
||||
output_dir=Path("output"),
|
||||
llm_provider=FIXED_LLM_PROVIDER,
|
||||
use_llm=FIXED_USE_LLM,
|
||||
llm_model=FIXED_LLM_MODEL,
|
||||
use_semantic_chunking=FIXED_USE_SEMANTIC_CHUNKING,
|
||||
use_ocr_annotations=FIXED_USE_OCR_ANNOTATIONS,
|
||||
ingest_to_weaviate=FIXED_INGEST_TO_WEAVIATE,
|
||||
)
|
||||
else:
|
||||
# Process local file
|
||||
local_path = Path(pdf_path)
|
||||
if not local_path.exists():
|
||||
logger.error(f"PDF file not found: {pdf_path}")
|
||||
return ParsePdfOutput(
|
||||
success=False,
|
||||
document_name="",
|
||||
source_id="",
|
||||
pages=0,
|
||||
chunks_count=0,
|
||||
cost_ocr=0.0,
|
||||
cost_llm=0.0,
|
||||
cost_total=0.0,
|
||||
output_dir="",
|
||||
metadata={},
|
||||
error=f"PDF file not found: {pdf_path}",
|
||||
)
|
||||
|
||||
logger.info(f"Processing local file: {local_path}")
|
||||
result = process_pdf(
|
||||
pdf_path=local_path,
|
||||
output_dir=Path("output"),
|
||||
use_llm=FIXED_USE_LLM,
|
||||
llm_provider=FIXED_LLM_PROVIDER,
|
||||
llm_model=FIXED_LLM_MODEL,
|
||||
use_semantic_chunking=FIXED_USE_SEMANTIC_CHUNKING,
|
||||
use_ocr_annotations=FIXED_USE_OCR_ANNOTATIONS,
|
||||
ingest_to_weaviate=FIXED_INGEST_TO_WEAVIATE,
|
||||
)
|
||||
|
||||
# Convert pipeline result to output schema
|
||||
success = result.get("success", False)
|
||||
document_name = result.get("document_name", "")
|
||||
source_id = result.get("source_id", document_name)
|
||||
|
||||
# Extract costs
|
||||
cost_ocr = result.get("cost_ocr", 0.0)
|
||||
cost_llm = result.get("cost_llm", 0.0)
|
||||
cost_total = result.get("cost_total", cost_ocr + cost_llm)
|
||||
|
||||
# Extract metadata
|
||||
metadata_raw = result.get("metadata", {})
|
||||
if metadata_raw is None:
|
||||
metadata_raw = {}
|
||||
|
||||
# Build output
|
||||
output = ParsePdfOutput(
|
||||
success=success,
|
||||
document_name=document_name,
|
||||
source_id=source_id,
|
||||
pages=result.get("pages", 0),
|
||||
chunks_count=result.get("chunks_count", 0),
|
||||
cost_ocr=cost_ocr,
|
||||
cost_llm=cost_llm,
|
||||
cost_total=cost_total,
|
||||
output_dir=str(result.get("output_dir", "")),
|
||||
metadata=metadata_raw,
|
||||
error=result.get("error"),
|
||||
)
|
||||
|
||||
if success:
|
||||
logger.info(
|
||||
f"Successfully processed {document_name}: "
|
||||
f"{output.chunks_count} chunks, {output.cost_total:.4f} EUR"
|
||||
)
|
||||
else:
|
||||
logger.error(f"Failed to process {pdf_path}: {output.error}")
|
||||
|
||||
return output
|
||||
|
||||
except httpx.HTTPError as e:
|
||||
logger.error(f"HTTP error downloading PDF: {e}")
|
||||
return ParsePdfOutput(
|
||||
success=False,
|
||||
document_name="",
|
||||
source_id="",
|
||||
pages=0,
|
||||
chunks_count=0,
|
||||
cost_ocr=0.0,
|
||||
cost_llm=0.0,
|
||||
cost_total=0.0,
|
||||
output_dir="",
|
||||
metadata={},
|
||||
error=f"Failed to download PDF: {e}",
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing PDF: {e}", exc_info=True)
|
||||
return ParsePdfOutput(
|
||||
success=False,
|
||||
document_name="",
|
||||
source_id="",
|
||||
pages=0,
|
||||
chunks_count=0,
|
||||
cost_ocr=0.0,
|
||||
cost_llm=0.0,
|
||||
cost_total=0.0,
|
||||
output_dir="",
|
||||
metadata={},
|
||||
error=f"Processing error: {str(e)}",
|
||||
)
|
||||
1552
generations/library_rag/mcp_tools/retrieval_tools.py
Normal file
1552
generations/library_rag/mcp_tools/retrieval_tools.py
Normal file
File diff suppressed because it is too large
Load Diff
361
generations/library_rag/mcp_tools/schemas.py
Normal file
361
generations/library_rag/mcp_tools/schemas.py
Normal file
@@ -0,0 +1,361 @@
|
||||
"""
|
||||
Pydantic schemas for MCP tool inputs and outputs.
|
||||
|
||||
All schemas use strict validation and include field descriptions
|
||||
for automatic JSON schema generation in MCP tool definitions.
|
||||
"""
|
||||
|
||||
from typing import Any, Dict, List, Optional
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Parsing Tool Schemas
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class ParsePdfInput(BaseModel):
|
||||
"""Input schema for parse_pdf tool."""
|
||||
|
||||
pdf_path: str = Field(
|
||||
...,
|
||||
description="Path to the PDF file to process, or URL to download",
|
||||
min_length=1,
|
||||
)
|
||||
|
||||
|
||||
class ParsePdfOutput(BaseModel):
|
||||
"""Output schema for parse_pdf tool."""
|
||||
|
||||
success: bool = Field(..., description="Whether processing succeeded")
|
||||
document_name: str = Field(..., description="Name of the processed document")
|
||||
source_id: str = Field(..., description="Unique identifier for the document")
|
||||
pages: int = Field(..., description="Number of pages processed")
|
||||
chunks_count: int = Field(..., description="Number of chunks created")
|
||||
cost_ocr: float = Field(..., description="OCR processing cost in EUR")
|
||||
cost_llm: float = Field(..., description="LLM processing cost in EUR")
|
||||
cost_total: float = Field(..., description="Total processing cost in EUR")
|
||||
output_dir: str = Field(..., description="Directory containing output files")
|
||||
metadata: Dict[str, Any] = Field(
|
||||
default_factory=dict,
|
||||
description="Extracted metadata (title, author, language, year)",
|
||||
)
|
||||
error: Optional[str] = Field(None, description="Error message if failed")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Retrieval Tool Schemas
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class ChunkResult(BaseModel):
|
||||
"""A single chunk result from search."""
|
||||
|
||||
text: str = Field(..., description="Chunk text content")
|
||||
similarity: float = Field(..., description="Similarity score (0-1)")
|
||||
source_id: str = Field(..., description="Source document ID (e.g., 'peirce_collected_papers')")
|
||||
canonical_reference: Optional[str] = Field(None, description="Academic citation reference (e.g., 'CP 5.628', 'Ménon 80a')")
|
||||
section_path: str = Field(..., description="Hierarchical section path")
|
||||
chapter_title: Optional[str] = Field(None, description="Chapter title if available")
|
||||
work_title: str = Field(..., description="Title of the work")
|
||||
work_author: str = Field(..., description="Author of the work")
|
||||
order_index: int = Field(..., description="Position in document")
|
||||
|
||||
|
||||
class SearchChunksInput(BaseModel):
|
||||
"""Input schema for search_chunks tool."""
|
||||
|
||||
query: str = Field(
|
||||
...,
|
||||
description="Semantic search query",
|
||||
min_length=1,
|
||||
max_length=1000,
|
||||
)
|
||||
limit: int = Field(
|
||||
default=10,
|
||||
description="Maximum number of results to return",
|
||||
ge=1,
|
||||
le=500,
|
||||
)
|
||||
min_similarity: float = Field(
|
||||
default=0.0,
|
||||
description="Minimum similarity threshold (0-1)",
|
||||
ge=0.0,
|
||||
le=1.0,
|
||||
)
|
||||
author_filter: Optional[str] = Field(
|
||||
None,
|
||||
description="Filter by author name",
|
||||
)
|
||||
work_filter: Optional[str] = Field(
|
||||
None,
|
||||
description="Filter by work title",
|
||||
)
|
||||
language_filter: Optional[str] = Field(
|
||||
None,
|
||||
description="Filter by language code (e.g., 'fr', 'en')",
|
||||
)
|
||||
|
||||
|
||||
class SearchChunksOutput(BaseModel):
|
||||
"""Output schema for search_chunks tool."""
|
||||
|
||||
results: List[ChunkResult] = Field(
|
||||
default_factory=list,
|
||||
description="List of matching chunks",
|
||||
)
|
||||
total_count: int = Field(..., description="Total number of results")
|
||||
query: str = Field(..., description="Original query")
|
||||
|
||||
|
||||
class SummaryResult(BaseModel):
|
||||
"""A single summary result from search."""
|
||||
|
||||
text: str = Field(..., description="Summary text")
|
||||
similarity: float = Field(..., description="Similarity score (0-1)")
|
||||
title: str = Field(..., description="Section title")
|
||||
section_path: str = Field(..., description="Hierarchical section path")
|
||||
level: int = Field(..., description="Hierarchy level (1=chapter, 2=section, etc.)")
|
||||
concepts: List[str] = Field(default_factory=list, description="Key concepts")
|
||||
document_source_id: str = Field(..., description="Source document ID")
|
||||
|
||||
|
||||
class SearchSummariesInput(BaseModel):
|
||||
"""Input schema for search_summaries tool."""
|
||||
|
||||
query: str = Field(
|
||||
...,
|
||||
description="Semantic search query",
|
||||
min_length=1,
|
||||
max_length=1000,
|
||||
)
|
||||
limit: int = Field(
|
||||
default=10,
|
||||
description="Maximum number of results to return",
|
||||
ge=1,
|
||||
le=100,
|
||||
)
|
||||
min_level: Optional[int] = Field(
|
||||
None,
|
||||
description="Minimum hierarchy level (1=chapter)",
|
||||
ge=1,
|
||||
le=5,
|
||||
)
|
||||
max_level: Optional[int] = Field(
|
||||
None,
|
||||
description="Maximum hierarchy level",
|
||||
ge=1,
|
||||
le=5,
|
||||
)
|
||||
|
||||
|
||||
class SearchSummariesOutput(BaseModel):
|
||||
"""Output schema for search_summaries tool."""
|
||||
|
||||
results: List[SummaryResult] = Field(
|
||||
default_factory=list,
|
||||
description="List of matching summaries",
|
||||
)
|
||||
total_count: int = Field(..., description="Total number of results")
|
||||
query: str = Field(..., description="Original query")
|
||||
|
||||
|
||||
class GetDocumentInput(BaseModel):
|
||||
"""Input schema for get_document tool."""
|
||||
|
||||
source_id: str = Field(
|
||||
...,
|
||||
description="Document source ID (e.g., 'platon-menon')",
|
||||
min_length=1,
|
||||
)
|
||||
include_chunks: bool = Field(
|
||||
default=False,
|
||||
description="Include document chunks in response",
|
||||
)
|
||||
chunk_limit: int = Field(
|
||||
default=50,
|
||||
description="Maximum chunks to return if include_chunks=True",
|
||||
ge=1,
|
||||
le=500,
|
||||
)
|
||||
|
||||
|
||||
class DocumentInfo(BaseModel):
|
||||
"""Document information."""
|
||||
|
||||
source_id: str = Field(..., description="Unique document identifier")
|
||||
work_title: str = Field(..., description="Title of the work")
|
||||
work_author: str = Field(..., description="Author of the work")
|
||||
edition: Optional[str] = Field(None, description="Edition information")
|
||||
pages: int = Field(..., description="Number of pages")
|
||||
language: str = Field(..., description="Document language")
|
||||
toc: Optional[Dict[str, Any]] = Field(None, description="Table of contents")
|
||||
hierarchy: Optional[Dict[str, Any]] = Field(None, description="Document hierarchy")
|
||||
|
||||
|
||||
class GetDocumentOutput(BaseModel):
|
||||
"""Output schema for get_document tool."""
|
||||
|
||||
document: Optional[DocumentInfo] = Field(None, description="Document information")
|
||||
chunks: List[ChunkResult] = Field(
|
||||
default_factory=list,
|
||||
description="Document chunks (if requested)",
|
||||
)
|
||||
chunks_total: int = Field(
|
||||
default=0,
|
||||
description="Total number of chunks in document",
|
||||
)
|
||||
found: bool = Field(..., description="Whether document was found")
|
||||
error: Optional[str] = Field(None, description="Error message if not found")
|
||||
|
||||
|
||||
class ListDocumentsInput(BaseModel):
|
||||
"""Input schema for list_documents tool."""
|
||||
|
||||
author_filter: Optional[str] = Field(None, description="Filter by author name")
|
||||
work_filter: Optional[str] = Field(None, description="Filter by work title")
|
||||
language_filter: Optional[str] = Field(None, description="Filter by language code")
|
||||
limit: int = Field(
|
||||
default=50,
|
||||
description="Maximum number of results",
|
||||
ge=1,
|
||||
le=250,
|
||||
)
|
||||
offset: int = Field(
|
||||
default=0,
|
||||
description="Offset for pagination",
|
||||
ge=0,
|
||||
)
|
||||
|
||||
|
||||
class DocumentSummary(BaseModel):
|
||||
"""Summary of a document for listing."""
|
||||
|
||||
source_id: str = Field(..., description="Unique document identifier")
|
||||
work_title: str = Field(..., description="Title of the work")
|
||||
work_author: str = Field(..., description="Author of the work")
|
||||
pages: int = Field(..., description="Number of pages")
|
||||
chunks_count: int = Field(..., description="Number of chunks")
|
||||
language: str = Field(..., description="Document language")
|
||||
|
||||
|
||||
class ListDocumentsOutput(BaseModel):
|
||||
"""Output schema for list_documents tool."""
|
||||
|
||||
documents: List[DocumentSummary] = Field(
|
||||
default_factory=list,
|
||||
description="List of documents",
|
||||
)
|
||||
total_count: int = Field(..., description="Total number of documents")
|
||||
limit: int = Field(..., description="Applied limit")
|
||||
offset: int = Field(..., description="Applied offset")
|
||||
|
||||
|
||||
class GetChunksByDocumentInput(BaseModel):
|
||||
"""Input schema for get_chunks_by_document tool."""
|
||||
|
||||
source_id: str = Field(
|
||||
...,
|
||||
description="Document source ID",
|
||||
min_length=1,
|
||||
)
|
||||
limit: int = Field(
|
||||
default=50,
|
||||
description="Maximum number of chunks to return",
|
||||
ge=1,
|
||||
le=500,
|
||||
)
|
||||
offset: int = Field(
|
||||
default=0,
|
||||
description="Offset for pagination",
|
||||
ge=0,
|
||||
)
|
||||
section_filter: Optional[str] = Field(
|
||||
None,
|
||||
description="Filter by section path prefix",
|
||||
)
|
||||
|
||||
|
||||
class GetChunksByDocumentOutput(BaseModel):
|
||||
"""Output schema for get_chunks_by_document tool."""
|
||||
|
||||
chunks: List[ChunkResult] = Field(
|
||||
default_factory=list,
|
||||
description="Ordered list of chunks",
|
||||
)
|
||||
total_count: int = Field(..., description="Total chunks in document")
|
||||
document_source_id: str = Field(..., description="Document source ID")
|
||||
limit: int = Field(..., description="Applied limit")
|
||||
offset: int = Field(..., description="Applied offset")
|
||||
|
||||
|
||||
class WorkInfo(BaseModel):
|
||||
"""Information about a work."""
|
||||
|
||||
title: str = Field(..., description="Work title")
|
||||
author: str = Field(..., description="Author name")
|
||||
year: Optional[int] = Field(None, description="Publication year")
|
||||
language: str = Field(..., description="Language code")
|
||||
genre: Optional[str] = Field(None, description="Genre classification")
|
||||
|
||||
|
||||
class AuthorWorkResult(BaseModel):
|
||||
"""Work with its documents for author filtering."""
|
||||
|
||||
work: WorkInfo = Field(..., description="Work information")
|
||||
documents: List[DocumentSummary] = Field(
|
||||
default_factory=list,
|
||||
description="Documents for this work",
|
||||
)
|
||||
total_chunks: int = Field(..., description="Total chunks across all documents")
|
||||
|
||||
|
||||
class FilterByAuthorInput(BaseModel):
|
||||
"""Input schema for filter_by_author tool."""
|
||||
|
||||
author: str = Field(
|
||||
...,
|
||||
description="Author name to search for",
|
||||
min_length=1,
|
||||
)
|
||||
include_chunk_counts: bool = Field(
|
||||
default=True,
|
||||
description="Include chunk counts in results",
|
||||
)
|
||||
|
||||
|
||||
class FilterByAuthorOutput(BaseModel):
|
||||
"""Output schema for filter_by_author tool."""
|
||||
|
||||
author: str = Field(..., description="Searched author name")
|
||||
works: List[AuthorWorkResult] = Field(
|
||||
default_factory=list,
|
||||
description="Works by this author",
|
||||
)
|
||||
total_works: int = Field(..., description="Total number of works")
|
||||
total_documents: int = Field(..., description="Total number of documents")
|
||||
total_chunks: int = Field(..., description="Total number of chunks")
|
||||
|
||||
|
||||
class DeleteDocumentInput(BaseModel):
|
||||
"""Input schema for delete_document tool."""
|
||||
|
||||
source_id: str = Field(
|
||||
...,
|
||||
description="Document source ID to delete",
|
||||
min_length=1,
|
||||
)
|
||||
confirm: bool = Field(
|
||||
default=False,
|
||||
description="Must be True to confirm deletion",
|
||||
)
|
||||
|
||||
|
||||
class DeleteDocumentOutput(BaseModel):
|
||||
"""Output schema for delete_document tool."""
|
||||
|
||||
success: bool = Field(..., description="Whether deletion succeeded")
|
||||
source_id: str = Field(..., description="Deleted document source ID")
|
||||
chunks_deleted: int = Field(..., description="Number of chunks deleted")
|
||||
summaries_deleted: int = Field(..., description="Number of summaries deleted")
|
||||
error: Optional[str] = Field(None, description="Error message if failed")
|
||||
112
generations/library_rag/mypy.ini
Normal file
112
generations/library_rag/mypy.ini
Normal file
@@ -0,0 +1,112 @@
|
||||
[mypy]
|
||||
# Library RAG - Strict Type Checking Configuration
|
||||
# This configuration enforces strict type safety across all modules.
|
||||
|
||||
# Python version
|
||||
python_version = 3.10
|
||||
|
||||
# Strict mode settings
|
||||
strict = True
|
||||
|
||||
# These are implied by strict=True, but listed explicitly for clarity:
|
||||
check_untyped_defs = True
|
||||
disallow_untyped_defs = True
|
||||
disallow_incomplete_defs = True
|
||||
disallow_untyped_calls = True
|
||||
disallow_untyped_decorators = True
|
||||
disallow_any_generics = True
|
||||
disallow_subclassing_any = True
|
||||
|
||||
# Warning settings
|
||||
warn_return_any = True
|
||||
warn_redundant_casts = True
|
||||
warn_unused_ignores = True
|
||||
warn_unused_configs = True
|
||||
warn_unreachable = True
|
||||
|
||||
# Strictness settings
|
||||
strict_equality = True
|
||||
strict_optional = True
|
||||
no_implicit_optional = True
|
||||
no_implicit_reexport = True
|
||||
|
||||
# Error reporting
|
||||
show_error_codes = True
|
||||
show_column_numbers = True
|
||||
show_error_context = True
|
||||
pretty = True
|
||||
|
||||
# Cache settings
|
||||
cache_dir = .mypy_cache
|
||||
incremental = True
|
||||
|
||||
# Exclude legacy directories and utility scripts from type checking
|
||||
exclude = (?x)(
|
||||
^utils2/
|
||||
| ^tests/utils2/
|
||||
| ^toutweaviate\.py$
|
||||
| ^query_test\.py$
|
||||
| ^update_docstrings\.py$
|
||||
| ^schema_v2\.py$
|
||||
)
|
||||
|
||||
# =============================================================================
|
||||
# Per-module overrides for gradual migration
|
||||
# =============================================================================
|
||||
# These overrides allow gradual adoption of strict typing.
|
||||
# Remove these sections as modules are fully typed.
|
||||
|
||||
# Third-party libraries without stubs
|
||||
[mypy-weaviate.*]
|
||||
ignore_missing_imports = True
|
||||
|
||||
[mypy-mistralai.*]
|
||||
ignore_missing_imports = True
|
||||
|
||||
[mypy-werkzeug.*]
|
||||
ignore_missing_imports = True
|
||||
|
||||
[mypy-requests]
|
||||
ignore_missing_imports = True
|
||||
|
||||
[mypy-ollama.*]
|
||||
ignore_missing_imports = True
|
||||
|
||||
# =============================================================================
|
||||
# Legacy modules - excluded from strict typing
|
||||
# =============================================================================
|
||||
# These modules are legacy code that will be typed in future issues.
|
||||
|
||||
# utils2/ - Legacy module directory (to be deprecated)
|
||||
[mypy-utils2.*]
|
||||
ignore_errors = True
|
||||
|
||||
# tests/utils2/ - Tests for legacy modules
|
||||
[mypy-tests.utils2.*]
|
||||
ignore_errors = True
|
||||
|
||||
# Standalone utility scripts
|
||||
[mypy-toutweaviate]
|
||||
ignore_errors = True
|
||||
|
||||
[mypy-query_test]
|
||||
ignore_errors = True
|
||||
|
||||
[mypy-update_docstrings]
|
||||
ignore_errors = True
|
||||
|
||||
# =============================================================================
|
||||
# Modules with relaxed typing (gradual migration)
|
||||
# =============================================================================
|
||||
|
||||
# llm_structurer.py - Complex legacy module with threading
|
||||
[mypy-utils.llm_structurer]
|
||||
disallow_untyped_defs = False
|
||||
disallow_untyped_calls = False
|
||||
warn_return_any = False
|
||||
check_untyped_defs = False
|
||||
warn_unreachable = False
|
||||
|
||||
# llm_classifier.py - Uses lowercase dict syntax
|
||||
[mypy-utils.llm_classifier]
|
||||
warn_return_any = False
|
||||
0
generations/library_rag/output/.gitkeep
Normal file
0
generations/library_rag/output/.gitkeep
Normal file
6
generations/library_rag/pytest.ini
Normal file
6
generations/library_rag/pytest.ini
Normal file
@@ -0,0 +1,6 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
python_files = test_*.py
|
||||
python_classes = Test*
|
||||
python_functions = test_*
|
||||
addopts = -v --tb=short
|
||||
403
generations/library_rag/rag-philo-charte.css
Normal file
403
generations/library_rag/rag-philo-charte.css
Normal file
@@ -0,0 +1,403 @@
|
||||
|
||||
/* =========================================================
|
||||
Charte graphique – Site RAG Philosophie
|
||||
Fichier CSS prêt à l'emploi
|
||||
Palette beige + contrastes doux
|
||||
Typo : DM Sans (titres) + Lato (texte)
|
||||
========================================================= */
|
||||
|
||||
/* -------------------------
|
||||
1. Reset basique
|
||||
------------------------- */
|
||||
|
||||
*,
|
||||
*::before,
|
||||
*::after {
|
||||
box-sizing: border-box;
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
}
|
||||
|
||||
html {
|
||||
font-size: 16px;
|
||||
scroll-behavior: smooth;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Lato', system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||||
background-color: #F8F4EE; /* beige clair */
|
||||
color: #2B2B2B; /* gris anthracite */
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
2. Variables CSS
|
||||
------------------------- */
|
||||
|
||||
:root {
|
||||
--color-bg-main: #F8F4EE;
|
||||
--color-bg-secondary: #EAE5E0;
|
||||
--color-text-main: #2B2B2B;
|
||||
--color-text-strong: #1F1F1F;
|
||||
--color-accent: #7D6E58;
|
||||
--color-accent-alt: #556B63;
|
||||
|
||||
--max-content-width: 1100px;
|
||||
|
||||
--font-title: 'DM Sans', system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||||
--font-body: 'Lato', system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
3. Conteneur principal
|
||||
------------------------- */
|
||||
|
||||
.wrapper {
|
||||
max-width: var(--max-content-width);
|
||||
margin: 0 auto;
|
||||
padding: 1.5rem 1.5rem;
|
||||
}
|
||||
|
||||
@media (min-width: 992px) {
|
||||
.wrapper {
|
||||
padding: 2.5rem 3rem;
|
||||
}
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
4. Titres & textes
|
||||
------------------------- */
|
||||
|
||||
h1, h2, h3, h4, h5, h6 {
|
||||
font-family: var(--font-title);
|
||||
color: var(--color-text-strong);
|
||||
margin-bottom: 0.75rem;
|
||||
}
|
||||
|
||||
h1 {
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
letter-spacing: 0.02em;
|
||||
}
|
||||
|
||||
h2 {
|
||||
font-size: 2rem;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
h3 {
|
||||
font-size: 1.5rem;
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
p {
|
||||
margin-bottom: 1.5rem;
|
||||
font-family: var(--font-body);
|
||||
font-size: 1rem;
|
||||
color: var(--color-text-main);
|
||||
}
|
||||
|
||||
strong {
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
/* Texte long (ex : article) */
|
||||
|
||||
.article-body {
|
||||
line-height: 1.65;
|
||||
font-size: 1rem;
|
||||
}
|
||||
|
||||
/* Intro / chapeau */
|
||||
|
||||
.lead {
|
||||
font-size: 1.1rem;
|
||||
font-style: italic;
|
||||
color: var(--color-accent-alt);
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
/* Légendes / annotations */
|
||||
|
||||
.caption {
|
||||
font-size: 0.9rem;
|
||||
font-weight: 300;
|
||||
color: var(--color-accent);
|
||||
margin-top: 0.25rem;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
5. Liens
|
||||
------------------------- */
|
||||
|
||||
a {
|
||||
color: var(--color-accent);
|
||||
text-decoration: none;
|
||||
transition: color 0.2s ease, text-decoration-color 0.2s ease;
|
||||
}
|
||||
|
||||
a:hover,
|
||||
a:focus {
|
||||
text-decoration: underline;
|
||||
text-decoration-thickness: 1px;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
6. Boutons
|
||||
------------------------- */
|
||||
|
||||
.btn {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
padding: 0.65rem 1.6rem;
|
||||
border-radius: 8px;
|
||||
border: 1.5px solid var(--color-accent);
|
||||
background-color: transparent;
|
||||
color: var(--color-accent);
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.95rem;
|
||||
font-weight: 500;
|
||||
cursor: pointer;
|
||||
text-decoration: none;
|
||||
transition: background-color 0.2s ease, color 0.2s ease, border-color 0.2s ease;
|
||||
}
|
||||
|
||||
.btn:hover,
|
||||
.btn:focus {
|
||||
background-color: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
/* Variante pleine */
|
||||
|
||||
.btn-primary {
|
||||
background-color: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
.btn-primary:hover,
|
||||
.btn-primary:focus {
|
||||
background-color: var(--color-accent-alt);
|
||||
border-color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
7. Blocs & sections
|
||||
------------------------- */
|
||||
|
||||
.section {
|
||||
padding: 3.5rem 0;
|
||||
}
|
||||
|
||||
.section--alt {
|
||||
background-color: var(--color-bg-secondary);
|
||||
}
|
||||
|
||||
/* Cartes / encadrés génériques */
|
||||
|
||||
.card {
|
||||
background-color: #FFFFFF0F;
|
||||
border-radius: 12px;
|
||||
padding: 1.75rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.25);
|
||||
box-shadow: 0 14px 40px rgba(0, 0, 0, 0.03);
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
8. Citations & encadrés
|
||||
------------------------- */
|
||||
|
||||
blockquote {
|
||||
margin: 2.5rem 0;
|
||||
padding: 1.75rem 2rem;
|
||||
background-color: var(--color-bg-secondary);
|
||||
border-left: 3px solid var(--color-accent);
|
||||
font-style: italic;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
blockquote p {
|
||||
margin-bottom: 0;
|
||||
}
|
||||
|
||||
.quote-box {
|
||||
margin: 2.5rem 0;
|
||||
padding: 1.75rem 2rem;
|
||||
background-color: var(--color-bg-secondary);
|
||||
border: 1px solid var(--color-accent);
|
||||
border-radius: 10px;
|
||||
font-style: italic;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
/* Citation auteur */
|
||||
|
||||
.quote-author {
|
||||
margin-top: 1rem;
|
||||
font-style: normal;
|
||||
font-size: 0.9rem;
|
||||
color: var(--color-accent);
|
||||
text-align: right;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
9. Layout pour pages RAG
|
||||
------------------------- */
|
||||
|
||||
/* Layout type pour page de recherche / résultats */
|
||||
|
||||
.rag-layout {
|
||||
display: grid;
|
||||
gap: 2rem;
|
||||
}
|
||||
|
||||
@media (min-width: 992px) {
|
||||
.rag-layout {
|
||||
grid-template-columns: minmax(0, 2fr) minmax(260px, 1fr);
|
||||
align-items: flex-start;
|
||||
}
|
||||
}
|
||||
|
||||
/* Colonne contenu principal */
|
||||
|
||||
.rag-main {
|
||||
background-color: #FFFFFF0A;
|
||||
border-radius: 12px;
|
||||
padding: 2rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.2);
|
||||
}
|
||||
|
||||
/* Colonne latérale : contexte, filtres, sources */
|
||||
|
||||
.rag-sidebar {
|
||||
background-color: var(--color-bg-secondary);
|
||||
border-radius: 12px;
|
||||
padding: 1.75rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.25);
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
10. Éléments spécifiques RAG
|
||||
------------------------- */
|
||||
|
||||
/* Zone question utilisateur */
|
||||
|
||||
.rag-query {
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.rag-query-label {
|
||||
font-family: var(--font-title);
|
||||
font-size: 0.9rem;
|
||||
letter-spacing: 0.08em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-accent-alt);
|
||||
margin-bottom: 0.4rem;
|
||||
}
|
||||
|
||||
.rag-query-input {
|
||||
width: 100%;
|
||||
padding: 0.8rem 1rem;
|
||||
border-radius: 8px;
|
||||
border: 1px solid rgba(0, 0, 0, 0.08);
|
||||
background-color: #fff;
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.95rem;
|
||||
}
|
||||
|
||||
/* Résultats RAG */
|
||||
|
||||
.rag-result {
|
||||
padding: 1.5rem 0;
|
||||
border-bottom: 1px solid rgba(0, 0, 0, 0.06);
|
||||
}
|
||||
|
||||
.rag-result:last-child {
|
||||
border-bottom: none;
|
||||
}
|
||||
|
||||
.rag-result-title {
|
||||
font-family: var(--font-title);
|
||||
font-size: 1.2rem;
|
||||
font-weight: 600;
|
||||
color: var(--color-text-strong);
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.rag-result-meta {
|
||||
font-size: 0.85rem;
|
||||
color: var(--color-accent);
|
||||
margin-bottom: 0.6rem;
|
||||
}
|
||||
|
||||
.rag-result-snippet {
|
||||
font-size: 0.95rem;
|
||||
color: var(--color-text-main);
|
||||
}
|
||||
|
||||
/* Badges (tags, sources, type de document, etc.) */
|
||||
|
||||
.badge {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
padding: 0.25rem 0.6rem;
|
||||
border-radius: 999px;
|
||||
border: 1px solid rgba(85, 107, 99, 0.4);
|
||||
font-size: 0.75rem;
|
||||
color: var(--color-accent-alt);
|
||||
margin-right: 0.4rem;
|
||||
margin-bottom: 0.3rem;
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
11. Header & footer
|
||||
------------------------- */
|
||||
|
||||
.site-header {
|
||||
padding: 1.5rem 0;
|
||||
border-bottom: 1px solid rgba(0, 0, 0, 0.04);
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
||||
.site-title {
|
||||
font-family: var(--font-title);
|
||||
font-size: 1.4rem;
|
||||
font-weight: 600;
|
||||
letter-spacing: 0.06em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-text-strong);
|
||||
}
|
||||
|
||||
.site-subtitle {
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.9rem;
|
||||
color: var(--color-accent-alt);
|
||||
margin-top: 0.2rem;
|
||||
}
|
||||
|
||||
.site-footer {
|
||||
margin-top: 4rem;
|
||||
padding: 2rem 0 1.5rem;
|
||||
border-top: 1px solid rgba(0, 0, 0, 0.04);
|
||||
font-size: 0.85rem;
|
||||
color: rgba(43, 43, 43, 0.7);
|
||||
}
|
||||
|
||||
/* -------------------------
|
||||
12. Utilitaires
|
||||
------------------------- */
|
||||
|
||||
.mt-1 { margin-top: 0.5rem; }
|
||||
.mt-2 { margin-top: 1rem; }
|
||||
.mt-3 { margin-top: 1.5rem; }
|
||||
.mt-4 { margin-top: 2rem; }
|
||||
|
||||
.mb-1 { margin-bottom: 0.5rem; }
|
||||
.mb-2 { margin-bottom: 1rem; }
|
||||
.mb-3 { margin-bottom: 1.5rem; }
|
||||
.mb-4 { margin-bottom: 2rem; }
|
||||
|
||||
.text-center { text-align: center; }
|
||||
.text-right { text-align: right; }
|
||||
.text-muted { color: rgba(43, 43, 43, 0.7); }
|
||||
27
generations/library_rag/requirements.txt
Normal file
27
generations/library_rag/requirements.txt
Normal file
@@ -0,0 +1,27 @@
|
||||
# Core dependencies
|
||||
weaviate-client>=4.18.0
|
||||
flask>=3.0.0
|
||||
mistralai>=1.0.0
|
||||
python-dotenv>=1.0.0
|
||||
requests>=2.31.0
|
||||
werkzeug>=3.0.0
|
||||
|
||||
# MCP Server dependencies
|
||||
mcp>=1.0.0
|
||||
pydantic>=2.0.0
|
||||
|
||||
# Type checking and static analysis
|
||||
mypy>=1.8.0
|
||||
types-Flask>=1.1.0
|
||||
types-requests>=2.31.0
|
||||
|
||||
# Documentation validation
|
||||
pydocstyle>=6.3.0
|
||||
|
||||
# Testing dependencies
|
||||
pytest>=7.0.0
|
||||
pytest-asyncio>=0.23.0
|
||||
httpx>=0.25.0
|
||||
|
||||
# MCP Client (for Python agent applications)
|
||||
anthropic>=0.39.0 # Claude API
|
||||
535
generations/library_rag/schema.py
Normal file
535
generations/library_rag/schema.py
Normal file
@@ -0,0 +1,535 @@
|
||||
"""Weaviate schema definition for Library RAG - Philosophical Texts Database.
|
||||
|
||||
This module defines and manages the Weaviate vector database schema for the
|
||||
Library RAG application. It provides functions to create, verify, and display
|
||||
the schema configuration for indexing and searching philosophical texts.
|
||||
|
||||
Schema Architecture:
|
||||
The schema follows a normalized design with denormalized nested objects
|
||||
for efficient querying. The hierarchy is::
|
||||
|
||||
Work (metadata only)
|
||||
└── Document (edition/translation instance)
|
||||
├── Chunk (vectorized text fragments)
|
||||
└── Summary (vectorized chapter summaries)
|
||||
|
||||
Collections:
|
||||
**Work** (no vectorization):
|
||||
Represents a philosophical or scholarly work (e.g., Plato's Meno).
|
||||
Stores canonical metadata: title, author, year, language, genre.
|
||||
Not vectorized - used only for metadata and relationships.
|
||||
|
||||
**Document** (no vectorization):
|
||||
Represents a specific edition or translation of a Work.
|
||||
Contains: sourceId, edition, language, pages, TOC, hierarchy.
|
||||
Includes nested Work reference for denormalized access.
|
||||
|
||||
**Chunk** (vectorized with text2vec-transformers):
|
||||
Text fragments optimized for semantic search (200-800 chars).
|
||||
Vectorized fields: text, keywords.
|
||||
Non-vectorized fields: sectionPath, chapterTitle, unitType, orderIndex.
|
||||
Includes nested Document and Work references.
|
||||
|
||||
**Summary** (vectorized with text2vec-transformers):
|
||||
LLM-generated chapter/section summaries for high-level search.
|
||||
Vectorized fields: text, concepts.
|
||||
Includes nested Document reference.
|
||||
|
||||
Vectorization Strategy:
|
||||
- Only Chunk.text, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
|
||||
- Uses text2vec-transformers (BAAI/bge-m3 with 1024-dim via Docker)
|
||||
- Metadata fields use skip_vectorization=True for filtering only
|
||||
- Work and Document collections have no vectorizer (metadata only)
|
||||
|
||||
Migration Note (2024-12):
|
||||
Migrated from MiniLM-L6 (384-dim) to BAAI/bge-m3 (1024-dim) for:
|
||||
- 2.7x richer semantic representation
|
||||
- 8192 token context (vs 512)
|
||||
- Superior multilingual support (Greek, Latin, French, English)
|
||||
- Better performance on philosophical/academic texts
|
||||
|
||||
Nested Objects:
|
||||
Instead of using Weaviate cross-references, we use nested objects for
|
||||
denormalized data access. This allows single-query retrieval of chunk
|
||||
data with its Work/Document metadata without joins::
|
||||
|
||||
Chunk.work = {title, author}
|
||||
Chunk.document = {sourceId, edition}
|
||||
Document.work = {title, author}
|
||||
Summary.document = {sourceId}
|
||||
|
||||
Usage:
|
||||
From command line::
|
||||
|
||||
$ python schema.py
|
||||
|
||||
Programmatically::
|
||||
|
||||
import weaviate
|
||||
from schema import create_schema, verify_schema
|
||||
|
||||
with weaviate.connect_to_local() as client:
|
||||
create_schema(client, delete_existing=True)
|
||||
verify_schema(client)
|
||||
|
||||
Check existing schema::
|
||||
|
||||
from schema import display_schema
|
||||
with weaviate.connect_to_local() as client:
|
||||
display_schema(client)
|
||||
|
||||
Dependencies:
|
||||
- Weaviate Python client v4+
|
||||
- Running Weaviate instance with text2vec-transformers module
|
||||
- Docker Compose setup from docker-compose.yml
|
||||
|
||||
See Also:
|
||||
- utils/weaviate_ingest.py : Functions to ingest data into this schema
|
||||
- utils/types.py : TypedDict definitions matching schema structure
|
||||
- docker-compose.yml : Weaviate + transformers container setup
|
||||
"""
|
||||
|
||||
import sys
|
||||
from typing import List, Set
|
||||
|
||||
import weaviate
|
||||
import weaviate.classes.config as wvc
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Schema Creation Functions
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def create_work_collection(client: weaviate.WeaviateClient) -> None:
|
||||
"""Create the Work collection for philosophical works metadata.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
|
||||
Note:
|
||||
This collection has no vectorization - used only for metadata.
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Work",
|
||||
description="A philosophical or scholarly work (e.g., Meno, Republic, Apology).",
|
||||
vectorizer_config=wvc.Configure.Vectorizer.none(),
|
||||
properties=[
|
||||
wvc.Property(
|
||||
name="title",
|
||||
description="Title of the work.",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="author",
|
||||
description="Author of the work.",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="originalTitle",
|
||||
description="Original title in source language (optional).",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="year",
|
||||
description="Year of composition or publication (negative for BCE).",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="language",
|
||||
description="Original language (e.g., 'gr', 'la', 'fr').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="genre",
|
||||
description="Genre or type (e.g., 'dialogue', 'treatise', 'commentary').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def create_document_collection(client: weaviate.WeaviateClient) -> None:
|
||||
"""Create the Document collection for edition/translation instances.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
|
||||
Note:
|
||||
Contains nested Work reference for denormalized access.
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Document",
|
||||
description="A specific edition or translation of a work (PDF, ebook, etc.).",
|
||||
vectorizer_config=wvc.Configure.Vectorizer.none(),
|
||||
properties=[
|
||||
wvc.Property(
|
||||
name="sourceId",
|
||||
description="Unique identifier for this document (filename without extension).",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="edition",
|
||||
description="Edition or translator (e.g., 'trad. Cousin', 'Loeb Classical Library').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="language",
|
||||
description="Language of this edition (e.g., 'fr', 'en').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="pages",
|
||||
description="Number of pages in the PDF/document.",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="chunksCount",
|
||||
description="Total number of chunks extracted from this document.",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="toc",
|
||||
description="Table of contents as JSON string [{title, level, page}, ...].",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="hierarchy",
|
||||
description="Full hierarchical structure as JSON string.",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="createdAt",
|
||||
description="Timestamp when this document was ingested.",
|
||||
data_type=wvc.DataType.DATE,
|
||||
),
|
||||
# Nested Work reference
|
||||
wvc.Property(
|
||||
name="work",
|
||||
description="Reference to the Work this document is an instance of.",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="title", data_type=wvc.DataType.TEXT),
|
||||
wvc.Property(name="author", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
|
||||
"""Create the Chunk collection for vectorized text fragments.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
|
||||
Note:
|
||||
Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
|
||||
Other fields have skip_vectorization=True for filtering only.
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Chunk",
|
||||
description="A text chunk (paragraph, argument, etc.) vectorized for semantic search.",
|
||||
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
|
||||
vectorize_collection_name=False,
|
||||
),
|
||||
properties=[
|
||||
# Main content (vectorized)
|
||||
wvc.Property(
|
||||
name="text",
|
||||
description="The text content to be vectorized (200-800 chars optimal).",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
# Hierarchical context (not vectorized, for filtering)
|
||||
wvc.Property(
|
||||
name="sectionPath",
|
||||
description="Full hierarchical path (e.g., 'Présentation > Qu'est-ce que la vertu?').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
wvc.Property(
|
||||
name="sectionLevel",
|
||||
description="Depth in hierarchy (1=top-level, 2=subsection, etc.).",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="chapterTitle",
|
||||
description="Title of the top-level chapter/section.",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
wvc.Property(
|
||||
name="canonicalReference",
|
||||
description="Canonical academic reference (e.g., 'CP 1.628', 'Ménon 80a').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
# Classification (not vectorized, for filtering)
|
||||
wvc.Property(
|
||||
name="unitType",
|
||||
description="Type of logical unit (main_content, argument, exposition, transition, définition).",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
wvc.Property(
|
||||
name="keywords",
|
||||
description="Key concepts extracted from this chunk (vectorized for semantic search).",
|
||||
data_type=wvc.DataType.TEXT_ARRAY,
|
||||
),
|
||||
# Technical metadata (not vectorized)
|
||||
wvc.Property(
|
||||
name="orderIndex",
|
||||
description="Sequential position in the document (0-based).",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="language",
|
||||
description="Language of this chunk (e.g., 'fr', 'en', 'gr').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
# Cross references (nested objects)
|
||||
wvc.Property(
|
||||
name="document",
|
||||
description="Reference to parent Document with essential metadata.",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="sourceId", data_type=wvc.DataType.TEXT),
|
||||
wvc.Property(name="edition", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
),
|
||||
wvc.Property(
|
||||
name="work",
|
||||
description="Reference to the Work with essential metadata.",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="title", data_type=wvc.DataType.TEXT),
|
||||
wvc.Property(name="author", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def create_summary_collection(client: weaviate.WeaviateClient) -> None:
|
||||
"""Create the Summary collection for chapter/section summaries.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
|
||||
Note:
|
||||
Uses text2vec-transformers for vectorizing summary text.
|
||||
"""
|
||||
client.collections.create(
|
||||
name="Summary",
|
||||
description="Chapter or section summary, vectorized for high-level semantic search.",
|
||||
vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
|
||||
vectorize_collection_name=False,
|
||||
),
|
||||
properties=[
|
||||
wvc.Property(
|
||||
name="sectionPath",
|
||||
description="Hierarchical path (e.g., 'Chapter 1 > Section 2').",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
wvc.Property(
|
||||
name="title",
|
||||
description="Title of the section.",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
skip_vectorization=True,
|
||||
),
|
||||
wvc.Property(
|
||||
name="level",
|
||||
description="Hierarchy depth (1=chapter, 2=section, 3=subsection).",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="text",
|
||||
description="LLM-generated summary of the section content (VECTORIZED).",
|
||||
data_type=wvc.DataType.TEXT,
|
||||
),
|
||||
wvc.Property(
|
||||
name="concepts",
|
||||
description="Key philosophical concepts in this section.",
|
||||
data_type=wvc.DataType.TEXT_ARRAY,
|
||||
),
|
||||
wvc.Property(
|
||||
name="chunksCount",
|
||||
description="Number of chunks in this section.",
|
||||
data_type=wvc.DataType.INT,
|
||||
),
|
||||
# Reference to Document
|
||||
wvc.Property(
|
||||
name="document",
|
||||
description="Reference to parent Document.",
|
||||
data_type=wvc.DataType.OBJECT,
|
||||
nested_properties=[
|
||||
wvc.Property(name="sourceId", data_type=wvc.DataType.TEXT),
|
||||
],
|
||||
),
|
||||
],
|
||||
)
|
||||
|
||||
|
||||
def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True) -> None:
|
||||
"""Create the complete Weaviate schema for Library RAG.
|
||||
|
||||
Creates all four collections: Work, Document, Chunk, Summary.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
delete_existing: If True, delete all existing collections first.
|
||||
|
||||
Raises:
|
||||
Exception: If collection creation fails.
|
||||
"""
|
||||
if delete_existing:
|
||||
print("\n[1/4] Suppression des collections existantes...")
|
||||
client.collections.delete_all()
|
||||
print(" ✓ Collections supprimées")
|
||||
|
||||
print("\n[2/4] Création des collections...")
|
||||
|
||||
print(" → Work (métadonnées œuvre)...")
|
||||
create_work_collection(client)
|
||||
|
||||
print(" → Document (métadonnées édition)...")
|
||||
create_document_collection(client)
|
||||
|
||||
print(" → Chunk (fragments vectorisés)...")
|
||||
create_chunk_collection(client)
|
||||
|
||||
print(" → Summary (résumés de chapitres)...")
|
||||
create_summary_collection(client)
|
||||
|
||||
print(" ✓ 4 collections créées")
|
||||
|
||||
|
||||
def verify_schema(client: weaviate.WeaviateClient) -> bool:
|
||||
"""Verify that all expected collections exist.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
|
||||
Returns:
|
||||
True if all expected collections exist, False otherwise.
|
||||
"""
|
||||
print("\n[3/4] Vérification des collections...")
|
||||
collections = client.collections.list_all()
|
||||
|
||||
expected: Set[str] = {"Work", "Document", "Chunk", "Summary"}
|
||||
actual: Set[str] = set(collections.keys())
|
||||
|
||||
if expected == actual:
|
||||
print(f" ✓ Toutes les collections créées: {sorted(actual)}")
|
||||
return True
|
||||
else:
|
||||
missing: Set[str] = expected - actual
|
||||
extra: Set[str] = actual - expected
|
||||
if missing:
|
||||
print(f" ✗ Collections manquantes: {missing}")
|
||||
if extra:
|
||||
print(f" ⚠ Collections inattendues: {extra}")
|
||||
return False
|
||||
|
||||
|
||||
def display_schema(client: weaviate.WeaviateClient) -> None:
|
||||
"""Display detailed information about schema collections.
|
||||
|
||||
Args:
|
||||
client: Connected Weaviate client.
|
||||
"""
|
||||
print("\n[4/4] Détail des collections créées:")
|
||||
print("=" * 80)
|
||||
|
||||
collections = client.collections.list_all()
|
||||
|
||||
for name in ["Work", "Document", "Chunk", "Summary"]:
|
||||
if name not in collections:
|
||||
continue
|
||||
|
||||
config = collections[name]
|
||||
print(f"\n📦 {name}")
|
||||
print("─" * 80)
|
||||
print(f"Description: {config.description}")
|
||||
|
||||
# Vectorizer
|
||||
vectorizer_str: str = str(config.vectorizer)
|
||||
if "text2vec" in vectorizer_str.lower():
|
||||
print("Vectorizer: text2vec-transformers ✓")
|
||||
else:
|
||||
print("Vectorizer: none")
|
||||
|
||||
# Properties
|
||||
print("\nPropriétés:")
|
||||
for prop in config.properties:
|
||||
# Data type
|
||||
dtype: str = str(prop.data_type).split('.')[-1]
|
||||
|
||||
# Skip vectorization flag
|
||||
skip: str = ""
|
||||
if hasattr(prop, 'skip_vectorization') and prop.skip_vectorization:
|
||||
skip = " [skip_vec]"
|
||||
|
||||
# Nested properties
|
||||
nested: str = ""
|
||||
if hasattr(prop, 'nested_properties') and prop.nested_properties:
|
||||
nested_names: List[str] = [p.name for p in prop.nested_properties]
|
||||
nested = f" → {{{', '.join(nested_names)}}}"
|
||||
|
||||
print(f" • {prop.name:<20} {dtype:<15} {skip}{nested}")
|
||||
|
||||
|
||||
def print_summary() -> None:
|
||||
"""Print a summary of the schema architecture."""
|
||||
print("\n" + "=" * 80)
|
||||
print("SCHÉMA CRÉÉ AVEC SUCCÈS!")
|
||||
print("=" * 80)
|
||||
print("\n✓ Architecture:")
|
||||
print(" - Work: Source unique pour author/title")
|
||||
print(" - Document: Métadonnées d'édition avec référence vers Work")
|
||||
print(" - Chunk: Fragments vectorisés (text + keywords)")
|
||||
print(" - Summary: Résumés de chapitres vectorisés (text)")
|
||||
print("\n✓ Vectorisation:")
|
||||
print(" - Work: NONE")
|
||||
print(" - Document: NONE")
|
||||
print(" - Chunk: text2vec (text + keywords)")
|
||||
print(" - Summary: text2vec (text)")
|
||||
print("=" * 80)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Main Script Execution
|
||||
# =============================================================================
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Main entry point for schema creation script."""
|
||||
# Fix encoding for Windows console
|
||||
if sys.platform == "win32" and hasattr(sys.stdout, 'reconfigure'):
|
||||
sys.stdout.reconfigure(encoding='utf-8')
|
||||
|
||||
print("=" * 80)
|
||||
print("CRÉATION DU SCHÉMA WEAVIATE - BASE DE TEXTES PHILOSOPHIQUES")
|
||||
print("=" * 80)
|
||||
|
||||
# Connect to local Weaviate
|
||||
client: weaviate.WeaviateClient = weaviate.connect_to_local(
|
||||
host="localhost",
|
||||
port=8080,
|
||||
grpc_port=50051,
|
||||
)
|
||||
|
||||
try:
|
||||
create_schema(client, delete_existing=True)
|
||||
verify_schema(client)
|
||||
display_schema(client)
|
||||
print_summary()
|
||||
finally:
|
||||
client.close()
|
||||
print("\n✓ Connexion fermée\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
785
generations/library_rag/templates/base.html
Normal file
785
generations/library_rag/templates/base.html
Normal file
@@ -0,0 +1,785 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="fr">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>{% block title %}Philosophia{% endblock %} – Visualiseur Weaviate</title>
|
||||
|
||||
<!-- Google Fonts: DM Sans + Lato -->
|
||||
<link rel="preconnect" href="https://fonts.googleapis.com">
|
||||
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
||||
<link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,600;0,700;1,400&family=Lato:ital,wght@0,300;0,400;0,700;1,400&display=swap" rel="stylesheet">
|
||||
|
||||
<style>
|
||||
/* =========================================================
|
||||
Charte graphique – Site RAG Philosophie
|
||||
Palette beige + contrastes doux
|
||||
Typo : DM Sans (titres) + Lato (texte)
|
||||
========================================================= */
|
||||
|
||||
/* Reset basique */
|
||||
*,
|
||||
*::before,
|
||||
*::after {
|
||||
box-sizing: border-box;
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
}
|
||||
|
||||
html {
|
||||
font-size: 16px;
|
||||
scroll-behavior: smooth;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Lato', system-ui, -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||||
background-color: #F8F4EE;
|
||||
color: #2B2B2B;
|
||||
line-height: 1.6;
|
||||
}
|
||||
|
||||
/* Variables CSS */
|
||||
:root {
|
||||
--color-bg-main: #F8F4EE;
|
||||
--color-bg-secondary: #EAE5E0;
|
||||
--color-text-main: #2B2B2B;
|
||||
--color-text-strong: #1F1F1F;
|
||||
--color-accent: #7D6E58;
|
||||
--color-accent-alt: #556B63;
|
||||
--max-content-width: 1100px;
|
||||
--font-title: 'DM Sans', system-ui, sans-serif;
|
||||
--font-body: 'Lato', system-ui, sans-serif;
|
||||
}
|
||||
|
||||
/* Conteneur principal */
|
||||
.wrapper {
|
||||
max-width: var(--max-content-width);
|
||||
margin: 0 auto;
|
||||
padding: 1.5rem 1.5rem;
|
||||
}
|
||||
|
||||
@media (min-width: 992px) {
|
||||
.wrapper {
|
||||
padding: 2.5rem 3rem;
|
||||
}
|
||||
}
|
||||
|
||||
/* Titres & textes */
|
||||
h1, h2, h3, h4, h5, h6 {
|
||||
font-family: var(--font-title);
|
||||
color: var(--color-text-strong);
|
||||
margin-bottom: 0.75rem;
|
||||
}
|
||||
|
||||
h1 {
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
letter-spacing: 0.02em;
|
||||
}
|
||||
|
||||
h2 {
|
||||
font-size: 2rem;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
h3 {
|
||||
font-size: 1.5rem;
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
p {
|
||||
margin-bottom: 1.5rem;
|
||||
font-family: var(--font-body);
|
||||
font-size: 1rem;
|
||||
color: var(--color-text-main);
|
||||
}
|
||||
|
||||
strong {
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.lead {
|
||||
font-size: 1.1rem;
|
||||
font-style: italic;
|
||||
color: var(--color-accent-alt);
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.caption {
|
||||
font-size: 0.9rem;
|
||||
font-weight: 300;
|
||||
color: var(--color-accent);
|
||||
margin-top: 0.25rem;
|
||||
}
|
||||
|
||||
/* Liens */
|
||||
a {
|
||||
color: var(--color-accent);
|
||||
text-decoration: none;
|
||||
transition: color 0.2s ease, text-decoration-color 0.2s ease;
|
||||
}
|
||||
|
||||
a:hover,
|
||||
a:focus {
|
||||
text-decoration: underline;
|
||||
text-decoration-thickness: 1px;
|
||||
}
|
||||
|
||||
/* Boutons */
|
||||
.btn {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
padding: 0.65rem 1.6rem;
|
||||
border-radius: 8px;
|
||||
border: 1.5px solid var(--color-accent);
|
||||
background-color: transparent;
|
||||
color: var(--color-accent);
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.95rem;
|
||||
font-weight: 500;
|
||||
cursor: pointer;
|
||||
text-decoration: none;
|
||||
transition: background-color 0.2s ease, color 0.2s ease, border-color 0.2s ease;
|
||||
}
|
||||
|
||||
.btn:hover,
|
||||
.btn:focus {
|
||||
background-color: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
text-decoration: none;
|
||||
}
|
||||
|
||||
.btn-primary {
|
||||
background-color: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
.btn-primary:hover,
|
||||
.btn-primary:focus {
|
||||
background-color: var(--color-accent-alt);
|
||||
border-color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.btn-sm {
|
||||
padding: 0.4rem 1rem;
|
||||
font-size: 0.85rem;
|
||||
}
|
||||
|
||||
/* Sidebar Navigation */
|
||||
.nav-sidebar {
|
||||
position: fixed;
|
||||
left: 0;
|
||||
top: 0;
|
||||
width: 260px;
|
||||
height: 100vh;
|
||||
background-color: #fff;
|
||||
border-right: 1px solid rgba(125, 110, 88, 0.15);
|
||||
box-shadow: 2px 0 8px rgba(0, 0, 0, 0.03);
|
||||
transform: translateX(-100%);
|
||||
transition: transform 0.3s ease;
|
||||
z-index: 1000;
|
||||
overflow-y: auto;
|
||||
}
|
||||
|
||||
.nav-sidebar.visible {
|
||||
transform: translateX(0);
|
||||
}
|
||||
|
||||
.sidebar-header {
|
||||
padding: 1.75rem 1.5rem;
|
||||
border-bottom: 1px solid rgba(125, 110, 88, 0.1);
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
}
|
||||
|
||||
.sidebar-header-text {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.sidebar-title {
|
||||
font-family: var(--font-title);
|
||||
font-size: 1.3rem;
|
||||
font-weight: 600;
|
||||
letter-spacing: 0.06em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-text-strong);
|
||||
margin-bottom: 0.25rem;
|
||||
}
|
||||
|
||||
.sidebar-subtitle {
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.85rem;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.sidebar-close-btn {
|
||||
background: none;
|
||||
border: none;
|
||||
font-size: 1.5rem;
|
||||
color: var(--color-accent);
|
||||
cursor: pointer;
|
||||
padding: 0.25rem;
|
||||
line-height: 1;
|
||||
transition: transform 0.2s ease, color 0.2s ease;
|
||||
}
|
||||
|
||||
.sidebar-close-btn:hover {
|
||||
transform: scale(1.1);
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.sidebar-nav {
|
||||
padding: 1rem 0;
|
||||
}
|
||||
|
||||
.sidebar-nav a {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.75rem;
|
||||
padding: 0.85rem 1.5rem;
|
||||
font-family: var(--font-title);
|
||||
font-size: 0.95rem;
|
||||
font-weight: 500;
|
||||
color: var(--color-accent-alt);
|
||||
text-decoration: none;
|
||||
border-left: 3px solid transparent;
|
||||
transition: background-color 0.2s ease, border-color 0.2s ease, color 0.2s ease;
|
||||
}
|
||||
|
||||
.sidebar-nav a:hover {
|
||||
background-color: rgba(125, 110, 88, 0.05);
|
||||
color: var(--color-accent);
|
||||
}
|
||||
|
||||
.sidebar-nav a.active {
|
||||
background-color: rgba(125, 110, 88, 0.08);
|
||||
border-left-color: var(--color-accent);
|
||||
color: var(--color-accent);
|
||||
}
|
||||
|
||||
.sidebar-nav a .icon {
|
||||
font-size: 1.1rem;
|
||||
width: 20px;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
/* Sidebar overlay */
|
||||
.sidebar-overlay {
|
||||
position: fixed;
|
||||
inset: 0;
|
||||
background-color: rgba(0, 0, 0, 0.3);
|
||||
opacity: 0;
|
||||
visibility: hidden;
|
||||
transition: opacity 0.3s ease, visibility 0.3s ease;
|
||||
z-index: 999;
|
||||
}
|
||||
|
||||
.sidebar-overlay.visible {
|
||||
opacity: 1;
|
||||
visibility: visible;
|
||||
}
|
||||
|
||||
/* Hamburger button */
|
||||
.hamburger-btn {
|
||||
position: fixed;
|
||||
left: 1rem;
|
||||
top: 1rem;
|
||||
width: 48px;
|
||||
height: 48px;
|
||||
background-color: #fff;
|
||||
border: 1px solid rgba(125, 110, 88, 0.2);
|
||||
border-radius: 8px;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
gap: 5px;
|
||||
cursor: pointer;
|
||||
z-index: 1001;
|
||||
transition: background-color 0.2s ease, transform 0.2s ease, opacity 0.3s ease, visibility 0.3s ease;
|
||||
box-shadow: 0 2px 8px rgba(0, 0, 0, 0.08);
|
||||
}
|
||||
|
||||
.hamburger-btn.active {
|
||||
opacity: 0;
|
||||
visibility: hidden;
|
||||
}
|
||||
|
||||
.hamburger-btn:hover {
|
||||
background-color: rgba(125, 110, 88, 0.05);
|
||||
transform: scale(1.05);
|
||||
}
|
||||
|
||||
.hamburger-btn span {
|
||||
width: 22px;
|
||||
height: 2px;
|
||||
background-color: var(--color-accent);
|
||||
border-radius: 2px;
|
||||
transition: transform 0.3s ease, opacity 0.3s ease;
|
||||
}
|
||||
|
||||
/* Header */
|
||||
.site-header {
|
||||
padding: 1.5rem 0;
|
||||
border-bottom: 1px solid rgba(0, 0, 0, 0.04);
|
||||
margin-bottom: 1.5rem;
|
||||
margin-left: 64px; /* Space for hamburger button */
|
||||
}
|
||||
|
||||
.site-title {
|
||||
font-family: var(--font-title);
|
||||
font-size: 1.4rem;
|
||||
font-weight: 600;
|
||||
letter-spacing: 0.06em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-text-strong);
|
||||
}
|
||||
|
||||
.site-subtitle {
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.9rem;
|
||||
color: var(--color-accent-alt);
|
||||
margin-top: 0.2rem;
|
||||
}
|
||||
|
||||
.header-inner {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
flex-wrap: wrap;
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
/* Hide old navigation in header */
|
||||
.nav-links {
|
||||
display: none;
|
||||
}
|
||||
|
||||
/* Sections */
|
||||
.section {
|
||||
padding: 2.5rem 0;
|
||||
}
|
||||
|
||||
.section--alt {
|
||||
background-color: var(--color-bg-secondary);
|
||||
}
|
||||
|
||||
/* Cards */
|
||||
.card {
|
||||
background-color: rgba(255, 255, 255, 0.06);
|
||||
border-radius: 12px;
|
||||
padding: 1.75rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.25);
|
||||
box-shadow: 0 14px 40px rgba(0, 0, 0, 0.03);
|
||||
}
|
||||
|
||||
/* Passage cards */
|
||||
.passage-card {
|
||||
background-color: rgba(255, 255, 255, 0.06);
|
||||
border-radius: 12px;
|
||||
padding: 1.75rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.25);
|
||||
border-left: 3px solid var(--color-accent);
|
||||
box-shadow: 0 14px 40px rgba(0, 0, 0, 0.03);
|
||||
margin-bottom: 1.25rem;
|
||||
transition: transform 0.2s ease, box-shadow 0.2s ease;
|
||||
}
|
||||
|
||||
.passage-card:hover {
|
||||
transform: translateX(4px);
|
||||
box-shadow: 0 18px 50px rgba(0, 0, 0, 0.06);
|
||||
}
|
||||
|
||||
.passage-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
margin-bottom: 1rem;
|
||||
flex-wrap: wrap;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.passage-text {
|
||||
font-family: var(--font-body);
|
||||
font-size: 1rem;
|
||||
line-height: 1.65;
|
||||
color: var(--color-text-main);
|
||||
font-style: italic;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.passage-meta {
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.85rem;
|
||||
font-weight: 300;
|
||||
color: var(--color-accent);
|
||||
padding-top: 0.75rem;
|
||||
border-top: 1px solid rgba(125, 110, 88, 0.15);
|
||||
}
|
||||
|
||||
/* Badges */
|
||||
.badge {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
padding: 0.25rem 0.6rem;
|
||||
border-radius: 999px;
|
||||
border: 1px solid rgba(85, 107, 99, 0.4);
|
||||
font-size: 0.75rem;
|
||||
color: var(--color-accent-alt);
|
||||
margin-right: 0.4rem;
|
||||
margin-bottom: 0.3rem;
|
||||
}
|
||||
|
||||
.badge-author {
|
||||
background-color: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
border: none;
|
||||
font-family: var(--font-title);
|
||||
font-weight: 500;
|
||||
}
|
||||
|
||||
.badge-work {
|
||||
background-color: transparent;
|
||||
border-color: var(--color-accent-alt);
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.badge-similarity {
|
||||
background-color: var(--color-accent-alt);
|
||||
color: var(--color-bg-main);
|
||||
border: none;
|
||||
}
|
||||
|
||||
.keyword-tag {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
padding: 0.2rem 0.5rem;
|
||||
border-radius: 999px;
|
||||
border: 1px solid rgba(85, 107, 99, 0.3);
|
||||
font-size: 0.7rem;
|
||||
color: var(--color-accent-alt);
|
||||
margin: 0.1rem;
|
||||
}
|
||||
|
||||
/* Stats grid */
|
||||
.stats-grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(180px, 1fr));
|
||||
gap: 1.25rem;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.stat-box {
|
||||
background-color: var(--color-bg-secondary);
|
||||
border: 1px solid rgba(125, 110, 88, 0.25);
|
||||
border-radius: 12px;
|
||||
padding: 1.75rem;
|
||||
text-align: center;
|
||||
box-shadow: 0 14px 40px rgba(0, 0, 0, 0.03);
|
||||
}
|
||||
|
||||
.stat-number {
|
||||
font-family: var(--font-title);
|
||||
font-size: 2.5rem;
|
||||
font-weight: 700;
|
||||
color: var(--color-accent);
|
||||
line-height: 1;
|
||||
}
|
||||
|
||||
.stat-label {
|
||||
font-family: var(--font-title);
|
||||
font-size: 0.85rem;
|
||||
font-weight: 500;
|
||||
letter-spacing: 0.08em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-accent-alt);
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
|
||||
/* Forms */
|
||||
.form-group {
|
||||
margin-bottom: 1.25rem;
|
||||
}
|
||||
|
||||
.form-label {
|
||||
display: block;
|
||||
font-family: var(--font-title);
|
||||
font-size: 0.9rem;
|
||||
letter-spacing: 0.05em;
|
||||
text-transform: uppercase;
|
||||
color: var(--color-accent-alt);
|
||||
margin-bottom: 0.4rem;
|
||||
}
|
||||
|
||||
.form-control {
|
||||
width: 100%;
|
||||
padding: 0.8rem 1rem;
|
||||
border-radius: 8px;
|
||||
border: 1px solid rgba(0, 0, 0, 0.08);
|
||||
background-color: #fff;
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.95rem;
|
||||
color: var(--color-text-main);
|
||||
transition: border-color 0.2s ease, box-shadow 0.2s ease;
|
||||
}
|
||||
|
||||
.form-control:focus {
|
||||
outline: none;
|
||||
border-color: var(--color-accent);
|
||||
box-shadow: 0 0 0 3px rgba(125, 110, 88, 0.1);
|
||||
}
|
||||
|
||||
select.form-control {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.form-row {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
|
||||
gap: 1rem;
|
||||
}
|
||||
|
||||
/* Search box */
|
||||
.search-box {
|
||||
background-color: var(--color-bg-secondary);
|
||||
border-radius: 12px;
|
||||
padding: 2rem;
|
||||
margin-bottom: 2rem;
|
||||
border: 1px solid rgba(125, 110, 88, 0.2);
|
||||
}
|
||||
|
||||
.search-input {
|
||||
font-size: 1.1rem;
|
||||
padding: 1rem 1.25rem;
|
||||
}
|
||||
|
||||
/* Pagination */
|
||||
.pagination {
|
||||
display: flex;
|
||||
justify-content: center;
|
||||
align-items: center;
|
||||
gap: 1rem;
|
||||
margin-top: 2rem;
|
||||
padding-top: 1.5rem;
|
||||
border-top: 1px solid rgba(125, 110, 88, 0.15);
|
||||
}
|
||||
|
||||
.pagination-info {
|
||||
font-family: var(--font-body);
|
||||
font-size: 0.9rem;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
/* Footer */
|
||||
.site-footer {
|
||||
margin-top: 4rem;
|
||||
padding: 2rem 0 1.5rem;
|
||||
border-top: 1px solid rgba(0, 0, 0, 0.04);
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.footer-quote {
|
||||
font-family: var(--font-body);
|
||||
font-style: italic;
|
||||
font-size: 0.9rem;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
/* Utilitaires */
|
||||
.mt-1 { margin-top: 0.5rem; }
|
||||
.mt-2 { margin-top: 1rem; }
|
||||
.mt-3 { margin-top: 1.5rem; }
|
||||
.mt-4 { margin-top: 2rem; }
|
||||
.mb-1 { margin-bottom: 0.5rem; }
|
||||
.mb-2 { margin-bottom: 1rem; }
|
||||
.mb-3 { margin-bottom: 1.5rem; }
|
||||
.mb-4 { margin-bottom: 2rem; }
|
||||
.text-center { text-align: center; }
|
||||
.text-muted { color: rgba(43, 43, 43, 0.7); }
|
||||
|
||||
/* Divider */
|
||||
.divider {
|
||||
border: none;
|
||||
height: 1px;
|
||||
background: linear-gradient(90deg, transparent, rgba(125, 110, 88, 0.3), transparent);
|
||||
margin: 2rem 0;
|
||||
}
|
||||
|
||||
/* Empty state */
|
||||
.empty-state {
|
||||
text-align: center;
|
||||
padding: 3rem 1rem;
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.empty-state-icon {
|
||||
font-size: 3rem;
|
||||
margin-bottom: 1rem;
|
||||
opacity: 0.5;
|
||||
}
|
||||
|
||||
/* Lists */
|
||||
.list-inline {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 0.5rem;
|
||||
list-style: none;
|
||||
}
|
||||
|
||||
.list-inline li {
|
||||
font-family: var(--font-body);
|
||||
color: var(--color-text-main);
|
||||
}
|
||||
|
||||
/* Alert */
|
||||
.alert {
|
||||
padding: 1rem 1.25rem;
|
||||
border-radius: 8px;
|
||||
margin-bottom: 1.5rem;
|
||||
font-size: 0.95rem;
|
||||
}
|
||||
|
||||
.alert-warning {
|
||||
background-color: rgba(125, 110, 88, 0.1);
|
||||
border: 1px solid rgba(125, 110, 88, 0.3);
|
||||
color: var(--color-accent);
|
||||
}
|
||||
|
||||
.alert-success {
|
||||
background-color: rgba(85, 107, 99, 0.1);
|
||||
border: 1px solid rgba(85, 107, 99, 0.3);
|
||||
color: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
/* Ornament */
|
||||
.ornament {
|
||||
text-align: center;
|
||||
color: var(--color-accent);
|
||||
font-size: 1rem;
|
||||
letter-spacing: 0.5em;
|
||||
opacity: 0.4;
|
||||
margin: 1.5rem 0;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<!-- Hamburger button -->
|
||||
<button class="hamburger-btn" id="hamburger-btn" aria-label="Toggle navigation">
|
||||
<span></span>
|
||||
<span></span>
|
||||
<span></span>
|
||||
</button>
|
||||
|
||||
<!-- Sidebar overlay -->
|
||||
<div class="sidebar-overlay" id="sidebar-overlay"></div>
|
||||
|
||||
<!-- Sidebar navigation -->
|
||||
<nav class="nav-sidebar" id="nav-sidebar">
|
||||
<div class="sidebar-header">
|
||||
<div class="sidebar-header-text">
|
||||
<div class="sidebar-title">Philosophia</div>
|
||||
<div class="sidebar-subtitle">Base Weaviate</div>
|
||||
</div>
|
||||
<button class="sidebar-close-btn" id="sidebar-close-btn" aria-label="Fermer le menu">✕</button>
|
||||
</div>
|
||||
<div class="sidebar-nav">
|
||||
<a href="/" class="{{ 'active' if request.endpoint == 'index' else '' }}">
|
||||
<span class="icon">🏠</span>
|
||||
<span>Accueil</span>
|
||||
</a>
|
||||
<a href="/passages" class="{{ 'active' if request.endpoint == 'passages' else '' }}">
|
||||
<span class="icon">📄</span>
|
||||
<span>Passages</span>
|
||||
</a>
|
||||
<a href="/search" class="{{ 'active' if request.endpoint == 'search' else '' }}">
|
||||
<span class="icon">🔍</span>
|
||||
<span>Recherche</span>
|
||||
</a>
|
||||
<a href="/chat" class="{{ 'active' if request.endpoint == 'chat' else '' }}">
|
||||
<span class="icon">💬</span>
|
||||
<span>Conversation</span>
|
||||
</a>
|
||||
<a href="/upload" class="{{ 'active' if request.endpoint == 'upload' else '' }}">
|
||||
<span class="icon">📤</span>
|
||||
<span>Parser PDF</span>
|
||||
</a>
|
||||
<a href="/documents" class="{{ 'active' if request.endpoint == 'documents' else '' }}">
|
||||
<span class="icon">📚</span>
|
||||
<span>Documents</span>
|
||||
</a>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<div class="wrapper">
|
||||
<!-- Header -->
|
||||
<header class="site-header">
|
||||
<div class="header-inner">
|
||||
<div>
|
||||
<a href="/" style="text-decoration: none;">
|
||||
<div class="site-title">Philosophia</div>
|
||||
<div class="site-subtitle">Visualiseur de base Weaviate</div>
|
||||
</a>
|
||||
</div>
|
||||
<nav class="nav-links">
|
||||
<a href="/" class="{{ 'active' if request.endpoint == 'index' else '' }}">Accueil</a>
|
||||
<a href="/passages" class="{{ 'active' if request.endpoint == 'passages' else '' }}">Passages</a>
|
||||
<a href="/search" class="{{ 'active' if request.endpoint == 'search' else '' }}">Recherche</a>
|
||||
<a href="/chat" class="{{ 'active' if request.endpoint == 'chat' else '' }}">Conversation</a>
|
||||
<a href="/upload" class="{{ 'active' if request.endpoint == 'upload' else '' }}">Parser PDF</a>
|
||||
<a href="/documents" class="{{ 'active' if request.endpoint == 'documents' else '' }}">Documents</a>
|
||||
</nav>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<!-- Main Content -->
|
||||
<main>
|
||||
{% block content %}{% endblock %}
|
||||
</main>
|
||||
|
||||
<!-- Footer -->
|
||||
<footer class="site-footer">
|
||||
<p class="footer-quote">« La philosophie est la médecine de l'âme. » — Cicéron</p>
|
||||
</footer>
|
||||
</div>
|
||||
|
||||
<!-- Sidebar toggle script -->
|
||||
<script>
|
||||
const hamburgerBtn = document.getElementById('hamburger-btn');
|
||||
const navSidebar = document.getElementById('nav-sidebar');
|
||||
const sidebarOverlay = document.getElementById('sidebar-overlay');
|
||||
const sidebarCloseBtn = document.getElementById('sidebar-close-btn');
|
||||
|
||||
function toggleSidebar() {
|
||||
hamburgerBtn.classList.toggle('active');
|
||||
navSidebar.classList.toggle('visible');
|
||||
sidebarOverlay.classList.toggle('visible');
|
||||
}
|
||||
|
||||
function closeSidebar() {
|
||||
hamburgerBtn.classList.remove('active');
|
||||
navSidebar.classList.remove('visible');
|
||||
sidebarOverlay.classList.remove('visible');
|
||||
}
|
||||
|
||||
hamburgerBtn.addEventListener('click', toggleSidebar);
|
||||
sidebarOverlay.addEventListener('click', closeSidebar);
|
||||
sidebarCloseBtn.addEventListener('click', closeSidebar);
|
||||
|
||||
// Close sidebar on link click
|
||||
const sidebarLinks = navSidebar.querySelectorAll('a');
|
||||
sidebarLinks.forEach(link => {
|
||||
link.addEventListener('click', closeSidebar);
|
||||
});
|
||||
|
||||
// Close sidebar on Escape key
|
||||
document.addEventListener('keydown', (e) => {
|
||||
if (e.key === 'Escape' && navSidebar.classList.contains('visible')) {
|
||||
closeSidebar();
|
||||
}
|
||||
});
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
541
generations/library_rag/templates/document_view.html
Normal file
541
generations/library_rag/templates/document_view.html
Normal file
@@ -0,0 +1,541 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}{{ result.document_name }} - Détails{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
{# Dictionnaire de traduction des types de chunks #}
|
||||
{% set chunk_types = {
|
||||
'main_content': {'label': 'Contenu principal', 'icon': '📄', 'desc': 'Paragraphe de contenu substantiel', 'color': 'rgba(125, 110, 88, 0.15)'},
|
||||
'exposition': {'label': 'Exposition', 'icon': '📖', 'desc': 'Présentation d\'idées ou de contexte', 'color': 'rgba(85, 107, 99, 0.15)'},
|
||||
'argument': {'label': 'Argument', 'icon': '💭', 'desc': 'Raisonnement ou argumentation', 'color': 'rgba(164, 132, 92, 0.15)'},
|
||||
'définition': {'label': 'Définition', 'icon': '📌', 'desc': 'Définition de concept ou terme', 'color': 'rgba(125, 110, 88, 0.2)'},
|
||||
'example': {'label': 'Exemple', 'icon': '💡', 'desc': 'Illustration ou cas pratique', 'color': 'rgba(218, 188, 134, 0.2)'},
|
||||
'citation': {'label': 'Citation', 'icon': '💬', 'desc': 'Citation d\'auteur ou référence', 'color': 'rgba(85, 107, 99, 0.2)'},
|
||||
'abstract': {'label': 'Résumé', 'icon': '📋', 'desc': 'Résumé ou synthèse', 'color': 'rgba(164, 132, 92, 0.2)'},
|
||||
'preface': {'label': 'Préface', 'icon': '✍️', 'desc': 'Préface, avant-propos ou avertissement', 'color': 'rgba(85, 107, 99, 0.15)'},
|
||||
'conclusion': {'label': 'Conclusion', 'icon': '🎯', 'desc': 'Conclusion d\'une argumentation', 'color': 'rgba(125, 110, 88, 0.2)'}
|
||||
} %}
|
||||
|
||||
<style>
|
||||
/* TOC hiérarchique */
|
||||
.toc-tree {
|
||||
list-style: none;
|
||||
padding-left: 0;
|
||||
margin: 0;
|
||||
}
|
||||
.toc-tree ul {
|
||||
list-style: none;
|
||||
padding-left: 1.5rem;
|
||||
margin: 0;
|
||||
display: none;
|
||||
}
|
||||
.toc-tree ul.expanded {
|
||||
display: block;
|
||||
}
|
||||
.toc-item {
|
||||
padding: 0.4rem 0;
|
||||
border-bottom: 1px solid rgba(125, 110, 88, 0.1);
|
||||
}
|
||||
.toc-item-header {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
cursor: pointer;
|
||||
}
|
||||
.toc-toggle {
|
||||
width: 20px;
|
||||
height: 20px;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
color: var(--color-accent);
|
||||
font-size: 0.8rem;
|
||||
transition: transform 0.2s;
|
||||
}
|
||||
.toc-toggle.expanded {
|
||||
transform: rotate(90deg);
|
||||
}
|
||||
.toc-toggle.no-children {
|
||||
visibility: hidden;
|
||||
}
|
||||
.toc-level-1 { font-weight: bold; color: var(--color-text-main); }
|
||||
.toc-level-2 { color: var(--color-accent-alt); padding-left: 0.5rem; }
|
||||
.toc-level-3 { color: var(--color-text-muted); font-size: 0.9rem; padding-left: 0.5rem; }
|
||||
.toc-level-4 { color: var(--color-text-muted); font-size: 0.85rem; font-style: italic; padding-left: 0.5rem; }
|
||||
|
||||
/* Passages dépliables */
|
||||
.passage-card {
|
||||
background: var(--color-bg-secondary);
|
||||
border-radius: 8px;
|
||||
margin-bottom: 0.75rem;
|
||||
border-left: 3px solid var(--color-accent);
|
||||
overflow: hidden;
|
||||
}
|
||||
.passage-header {
|
||||
padding: 1rem;
|
||||
cursor: pointer;
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: flex-start;
|
||||
transition: background-color 0.2s;
|
||||
}
|
||||
.passage-header:hover {
|
||||
background-color: rgba(125, 110, 88, 0.05);
|
||||
}
|
||||
.passage-toggle {
|
||||
color: var(--color-accent);
|
||||
font-size: 1.2rem;
|
||||
transition: transform 0.2s;
|
||||
}
|
||||
.passage-toggle.expanded {
|
||||
transform: rotate(180deg);
|
||||
}
|
||||
.passage-content {
|
||||
display: none;
|
||||
padding: 0 1rem 1rem 1rem;
|
||||
border-top: 1px solid rgba(125, 110, 88, 0.1);
|
||||
}
|
||||
.passage-content.expanded {
|
||||
display: block;
|
||||
}
|
||||
.passage-text {
|
||||
font-style: italic;
|
||||
color: var(--color-text-main);
|
||||
font-size: 0.9rem;
|
||||
line-height: 1.6;
|
||||
background: var(--color-bg-main);
|
||||
padding: 1rem;
|
||||
border-radius: 6px;
|
||||
margin-top: 0.75rem;
|
||||
max-height: 300px;
|
||||
overflow-y: auto;
|
||||
}
|
||||
.passage-meta {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
|
||||
gap: 0.5rem;
|
||||
margin-top: 0.75rem;
|
||||
}
|
||||
.passage-meta-item {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
font-size: 0.85rem;
|
||||
}
|
||||
.passage-meta-label {
|
||||
color: var(--color-text-muted);
|
||||
min-width: 80px;
|
||||
}
|
||||
.concepts-list {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 0.3rem;
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
.concept-tag {
|
||||
background: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
padding: 0.2rem 0.5rem;
|
||||
border-radius: 4px;
|
||||
font-size: 0.75rem;
|
||||
}
|
||||
|
||||
/* Expand/Collapse all */
|
||||
.toolbar {
|
||||
display: flex;
|
||||
gap: 0.5rem;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
.toolbar button {
|
||||
padding: 0.4rem 0.8rem;
|
||||
font-size: 0.8rem;
|
||||
background: var(--color-bg-secondary);
|
||||
border: 1px solid rgba(125, 110, 88, 0.3);
|
||||
border-radius: 4px;
|
||||
cursor: pointer;
|
||||
color: var(--color-text-main);
|
||||
}
|
||||
.toolbar button:hover {
|
||||
background: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
</style>
|
||||
|
||||
<section class="section">
|
||||
<h1>📄 {{ result.document_name }}</h1>
|
||||
<p class="lead">Détails du document traité</p>
|
||||
|
||||
<div class="ornament">· · ·</div>
|
||||
|
||||
<!-- Statistiques -->
|
||||
<div class="stats-grid">
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.pages or 0 }}</div>
|
||||
<div class="stat-label">Pages</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.chunks_count or 0 }}</div>
|
||||
<div class="stat-label">Chunks</div>
|
||||
</div>
|
||||
{% if result.weaviate_ingest and result.weaviate_ingest.success %}
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.weaviate_ingest.count }}</div>
|
||||
<div class="stat-label">Dans Weaviate</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
{% if result.toc %}
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.flat_toc|length if result.flat_toc else result.toc|length }}</div>
|
||||
<div class="stat-label">Entrées TOC</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<!-- Métadonnées du document -->
|
||||
<div class="card">
|
||||
<h3>📖 Informations du document</h3>
|
||||
<div class="mt-2">
|
||||
<table style="width: 100%; border-collapse: collapse;">
|
||||
{% if result.metadata.title %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0; width: 150px;"><strong>Titre</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.metadata.title }}</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.author %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Auteur</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<span class="badge badge-author">{{ result.metadata.author }}</span>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.publisher %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Éditeur</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.metadata.publisher }}</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.year %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Année</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.metadata.year }}</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.doi %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>DOI</strong></td>
|
||||
<td style="padding: 0.75rem 0;"><code>{{ result.metadata.doi }}</code></td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.isbn %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>ISBN</strong></td>
|
||||
<td style="padding: 0.75rem 0;"><code>{{ result.metadata.isbn }}</code></td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Pages</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.pages or 0 }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="padding: 0.75rem 0;"><strong>Chunks</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.chunks_count or 0 }} segments de texte</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Table des matières hiérarchique -->
|
||||
{% if result.toc and result.toc|length > 0 %}
|
||||
<div class="card mt-3">
|
||||
<h3>📑 Table des matières ({{ result.flat_toc|length if result.flat_toc else '?' }} entrées)</h3>
|
||||
|
||||
<div class="toolbar">
|
||||
<button onclick="expandAllToc()">▼ Tout déplier</button>
|
||||
<button onclick="collapseAllToc()">▲ Tout replier</button>
|
||||
</div>
|
||||
|
||||
<div class="mt-2">
|
||||
<ul class="toc-tree" id="toc-tree">
|
||||
{% macro render_toc(items) %}
|
||||
{% for item in items %}
|
||||
<li class="toc-item">
|
||||
<div class="toc-item-header" onclick="toggleTocItem(this)">
|
||||
<span class="toc-toggle {% if not item.children or item.children|length == 0 %}no-children{% endif %}">▶</span>
|
||||
<span class="toc-level-{{ item.level }}">{{ item.title }}</span>
|
||||
</div>
|
||||
{% if item.children and item.children|length > 0 %}
|
||||
<ul>
|
||||
{{ render_toc(item.children) }}
|
||||
</ul>
|
||||
{% endif %}
|
||||
</li>
|
||||
{% endfor %}
|
||||
{% endmacro %}
|
||||
{{ render_toc(result.toc) }}
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<!-- Fichiers générés -->
|
||||
<div class="card">
|
||||
<h3>📁 Fichiers générés</h3>
|
||||
<div class="mt-2">
|
||||
<table style="width: 100%; border-collapse: collapse;">
|
||||
{% if result.files.markdown %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Markdown</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}.md" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.files.chunks %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Chunks JSON</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_chunks.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>OCR brut</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_ocr.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% if result.files.weaviate %}
|
||||
<tr>
|
||||
<td style="padding: 0.75rem 0;"><strong>Weaviate JSON</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_weaviate.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Tous les passages avec métadonnées -->
|
||||
{% if result.chunks and result.chunks|length > 0 %}
|
||||
<div class="card mt-3">
|
||||
<h3>📝 Passages ({{ result.chunks|length }})</h3>
|
||||
|
||||
<div class="toolbar">
|
||||
<button onclick="expandAllPassages()">▼ Tout déplier</button>
|
||||
<button onclick="collapseAllPassages()">▲ Tout replier</button>
|
||||
</div>
|
||||
|
||||
<div class="mt-2" id="passages-container">
|
||||
{% for chunk in result.chunks %}
|
||||
{% set level = chunk.section_level or chunk.sectionLevel or 1 %}
|
||||
<div class="passage-card" data-index="{{ loop.index0 }}" style="{% if level > 1 %}margin-left: {{ (level - 1) * 1 }}rem; border-left: 3px solid {% if level == 2 %}var(--color-accent-alt){% else %}rgba(125, 110, 88, 0.3){% endif %};{% endif %}">
|
||||
<div class="passage-header" onclick="togglePassage(this)">
|
||||
<div style="flex: 1;">
|
||||
<!-- Hiérarchie visuelle -->
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem; flex-wrap: wrap;">
|
||||
{% if chunk.chapter_title and chunk.chapter_title != chunk.section and level > 1 %}
|
||||
<span style="font-size: 0.75rem; color: var(--color-text-muted);">{{ chunk.chapter_title }} ›</span>
|
||||
{% endif %}
|
||||
{% if chunk.subsection_title and chunk.subsection_title != chunk.chapter_title and chunk.subsection_title != chunk.section %}
|
||||
<span style="font-size: 0.75rem; color: var(--color-accent-alt);">{{ chunk.subsection_title }} ›</span>
|
||||
{% endif %}
|
||||
{% if chunk.paragraph_number %}
|
||||
<span class="badge" style="background-color: var(--color-accent); color: white; font-weight: bold;">
|
||||
§ {{ chunk.paragraph_number }}
|
||||
</span>
|
||||
{% endif %}
|
||||
<span class="badge badge-work" style="{% if level == 1 %}background-color: var(--color-accent); color: white;{% endif %}">
|
||||
{% if level == 1 %}📚{% elif level == 2 %}📖{% else %}📄{% endif %}
|
||||
{{ chunk.section or 'Sans section' }}
|
||||
</span>
|
||||
{% if chunk.type %}
|
||||
{% set type_info = chunk_types.get(chunk.type, {'label': chunk.type, 'icon': '📝', 'desc': 'Type de contenu', 'color': 'rgba(125, 110, 88, 0.15)'}) %}
|
||||
<span class="type-badge" style="background: {{ type_info.color }};" title="{{ type_info.desc }}">
|
||||
{{ type_info.icon }} {{ type_info.label }}
|
||||
</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% if chunk.summary %}
|
||||
<div style="margin-top: 0.3rem; font-size: 0.85rem; color: var(--color-text-muted);">
|
||||
{{ chunk.summary[:100] }}{% if chunk.summary|length > 100 %}...{% endif %}
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem;">
|
||||
<span class="caption">{{ chunk.chunk_id or 'chunk_' ~ loop.index0 }}</span>
|
||||
<span class="passage-toggle">▼</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="passage-content">
|
||||
{% set level = chunk.section_level or chunk.sectionLevel or 1 %}
|
||||
|
||||
<!-- Métadonnées simplifiées -->
|
||||
<div style="display: flex; gap: 1rem; flex-wrap: wrap; align-items: center; padding: 0.75rem 0; border-bottom: 1px solid rgba(125, 110, 88, 0.1);">
|
||||
<!-- Hiérarchie -->
|
||||
{% if chunk.chapter_title and chunk.chapter_title != chunk.section %}
|
||||
<span style="font-size: 0.85rem; color: var(--color-text-muted);">
|
||||
📚 {{ chunk.chapter_title }}
|
||||
</span>
|
||||
<span style="color: var(--color-text-muted);">›</span>
|
||||
{% endif %}
|
||||
|
||||
{% if chunk.section %}
|
||||
<span style="font-size: 0.85rem; {% if level == 1 %}font-weight: 600; color: var(--color-accent);{% else %}color: var(--color-accent-alt);{% endif %}">
|
||||
{% if level == 1 %}📖{% elif level == 2 %}📄{% else %}📃{% endif %} {{ chunk.section }}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
<!-- Type -->
|
||||
{% if chunk.type %}
|
||||
{% set type_info = chunk_types.get(chunk.type, {'label': chunk.type, 'icon': '📝', 'desc': 'Type de contenu', 'color': 'rgba(125, 110, 88, 0.15)'}) %}
|
||||
<span style="font-size: 0.75rem; padding: 0.2rem 0.5rem; border-radius: 4px; background: {{ type_info.color }};" title="{{ type_info.desc }}">
|
||||
{{ type_info.icon }} {{ type_info.label }}
|
||||
</span>
|
||||
{% endif %}
|
||||
|
||||
<!-- Niveau -->
|
||||
<span style="font-size: 0.75rem; padding: 0.2rem 0.5rem; border-radius: 4px;
|
||||
{% if level == 1 %}background-color: var(--color-accent); color: white;
|
||||
{% elif level == 2 %}background-color: var(--color-accent-alt); color: white;
|
||||
{% else %}background-color: rgba(125, 110, 88, 0.2);{% endif %}">
|
||||
Niv. {{ level }}
|
||||
</span>
|
||||
|
||||
<!-- Paragraphe -->
|
||||
{% if chunk.paragraph_number %}
|
||||
<span style="font-size: 0.75rem; padding: 0.2rem 0.5rem; border-radius: 4px; background-color: var(--color-accent); color: white;">
|
||||
§ {{ chunk.paragraph_number }}
|
||||
</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
<!-- Concepts si présents -->
|
||||
{% if chunk.concepts and chunk.concepts|length > 0 %}
|
||||
<div style="padding: 0.5rem 0;">
|
||||
<div class="concepts-list">
|
||||
{% for concept in chunk.concepts %}
|
||||
<span class="concept-tag">{{ concept }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Texte complet -->
|
||||
<div class="passage-text">
|
||||
{{ chunk.text }}
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Données Weaviate -->
|
||||
{% if result.weaviate_ingest %}
|
||||
<div class="card mt-3">
|
||||
<h3>🗄️ Ingestion Weaviate</h3>
|
||||
<div class="mt-2">
|
||||
{% if result.weaviate_ingest.success %}
|
||||
<div class="alert alert-success" style="background-color: rgba(85, 107, 99, 0.1); border: 1px solid rgba(85, 107, 99, 0.3); color: var(--color-accent-alt); padding: 1rem; border-radius: 8px;">
|
||||
<strong>✓ Ingestion réussie :</strong> {{ result.weaviate_ingest.count }} passages insérés dans la collection <code>Passage</code>
|
||||
</div>
|
||||
{% else %}
|
||||
<div class="alert alert-warning" style="background-color: rgba(125, 110, 88, 0.1); border: 1px solid rgba(125, 110, 88, 0.3); color: var(--color-accent); padding: 1rem; border-radius: 8px;">
|
||||
<strong>⚠️ Erreur d'ingestion :</strong> {{ result.weaviate_ingest.error }}
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Images extraites -->
|
||||
{% if result.files.images %}
|
||||
<div class="card mt-3">
|
||||
<h3>🖼️ Images extraites ({{ result.files.images|length }})</h3>
|
||||
<div class="mt-2" style="display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 1rem;">
|
||||
{% for img in result.files.images[:12] %}
|
||||
<div style="text-align: center;">
|
||||
<a href="/output/{{ result.document_name }}/images/{{ img.split('/')[-1].split('\\')[-1] }}" target="_blank">
|
||||
<img
|
||||
src="/output/{{ result.document_name }}/images/{{ img.split('/')[-1].split('\\')[-1] }}"
|
||||
alt="Image"
|
||||
style="max-width: 100%; max-height: 120px; border-radius: 8px; border: 1px solid rgba(125, 110, 88, 0.2);"
|
||||
>
|
||||
</a>
|
||||
<div class="caption">{{ img.split('/')[-1].split('\\')[-1] }}</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% if result.files.images|length > 12 %}
|
||||
<div style="display: flex; align-items: center; justify-content: center;">
|
||||
<span class="text-muted">+ {{ result.files.images|length - 12 }} autres</span>
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="text-center mt-4">
|
||||
<a href="/documents" class="btn btn-primary">← Retour aux documents</a>
|
||||
<a href="/upload" class="btn" style="margin-left: 0.5rem;">Analyser un autre PDF</a>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<script>
|
||||
// TOC toggle
|
||||
function toggleTocItem(header) {
|
||||
const item = header.parentElement;
|
||||
const toggle = header.querySelector('.toc-toggle');
|
||||
const children = item.querySelector('ul');
|
||||
|
||||
if (children) {
|
||||
children.classList.toggle('expanded');
|
||||
toggle.classList.toggle('expanded');
|
||||
}
|
||||
}
|
||||
|
||||
function expandAllToc() {
|
||||
document.querySelectorAll('.toc-tree ul').forEach(ul => ul.classList.add('expanded'));
|
||||
document.querySelectorAll('.toc-toggle').forEach(t => t.classList.add('expanded'));
|
||||
}
|
||||
|
||||
function collapseAllToc() {
|
||||
document.querySelectorAll('.toc-tree ul').forEach(ul => ul.classList.remove('expanded'));
|
||||
document.querySelectorAll('.toc-toggle').forEach(t => t.classList.remove('expanded'));
|
||||
}
|
||||
|
||||
// Passages toggle
|
||||
function togglePassage(header) {
|
||||
const card = header.parentElement;
|
||||
const content = card.querySelector('.passage-content');
|
||||
const toggle = header.querySelector('.passage-toggle');
|
||||
|
||||
content.classList.toggle('expanded');
|
||||
toggle.classList.toggle('expanded');
|
||||
}
|
||||
|
||||
function expandAllPassages() {
|
||||
document.querySelectorAll('.passage-content').forEach(c => c.classList.add('expanded'));
|
||||
document.querySelectorAll('.passage-toggle').forEach(t => t.classList.add('expanded'));
|
||||
}
|
||||
|
||||
function collapseAllPassages() {
|
||||
document.querySelectorAll('.passage-content').forEach(c => c.classList.remove('expanded'));
|
||||
document.querySelectorAll('.passage-toggle').forEach(t => t.classList.remove('expanded'));
|
||||
}
|
||||
</script>
|
||||
{% endblock %}
|
||||
171
generations/library_rag/templates/documents.html
Normal file
171
generations/library_rag/templates/documents.html
Normal file
@@ -0,0 +1,171 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Documents{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<!-- Messages flash -->
|
||||
{% with messages = get_flashed_messages(with_categories=true) %}
|
||||
{% if messages %}
|
||||
<div style="max-width: 900px; margin: 0 auto 2rem auto;">
|
||||
{% for category, message in messages %}
|
||||
<div class="alert alert-{{ category }}" style="
|
||||
padding: 1rem 1.5rem;
|
||||
border-radius: 8px;
|
||||
margin-bottom: 1rem;
|
||||
border-left: 4px solid;
|
||||
{% if category == 'success' %}
|
||||
background-color: rgba(85, 107, 99, 0.1);
|
||||
border-color: var(--color-accent-alt);
|
||||
color: var(--color-accent-alt);
|
||||
{% elif category == 'warning' %}
|
||||
background-color: rgba(218, 188, 134, 0.15);
|
||||
border-color: #dabc86;
|
||||
color: #a89159;
|
||||
{% elif category == 'error' %}
|
||||
background-color: rgba(160, 82, 82, 0.1);
|
||||
border-color: #a05252;
|
||||
color: #a05252;
|
||||
{% else %}
|
||||
background-color: rgba(125, 110, 88, 0.1);
|
||||
border-color: var(--color-accent);
|
||||
color: var(--color-text-main);
|
||||
{% endif %}
|
||||
">
|
||||
{{ message }}
|
||||
</div>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
{% endwith %}
|
||||
|
||||
<section class="section">
|
||||
<h1>📚 Documents traités</h1>
|
||||
<p class="lead">Liste des documents analysés par le parser PDF</p>
|
||||
|
||||
{% if documents %}
|
||||
<div class="stats-grid mb-4">
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ documents|length }}</div>
|
||||
<div class="stat-label">Documents</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ documents|sum(attribute='summaries_count') }}</div>
|
||||
<div class="stat-label">Résumés totaux</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ documents|sum(attribute='chunks_count') }}</div>
|
||||
<div class="stat-label">Chunks totaux</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{% for doc in documents %}
|
||||
<div class="passage-card">
|
||||
<div class="passage-header">
|
||||
<div>
|
||||
<span class="badge badge-author">{{ doc.name }}</span>
|
||||
{% if doc.has_structured %}
|
||||
<span class="badge badge-similarity">LLM</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
<div>
|
||||
{% if doc.summaries_count %}
|
||||
<span class="badge">{{ doc.summaries_count }} résumés</span>
|
||||
{% endif %}
|
||||
{% if doc.authors_count %}
|
||||
<span class="badge">{{ doc.authors_count }} auteur{{ 's' if doc.authors_count > 1 else '' }}</span>
|
||||
{% endif %}
|
||||
{% if doc.chunks_count %}
|
||||
<span class="badge">{{ doc.chunks_count }} chunks</span>
|
||||
{% endif %}
|
||||
{% if doc.has_images %}
|
||||
<span class="badge">{{ doc.image_count }} images</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Métadonnées -->
|
||||
{% if doc.title or doc.author %}
|
||||
<div class="mt-2 mb-2">
|
||||
{% if doc.title %}
|
||||
<div><strong>Titre :</strong> {{ doc.title }}</div>
|
||||
{% endif %}
|
||||
{% if doc.author %}
|
||||
<div><strong>Auteur :</strong> {{ doc.author }}</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Table des matières (aperçu) -->
|
||||
{% if doc.toc and doc.toc|length > 0 %}
|
||||
<div class="mt-2 mb-2" style="background: var(--color-bg-secondary); padding: 0.75rem 1rem; border-radius: 8px;">
|
||||
<strong style="font-size: 0.85rem; color: var(--color-accent-alt);">Table des matières :</strong>
|
||||
<ul style="list-style: none; padding-left: 0; margin: 0.5rem 0 0 0; font-size: 0.9rem;">
|
||||
{% for item in doc.toc[:5] %}
|
||||
<li style="padding: 0.2rem 0; padding-left: {{ (item.level - 1) * 1 }}rem;">
|
||||
{% if item.level == 1 %}
|
||||
<strong>{{ item.title }}</strong>
|
||||
{% else %}
|
||||
<span style="color: var(--color-accent-alt);">{{ item.title }}</span>
|
||||
{% endif %}
|
||||
</li>
|
||||
{% endfor %}
|
||||
{% if doc.toc|length > 5 %}
|
||||
<li style="padding: 0.2rem 0; color: var(--color-accent); font-style: italic;">
|
||||
... et {{ doc.toc|length - 5 }} autres sections
|
||||
</li>
|
||||
{% endif %}
|
||||
</ul>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Boutons d'accès aux fichiers -->
|
||||
<div class="mt-2">
|
||||
<div style="display: flex; flex-wrap: wrap; gap: 0.5rem;">
|
||||
<a href="/documents/{{ doc.name }}/view" class="btn btn-sm btn-primary">
|
||||
👁️ Voir détails
|
||||
</a>
|
||||
{% if doc.has_markdown %}
|
||||
<a href="/output/{{ doc.name }}/{{ doc.name }}.md" target="_blank" class="btn btn-sm">
|
||||
📄 Markdown
|
||||
</a>
|
||||
{% endif %}
|
||||
{% if doc.has_chunks %}
|
||||
<a href="/output/{{ doc.name }}/{{ doc.name }}_chunks.json" target="_blank" class="btn btn-sm">
|
||||
📊 Chunks
|
||||
</a>
|
||||
{% endif %}
|
||||
{% if doc.has_structured %}
|
||||
<a href="/output/{{ doc.name }}/{{ doc.name }}_structured.json" target="_blank" class="btn btn-sm">
|
||||
🧠 Structure LLM
|
||||
</a>
|
||||
{% endif %}
|
||||
<a href="/output/{{ doc.name }}/{{ doc.name }}_ocr.json" target="_blank" class="btn btn-sm">
|
||||
🔍 OCR brut
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="passage-meta mt-2" style="display: flex; justify-content: space-between; align-items: center;">
|
||||
<span><strong>Dossier :</strong> output/{{ doc.name }}/</span>
|
||||
<form action="/documents/delete/{{ doc.name }}" method="post" style="margin: 0;" onsubmit="return confirm('⚠️ Supprimer le document « {{ doc.name }} » ?\n\n• Fichiers locaux (markdown, chunks, images)\n• Chunks dans Weaviate ({{ doc.chunks_count or 0 }} passages)\n\n⚠️ Cette action est IRRÉVERSIBLE.');">
|
||||
<button type="submit" class="btn btn-sm" style="color: #a05252; border-color: #a05252; padding: 0.3rem 0.6rem;" title="Supprimer ce document et ses données Weaviate">
|
||||
🗑️ Supprimer
|
||||
</button>
|
||||
</form>
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
|
||||
{% else %}
|
||||
<div class="empty-state">
|
||||
<div class="empty-state-icon">📭</div>
|
||||
<h3>Aucun document traité</h3>
|
||||
<p class="text-muted">Uploadez un PDF pour commencer.</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="text-center mt-4">
|
||||
<a href="/upload" class="btn btn-primary">Analyser un PDF</a>
|
||||
</div>
|
||||
</section>
|
||||
{% endblock %}
|
||||
106
generations/library_rag/templates/index.html
Normal file
106
generations/library_rag/templates/index.html
Normal file
@@ -0,0 +1,106 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Accueil{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<section class="section">
|
||||
<h1 class="text-center">Bienvenue sur Philosophia</h1>
|
||||
<p class="lead text-center">Explorez les textes philosophiques indexés dans Weaviate</p>
|
||||
|
||||
<div class="ornament">· · ·</div>
|
||||
|
||||
{% if stats %}
|
||||
<!-- Statistics -->
|
||||
<div class="stats-grid">
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ stats.passages }}</div>
|
||||
<div class="stat-label">Passages</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ stats.works }}</div>
|
||||
<div class="stat-label">Œuvres</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ stats.authors }}</div>
|
||||
<div class="stat-label">Auteurs</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ stats.languages }}</div>
|
||||
<div class="stat-label">Langues</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<div class="form-row">
|
||||
<!-- Works -->
|
||||
<div class="card">
|
||||
<h3>📖 Œuvres disponibles</h3>
|
||||
{% if stats.work_list %}
|
||||
<ul class="mt-2" style="list-style: none;">
|
||||
{% for work in stats.work_list %}
|
||||
<li style="padding: 0.3rem 0;">
|
||||
<a href="/passages?work={{ work | urlencode }}">{{ work }}</a>
|
||||
</li>
|
||||
{% endfor %}
|
||||
</ul>
|
||||
{% else %}
|
||||
<p class="text-muted">Aucune œuvre trouvée</p>
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
<!-- Authors -->
|
||||
<div class="card">
|
||||
<h3>✍️ Auteurs</h3>
|
||||
{% if stats.author_list %}
|
||||
<ul class="mt-2" style="list-style: none;">
|
||||
{% for author in stats.author_list %}
|
||||
<li style="padding: 0.3rem 0;">
|
||||
<a href="/passages?author={{ author | urlencode }}">{{ author }}</a>
|
||||
</li>
|
||||
{% endfor %}
|
||||
</ul>
|
||||
{% else %}
|
||||
<p class="text-muted">Aucun auteur trouvé</p>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<!-- How to use -->
|
||||
<div class="card">
|
||||
<h3>💡 Comment utiliser Philosophia ?</h3>
|
||||
<div class="mt-2">
|
||||
<p><strong>1. Parcourir les passages</strong> — Consultez tous les passages indexés avec filtres par auteur ou œuvre</p>
|
||||
<p><strong>2. Recherche sémantique</strong> — Posez une question en langage naturel pour trouver des passages pertinents</p>
|
||||
<p class="mb-1">Exemples de recherches :</p>
|
||||
<ul class="list-inline">
|
||||
<li><span class="badge">Qu'est-ce que la vertu ?</span></li>
|
||||
<li><span class="badge">La mort est-elle à craindre ?</span></li>
|
||||
<li><span class="badge">Comment vivre une vie juste ?</span></li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="text-center mt-4">
|
||||
<a href="/search" class="btn btn-primary">Commencer une recherche</a>
|
||||
<a href="/passages" class="btn" style="margin-left: 0.5rem;">Parcourir les passages</a>
|
||||
</div>
|
||||
|
||||
{% else %}
|
||||
<div class="alert alert-warning">
|
||||
<strong>⚠️ Base de données non disponible</strong><br>
|
||||
Assurez-vous que Weaviate est démarré et que les données sont chargées.
|
||||
</div>
|
||||
<div class="card">
|
||||
<h3>Pour démarrer :</h3>
|
||||
<pre style="background: var(--color-bg-secondary); padding: 1rem; border-radius: 8px; margin-top: 1rem; overflow-x: auto;">docker-compose up -d
|
||||
python schema.py
|
||||
python ingest_test.py</pre>
|
||||
</div>
|
||||
{% endif %}
|
||||
</section>
|
||||
{% endblock %}
|
||||
|
||||
|
||||
117
generations/library_rag/templates/passages.html
Normal file
117
generations/library_rag/templates/passages.html
Normal file
@@ -0,0 +1,117 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Passages{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<section class="section">
|
||||
<h1>📚 Parcourir les passages</h1>
|
||||
<p class="lead">Explorez tous les passages indexés dans la base de données</p>
|
||||
|
||||
<!-- Filters -->
|
||||
<div class="search-box">
|
||||
<form method="get" action="/passages">
|
||||
<div class="form-row">
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="author">Auteur</label>
|
||||
<select name="author" id="author" class="form-control">
|
||||
<option value="">Tous les auteurs</option>
|
||||
{% if stats and stats.author_list %}
|
||||
{% for author in stats.author_list %}
|
||||
<option value="{{ author }}" {{ 'selected' if author_filter == author else '' }}>{{ author }}</option>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="work">Œuvre</label>
|
||||
<select name="work" id="work" class="form-control">
|
||||
<option value="">Toutes les œuvres</option>
|
||||
{% if stats and stats.work_list %}
|
||||
{% for work in stats.work_list %}
|
||||
<option value="{{ work }}" {{ 'selected' if work_filter == work else '' }}>{{ work }}</option>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="per_page">Par page</label>
|
||||
<select name="per_page" id="per_page" class="form-control">
|
||||
<option value="10" {{ 'selected' if per_page == 10 else '' }}>10</option>
|
||||
<option value="20" {{ 'selected' if per_page == 20 else '' }}>20</option>
|
||||
<option value="50" {{ 'selected' if per_page == 50 else '' }}>50</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
<div class="mt-2">
|
||||
<button type="submit" class="btn btn-primary">Filtrer</button>
|
||||
<a href="/passages" class="btn" style="margin-left: 0.5rem;">Réinitialiser</a>
|
||||
</div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Active filters -->
|
||||
{% if author_filter or work_filter %}
|
||||
<div class="mb-3">
|
||||
<span class="text-muted">Filtres actifs :</span>
|
||||
{% if author_filter %}
|
||||
<span class="badge badge-author">{{ author_filter }}</span>
|
||||
{% endif %}
|
||||
{% if work_filter %}
|
||||
<span class="badge badge-work">{{ work_filter }}</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Chunks list -->
|
||||
{% if chunks %}
|
||||
{% for chunk in chunks %}
|
||||
<div class="passage-card">
|
||||
<div class="passage-header">
|
||||
<div>
|
||||
<span class="badge badge-work">{{ chunk.work.title if chunk.work else '?' }} {{ chunk.sectionPath or '' }}</span>
|
||||
<span class="badge badge-author">{{ chunk.work.author if chunk.work else 'Anonyme' }}</span>
|
||||
</div>
|
||||
</div>
|
||||
<div class="passage-text">"{{ chunk.text }}"</div>
|
||||
<div class="passage-meta">
|
||||
<strong>Type :</strong> {{ chunk.unitType or '—' }} │
|
||||
<strong>Langue :</strong> {{ (chunk.language or '—') | upper }} │
|
||||
<strong>Index :</strong> {{ chunk.orderIndex or '—' }}
|
||||
</div>
|
||||
{% if chunk.keywords %}
|
||||
<div class="mt-2">
|
||||
{% for kw in chunk.keywords %}
|
||||
<span class="keyword-tag">{{ kw }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endfor %}
|
||||
|
||||
<!-- Pagination -->
|
||||
<div class="pagination">
|
||||
{% if page > 1 %}
|
||||
<a href="/passages?page={{ page - 1 }}&per_page={{ per_page }}{% if author_filter %}&author={{ author_filter | urlencode }}{% endif %}{% if work_filter %}&work={{ work_filter | urlencode }}{% endif %}" class="btn btn-sm">← Précédent</a>
|
||||
{% else %}
|
||||
<span class="btn btn-sm" style="opacity: 0.4; cursor: not-allowed;">← Précédent</span>
|
||||
{% endif %}
|
||||
|
||||
<span class="pagination-info">Page {{ page }}</span>
|
||||
|
||||
{% if passages | length >= per_page %}
|
||||
<a href="/passages?page={{ page + 1 }}&per_page={{ per_page }}{% if author_filter %}&author={{ author_filter | urlencode }}{% endif %}{% if work_filter %}&work={{ work_filter | urlencode }}{% endif %}" class="btn btn-sm">Suivant →</a>
|
||||
{% else %}
|
||||
<span class="btn btn-sm" style="opacity: 0.4; cursor: not-allowed;">Suivant →</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% else %}
|
||||
<div class="empty-state">
|
||||
<div class="empty-state-icon">📭</div>
|
||||
<h3>Aucun passage trouvé</h3>
|
||||
<p class="text-muted">Essayez de modifier vos filtres ou <a href="/passages">réinitialisez</a>.</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
</section>
|
||||
{% endblock %}
|
||||
|
||||
|
||||
134
generations/library_rag/templates/search.html
Normal file
134
generations/library_rag/templates/search.html
Normal file
@@ -0,0 +1,134 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Recherche{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<section class="section">
|
||||
<h1>🔍 Recherche sémantique</h1>
|
||||
<p class="lead">Posez une question en langage naturel pour trouver des passages pertinents</p>
|
||||
|
||||
<!-- Search form -->
|
||||
<div class="search-box">
|
||||
<form method="get" action="/search">
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="q">Votre question</label>
|
||||
<input
|
||||
type="text"
|
||||
name="q"
|
||||
id="q"
|
||||
class="form-control search-input"
|
||||
value="{{ query }}"
|
||||
placeholder="Ex: Qu'est-ce que la sagesse ? Pourquoi philosopher ?"
|
||||
autofocus
|
||||
>
|
||||
</div>
|
||||
<div class="form-row">
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="author">Auteur</label>
|
||||
<select name="author" id="author" class="form-control">
|
||||
<option value="">Tous les auteurs</option>
|
||||
{% if stats and stats.author_list %}
|
||||
{% for author in stats.author_list %}
|
||||
<option value="{{ author }}" {{ 'selected' if author_filter == author else '' }}>{{ author }}</option>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="work">Œuvre</label>
|
||||
<select name="work" id="work" class="form-control">
|
||||
<option value="">Toutes les œuvres</option>
|
||||
{% if stats and stats.work_list %}
|
||||
{% for work in stats.work_list %}
|
||||
<option value="{{ work }}" {{ 'selected' if work_filter == work else '' }}>{{ work }}</option>
|
||||
{% endfor %}
|
||||
{% endif %}
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="limit">Résultats</label>
|
||||
<select name="limit" id="limit" class="form-control">
|
||||
<option value="5" {{ 'selected' if limit == 5 else '' }}>5</option>
|
||||
<option value="10" {{ 'selected' if limit == 10 else '' }}>10</option>
|
||||
<option value="20" {{ 'selected' if limit == 20 else '' }}>20</option>
|
||||
</select>
|
||||
</div>
|
||||
</div>
|
||||
<div class="mt-2">
|
||||
<button type="submit" class="btn btn-primary">Rechercher</button>
|
||||
<a href="/search" class="btn" style="margin-left: 0.5rem;">Réinitialiser</a>
|
||||
</div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<!-- Results -->
|
||||
{% if query %}
|
||||
<div class="ornament">·</div>
|
||||
|
||||
{% if results %}
|
||||
<div class="mb-3">
|
||||
<strong>{{ results | length }}</strong> passage{% if results | length > 1 %}s{% endif %} trouvé{% if results | length > 1 %}s{% endif %}
|
||||
{% if author_filter or work_filter %}
|
||||
<span class="text-muted">—</span>
|
||||
{% if author_filter %}
|
||||
<span class="badge badge-author">{{ author_filter }}</span>
|
||||
{% endif %}
|
||||
{% if work_filter %}
|
||||
<span class="badge badge-work">{{ work_filter }}</span>
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
</div>
|
||||
|
||||
{% for result in results %}
|
||||
<div class="passage-card">
|
||||
<div class="passage-header">
|
||||
<div>
|
||||
<span class="badge badge-work">{{ result.work.title if result.work else '?' }} {{ result.sectionPath or '' }}</span>
|
||||
<span class="badge badge-author">{{ result.work.author if result.work else 'Anonyme' }}</span>
|
||||
</div>
|
||||
{% if result.similarity %}
|
||||
<span class="badge badge-similarity">⚡ {{ result.similarity }}% similaire</span>
|
||||
{% endif %}
|
||||
</div>
|
||||
<div class="passage-text">"{{ result.text }}"</div>
|
||||
<div class="passage-meta">
|
||||
<strong>Type :</strong> {{ result.unitType or '—' }} │
|
||||
<strong>Langue :</strong> {{ (result.language or '—') | upper }} │
|
||||
<strong>Index :</strong> {{ result.orderIndex or '—' }}
|
||||
</div>
|
||||
{% if result.keywords %}
|
||||
<div class="mt-2">
|
||||
{% for kw in result.keywords %}
|
||||
<span class="keyword-tag">{{ kw }}</span>
|
||||
{% endfor %}
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% else %}
|
||||
<div class="empty-state">
|
||||
<div class="empty-state-icon">🔮</div>
|
||||
<h3>Aucun résultat trouvé</h3>
|
||||
<p class="text-muted">Essayez une autre formulation ou modifiez vos filtres.</p>
|
||||
</div>
|
||||
{% endif %}
|
||||
{% else %}
|
||||
<!-- Suggestions -->
|
||||
<div class="card">
|
||||
<h3>💡 Suggestions de recherche</h3>
|
||||
<div class="mt-2">
|
||||
<p class="mb-2">Voici quelques exemples de questions que vous pouvez poser :</p>
|
||||
<div>
|
||||
<a href="/search?q=Qu%27est-ce%20que%20la%20vertu%20%3F" class="badge" style="cursor: pointer;">Qu'est-ce que la vertu ?</a>
|
||||
<a href="/search?q=La%20mort%20est-elle%20%C3%A0%20craindre%20%3F" class="badge" style="cursor: pointer;">La mort est-elle à craindre ?</a>
|
||||
<a href="/search?q=Comment%20atteindre%20le%20bonheur%20%3F" class="badge" style="cursor: pointer;">Comment atteindre le bonheur ?</a>
|
||||
<a href="/search?q=Qu%27est-ce%20que%20la%20justice%20%3F" class="badge" style="cursor: pointer;">Qu'est-ce que la justice ?</a>
|
||||
<a href="/search?q=L%27%C3%A2me%20est-elle%20immortelle%20%3F" class="badge" style="cursor: pointer;">L'âme est-elle immortelle ?</a>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
</section>
|
||||
{% endblock %}
|
||||
|
||||
|
||||
285
generations/library_rag/templates/test_chat_backend.html
Normal file
285
generations/library_rag/templates/test_chat_backend.html
Normal file
@@ -0,0 +1,285 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="fr">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>Test Chat Backend</title>
|
||||
<style>
|
||||
body {
|
||||
font-family: system-ui, -apple-system, sans-serif;
|
||||
max-width: 1000px;
|
||||
margin: 2rem auto;
|
||||
padding: 0 1rem;
|
||||
background: #f5f5f5;
|
||||
}
|
||||
h1 {
|
||||
color: #333;
|
||||
}
|
||||
.container {
|
||||
background: white;
|
||||
padding: 2rem;
|
||||
border-radius: 8px;
|
||||
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
||||
}
|
||||
.form-group {
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
label {
|
||||
display: block;
|
||||
font-weight: 500;
|
||||
margin-bottom: 0.5rem;
|
||||
color: #555;
|
||||
}
|
||||
input, select, textarea {
|
||||
width: 100%;
|
||||
padding: 0.75rem;
|
||||
border: 1px solid #ddd;
|
||||
border-radius: 4px;
|
||||
font-size: 1rem;
|
||||
}
|
||||
textarea {
|
||||
min-height: 100px;
|
||||
font-family: inherit;
|
||||
}
|
||||
button {
|
||||
background: #007bff;
|
||||
color: white;
|
||||
border: none;
|
||||
padding: 0.75rem 2rem;
|
||||
border-radius: 4px;
|
||||
font-size: 1rem;
|
||||
cursor: pointer;
|
||||
}
|
||||
button:hover {
|
||||
background: #0056b3;
|
||||
}
|
||||
button:disabled {
|
||||
background: #ccc;
|
||||
cursor: not-allowed;
|
||||
}
|
||||
.output {
|
||||
margin-top: 2rem;
|
||||
padding: 1.5rem;
|
||||
background: #f9f9f9;
|
||||
border-radius: 4px;
|
||||
border: 1px solid #ddd;
|
||||
}
|
||||
.output h3 {
|
||||
margin-top: 0;
|
||||
}
|
||||
.context {
|
||||
background: #e3f2fd;
|
||||
padding: 1rem;
|
||||
border-radius: 4px;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
.context-item {
|
||||
background: white;
|
||||
padding: 0.75rem;
|
||||
margin-bottom: 0.5rem;
|
||||
border-radius: 4px;
|
||||
border-left: 3px solid #2196f3;
|
||||
}
|
||||
.response {
|
||||
background: white;
|
||||
padding: 1.5rem;
|
||||
border-radius: 4px;
|
||||
white-space: pre-wrap;
|
||||
font-family: -apple-system, system-ui, sans-serif;
|
||||
line-height: 1.6;
|
||||
min-height: 100px;
|
||||
}
|
||||
.status {
|
||||
display: inline-block;
|
||||
padding: 0.25rem 0.75rem;
|
||||
border-radius: 4px;
|
||||
font-size: 0.85rem;
|
||||
font-weight: 500;
|
||||
}
|
||||
.status-searching { background: #fff3cd; color: #856404; }
|
||||
.status-generating { background: #d1ecf1; color: #0c5460; }
|
||||
.status-complete { background: #d4edda; color: #155724; }
|
||||
.status-error { background: #f8d7da; color: #721c24; }
|
||||
.log {
|
||||
font-family: monospace;
|
||||
font-size: 0.85rem;
|
||||
color: #666;
|
||||
margin-top: 0.5rem;
|
||||
}
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<h1>🧪 Test Chat Backend RAG</h1>
|
||||
|
||||
<div class="form-group">
|
||||
<label for="question">Question :</label>
|
||||
<textarea id="question" placeholder="Qu'est-ce que la vertu ?">Qu'est-ce que la vertu ?</textarea>
|
||||
</div>
|
||||
|
||||
<div class="form-group">
|
||||
<label for="provider">Provider :</label>
|
||||
<select id="provider">
|
||||
<option value="mistral">Mistral API</option>
|
||||
<option value="anthropic">Anthropic (Claude)</option>
|
||||
<option value="openai">OpenAI</option>
|
||||
<option value="ollama">Ollama (local)</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="form-group">
|
||||
<label for="model">Model :</label>
|
||||
<select id="model">
|
||||
<!-- Mistral models -->
|
||||
<option value="mistral-small-latest" data-provider="mistral">mistral-small-latest</option>
|
||||
<option value="mistral-large-latest" data-provider="mistral">mistral-large-latest</option>
|
||||
|
||||
<!-- Anthropic models -->
|
||||
<option value="claude-sonnet-4-5-20250929" data-provider="anthropic">claude-sonnet-4-5</option>
|
||||
<option value="claude-opus-4-5-20251101" data-provider="anthropic">claude-opus-4-5</option>
|
||||
|
||||
<!-- OpenAI models -->
|
||||
<option value="gpt-5.2" data-provider="openai">ChatGPT 5.2</option>
|
||||
<option value="gpt-4o" data-provider="openai">GPT-4o</option>
|
||||
<option value="gpt-4o-mini" data-provider="openai">GPT-4o Mini</option>
|
||||
<option value="o1-preview" data-provider="openai">o1-preview</option>
|
||||
|
||||
<!-- Ollama models -->
|
||||
<option value="qwen2.5:7b" data-provider="ollama">qwen2.5:7b</option>
|
||||
</select>
|
||||
</div>
|
||||
|
||||
<div class="form-group">
|
||||
<label for="limit">Nombre de contextes RAG :</label>
|
||||
<input type="number" id="limit" value="3" min="1" max="10">
|
||||
</div>
|
||||
|
||||
<button id="sendBtn" onclick="sendQuestion()">Envoyer</button>
|
||||
|
||||
<div class="output" id="output" style="display: none;">
|
||||
<h3>Résultat :</h3>
|
||||
<div class="log" id="log"></div>
|
||||
<div id="contextSection" style="display: none;">
|
||||
<h4>📚 Contexte RAG :</h4>
|
||||
<div class="context" id="context"></div>
|
||||
</div>
|
||||
<div id="responseSection" style="display: none;">
|
||||
<h4>💬 Réponse :</h4>
|
||||
<div class="response" id="response"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
// Auto-select model when provider changes
|
||||
document.getElementById('provider').addEventListener('change', function() {
|
||||
const provider = this.value;
|
||||
const modelSelect = document.getElementById('model');
|
||||
const options = modelSelect.querySelectorAll('option');
|
||||
|
||||
for (let option of options) {
|
||||
if (option.dataset.provider === provider) {
|
||||
option.selected = true;
|
||||
break;
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
async function sendQuestion() {
|
||||
const question = document.getElementById('question').value.trim();
|
||||
const provider = document.getElementById('provider').value;
|
||||
const model = document.getElementById('model').value;
|
||||
const limit = parseInt(document.getElementById('limit').value);
|
||||
|
||||
if (!question) {
|
||||
alert('Veuillez entrer une question');
|
||||
return;
|
||||
}
|
||||
|
||||
// Reset UI
|
||||
document.getElementById('output').style.display = 'block';
|
||||
document.getElementById('contextSection').style.display = 'none';
|
||||
document.getElementById('responseSection').style.display = 'none';
|
||||
document.getElementById('context').innerHTML = '';
|
||||
document.getElementById('response').textContent = '';
|
||||
document.getElementById('sendBtn').disabled = true;
|
||||
|
||||
// Log
|
||||
const logDiv = document.getElementById('log');
|
||||
logDiv.innerHTML = `<span class="status status-searching">Envoi...</span>`;
|
||||
|
||||
try {
|
||||
// Step 1: POST /chat/send
|
||||
const response = await fetch('/chat/send', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ question, provider, model, limit })
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
const error = await response.json();
|
||||
throw new Error(error.error || 'Erreur HTTP');
|
||||
}
|
||||
|
||||
const data = await response.json();
|
||||
const sessionId = data.session_id;
|
||||
|
||||
logDiv.innerHTML = `<span class="status status-generating">Session: ${sessionId}</span>`;
|
||||
|
||||
// Step 2: SSE /chat/stream/<session_id>
|
||||
const eventSource = new EventSource(`/chat/stream/${sessionId}`);
|
||||
|
||||
eventSource.onmessage = function(event) {
|
||||
try {
|
||||
const data = JSON.parse(event.data);
|
||||
|
||||
if (data.type === 'context') {
|
||||
// Show RAG context
|
||||
document.getElementById('contextSection').style.display = 'block';
|
||||
const contextDiv = document.getElementById('context');
|
||||
contextDiv.innerHTML = data.chunks.map((chunk, i) => `
|
||||
<div class="context-item">
|
||||
<strong>Passage ${i + 1}</strong> (${chunk.similarity}%) - ${chunk.author} - ${chunk.work}<br>
|
||||
<small>${chunk.section}</small><br>
|
||||
<div style="margin-top: 0.5rem; font-size: 0.9rem;">${chunk.text.substring(0, 200)}...</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
else if (data.type === 'token') {
|
||||
// Stream tokens
|
||||
document.getElementById('responseSection').style.display = 'block';
|
||||
const responseDiv = document.getElementById('response');
|
||||
responseDiv.textContent += data.content;
|
||||
}
|
||||
else if (data.type === 'complete') {
|
||||
// Complete
|
||||
logDiv.innerHTML = `<span class="status status-complete">✓ Terminé</span>`;
|
||||
eventSource.close();
|
||||
document.getElementById('sendBtn').disabled = false;
|
||||
}
|
||||
else if (data.type === 'error') {
|
||||
// Error
|
||||
logDiv.innerHTML = `<span class="status status-error">✗ ${data.message}</span>`;
|
||||
eventSource.close();
|
||||
document.getElementById('sendBtn').disabled = false;
|
||||
}
|
||||
} catch (e) {
|
||||
console.error('Parse error:', e);
|
||||
}
|
||||
};
|
||||
|
||||
eventSource.onerror = function(error) {
|
||||
console.error('SSE error:', error);
|
||||
logDiv.innerHTML = `<span class="status status-error">✗ Erreur de connexion SSE</span>`;
|
||||
eventSource.close();
|
||||
document.getElementById('sendBtn').disabled = false;
|
||||
};
|
||||
|
||||
} catch (error) {
|
||||
logDiv.innerHTML = `<span class="status status-error">✗ ${error.message}</span>`;
|
||||
document.getElementById('sendBtn').disabled = false;
|
||||
}
|
||||
}
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
178
generations/library_rag/templates/upload.html
Normal file
178
generations/library_rag/templates/upload.html
Normal file
@@ -0,0 +1,178 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Upload Document{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<section class="section">
|
||||
<h1>📄 Parser PDF/Markdown</h1>
|
||||
<p class="lead">Uploadez un fichier PDF ou Markdown pour l'analyser et structurer son contenu</p>
|
||||
|
||||
{% if error %}
|
||||
<div class="alert alert-warning">
|
||||
<strong>Erreur :</strong> {{ error }}
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="search-box">
|
||||
<form method="post" enctype="multipart/form-data">
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="file">Fichier PDF ou Markdown</label>
|
||||
<input
|
||||
type="file"
|
||||
name="file"
|
||||
id="file"
|
||||
class="form-control"
|
||||
accept=".pdf,.md"
|
||||
required
|
||||
>
|
||||
<div class="caption mt-1">Taille maximale : 50 MB</div>
|
||||
<div class="caption" style="color: var(--color-accent); margin-top: 0.25rem;">
|
||||
💡 Pour retester un document existant sans refaire l'OCR payant, cochez "Skip OCR"
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="form-row mt-3">
|
||||
<div class="form-group">
|
||||
<label class="form-label">Options</label>
|
||||
<div style="display: flex; flex-direction: column; gap: 0.5rem;">
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem;">
|
||||
<input
|
||||
type="checkbox"
|
||||
name="skip_ocr"
|
||||
id="skip_ocr"
|
||||
style="width: auto;"
|
||||
>
|
||||
<label for="skip_ocr" style="margin: 0; font-size: 0.95rem; text-transform: none; letter-spacing: 0;">
|
||||
⚡ Skip OCR (réutiliser markdown existant)
|
||||
</label>
|
||||
</div>
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem;">
|
||||
<input
|
||||
type="checkbox"
|
||||
name="use_llm"
|
||||
id="use_llm"
|
||||
checked
|
||||
style="width: auto;"
|
||||
>
|
||||
<label for="use_llm" style="margin: 0; font-size: 0.95rem; text-transform: none; letter-spacing: 0;">
|
||||
Activer la structuration LLM (Ollama)
|
||||
</label>
|
||||
</div>
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem;">
|
||||
<input
|
||||
type="checkbox"
|
||||
name="ingest_weaviate"
|
||||
id="ingest_weaviate"
|
||||
checked
|
||||
style="width: auto;"
|
||||
>
|
||||
<label for="ingest_weaviate" style="margin: 0; font-size: 0.95rem; text-transform: none; letter-spacing: 0;">
|
||||
Insérer dans Weaviate (vectorisation)
|
||||
</label>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="llm_provider">Provider LLM</label>
|
||||
<select name="llm_provider" id="llm_provider" class="form-control" onchange="updateModelOptions()">
|
||||
<option value="mistral" selected>⚡ Mistral API (rapide)</option>
|
||||
<option value="ollama">🖥️ Ollama (local, lent)</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="form-group">
|
||||
<label class="form-label" for="llm_model">Modèle LLM</label>
|
||||
<select name="llm_model" id="llm_model" class="form-control">
|
||||
<!-- Options Mistral API -->
|
||||
<option value="mistral-small-latest" selected>mistral-small (rapide, économique)</option>
|
||||
<option value="mistral-medium-latest">mistral-medium (équilibré)</option>
|
||||
<option value="mistral-large-latest">mistral-large (puissant)</option>
|
||||
</select>
|
||||
</div>
|
||||
<script>
|
||||
function updateModelOptions() {
|
||||
const provider = document.getElementById('llm_provider').value;
|
||||
const modelSelect = document.getElementById('llm_model');
|
||||
|
||||
if (provider === 'mistral') {
|
||||
modelSelect.innerHTML = `
|
||||
<option value="mistral-small-latest" selected>mistral-small (rapide, économique)</option>
|
||||
<option value="mistral-medium-latest">mistral-medium (équilibré)</option>
|
||||
<option value="mistral-large-latest">mistral-large (puissant)</option>
|
||||
`;
|
||||
} else {
|
||||
modelSelect.innerHTML = `
|
||||
<option value="qwen2.5:7b" selected>qwen2.5:7b (recommandé)</option>
|
||||
<option value="qwen2.5:14b">qwen2.5:14b</option>
|
||||
<option value="llama3.2:3b">llama3.2:3b (rapide)</option>
|
||||
<option value="mistral:7b">mistral:7b</option>
|
||||
`;
|
||||
}
|
||||
}
|
||||
</script>
|
||||
</div>
|
||||
|
||||
<!-- Options Extraction TOC améliorée -->
|
||||
<div class="card mt-4" style="border-left: 3px solid #4CAF50;">
|
||||
<h4 style="color: #4CAF50;">📑 Extraction TOC améliorée (Recommandé)</h4>
|
||||
<p style="font-size: 0.9rem; color: #666;">
|
||||
Analyse l'indentation du texte pour détecter automatiquement la hiérarchie de la table des matières.
|
||||
<br><strong style="color: #4CAF50;">✅ Fiable, rapide et sans coût supplémentaire</strong>
|
||||
</p>
|
||||
<div style="display: flex; align-items: center; gap: 0.5rem; margin-top: 1rem;">
|
||||
<input
|
||||
type="checkbox"
|
||||
name="use_ocr_annotations"
|
||||
id="use_ocr_annotations"
|
||||
style="width: auto;"
|
||||
checked
|
||||
>
|
||||
<label for="use_ocr_annotations" style="margin: 0; font-size: 0.95rem; font-weight: 600;">
|
||||
Activer l'analyse d'indentation pour la TOC
|
||||
</label>
|
||||
</div>
|
||||
<div style="margin-top: 0.75rem; padding: 0.75rem; background: #f0f9f0; border-radius: 4px; font-size: 0.85rem;">
|
||||
<strong>Fonctionnement :</strong> Détecte les niveaux hiérarchiques en comptant les espaces d'indentation dans la table des matières.
|
||||
<br>
|
||||
<em>Idéal pour les documents académiques avec TOC structurée.</em>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="mt-3">
|
||||
<button type="submit" class="btn btn-primary">
|
||||
Analyser le document
|
||||
</button>
|
||||
</div>
|
||||
</form>
|
||||
</div>
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<div class="card">
|
||||
<h3>📋 Pipeline de traitement</h3>
|
||||
<div class="mt-2">
|
||||
<p><strong>1. OCR Mistral</strong> — Extraction du texte et des images via l'API Mistral</p>
|
||||
<p><strong>2. Markdown</strong> — Construction du document Markdown avec images</p>
|
||||
<p><strong>3. Hiérarchie</strong> — Analyse des titres pour créer une structure arborescente</p>
|
||||
<p><strong>4. LLM (optionnel)</strong> — Amélioration de la structure via Ollama</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="card mt-3">
|
||||
<h3>📁 Fichiers générés</h3>
|
||||
<div class="mt-2">
|
||||
<ul style="list-style: none;">
|
||||
<li class="mb-1"><span class="badge">document.md</span> Texte Markdown OCR</li>
|
||||
<li class="mb-1"><span class="badge">document_chunks.json</span> Chunks hiérarchiques</li>
|
||||
<li class="mb-1"><span class="badge">document_structured.json</span> Structure LLM</li>
|
||||
<li class="mb-1"><span class="badge">document_ocr.json</span> Réponse OCR brute</li>
|
||||
<li><span class="badge">images/</span> Images extraites</li>
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="text-center mt-4">
|
||||
<a href="/documents" class="btn">Voir les documents traités</a>
|
||||
</div>
|
||||
</section>
|
||||
{% endblock %}
|
||||
|
||||
447
generations/library_rag/templates/upload_progress.html
Normal file
447
generations/library_rag/templates/upload_progress.html
Normal file
@@ -0,0 +1,447 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Traitement en cours{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
<style>
|
||||
.progress-container {
|
||||
max-width: 600px;
|
||||
margin: 0 auto;
|
||||
padding: 2rem;
|
||||
}
|
||||
|
||||
.progress-header {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 1rem;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.progress-icon {
|
||||
width: 48px;
|
||||
height: 48px;
|
||||
border-radius: 50%;
|
||||
background: var(--color-accent);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
font-size: 1.5rem;
|
||||
}
|
||||
|
||||
.progress-icon.error {
|
||||
background: #c0392b;
|
||||
}
|
||||
|
||||
.progress-icon.success {
|
||||
background: var(--color-accent-alt);
|
||||
}
|
||||
|
||||
.progress-title {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.progress-title h2 {
|
||||
margin: 0;
|
||||
font-size: 1.3rem;
|
||||
}
|
||||
|
||||
.progress-title .subtitle {
|
||||
color: var(--color-text-muted);
|
||||
font-size: 0.9rem;
|
||||
margin-top: 0.25rem;
|
||||
}
|
||||
|
||||
/* Barre de progression globale */
|
||||
.overall-progress {
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
|
||||
.overall-progress-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
margin-bottom: 0.5rem;
|
||||
font-size: 0.85rem;
|
||||
color: var(--color-text-muted);
|
||||
}
|
||||
|
||||
.progress-bar-container {
|
||||
height: 8px;
|
||||
background: rgba(125, 110, 88, 0.2);
|
||||
border-radius: 4px;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.progress-bar {
|
||||
height: 100%;
|
||||
background: linear-gradient(90deg, var(--color-accent), var(--color-accent-alt));
|
||||
border-radius: 4px;
|
||||
transition: width 0.3s ease;
|
||||
width: 0%;
|
||||
}
|
||||
|
||||
/* Liste des étapes */
|
||||
.steps-list {
|
||||
list-style: none;
|
||||
padding: 0;
|
||||
margin: 0;
|
||||
}
|
||||
|
||||
.step-item {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding: 0.75rem 0;
|
||||
border-bottom: 1px solid rgba(125, 110, 88, 0.1);
|
||||
opacity: 0.5;
|
||||
transition: all 0.3s ease;
|
||||
}
|
||||
|
||||
.step-item.active {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
.step-item.completed {
|
||||
opacity: 0.8;
|
||||
}
|
||||
|
||||
.step-item.error {
|
||||
opacity: 1;
|
||||
color: #c0392b;
|
||||
}
|
||||
|
||||
.step-icon {
|
||||
width: 32px;
|
||||
height: 32px;
|
||||
border-radius: 50%;
|
||||
background: rgba(125, 110, 88, 0.1);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
margin-right: 1rem;
|
||||
font-size: 0.9rem;
|
||||
transition: all 0.3s ease;
|
||||
}
|
||||
|
||||
.step-item.active .step-icon {
|
||||
background: var(--color-accent);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
.step-item.completed .step-icon {
|
||||
background: var(--color-accent-alt);
|
||||
color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
.step-item.error .step-icon {
|
||||
background: #c0392b;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.step-content {
|
||||
flex: 1;
|
||||
}
|
||||
|
||||
.step-name {
|
||||
font-weight: 500;
|
||||
font-size: 0.95rem;
|
||||
}
|
||||
|
||||
.step-detail {
|
||||
font-size: 0.8rem;
|
||||
color: var(--color-text-muted);
|
||||
margin-top: 0.2rem;
|
||||
}
|
||||
|
||||
.step-progress {
|
||||
font-size: 0.85rem;
|
||||
color: var(--color-text-muted);
|
||||
min-width: 40px;
|
||||
text-align: right;
|
||||
}
|
||||
|
||||
/* Spinner */
|
||||
@keyframes spin {
|
||||
0% { transform: rotate(0deg); }
|
||||
100% { transform: rotate(360deg); }
|
||||
}
|
||||
|
||||
.spinner {
|
||||
width: 16px;
|
||||
height: 16px;
|
||||
border: 2px solid rgba(125, 110, 88, 0.3);
|
||||
border-top-color: var(--color-accent);
|
||||
border-radius: 50%;
|
||||
animation: spin 1s linear infinite;
|
||||
}
|
||||
|
||||
.step-item.active .spinner {
|
||||
border-top-color: var(--color-bg-main);
|
||||
}
|
||||
|
||||
/* Message d'erreur */
|
||||
.error-message {
|
||||
background: rgba(192, 57, 43, 0.1);
|
||||
border: 1px solid rgba(192, 57, 43, 0.3);
|
||||
color: #c0392b;
|
||||
padding: 1rem;
|
||||
border-radius: 8px;
|
||||
margin-top: 1rem;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.error-message.visible {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/* Footer */
|
||||
.progress-footer {
|
||||
margin-top: 2rem;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
.progress-footer .caption {
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
</style>
|
||||
|
||||
<section class="section">
|
||||
<div class="progress-container">
|
||||
<div class="progress-header">
|
||||
<div class="progress-icon" id="main-icon">
|
||||
<div class="spinner"></div>
|
||||
</div>
|
||||
<div class="progress-title">
|
||||
<h2 id="main-title">Traitement en cours...</h2>
|
||||
<div class="subtitle" id="main-subtitle">{{ filename }}</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="overall-progress">
|
||||
<div class="overall-progress-header">
|
||||
<span>Progression globale</span>
|
||||
<span id="progress-percent">0%</span>
|
||||
</div>
|
||||
<div class="progress-bar-container">
|
||||
<div class="progress-bar" id="progress-bar"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<ul class="steps-list" id="steps-list">
|
||||
<li class="step-item" data-step="ocr">
|
||||
<div class="step-icon">📄</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">OCR Mistral</div>
|
||||
<div class="step-detail">Extraction du texte et des images</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="markdown">
|
||||
<div class="step-icon">📝</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Construction Markdown</div>
|
||||
<div class="step-detail">Génération du document structuré</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="metadata">
|
||||
<div class="step-icon">📖</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Extraction métadonnées</div>
|
||||
<div class="step-detail">Titre, auteur, éditeur via LLM</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="toc">
|
||||
<div class="step-icon">📑</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Table des matières</div>
|
||||
<div class="step-detail">Extraction de la structure via LLM</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="classify">
|
||||
<div class="step-icon">🏷️</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Classification sections</div>
|
||||
<div class="step-detail">Identification des types de contenu</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="chunking">
|
||||
<div class="step-icon">✂️</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Chunking sémantique</div>
|
||||
<div class="step-detail">Découpage intelligent du texte</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="cleaning">
|
||||
<div class="step-icon">🧹</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Nettoyage</div>
|
||||
<div class="step-detail">Correction des artefacts OCR</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="validation">
|
||||
<div class="step-icon">✓</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Validation</div>
|
||||
<div class="step-detail">Vérification de la qualité</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
<li class="step-item" data-step="weaviate">
|
||||
<div class="step-icon">🗄️</div>
|
||||
<div class="step-content">
|
||||
<div class="step-name">Ingestion Weaviate</div>
|
||||
<div class="step-detail">Vectorisation et stockage</div>
|
||||
</div>
|
||||
<div class="step-progress">—</div>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<div class="error-message" id="error-message"></div>
|
||||
|
||||
<div class="progress-footer">
|
||||
<p class="caption" id="footer-message">Le traitement peut prendre quelques minutes selon la taille du document...</p>
|
||||
<a href="/upload" class="btn" id="back-btn" style="display: none;">← Retour</a>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<script>
|
||||
const jobId = "{{ job_id }}";
|
||||
const steps = [
|
||||
{ id: "ocr", weight: 20 },
|
||||
{ id: "markdown", weight: 5 },
|
||||
{ id: "metadata", weight: 10 },
|
||||
{ id: "toc", weight: 15 },
|
||||
{ id: "classify", weight: 10 },
|
||||
{ id: "chunking", weight: 20 },
|
||||
{ id: "cleaning", weight: 10 },
|
||||
{ id: "validation", weight: 5 },
|
||||
{ id: "weaviate", weight: 5 }
|
||||
];
|
||||
|
||||
let completedWeight = 0;
|
||||
let currentStepIndex = -1;
|
||||
|
||||
function updateStep(stepId, status, detail = null) {
|
||||
const stepItem = document.querySelector(`[data-step="${stepId}"]`);
|
||||
if (!stepItem) return;
|
||||
|
||||
const stepIndex = steps.findIndex(s => s.id === stepId);
|
||||
|
||||
// Marquer les étapes précédentes comme complétées
|
||||
steps.forEach((s, i) => {
|
||||
if (i < stepIndex) {
|
||||
const item = document.querySelector(`[data-step="${s.id}"]`);
|
||||
if (item && !item.classList.contains('completed') && !item.classList.contains('error')) {
|
||||
item.classList.remove('active');
|
||||
item.classList.add('completed');
|
||||
item.querySelector('.step-icon').innerHTML = '✓';
|
||||
item.querySelector('.step-progress').textContent = '100%';
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
// Mettre à jour l'étape actuelle
|
||||
stepItem.classList.remove('active', 'completed', 'error');
|
||||
|
||||
if (status === 'active') {
|
||||
stepItem.classList.add('active');
|
||||
stepItem.querySelector('.step-icon').innerHTML = '<div class="spinner"></div>';
|
||||
stepItem.querySelector('.step-progress').textContent = '...';
|
||||
currentStepIndex = stepIndex;
|
||||
} else if (status === 'completed') {
|
||||
stepItem.classList.add('completed');
|
||||
stepItem.querySelector('.step-icon').innerHTML = '✓';
|
||||
stepItem.querySelector('.step-progress').textContent = '100%';
|
||||
completedWeight += steps[stepIndex].weight;
|
||||
} else if (status === 'error') {
|
||||
stepItem.classList.add('error');
|
||||
stepItem.querySelector('.step-icon').innerHTML = '✗';
|
||||
stepItem.querySelector('.step-progress').textContent = '—';
|
||||
} else if (status === 'skipped') {
|
||||
stepItem.classList.add('completed');
|
||||
stepItem.querySelector('.step-icon').innerHTML = '⚡';
|
||||
stepItem.querySelector('.step-progress').textContent = 'skip';
|
||||
stepItem.querySelector('.step-detail').textContent = 'Réutilisation du cache';
|
||||
completedWeight += steps[stepIndex].weight;
|
||||
}
|
||||
|
||||
if (detail) {
|
||||
stepItem.querySelector('.step-detail').textContent = detail;
|
||||
}
|
||||
|
||||
// Mettre à jour la barre de progression
|
||||
updateProgressBar();
|
||||
}
|
||||
|
||||
function updateProgressBar() {
|
||||
const totalWeight = steps.reduce((sum, s) => sum + s.weight, 0);
|
||||
const percent = Math.round((completedWeight / totalWeight) * 100);
|
||||
document.getElementById('progress-bar').style.width = percent + '%';
|
||||
document.getElementById('progress-percent').textContent = percent + '%';
|
||||
}
|
||||
|
||||
function showError(message) {
|
||||
document.getElementById('main-icon').classList.add('error');
|
||||
document.getElementById('main-icon').innerHTML = '✗';
|
||||
document.getElementById('main-title').textContent = 'Erreur de traitement';
|
||||
document.getElementById('error-message').textContent = message;
|
||||
document.getElementById('error-message').classList.add('visible');
|
||||
document.getElementById('footer-message').style.display = 'none';
|
||||
document.getElementById('back-btn').style.display = 'inline-block';
|
||||
}
|
||||
|
||||
function showSuccess(redirectUrl) {
|
||||
document.getElementById('main-icon').classList.add('success');
|
||||
document.getElementById('main-icon').innerHTML = '✓';
|
||||
document.getElementById('main-title').textContent = 'Traitement terminé !';
|
||||
document.getElementById('progress-bar').style.width = '100%';
|
||||
document.getElementById('progress-percent').textContent = '100%';
|
||||
document.getElementById('footer-message').textContent = 'Redirection vers les résultats...';
|
||||
|
||||
setTimeout(() => {
|
||||
window.location.href = redirectUrl;
|
||||
}, 1000);
|
||||
}
|
||||
|
||||
// Connexion SSE pour recevoir les mises à jour
|
||||
const eventSource = new EventSource('/upload/progress/' + jobId);
|
||||
|
||||
eventSource.onmessage = function(event) {
|
||||
const data = JSON.parse(event.data);
|
||||
|
||||
if (data.type === 'step') {
|
||||
updateStep(data.step, data.status, data.detail);
|
||||
} else if (data.type === 'error') {
|
||||
showError(data.message);
|
||||
eventSource.close();
|
||||
} else if (data.type === 'complete') {
|
||||
showSuccess(data.redirect);
|
||||
eventSource.close();
|
||||
}
|
||||
};
|
||||
|
||||
eventSource.onerror = function() {
|
||||
// Vérifier si le traitement est terminé
|
||||
fetch('/upload/status/' + jobId)
|
||||
.then(r => r.json())
|
||||
.then(data => {
|
||||
if (data.status === 'complete') {
|
||||
showSuccess(data.redirect);
|
||||
} else if (data.status === 'error') {
|
||||
showError(data.message);
|
||||
}
|
||||
})
|
||||
.catch(() => {
|
||||
showError('Connexion perdue avec le serveur');
|
||||
});
|
||||
eventSource.close();
|
||||
};
|
||||
</script>
|
||||
{% endblock %}
|
||||
|
||||
|
||||
297
generations/library_rag/templates/upload_result.html
Normal file
297
generations/library_rag/templates/upload_result.html
Normal file
@@ -0,0 +1,297 @@
|
||||
{% extends "base.html" %}
|
||||
|
||||
{% block title %}Résultat - {{ result.document_name }}{% endblock %}
|
||||
|
||||
{% block content %}
|
||||
{# Dictionnaire de traduction des types de chunks #}
|
||||
{% set chunk_types = {
|
||||
'main_content': {'label': 'Contenu principal', 'icon': '📄', 'desc': 'Paragraphe de contenu substantiel'},
|
||||
'exposition': {'label': 'Exposition', 'icon': '📖', 'desc': 'Présentation d\'idées ou de contexte'},
|
||||
'argument': {'label': 'Argument', 'icon': '💭', 'desc': 'Raisonnement ou argumentation'},
|
||||
'définition': {'label': 'Définition', 'icon': '📌', 'desc': 'Définition de concept ou terme'},
|
||||
'example': {'label': 'Exemple', 'icon': '💡', 'desc': 'Illustration ou cas pratique'},
|
||||
'citation': {'label': 'Citation', 'icon': '💬', 'desc': 'Citation d\'auteur ou référence'},
|
||||
'abstract': {'label': 'Résumé', 'icon': '📋', 'desc': 'Résumé ou synthèse'},
|
||||
'preface': {'label': 'Préface', 'icon': '✍️', 'desc': 'Préface, avant-propos ou avertissement'},
|
||||
'conclusion': {'label': 'Conclusion', 'icon': '🎯', 'desc': 'Conclusion d\'une argumentation'}
|
||||
} %}
|
||||
|
||||
<section class="section">
|
||||
<h1>✅ Traitement terminé</h1>
|
||||
<p class="lead">Le document <strong>{{ result.document_name }}</strong> a été analysé avec succès</p>
|
||||
|
||||
<div class="ornament">· · ·</div>
|
||||
|
||||
<!-- Statistiques -->
|
||||
<div class="stats-grid">
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.pages }}</div>
|
||||
<div class="stat-label">Pages</div>
|
||||
</div>
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.chunks_count or 0 }}</div>
|
||||
<div class="stat-label">Chunks</div>
|
||||
</div>
|
||||
{% if result.files.images %}
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ result.files.images|length }}</div>
|
||||
<div class="stat-label">Images</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
<div class="stat-box">
|
||||
<div class="stat-number">{{ "%.4f"|format(result.cost_total or result.cost or 0) }}€</div>
|
||||
<div class="stat-label">Coût Total</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Détail des coûts si Mistral API -->
|
||||
{% if result.llm_stats %}
|
||||
<div class="card mt-3">
|
||||
<h3>💰 Détail des coûts</h3>
|
||||
<div class="mt-2">
|
||||
<table style="width: 100%; border-collapse: collapse;">
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.5rem 0;"><strong>OCR Mistral</strong></td>
|
||||
<td style="padding: 0.5rem 0; text-align: right;">{{ "%.4f"|format(result.cost_ocr or 0) }}€</td>
|
||||
</tr>
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.5rem 0;"><strong>LLM Mistral API</strong></td>
|
||||
<td style="padding: 0.5rem 0; text-align: right;">{{ "%.4f"|format(result.cost_llm or 0) }}€</td>
|
||||
</tr>
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.5rem 0; color: var(--color-text-muted);">└ {{ result.llm_stats.calls_count }} appels</td>
|
||||
<td style="padding: 0.5rem 0; text-align: right; color: var(--color-text-muted);">
|
||||
{{ result.llm_stats.total_input_tokens + result.llm_stats.total_output_tokens }} tokens
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="padding: 0.5rem 0;"><strong>Total</strong></td>
|
||||
<td style="padding: 0.5rem 0; text-align: right; font-weight: bold; color: var(--color-accent);">
|
||||
{{ "%.4f"|format(result.cost_total or 0) }}€
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<!-- Métadonnées du document -->
|
||||
<div class="card">
|
||||
<h3>📖 Informations du document</h3>
|
||||
<div class="mt-2">
|
||||
<table style="width: 100%; border-collapse: collapse;">
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0; width: 150px;"><strong>Œuvre</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<span class="badge badge-author">{{ result.metadata.work or result.document_name }}</span>
|
||||
</td>
|
||||
</tr>
|
||||
{% if result.metadata.title %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Titre</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.metadata.title }}</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.metadata.author %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Auteur</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<span class="badge badge-author">{{ result.metadata.author }}</span>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Pages</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.pages }}</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="padding: 0.75rem 0;"><strong>Chunks</strong></td>
|
||||
<td style="padding: 0.75rem 0;">{{ result.chunks_count or result.metadata.chunks_count or 0 }} segments de texte</td>
|
||||
</tr>
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Table des matières -->
|
||||
{% if result.metadata.toc and result.metadata.toc|length > 0 %}
|
||||
<div class="card mt-3">
|
||||
<h3>📑 Table des matières</h3>
|
||||
<div class="mt-2">
|
||||
<ul style="list-style: none; padding-left: 0;">
|
||||
{% for item in result.metadata.toc[:20] %}
|
||||
<li style="padding: 0.4rem 0; padding-left: {{ (item.level - 1) * 1.5 }}rem; border-bottom: 1px solid rgba(125, 110, 88, 0.1);">
|
||||
{% if item.level == 1 %}
|
||||
<strong>{{ item.title }}</strong>
|
||||
{% else %}
|
||||
<span style="color: var(--color-accent-alt);">{{ item.title }}</span>
|
||||
{% endif %}
|
||||
</li>
|
||||
{% endfor %}
|
||||
{% if result.metadata.toc|length > 20 %}
|
||||
<li style="padding: 0.5rem 0; color: var(--color-accent);">
|
||||
<em>... et {{ result.metadata.toc|length - 20 }} autres sections</em>
|
||||
</li>
|
||||
{% endif %}
|
||||
</ul>
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<hr class="divider">
|
||||
|
||||
<!-- Fichiers générés -->
|
||||
<div class="card">
|
||||
<h3>📁 Fichiers générés</h3>
|
||||
<div class="mt-2">
|
||||
<table style="width: 100%; border-collapse: collapse;">
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Markdown</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}.md" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Chunks JSON</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_chunks.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% if result.files.structured %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Structure LLM</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_structured.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>OCR brut</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_ocr.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% if result.files.weaviate %}
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.75rem 0;"><strong>Weaviate JSON</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
<a href="/output/{{ result.document_name }}/{{ result.document_name }}_weaviate.json" target="_blank" class="btn btn-sm">
|
||||
Voir le fichier
|
||||
</a>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
{% if result.files.images %}
|
||||
<tr>
|
||||
<td style="padding: 0.75rem 0;"><strong>Images</strong></td>
|
||||
<td style="padding: 0.75rem 0;">
|
||||
{{ result.files.images|length }} image(s) dans <code>images/</code>
|
||||
</td>
|
||||
</tr>
|
||||
{% endif %}
|
||||
</table>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Données insérées dans Weaviate -->
|
||||
{% if result.weaviate_ingest %}
|
||||
<div class="card mt-3">
|
||||
<h3>🗄️ Données insérées dans Weaviate</h3>
|
||||
<div class="mt-2">
|
||||
{% if result.weaviate_ingest.success %}
|
||||
<div class="alert alert-success" style="background-color: rgba(85, 107, 99, 0.1); border: 1px solid rgba(85, 107, 99, 0.3); color: var(--color-accent-alt); padding: 1rem; border-radius: 8px; margin-bottom: 1rem;">
|
||||
<strong>✓ Ingestion réussie :</strong> {{ result.weaviate_ingest.count }} passages insérés dans la collection <code>Passage</code>
|
||||
</div>
|
||||
|
||||
<table style="width: 100%; border-collapse: collapse; margin-bottom: 1rem;">
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.5rem 0; width: 120px;"><strong>Œuvre</strong></td>
|
||||
<td style="padding: 0.5rem 0;"><span class="badge badge-author">{{ result.weaviate_ingest.work }}</span></td>
|
||||
</tr>
|
||||
<tr style="border-bottom: 1px solid rgba(125, 110, 88, 0.2);">
|
||||
<td style="padding: 0.5rem 0;"><strong>Auteur</strong></td>
|
||||
<td style="padding: 0.5rem 0;"><span class="badge badge-author">{{ result.weaviate_ingest.author }}</span></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="padding: 0.5rem 0;"><strong>Passages</strong></td>
|
||||
<td style="padding: 0.5rem 0;">{{ result.weaviate_ingest.count }} objets vectorisés</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<h4 style="font-size: 1rem; margin-top: 1.5rem; margin-bottom: 0.75rem;">Aperçu des passages insérés :</h4>
|
||||
|
||||
{% for passage in result.weaviate_ingest.inserted[:5] %}
|
||||
<div style="background: var(--color-bg-secondary); padding: 1rem; border-radius: 8px; margin-bottom: 0.75rem; border-left: 3px solid var(--color-accent);">
|
||||
<div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 0.5rem;">
|
||||
<div style="display: flex; gap: 0.5rem; align-items: center; flex-wrap: wrap;">
|
||||
<span style="font-size: 0.85rem; color: var(--color-accent);">📄 {{ passage.section }}</span>
|
||||
{% set type_info = chunk_types.get(passage.unitType, {'label': passage.unitType, 'icon': '📝', 'desc': 'Type de contenu'}) %}
|
||||
<span style="font-size: 0.75rem; padding: 0.2rem 0.5rem; border-radius: 4px; background: rgba(125, 110, 88, 0.15);" title="{{ type_info.desc }}">
|
||||
{{ type_info.icon }} {{ type_info.label }}
|
||||
</span>
|
||||
</div>
|
||||
<span style="font-size: 0.7rem; color: var(--color-text-muted);">{{ passage.chunk_id }}</span>
|
||||
</div>
|
||||
<div style="font-style: italic; color: var(--color-text-main); font-size: 0.9rem; line-height: 1.5;">
|
||||
"{{ passage.text_preview }}"
|
||||
</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
|
||||
{% if result.weaviate_ingest.count > 5 %}
|
||||
<p class="text-muted text-center" style="margin-top: 1rem;">
|
||||
<em>... et {{ result.weaviate_ingest.count - 5 }} autres passages</em>
|
||||
</p>
|
||||
{% endif %}
|
||||
|
||||
{% else %}
|
||||
<div class="alert alert-warning" style="background-color: rgba(125, 110, 88, 0.1); border: 1px solid rgba(125, 110, 88, 0.3); color: var(--color-accent); padding: 1rem; border-radius: 8px;">
|
||||
<strong>⚠️ Erreur d'ingestion :</strong> {{ result.weaviate_ingest.error }}
|
||||
</div>
|
||||
<p class="text-muted">Vérifiez que Weaviate est démarré (<code>docker compose up -d</code>) et que le schéma est initialisé (<code>python schema.py</code>).</p>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<!-- Images extraites -->
|
||||
{% if result.files.images %}
|
||||
<div class="card mt-3">
|
||||
<h3>🖼️ Images extraites</h3>
|
||||
<div class="mt-2" style="display: grid; grid-template-columns: repeat(auto-fill, minmax(150px, 1fr)); gap: 1rem;">
|
||||
{% for img in result.files.images[:12] %}
|
||||
<div style="text-align: center;">
|
||||
<a href="/output/{{ result.document_name }}/images/{{ img.split('/')[-1].split('\\')[-1] }}" target="_blank">
|
||||
<img
|
||||
src="/output/{{ result.document_name }}/images/{{ img.split('/')[-1].split('\\')[-1] }}"
|
||||
alt="Image"
|
||||
style="max-width: 100%; max-height: 120px; border-radius: 8px; border: 1px solid rgba(125, 110, 88, 0.2);"
|
||||
>
|
||||
</a>
|
||||
<div class="caption">{{ img.split('/')[-1].split('\\')[-1] }}</div>
|
||||
</div>
|
||||
{% endfor %}
|
||||
{% if result.files.images|length > 12 %}
|
||||
<div style="display: flex; align-items: center; justify-content: center;">
|
||||
<span class="text-muted">+ {{ result.files.images|length - 12 }} autres</span>
|
||||
</div>
|
||||
{% endif %}
|
||||
</div>
|
||||
</div>
|
||||
{% endif %}
|
||||
|
||||
<div class="text-center mt-4">
|
||||
<a href="/upload" class="btn btn-primary">Analyser un autre PDF</a>
|
||||
<a href="/documents" class="btn" style="margin-left: 0.5rem;">Voir tous les documents</a>
|
||||
</div>
|
||||
</section>
|
||||
{% endblock %}
|
||||
1
generations/library_rag/tests/__init__.py
Normal file
1
generations/library_rag/tests/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Test suite for Philosophia project."""
|
||||
1
generations/library_rag/tests/mcp/__init__.py
Normal file
1
generations/library_rag/tests/mcp/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""MCP server unit tests."""
|
||||
196
generations/library_rag/tests/mcp/conftest.py
Normal file
196
generations/library_rag/tests/mcp/conftest.py
Normal file
@@ -0,0 +1,196 @@
|
||||
"""
|
||||
Pytest fixtures for MCP server tests.
|
||||
|
||||
Provides common fixtures for mocking dependencies and test data.
|
||||
"""
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, Generator
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from mcp_config import MCPConfig
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_env_with_api_key() -> Generator[Dict[str, str], None, None]:
|
||||
"""
|
||||
Provide environment with MISTRAL_API_KEY set.
|
||||
|
||||
Yields:
|
||||
Dictionary of environment variables.
|
||||
"""
|
||||
env = {"MISTRAL_API_KEY": "test-api-key-12345"}
|
||||
with patch.dict(os.environ, env, clear=True):
|
||||
yield env
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def valid_config() -> MCPConfig:
|
||||
"""
|
||||
Provide a valid MCPConfig instance for testing.
|
||||
|
||||
Returns:
|
||||
MCPConfig with valid test values.
|
||||
"""
|
||||
return MCPConfig(
|
||||
mistral_api_key="test-api-key",
|
||||
ollama_base_url="http://localhost:11434",
|
||||
structure_llm_model="test-model",
|
||||
structure_llm_temperature=0.2,
|
||||
default_llm_provider="ollama",
|
||||
weaviate_host="localhost",
|
||||
weaviate_port=8080,
|
||||
log_level="INFO",
|
||||
output_dir=Path("test_output"),
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_weaviate_client() -> Generator[MagicMock, None, None]:
|
||||
"""
|
||||
Provide a mocked Weaviate client.
|
||||
|
||||
Yields:
|
||||
MagicMock configured as a Weaviate client.
|
||||
"""
|
||||
with patch("weaviate.connect_to_local") as mock_connect:
|
||||
mock_client = MagicMock()
|
||||
mock_connect.return_value = mock_client
|
||||
yield mock_client
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Parsing Tools Fixtures
|
||||
# =============================================================================
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sample_pdf_bytes() -> bytes:
|
||||
"""
|
||||
Provide minimal valid PDF bytes for testing.
|
||||
|
||||
Returns:
|
||||
Bytes representing a minimal valid PDF file.
|
||||
"""
|
||||
# Minimal valid PDF structure
|
||||
return b"""%PDF-1.4
|
||||
1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj
|
||||
2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 >> endobj
|
||||
3 0 obj << /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] >> endobj
|
||||
xref
|
||||
0 4
|
||||
0000000000 65535 f
|
||||
0000000009 00000 n
|
||||
0000000058 00000 n
|
||||
0000000115 00000 n
|
||||
trailer << /Size 4 /Root 1 0 R >>
|
||||
startxref
|
||||
193
|
||||
%%EOF"""
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def successful_pipeline_result() -> Dict[str, Any]:
|
||||
"""
|
||||
Provide a successful pipeline result for mocking.
|
||||
|
||||
Returns:
|
||||
Dictionary mimicking a successful process_pdf result.
|
||||
"""
|
||||
return {
|
||||
"success": True,
|
||||
"document_name": "test-document",
|
||||
"source_id": "test-document",
|
||||
"pages": 10,
|
||||
"chunks_count": 25,
|
||||
"cost_ocr": 0.03,
|
||||
"cost_llm": 0.05,
|
||||
"cost_total": 0.08,
|
||||
"output_dir": Path("output/test-document"),
|
||||
"metadata": {
|
||||
"title": "Test Document Title",
|
||||
"author": "Test Author",
|
||||
"language": "en",
|
||||
"year": 2023,
|
||||
},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def failed_pipeline_result() -> Dict[str, Any]:
|
||||
"""
|
||||
Provide a failed pipeline result for mocking.
|
||||
|
||||
Returns:
|
||||
Dictionary mimicking a failed process_pdf result.
|
||||
"""
|
||||
return {
|
||||
"success": False,
|
||||
"document_name": "failed-document",
|
||||
"source_id": "failed-document",
|
||||
"pages": 0,
|
||||
"chunks_count": 0,
|
||||
"cost_ocr": 0.0,
|
||||
"cost_llm": 0.0,
|
||||
"cost_total": 0.0,
|
||||
"output_dir": "",
|
||||
"metadata": {},
|
||||
"error": "OCR processing failed: Invalid PDF structure",
|
||||
}
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_process_pdf() -> Generator[MagicMock, None, None]:
|
||||
"""
|
||||
Provide a mocked process_pdf function.
|
||||
|
||||
Yields:
|
||||
MagicMock for utils.pdf_pipeline.process_pdf.
|
||||
"""
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock:
|
||||
yield mock
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_process_pdf_bytes() -> Generator[MagicMock, None, None]:
|
||||
"""
|
||||
Provide a mocked process_pdf_bytes function.
|
||||
|
||||
Yields:
|
||||
MagicMock for utils.pdf_pipeline.process_pdf_bytes.
|
||||
"""
|
||||
with patch("mcp_tools.parsing_tools.process_pdf_bytes") as mock:
|
||||
yield mock
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_download_pdf() -> Generator[AsyncMock, None, None]:
|
||||
"""
|
||||
Provide a mocked download_pdf function.
|
||||
|
||||
Yields:
|
||||
AsyncMock for mcp_tools.parsing_tools.download_pdf.
|
||||
"""
|
||||
with patch("mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock) as mock:
|
||||
yield mock
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def temp_pdf_file(tmp_path: Path, sample_pdf_bytes: bytes) -> Path:
|
||||
"""
|
||||
Create a temporary PDF file for testing.
|
||||
|
||||
Args:
|
||||
tmp_path: Pytest tmp_path fixture.
|
||||
sample_pdf_bytes: Sample PDF content.
|
||||
|
||||
Returns:
|
||||
Path to the temporary PDF file.
|
||||
"""
|
||||
pdf_path = tmp_path / "test_document.pdf"
|
||||
pdf_path.write_bytes(sample_pdf_bytes)
|
||||
return pdf_path
|
||||
133
generations/library_rag/tests/mcp/test_config.py
Normal file
133
generations/library_rag/tests/mcp/test_config.py
Normal file
@@ -0,0 +1,133 @@
|
||||
"""
|
||||
Unit tests for MCP configuration management.
|
||||
|
||||
Tests the MCPConfig class for proper loading, validation, and defaults.
|
||||
"""
|
||||
|
||||
import os
|
||||
import pytest
|
||||
from pathlib import Path
|
||||
from unittest.mock import patch
|
||||
|
||||
from mcp_config import MCPConfig
|
||||
|
||||
|
||||
class TestMCPConfigFromEnv:
|
||||
"""Test MCPConfig.from_env() method."""
|
||||
|
||||
def test_loads_with_required_key(self) -> None:
|
||||
"""Test config loads when MISTRAL_API_KEY is present."""
|
||||
with patch.dict(os.environ, {"MISTRAL_API_KEY": "test-key-123"}, clear=True):
|
||||
config = MCPConfig.from_env()
|
||||
assert config.mistral_api_key == "test-key-123"
|
||||
|
||||
def test_raises_without_api_key(self) -> None:
|
||||
"""Test ValueError is raised when MISTRAL_API_KEY is missing."""
|
||||
with patch("mcp_config.load_dotenv"): # Prevent reading .env file
|
||||
with patch.dict(os.environ, {}, clear=True):
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
MCPConfig.from_env()
|
||||
assert "MISTRAL_API_KEY" in str(exc_info.value)
|
||||
|
||||
def test_default_values_applied(self) -> None:
|
||||
"""Test all default values are applied correctly."""
|
||||
with patch.dict(os.environ, {"MISTRAL_API_KEY": "test-key"}, clear=True):
|
||||
config = MCPConfig.from_env()
|
||||
|
||||
# Check all defaults
|
||||
assert config.ollama_base_url == "http://localhost:11434"
|
||||
assert config.structure_llm_model == "deepseek-r1:14b"
|
||||
assert config.structure_llm_temperature == 0.2
|
||||
assert config.default_llm_provider == "ollama"
|
||||
assert config.weaviate_host == "localhost"
|
||||
assert config.weaviate_port == 8080
|
||||
assert config.log_level == "INFO"
|
||||
assert config.output_dir == Path("output")
|
||||
|
||||
def test_custom_values_loaded(self) -> None:
|
||||
"""Test custom environment values are loaded correctly."""
|
||||
custom_env = {
|
||||
"MISTRAL_API_KEY": "custom-key",
|
||||
"OLLAMA_BASE_URL": "http://custom:1234",
|
||||
"STRUCTURE_LLM_MODEL": "custom-model",
|
||||
"STRUCTURE_LLM_TEMPERATURE": "0.7",
|
||||
"DEFAULT_LLM_PROVIDER": "mistral",
|
||||
"WEAVIATE_HOST": "weaviate.example.com",
|
||||
"WEAVIATE_PORT": "9999",
|
||||
"LOG_LEVEL": "DEBUG",
|
||||
"OUTPUT_DIR": "/custom/output",
|
||||
}
|
||||
with patch.dict(os.environ, custom_env, clear=True):
|
||||
config = MCPConfig.from_env()
|
||||
|
||||
assert config.mistral_api_key == "custom-key"
|
||||
assert config.ollama_base_url == "http://custom:1234"
|
||||
assert config.structure_llm_model == "custom-model"
|
||||
assert config.structure_llm_temperature == 0.7
|
||||
assert config.default_llm_provider == "mistral"
|
||||
assert config.weaviate_host == "weaviate.example.com"
|
||||
assert config.weaviate_port == 9999
|
||||
assert config.log_level == "DEBUG"
|
||||
assert config.output_dir == Path("/custom/output")
|
||||
|
||||
|
||||
class TestMCPConfigValidation:
|
||||
"""Test MCPConfig.validate() method."""
|
||||
|
||||
def test_valid_config_passes(self) -> None:
|
||||
"""Test valid configuration passes validation."""
|
||||
config = MCPConfig(
|
||||
mistral_api_key="test-key",
|
||||
default_llm_provider="ollama",
|
||||
log_level="INFO",
|
||||
structure_llm_temperature=0.5,
|
||||
)
|
||||
# Should not raise
|
||||
config.validate()
|
||||
|
||||
def test_invalid_llm_provider_fails(self) -> None:
|
||||
"""Test invalid LLM provider raises ValueError."""
|
||||
config = MCPConfig(
|
||||
mistral_api_key="test-key",
|
||||
default_llm_provider="invalid", # type: ignore
|
||||
)
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
config.validate()
|
||||
assert "Invalid LLM provider" in str(exc_info.value)
|
||||
|
||||
def test_invalid_log_level_fails(self) -> None:
|
||||
"""Test invalid log level raises ValueError."""
|
||||
config = MCPConfig(
|
||||
mistral_api_key="test-key",
|
||||
log_level="INVALID",
|
||||
)
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
config.validate()
|
||||
assert "Invalid log level" in str(exc_info.value)
|
||||
|
||||
def test_invalid_temperature_fails(self) -> None:
|
||||
"""Test temperature outside 0-2 range raises ValueError."""
|
||||
config = MCPConfig(
|
||||
mistral_api_key="test-key",
|
||||
structure_llm_temperature=2.5,
|
||||
)
|
||||
with pytest.raises(ValueError) as exc_info:
|
||||
config.validate()
|
||||
assert "Invalid temperature" in str(exc_info.value)
|
||||
|
||||
|
||||
class TestMCPConfigProperties:
|
||||
"""Test MCPConfig properties."""
|
||||
|
||||
def test_weaviate_url_property(self) -> None:
|
||||
"""Test weaviate_url property returns correct URL."""
|
||||
config = MCPConfig(
|
||||
mistral_api_key="test-key",
|
||||
weaviate_host="my-host",
|
||||
weaviate_port=9090,
|
||||
)
|
||||
assert config.weaviate_url == "http://my-host:9090"
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
673
generations/library_rag/tests/mcp/test_parsing_tools.py
Normal file
673
generations/library_rag/tests/mcp/test_parsing_tools.py
Normal file
@@ -0,0 +1,673 @@
|
||||
"""
|
||||
Unit tests for MCP parsing tools.
|
||||
|
||||
Tests the parse_pdf tool handler with mocked dependencies to ensure:
|
||||
- Local file processing works correctly
|
||||
- URL-based PDF downloads work correctly
|
||||
- Error handling is comprehensive
|
||||
- Fixed parameters are used correctly
|
||||
- Cost tracking is accurate
|
||||
|
||||
Uses asyncio for async test support.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import httpx
|
||||
import pytest
|
||||
|
||||
from mcp_tools.parsing_tools import (
|
||||
FIXED_LLM_MODEL,
|
||||
FIXED_LLM_PROVIDER,
|
||||
FIXED_USE_LLM,
|
||||
FIXED_USE_OCR_ANNOTATIONS,
|
||||
FIXED_USE_SEMANTIC_CHUNKING,
|
||||
download_pdf,
|
||||
extract_filename_from_url,
|
||||
is_url,
|
||||
parse_pdf_handler,
|
||||
)
|
||||
from mcp_tools.schemas import ParsePdfInput, ParsePdfOutput
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test is_url Helper Function
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestIsUrl:
|
||||
"""Tests for the is_url helper function."""
|
||||
|
||||
def test_https_url(self) -> None:
|
||||
"""Test that HTTPS URLs are recognized."""
|
||||
assert is_url("https://example.com/document.pdf") is True
|
||||
|
||||
def test_http_url(self) -> None:
|
||||
"""Test that HTTP URLs are recognized."""
|
||||
assert is_url("http://example.com/document.pdf") is True
|
||||
|
||||
def test_local_path_unix(self) -> None:
|
||||
"""Test that Unix local paths are not recognized as URLs."""
|
||||
assert is_url("/path/to/document.pdf") is False
|
||||
|
||||
def test_local_path_windows(self) -> None:
|
||||
"""Test that Windows local paths are not recognized as URLs."""
|
||||
assert is_url("C:\\Documents\\document.pdf") is False
|
||||
|
||||
def test_relative_path(self) -> None:
|
||||
"""Test that relative paths are not recognized as URLs."""
|
||||
assert is_url("./documents/document.pdf") is False
|
||||
|
||||
def test_ftp_url_not_supported(self) -> None:
|
||||
"""Test that FTP URLs are not recognized (only HTTP/HTTPS supported)."""
|
||||
assert is_url("ftp://example.com/document.pdf") is False
|
||||
|
||||
def test_empty_string(self) -> None:
|
||||
"""Test that empty strings are not recognized as URLs."""
|
||||
assert is_url("") is False
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test extract_filename_from_url Helper Function
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestExtractFilenameFromUrl:
|
||||
"""Tests for the extract_filename_from_url helper function."""
|
||||
|
||||
def test_url_with_pdf_filename(self) -> None:
|
||||
"""Test extraction when URL has a .pdf filename."""
|
||||
result = extract_filename_from_url("https://example.com/docs/aristotle.pdf")
|
||||
assert result == "aristotle.pdf"
|
||||
|
||||
def test_url_with_filename_no_extension(self) -> None:
|
||||
"""Test extraction when URL has a filename without extension."""
|
||||
result = extract_filename_from_url("https://example.com/docs/aristotle")
|
||||
assert result == "aristotle.pdf"
|
||||
|
||||
def test_url_without_path(self) -> None:
|
||||
"""Test extraction when URL has no path."""
|
||||
result = extract_filename_from_url("https://example.com/")
|
||||
assert result == "downloaded.pdf"
|
||||
|
||||
def test_url_with_api_endpoint(self) -> None:
|
||||
"""Test extraction when URL is an API endpoint."""
|
||||
result = extract_filename_from_url("https://api.example.com/download")
|
||||
assert result == "download.pdf"
|
||||
|
||||
def test_url_with_query_params(self) -> None:
|
||||
"""Test extraction when URL has query parameters."""
|
||||
result = extract_filename_from_url(
|
||||
"https://example.com/docs/kant.pdf?token=abc"
|
||||
)
|
||||
assert result == "kant.pdf"
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test download_pdf Function
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestDownloadPdf:
|
||||
"""Tests for the download_pdf async function."""
|
||||
|
||||
def test_successful_download(self) -> None:
|
||||
"""Test successful PDF download from URL."""
|
||||
|
||||
async def run_test() -> None:
|
||||
mock_response = MagicMock()
|
||||
mock_response.content = b"%PDF-1.4 test content"
|
||||
mock_response.headers = {"content-type": "application/pdf"}
|
||||
mock_response.raise_for_status = MagicMock()
|
||||
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.httpx.AsyncClient"
|
||||
) as mock_client_class:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.get = AsyncMock(return_value=mock_response)
|
||||
mock_client_class.return_value.__aenter__ = AsyncMock(
|
||||
return_value=mock_client
|
||||
)
|
||||
mock_client_class.return_value.__aexit__ = AsyncMock(return_value=None)
|
||||
|
||||
result = await download_pdf("https://example.com/document.pdf")
|
||||
|
||||
assert result == b"%PDF-1.4 test content"
|
||||
mock_client.get.assert_called_once_with(
|
||||
"https://example.com/document.pdf"
|
||||
)
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_download_with_non_pdf_content_type(self) -> None:
|
||||
"""Test download proceeds with warning when content-type is not PDF."""
|
||||
|
||||
async def run_test() -> None:
|
||||
mock_response = MagicMock()
|
||||
mock_response.content = b"%PDF-1.4 test content"
|
||||
mock_response.headers = {"content-type": "application/octet-stream"}
|
||||
mock_response.raise_for_status = MagicMock()
|
||||
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.httpx.AsyncClient"
|
||||
) as mock_client_class:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.get = AsyncMock(return_value=mock_response)
|
||||
mock_client_class.return_value.__aenter__ = AsyncMock(
|
||||
return_value=mock_client
|
||||
)
|
||||
mock_client_class.return_value.__aexit__ = AsyncMock(return_value=None)
|
||||
|
||||
# Should still succeed, just logs a warning
|
||||
result = await download_pdf("https://example.com/document.pdf")
|
||||
assert result == b"%PDF-1.4 test content"
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_download_http_error(self) -> None:
|
||||
"""Test that HTTP errors are propagated."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.httpx.AsyncClient"
|
||||
) as mock_client_class:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.get = AsyncMock(
|
||||
side_effect=httpx.HTTPStatusError(
|
||||
"Not Found",
|
||||
request=MagicMock(),
|
||||
response=MagicMock(status_code=404),
|
||||
)
|
||||
)
|
||||
mock_client_class.return_value.__aenter__ = AsyncMock(
|
||||
return_value=mock_client
|
||||
)
|
||||
mock_client_class.return_value.__aexit__ = AsyncMock(return_value=None)
|
||||
|
||||
with pytest.raises(httpx.HTTPStatusError):
|
||||
await download_pdf("https://example.com/nonexistent.pdf")
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test parse_pdf_handler - Local Files
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestParsePdfHandlerLocalFile:
|
||||
"""Tests for parse_pdf_handler with local file paths."""
|
||||
|
||||
def test_successful_local_file_processing(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test successful processing of a local PDF file."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is True
|
||||
assert result.document_name == "test-document"
|
||||
assert result.pages == 10
|
||||
assert result.chunks_count == 25
|
||||
assert result.cost_ocr == 0.03
|
||||
assert result.cost_llm == 0.05
|
||||
assert result.cost_total == 0.08
|
||||
assert result.metadata["title"] == "Test Document Title"
|
||||
assert result.error is None
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_local_file_uses_fixed_parameters(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test that local file processing uses the fixed optimal parameters."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
await parse_pdf_handler(input_data)
|
||||
|
||||
# Verify fixed parameters are passed
|
||||
mock_process_pdf.assert_called_once()
|
||||
call_kwargs = mock_process_pdf.call_args.kwargs
|
||||
|
||||
assert call_kwargs["use_llm"] == FIXED_USE_LLM
|
||||
assert call_kwargs["llm_provider"] == FIXED_LLM_PROVIDER
|
||||
assert call_kwargs["llm_model"] == FIXED_LLM_MODEL
|
||||
assert call_kwargs["use_semantic_chunking"] == FIXED_USE_SEMANTIC_CHUNKING
|
||||
assert call_kwargs["use_ocr_annotations"] == FIXED_USE_OCR_ANNOTATIONS
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_file_not_found_error(self) -> None:
|
||||
"""Test error handling when local file does not exist."""
|
||||
|
||||
async def run_test() -> None:
|
||||
input_data = ParsePdfInput(pdf_path="/nonexistent/path/document.pdf")
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert "not found" in result.error.lower()
|
||||
assert result.pages == 0
|
||||
assert result.chunks_count == 0
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_pipeline_failure(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
failed_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test handling when the pipeline returns a failure."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = failed_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert "OCR processing failed" in result.error
|
||||
assert result.pages == 0
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_pipeline_exception(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test handling when the pipeline raises an exception."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.side_effect = RuntimeError("Unexpected error")
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert "Processing error" in result.error
|
||||
assert "Unexpected error" in result.error
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test parse_pdf_handler - URL Downloads
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestParsePdfHandlerUrl:
|
||||
"""Tests for parse_pdf_handler with URL inputs."""
|
||||
|
||||
def test_successful_url_processing(
|
||||
self,
|
||||
sample_pdf_bytes: bytes,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test successful processing of a PDF from URL."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock
|
||||
) as mock_download:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.process_pdf_bytes"
|
||||
) as mock_process:
|
||||
mock_download.return_value = sample_pdf_bytes
|
||||
mock_process.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="https://example.com/philosophy/kant.pdf"
|
||||
)
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is True
|
||||
assert result.document_name == "test-document"
|
||||
mock_download.assert_called_once_with(
|
||||
"https://example.com/philosophy/kant.pdf"
|
||||
)
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_url_uses_extracted_filename(
|
||||
self,
|
||||
sample_pdf_bytes: bytes,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test that filename is extracted from URL for processing."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock
|
||||
) as mock_download:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.process_pdf_bytes"
|
||||
) as mock_process:
|
||||
mock_download.return_value = sample_pdf_bytes
|
||||
mock_process.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="https://example.com/docs/aristotle-metaphysics.pdf"
|
||||
)
|
||||
await parse_pdf_handler(input_data)
|
||||
|
||||
# Verify filename was extracted and passed
|
||||
mock_process.assert_called_once()
|
||||
call_kwargs = mock_process.call_args.kwargs
|
||||
assert call_kwargs["filename"] == "aristotle-metaphysics.pdf"
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_url_uses_fixed_parameters(
|
||||
self,
|
||||
sample_pdf_bytes: bytes,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test that URL processing uses the fixed optimal parameters."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock
|
||||
) as mock_download:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.process_pdf_bytes"
|
||||
) as mock_process:
|
||||
mock_download.return_value = sample_pdf_bytes
|
||||
mock_process.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="https://example.com/document.pdf"
|
||||
)
|
||||
await parse_pdf_handler(input_data)
|
||||
|
||||
call_kwargs = mock_process.call_args.kwargs
|
||||
assert call_kwargs["llm_provider"] == FIXED_LLM_PROVIDER
|
||||
assert call_kwargs["llm_model"] == FIXED_LLM_MODEL
|
||||
assert (
|
||||
call_kwargs["use_semantic_chunking"]
|
||||
== FIXED_USE_SEMANTIC_CHUNKING
|
||||
)
|
||||
assert (
|
||||
call_kwargs["use_ocr_annotations"] == FIXED_USE_OCR_ANNOTATIONS
|
||||
)
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_url_download_http_error(self) -> None:
|
||||
"""Test error handling when URL download fails with HTTP error."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock
|
||||
) as mock_download:
|
||||
mock_download.side_effect = httpx.HTTPStatusError(
|
||||
"Not Found",
|
||||
request=MagicMock(),
|
||||
response=MagicMock(status_code=404),
|
||||
)
|
||||
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="https://example.com/nonexistent.pdf"
|
||||
)
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert "Failed to download PDF" in result.error
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_url_download_network_error(self) -> None:
|
||||
"""Test error handling when URL download fails with network error."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch(
|
||||
"mcp_tools.parsing_tools.download_pdf", new_callable=AsyncMock
|
||||
) as mock_download:
|
||||
mock_download.side_effect = httpx.ConnectError("Connection refused")
|
||||
|
||||
input_data = ParsePdfInput(
|
||||
pdf_path="https://example.com/document.pdf"
|
||||
)
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert "Failed to download PDF" in result.error
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test Cost Tracking
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestCostTracking:
|
||||
"""Tests for cost tracking in parse_pdf output."""
|
||||
|
||||
def test_costs_are_tracked_correctly(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test that OCR and LLM costs are correctly tracked."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = {
|
||||
"success": True,
|
||||
"document_name": "test-doc",
|
||||
"source_id": "test-doc",
|
||||
"pages": 50,
|
||||
"chunks_count": 100,
|
||||
"cost_ocr": 0.15, # 50 pages * 0.003€
|
||||
"cost_llm": 0.25,
|
||||
"cost_total": 0.40,
|
||||
"output_dir": Path("output/test-doc"),
|
||||
"metadata": {},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.cost_ocr == 0.15
|
||||
assert result.cost_llm == 0.25
|
||||
assert result.cost_total == 0.40
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_cost_total_calculated_when_missing(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test that cost_total is calculated if not provided."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = {
|
||||
"success": True,
|
||||
"document_name": "test-doc",
|
||||
"source_id": "test-doc",
|
||||
"pages": 10,
|
||||
"chunks_count": 20,
|
||||
"cost_ocr": 0.03,
|
||||
"cost_llm": 0.05,
|
||||
# cost_total intentionally missing
|
||||
"output_dir": Path("output/test-doc"),
|
||||
"metadata": {},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.cost_total == 0.08 # 0.03 + 0.05
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_zero_costs_on_failure(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test that costs are zero when processing fails early."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.side_effect = RuntimeError("Early failure")
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.success is False
|
||||
assert result.cost_ocr == 0.0
|
||||
assert result.cost_llm == 0.0
|
||||
assert result.cost_total == 0.0
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test Metadata Handling
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestMetadataHandling:
|
||||
"""Tests for metadata extraction and handling."""
|
||||
|
||||
def test_metadata_extracted_correctly(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test that metadata is correctly passed through."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = {
|
||||
"success": True,
|
||||
"document_name": "platon-menon",
|
||||
"source_id": "platon-menon",
|
||||
"pages": 80,
|
||||
"chunks_count": 150,
|
||||
"cost_ocr": 0.24,
|
||||
"cost_llm": 0.30,
|
||||
"cost_total": 0.54,
|
||||
"output_dir": Path("output/platon-menon"),
|
||||
"metadata": {
|
||||
"title": "Ménon",
|
||||
"author": "Platon",
|
||||
"language": "fr",
|
||||
"year": -380,
|
||||
"genre": "dialogue",
|
||||
},
|
||||
"error": None,
|
||||
}
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.metadata["title"] == "Ménon"
|
||||
assert result.metadata["author"] == "Platon"
|
||||
assert result.metadata["language"] == "fr"
|
||||
assert result.metadata["year"] == -380
|
||||
assert result.metadata["genre"] == "dialogue"
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_empty_metadata_handled(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
) -> None:
|
||||
"""Test that empty/None metadata is handled gracefully."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = {
|
||||
"success": True,
|
||||
"document_name": "test-doc",
|
||||
"source_id": "test-doc",
|
||||
"pages": 10,
|
||||
"chunks_count": 20,
|
||||
"cost_ocr": 0.03,
|
||||
"cost_llm": 0.05,
|
||||
"cost_total": 0.08,
|
||||
"output_dir": Path("output/test-doc"),
|
||||
"metadata": None, # Explicitly None
|
||||
"error": None,
|
||||
}
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert result.metadata == {}
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Test Output Schema Validation
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class TestOutputSchemaValidation:
|
||||
"""Tests for ParsePdfOutput schema compliance."""
|
||||
|
||||
def test_output_is_valid_schema(
|
||||
self,
|
||||
temp_pdf_file: Path,
|
||||
successful_pipeline_result: Dict[str, Any],
|
||||
) -> None:
|
||||
"""Test that output conforms to ParsePdfOutput schema."""
|
||||
|
||||
async def run_test() -> None:
|
||||
with patch("mcp_tools.parsing_tools.process_pdf") as mock_process_pdf:
|
||||
mock_process_pdf.return_value = successful_pipeline_result
|
||||
|
||||
input_data = ParsePdfInput(pdf_path=str(temp_pdf_file))
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
# Verify it's the correct type
|
||||
assert isinstance(result, ParsePdfOutput)
|
||||
|
||||
# Verify all required fields are present
|
||||
assert hasattr(result, "success")
|
||||
assert hasattr(result, "document_name")
|
||||
assert hasattr(result, "source_id")
|
||||
assert hasattr(result, "pages")
|
||||
assert hasattr(result, "chunks_count")
|
||||
assert hasattr(result, "cost_ocr")
|
||||
assert hasattr(result, "cost_llm")
|
||||
assert hasattr(result, "cost_total")
|
||||
assert hasattr(result, "output_dir")
|
||||
assert hasattr(result, "metadata")
|
||||
assert hasattr(result, "error")
|
||||
|
||||
asyncio.run(run_test())
|
||||
|
||||
def test_error_output_is_valid_schema(self) -> None:
|
||||
"""Test that error output conforms to ParsePdfOutput schema."""
|
||||
|
||||
async def run_test() -> None:
|
||||
input_data = ParsePdfInput(pdf_path="/nonexistent/file.pdf")
|
||||
result = await parse_pdf_handler(input_data)
|
||||
|
||||
assert isinstance(result, ParsePdfOutput)
|
||||
assert result.success is False
|
||||
assert result.error is not None
|
||||
assert isinstance(result.error, str)
|
||||
|
||||
asyncio.run(run_test())
|
||||
1336
generations/library_rag/tests/mcp/test_retrieval_tools.py
Normal file
1336
generations/library_rag/tests/mcp/test_retrieval_tools.py
Normal file
File diff suppressed because it is too large
Load Diff
256
generations/library_rag/tests/mcp/test_schemas.py
Normal file
256
generations/library_rag/tests/mcp/test_schemas.py
Normal file
@@ -0,0 +1,256 @@
|
||||
"""
|
||||
Unit tests for MCP Pydantic schemas.
|
||||
|
||||
Tests schema validation, field constraints, and JSON schema generation.
|
||||
"""
|
||||
|
||||
import pytest
|
||||
from pydantic import ValidationError
|
||||
|
||||
from mcp_tools.schemas import (
|
||||
ParsePdfInput,
|
||||
ParsePdfOutput,
|
||||
SearchChunksInput,
|
||||
SearchChunksOutput,
|
||||
SearchSummariesInput,
|
||||
GetDocumentInput,
|
||||
ListDocumentsInput,
|
||||
GetChunksByDocumentInput,
|
||||
FilterByAuthorInput,
|
||||
DeleteDocumentInput,
|
||||
ChunkResult,
|
||||
DocumentInfo,
|
||||
)
|
||||
|
||||
|
||||
class TestParsePdfInput:
|
||||
"""Test ParsePdfInput schema validation."""
|
||||
|
||||
def test_valid_path(self) -> None:
|
||||
"""Test valid PDF path is accepted."""
|
||||
input_data = ParsePdfInput(pdf_path="/path/to/document.pdf")
|
||||
assert input_data.pdf_path == "/path/to/document.pdf"
|
||||
|
||||
def test_valid_url(self) -> None:
|
||||
"""Test valid URL is accepted."""
|
||||
input_data = ParsePdfInput(pdf_path="https://example.com/doc.pdf")
|
||||
assert input_data.pdf_path == "https://example.com/doc.pdf"
|
||||
|
||||
def test_empty_path_rejected(self) -> None:
|
||||
"""Test empty path raises validation error."""
|
||||
with pytest.raises(ValidationError) as exc_info:
|
||||
ParsePdfInput(pdf_path="")
|
||||
assert "string_too_short" in str(exc_info.value).lower()
|
||||
|
||||
|
||||
class TestParsePdfOutput:
|
||||
"""Test ParsePdfOutput schema."""
|
||||
|
||||
def test_full_output(self) -> None:
|
||||
"""Test creating complete output."""
|
||||
output = ParsePdfOutput(
|
||||
success=True,
|
||||
document_name="test-doc",
|
||||
source_id="test-doc-v1",
|
||||
pages=10,
|
||||
chunks_count=25,
|
||||
cost_ocr=0.03,
|
||||
cost_llm=0.01,
|
||||
cost_total=0.04,
|
||||
output_dir="/output/test-doc",
|
||||
metadata={"title": "Test", "author": "Unknown"},
|
||||
)
|
||||
assert output.success is True
|
||||
assert output.cost_total == 0.04
|
||||
assert output.metadata["title"] == "Test"
|
||||
|
||||
def test_output_with_error(self) -> None:
|
||||
"""Test output with error field set."""
|
||||
output = ParsePdfOutput(
|
||||
success=False,
|
||||
document_name="failed-doc",
|
||||
source_id="",
|
||||
pages=0,
|
||||
chunks_count=0,
|
||||
cost_ocr=0.0,
|
||||
cost_llm=0.0,
|
||||
cost_total=0.0,
|
||||
output_dir="",
|
||||
error="PDF processing failed: corrupted file",
|
||||
)
|
||||
assert output.success is False
|
||||
assert "corrupted" in output.error # type: ignore
|
||||
|
||||
|
||||
class TestSearchChunksInput:
|
||||
"""Test SearchChunksInput schema validation."""
|
||||
|
||||
def test_minimal_input(self) -> None:
|
||||
"""Test minimal valid input."""
|
||||
input_data = SearchChunksInput(query="test query")
|
||||
assert input_data.query == "test query"
|
||||
assert input_data.limit == 10 # default
|
||||
assert input_data.min_similarity == 0.0 # default
|
||||
|
||||
def test_full_input(self) -> None:
|
||||
"""Test input with all fields."""
|
||||
input_data = SearchChunksInput(
|
||||
query="What is justice?",
|
||||
limit=20,
|
||||
min_similarity=0.5,
|
||||
author_filter="Platon",
|
||||
work_filter="Republic",
|
||||
language_filter="fr",
|
||||
)
|
||||
assert input_data.limit == 20
|
||||
assert input_data.author_filter == "Platon"
|
||||
|
||||
def test_empty_query_rejected(self) -> None:
|
||||
"""Test empty query raises error."""
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="")
|
||||
|
||||
def test_query_too_long_rejected(self) -> None:
|
||||
"""Test query over 1000 chars is rejected."""
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="a" * 1001)
|
||||
|
||||
def test_limit_bounds(self) -> None:
|
||||
"""Test limit validation bounds."""
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="test", limit=0)
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="test", limit=101)
|
||||
|
||||
def test_similarity_bounds(self) -> None:
|
||||
"""Test similarity validation bounds."""
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="test", min_similarity=-0.1)
|
||||
with pytest.raises(ValidationError):
|
||||
SearchChunksInput(query="test", min_similarity=1.1)
|
||||
|
||||
|
||||
class TestSearchSummariesInput:
|
||||
"""Test SearchSummariesInput schema validation."""
|
||||
|
||||
def test_level_filters(self) -> None:
|
||||
"""Test min/max level filters."""
|
||||
input_data = SearchSummariesInput(
|
||||
query="test",
|
||||
min_level=1,
|
||||
max_level=3,
|
||||
)
|
||||
assert input_data.min_level == 1
|
||||
assert input_data.max_level == 3
|
||||
|
||||
def test_level_bounds(self) -> None:
|
||||
"""Test level validation bounds."""
|
||||
with pytest.raises(ValidationError):
|
||||
SearchSummariesInput(query="test", min_level=0)
|
||||
with pytest.raises(ValidationError):
|
||||
SearchSummariesInput(query="test", max_level=6)
|
||||
|
||||
|
||||
class TestGetDocumentInput:
|
||||
"""Test GetDocumentInput schema validation."""
|
||||
|
||||
def test_defaults(self) -> None:
|
||||
"""Test default values."""
|
||||
input_data = GetDocumentInput(source_id="doc-123")
|
||||
assert input_data.include_chunks is False
|
||||
assert input_data.chunk_limit == 50
|
||||
|
||||
def test_with_chunks(self) -> None:
|
||||
"""Test requesting chunks."""
|
||||
input_data = GetDocumentInput(
|
||||
source_id="doc-123",
|
||||
include_chunks=True,
|
||||
chunk_limit=100,
|
||||
)
|
||||
assert input_data.include_chunks is True
|
||||
assert input_data.chunk_limit == 100
|
||||
|
||||
|
||||
class TestDeleteDocumentInput:
|
||||
"""Test DeleteDocumentInput schema validation."""
|
||||
|
||||
def test_requires_confirmation(self) -> None:
|
||||
"""Test confirm defaults to False."""
|
||||
input_data = DeleteDocumentInput(source_id="doc-to-delete")
|
||||
assert input_data.confirm is False
|
||||
|
||||
def test_with_confirmation(self) -> None:
|
||||
"""Test explicit confirmation."""
|
||||
input_data = DeleteDocumentInput(
|
||||
source_id="doc-to-delete",
|
||||
confirm=True,
|
||||
)
|
||||
assert input_data.confirm is True
|
||||
|
||||
|
||||
class TestChunkResult:
|
||||
"""Test ChunkResult model."""
|
||||
|
||||
def test_full_chunk(self) -> None:
|
||||
"""Test creating full chunk result."""
|
||||
chunk = ChunkResult(
|
||||
text="This is the chunk content.",
|
||||
similarity=0.85,
|
||||
section_path="Chapter 1 > Section 1",
|
||||
chapter_title="Introduction",
|
||||
work_title="The Republic",
|
||||
work_author="Platon",
|
||||
order_index=5,
|
||||
)
|
||||
assert chunk.similarity == 0.85
|
||||
assert chunk.order_index == 5
|
||||
|
||||
|
||||
class TestDocumentInfo:
|
||||
"""Test DocumentInfo model."""
|
||||
|
||||
def test_with_optional_fields(self) -> None:
|
||||
"""Test DocumentInfo with all fields."""
|
||||
doc = DocumentInfo(
|
||||
source_id="platon-republic",
|
||||
work_title="The Republic",
|
||||
work_author="Platon",
|
||||
edition="GF Flammarion",
|
||||
pages=500,
|
||||
language="fr",
|
||||
toc={"chapters": ["I", "II", "III"]},
|
||||
hierarchy={"level": 1},
|
||||
)
|
||||
assert doc.toc is not None
|
||||
assert doc.hierarchy is not None
|
||||
|
||||
|
||||
class TestJsonSchemaGeneration:
|
||||
"""Test JSON schema generation from Pydantic models."""
|
||||
|
||||
def test_schemas_have_descriptions(self) -> None:
|
||||
"""Test all fields have descriptions for JSON schema."""
|
||||
schema = SearchChunksInput.model_json_schema()
|
||||
|
||||
# Check field descriptions exist
|
||||
properties = schema["properties"]
|
||||
assert "description" in properties["query"]
|
||||
assert "description" in properties["limit"]
|
||||
assert "description" in properties["min_similarity"]
|
||||
|
||||
def test_schema_includes_constraints(self) -> None:
|
||||
"""Test validation constraints are in JSON schema."""
|
||||
schema = SearchChunksInput.model_json_schema()
|
||||
props = schema["properties"]
|
||||
|
||||
# Check minLength constraint
|
||||
assert props["query"].get("minLength") == 1
|
||||
assert props["query"].get("maxLength") == 1000
|
||||
|
||||
# Check numeric constraints
|
||||
assert props["limit"].get("minimum") == 1
|
||||
assert props["limit"].get("maximum") == 100
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
429
generations/library_rag/tests/utils/test_toc_enricher.py
Normal file
429
generations/library_rag/tests/utils/test_toc_enricher.py
Normal file
@@ -0,0 +1,429 @@
|
||||
"""Unit tests for TOC enrichment module.
|
||||
|
||||
Tests the enrichment of chunk metadata with hierarchical information
|
||||
from the table of contents (TOC).
|
||||
"""
|
||||
|
||||
from typing import Any, Dict, List
|
||||
|
||||
import pytest
|
||||
|
||||
from utils.toc_enricher import (
|
||||
enrich_chunks_with_toc,
|
||||
extract_paragraph_number,
|
||||
find_matching_toc_entry,
|
||||
flatten_toc_with_paths,
|
||||
)
|
||||
from utils.types import FlatTOCEntryEnriched
|
||||
|
||||
|
||||
class TestFlattenTocWithPaths:
|
||||
"""Tests for flatten_toc_with_paths function."""
|
||||
|
||||
def test_flatten_simple_toc(self) -> None:
|
||||
"""Test flattening a simple two-level TOC."""
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Chapter 1",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{"title": "Section 1.1", "level": 2, "children": []},
|
||||
{"title": "Section 1.2", "level": 2, "children": []},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
flat_toc = flatten_toc_with_paths(toc, {})
|
||||
|
||||
assert len(flat_toc) == 3
|
||||
assert flat_toc[0]["title"] == "Chapter 1"
|
||||
assert flat_toc[0]["level"] == 1
|
||||
assert flat_toc[0]["full_path"] == "Chapter 1"
|
||||
assert flat_toc[1]["title"] == "Section 1.1"
|
||||
assert flat_toc[1]["full_path"] == "Chapter 1 > Section 1.1"
|
||||
assert flat_toc[1]["chapter_title"] == "Chapter 1"
|
||||
|
||||
def test_flatten_peirce_toc_with_cp_references(self) -> None:
|
||||
"""Test flattening Peirce TOC with CP references."""
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Peirce: CP 1.628",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "628. It is the instincts...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
flat_toc = flatten_toc_with_paths(toc, {})
|
||||
|
||||
assert len(flat_toc) == 2
|
||||
# Level 1 entry should extract CP reference
|
||||
assert flat_toc[0]["canonical_ref"] == "CP 1.628"
|
||||
# Level 2 entry should inherit CP reference
|
||||
assert flat_toc[1]["canonical_ref"] == "CP 1.628"
|
||||
assert flat_toc[1]["full_path"] == "Peirce: CP 1.628 > 628. It is the instincts..."
|
||||
assert flat_toc[1]["chapter_title"] == "Peirce: CP 1.628"
|
||||
|
||||
def test_flatten_empty_toc(self) -> None:
|
||||
"""Test flattening an empty TOC."""
|
||||
flat_toc = flatten_toc_with_paths([], {})
|
||||
assert flat_toc == []
|
||||
|
||||
def test_flatten_nested_hierarchy(self) -> None:
|
||||
"""Test flattening a deeply nested hierarchy."""
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Part I",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "Chapter 1",
|
||||
"level": 2,
|
||||
"children": [
|
||||
{
|
||||
"title": "Section 1.1",
|
||||
"level": 3,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
flat_toc = flatten_toc_with_paths(toc, {})
|
||||
|
||||
assert len(flat_toc) == 3
|
||||
assert flat_toc[2]["full_path"] == "Part I > Chapter 1 > Section 1.1"
|
||||
assert flat_toc[2]["parent_titles"] == ["Part I", "Chapter 1"]
|
||||
assert flat_toc[2]["chapter_title"] == "Part I"
|
||||
|
||||
def test_flatten_stephanus_pagination(self) -> None:
|
||||
"""Test flattening TOC with Stephanus pagination (e.g., Plato)."""
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Ménon 80a",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "80a. MÉNON : Socrate...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
flat_toc = flatten_toc_with_paths(toc, {})
|
||||
|
||||
assert flat_toc[0]["canonical_ref"] == "Ménon 80a"
|
||||
assert flat_toc[1]["canonical_ref"] == "Ménon 80a"
|
||||
|
||||
|
||||
class TestExtractParagraphNumber:
|
||||
"""Tests for extract_paragraph_number function."""
|
||||
|
||||
def test_extract_standard_paragraph(self) -> None:
|
||||
"""Test extracting standard paragraph number."""
|
||||
assert extract_paragraph_number("628. It is the instincts...") == "628"
|
||||
assert extract_paragraph_number("42. On the nature of...") == "42"
|
||||
|
||||
def test_extract_stephanus_paragraph(self) -> None:
|
||||
"""Test extracting Stephanus-style paragraph (with letter)."""
|
||||
assert extract_paragraph_number("80a. SOCRATE: Sais-tu...") == "80a"
|
||||
assert extract_paragraph_number("215c. Text here") == "215c"
|
||||
|
||||
def test_extract_section_symbol(self) -> None:
|
||||
"""Test extracting paragraph with section symbol."""
|
||||
assert extract_paragraph_number("§42 On the nature of...") == "42"
|
||||
assert extract_paragraph_number("§ 628 Text") == "628"
|
||||
|
||||
def test_extract_cp_reference(self) -> None:
|
||||
"""Test extracting paragraph from CP reference."""
|
||||
assert extract_paragraph_number("CP 5.628. Text") == "628"
|
||||
assert extract_paragraph_number("CP 1.42. My philosophy") == "42"
|
||||
|
||||
def test_extract_no_paragraph(self) -> None:
|
||||
"""Test extraction when no paragraph number present."""
|
||||
assert extract_paragraph_number("Introduction") is None
|
||||
assert extract_paragraph_number("") is None
|
||||
assert extract_paragraph_number("Chapter One") is None
|
||||
|
||||
|
||||
class TestFindMatchingTocEntry:
|
||||
"""Tests for find_matching_toc_entry function."""
|
||||
|
||||
def setup_method(self) -> None:
|
||||
"""Set up test fixtures."""
|
||||
self.flat_toc: List[FlatTOCEntryEnriched] = [
|
||||
{
|
||||
"title": "Peirce: CP 1.628",
|
||||
"level": 1,
|
||||
"full_path": "Peirce: CP 1.628",
|
||||
"chapter_title": "Peirce: CP 1.628",
|
||||
"canonical_ref": "CP 1.628",
|
||||
"parent_titles": [],
|
||||
"index_in_flat_list": 0,
|
||||
},
|
||||
{
|
||||
"title": "628. It is the instincts...",
|
||||
"level": 2,
|
||||
"full_path": "Peirce: CP 1.628 > 628. It is the instincts...",
|
||||
"chapter_title": "Peirce: CP 1.628",
|
||||
"canonical_ref": "CP 1.628",
|
||||
"parent_titles": ["Peirce: CP 1.628"],
|
||||
"index_in_flat_list": 1,
|
||||
},
|
||||
{
|
||||
"title": "Peirce: CP 1.42",
|
||||
"level": 1,
|
||||
"full_path": "Peirce: CP 1.42",
|
||||
"chapter_title": "Peirce: CP 1.42",
|
||||
"canonical_ref": "CP 1.42",
|
||||
"parent_titles": [],
|
||||
"index_in_flat_list": 2,
|
||||
},
|
||||
{
|
||||
"title": "42. My philosophy resuscitates Hegel",
|
||||
"level": 2,
|
||||
"full_path": "Peirce: CP 1.42 > 42. My philosophy resuscitates Hegel",
|
||||
"chapter_title": "Peirce: CP 1.42",
|
||||
"canonical_ref": "CP 1.42",
|
||||
"parent_titles": ["Peirce: CP 1.42"],
|
||||
"index_in_flat_list": 3,
|
||||
},
|
||||
]
|
||||
|
||||
def test_exact_title_match(self) -> None:
|
||||
"""Test exact title matching."""
|
||||
chunk: Dict[str, Any] = {
|
||||
"section": "628. It is the instincts...",
|
||||
"order_index": 0,
|
||||
}
|
||||
|
||||
entry = find_matching_toc_entry(chunk, self.flat_toc)
|
||||
|
||||
assert entry is not None
|
||||
assert entry["title"] == "628. It is the instincts..."
|
||||
assert entry["canonical_ref"] == "CP 1.628"
|
||||
|
||||
def test_paragraph_number_match(self) -> None:
|
||||
"""Test paragraph number matching with text similarity."""
|
||||
chunk: Dict[str, Any] = {
|
||||
"section": "42. My philosophy resuscitates Hegel",
|
||||
"order_index": 1,
|
||||
}
|
||||
|
||||
entry = find_matching_toc_entry(chunk, self.flat_toc)
|
||||
|
||||
assert entry is not None
|
||||
assert entry["canonical_ref"] == "CP 1.42"
|
||||
|
||||
def test_no_match_empty_toc(self) -> None:
|
||||
"""Test behavior with empty TOC."""
|
||||
chunk: Dict[str, Any] = {"section": "Some section", "order_index": 0}
|
||||
|
||||
entry = find_matching_toc_entry(chunk, [])
|
||||
|
||||
assert entry is None
|
||||
|
||||
def test_no_match_empty_section(self) -> None:
|
||||
"""Test behavior with chunk having no section."""
|
||||
chunk: Dict[str, Any] = {"text": "Some text", "order_index": 0}
|
||||
|
||||
entry = find_matching_toc_entry(chunk, self.flat_toc)
|
||||
|
||||
# Without section field, function returns None (doesn't guess)
|
||||
# This is correct behavior - we don't want to match without text
|
||||
assert entry is None
|
||||
|
||||
def test_proximity_match_fallback(self) -> None:
|
||||
"""Test proximity matching when no text match found."""
|
||||
chunk: Dict[str, Any] = {
|
||||
"section": "Unknown section",
|
||||
"order_index": 1,
|
||||
}
|
||||
|
||||
entry = find_matching_toc_entry(chunk, self.flat_toc)
|
||||
|
||||
# Should return entry with closest index_in_flat_list to order_index=1
|
||||
assert entry is not None
|
||||
assert entry["index_in_flat_list"] == 1
|
||||
|
||||
|
||||
class TestEnrichChunksWithToc:
|
||||
"""Tests for enrich_chunks_with_toc function."""
|
||||
|
||||
def test_enrich_chunks_no_toc(self) -> None:
|
||||
"""Test graceful fallback when TOC is absent."""
|
||||
chunks: List[Dict[str, Any]] = [
|
||||
{"text": "test", "section": "Intro"},
|
||||
]
|
||||
|
||||
enriched = enrich_chunks_with_toc(chunks, [], {})
|
||||
|
||||
assert enriched == chunks # Unchanged
|
||||
|
||||
def test_enrich_chunks_with_match(self) -> None:
|
||||
"""Test enrichment with successful TOC matching."""
|
||||
chunks: List[Dict[str, Any]] = [
|
||||
{"text": "test", "section": "628. It is the instincts..."},
|
||||
]
|
||||
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Peirce: CP 1.628",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "628. It is the instincts...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
enriched = enrich_chunks_with_toc(chunks, toc, {})
|
||||
|
||||
assert len(enriched) == 1
|
||||
assert "Peirce: CP 1.628" in enriched[0]["sectionPath"]
|
||||
assert enriched[0]["chapterTitle"] == "Peirce: CP 1.628"
|
||||
assert enriched[0]["canonical_reference"] == "CP 1.628"
|
||||
|
||||
def test_enrich_chunks_partial_match(self) -> None:
|
||||
"""Test enrichment when only some chunks match."""
|
||||
chunks: List[Dict[str, Any]] = [
|
||||
{"text": "test1", "section": "628. It is the instincts...", "order_index": 0},
|
||||
{"text": "test2", "section": "Unknown section", "order_index": 1},
|
||||
]
|
||||
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Peirce: CP 1.628",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "628. It is the instincts...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
enriched = enrich_chunks_with_toc(chunks, toc, {})
|
||||
|
||||
# First chunk should be enriched
|
||||
assert "Peirce: CP 1.628" in enriched[0]["sectionPath"]
|
||||
assert enriched[0]["canonical_reference"] == "CP 1.628"
|
||||
|
||||
# Second chunk doesn't match, so uses proximity fallback
|
||||
# Proximity matching will assign it to the closest TOC entry
|
||||
assert "sectionPath" in enriched[1] # Should get proximity match
|
||||
|
||||
def test_enrich_chunks_preserves_original_fields(self) -> None:
|
||||
"""Test that enrichment preserves other chunk fields."""
|
||||
chunks: List[Dict[str, Any]] = [
|
||||
{
|
||||
"text": "test",
|
||||
"section": "628. It is the instincts...",
|
||||
"order_index": 42,
|
||||
"keywords": ["test"],
|
||||
},
|
||||
]
|
||||
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Peirce: CP 1.628",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "628. It is the instincts...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
enriched = enrich_chunks_with_toc(chunks, toc, {})
|
||||
|
||||
# Original fields should be preserved
|
||||
assert enriched[0]["text"] == "test"
|
||||
assert enriched[0]["order_index"] == 42
|
||||
assert enriched[0]["keywords"] == ["test"]
|
||||
# New fields should be added
|
||||
assert "canonical_reference" in enriched[0]
|
||||
|
||||
def test_enrich_chunks_empty_chunks_list(self) -> None:
|
||||
"""Test behavior with empty chunks list."""
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{"title": "Chapter 1", "level": 1, "children": []},
|
||||
]
|
||||
|
||||
enriched = enrich_chunks_with_toc([], toc, {})
|
||||
|
||||
assert enriched == []
|
||||
|
||||
|
||||
# Integration test combining multiple functions
|
||||
class TestTocEnricherIntegration:
|
||||
"""Integration tests for the complete enrichment pipeline."""
|
||||
|
||||
def test_full_peirce_enrichment_pipeline(self) -> None:
|
||||
"""Test complete enrichment pipeline with Peirce data."""
|
||||
# Realistic Peirce TOC structure
|
||||
toc: List[Dict[str, Any]] = [
|
||||
{
|
||||
"title": "Peirce: CP 6.628",
|
||||
"level": 1,
|
||||
"children": [
|
||||
{
|
||||
"title": "628. I think we need to reflect...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
{
|
||||
"title": "629. The next point is...",
|
||||
"level": 2,
|
||||
"children": [],
|
||||
},
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
# Realistic chunks from pdf_pipeline
|
||||
chunks: List[Dict[str, Any]] = [
|
||||
{
|
||||
"text": "I think we need to reflect on the nature of signs...",
|
||||
"section": "628. I think we need to reflect...",
|
||||
"order_index": 0,
|
||||
},
|
||||
{
|
||||
"text": "The next point is about interpretation...",
|
||||
"section": "629. The next point is...",
|
||||
"order_index": 1,
|
||||
},
|
||||
]
|
||||
|
||||
# Run enrichment
|
||||
enriched = enrich_chunks_with_toc(chunks, toc, {})
|
||||
|
||||
# Verify results
|
||||
assert len(enriched) == 2
|
||||
|
||||
# First chunk
|
||||
assert enriched[0]["sectionPath"] == "Peirce: CP 6.628 > 628. I think we need to reflect..."
|
||||
assert enriched[0]["chapterTitle"] == "Peirce: CP 6.628"
|
||||
assert enriched[0]["canonical_reference"] == "CP 6.628"
|
||||
|
||||
# Second chunk
|
||||
assert enriched[1]["sectionPath"] == "Peirce: CP 6.628 > 629. The next point is..."
|
||||
assert enriched[1]["chapterTitle"] == "Peirce: CP 6.628"
|
||||
assert enriched[1]["canonical_reference"] == "CP 6.628"
|
||||
74
generations/library_rag/utils/__init__.py
Normal file
74
generations/library_rag/utils/__init__.py
Normal file
@@ -0,0 +1,74 @@
|
||||
"""Utils - Pipeline de parsing PDF avec OCR Mistral et structuration LLM.
|
||||
|
||||
Version 2.0 : Pipeline intelligent avec extraction LLM des métadonnées,
|
||||
TOC, classification des sections, chunking sémantique et validation.
|
||||
"""
|
||||
|
||||
from .mistral_client import create_client, get_api_key, estimate_ocr_cost
|
||||
from .pdf_uploader import upload_pdf
|
||||
from .ocr_processor import run_ocr, serialize_ocr_response
|
||||
from .markdown_builder import build_markdown
|
||||
from .image_extractor import extract_images, create_image_writer
|
||||
from .hierarchy_parser import build_hierarchy
|
||||
from .llm_structurer import structure_with_llm, LLMStructureError
|
||||
|
||||
# Nouveaux modules LLM v2
|
||||
from .llm_metadata import extract_metadata
|
||||
from .llm_toc import extract_toc
|
||||
from .llm_classifier import classify_sections, filter_indexable_sections
|
||||
from .llm_cleaner import clean_chunk, clean_page_markers, is_chunk_valid
|
||||
from .llm_chunker import chunk_section_with_llm, simple_chunk_by_paragraphs, extract_concepts_from_chunk, extract_paragraph_number
|
||||
from .llm_validator import validate_document, apply_corrections, enrich_chunks_with_concepts
|
||||
|
||||
# Pipeline
|
||||
from .pdf_pipeline import process_pdf, process_pdf_v2, process_pdf_bytes
|
||||
from .weaviate_ingest import ingest_document, delete_document_chunks
|
||||
|
||||
__all__ = [
|
||||
# Client Mistral
|
||||
"create_client",
|
||||
"get_api_key",
|
||||
"estimate_ocr_cost",
|
||||
# Upload
|
||||
"upload_pdf",
|
||||
# OCR
|
||||
"run_ocr",
|
||||
"serialize_ocr_response",
|
||||
# Markdown
|
||||
"build_markdown",
|
||||
# Images
|
||||
"extract_images",
|
||||
"create_image_writer",
|
||||
# Hiérarchie
|
||||
"build_hierarchy",
|
||||
# LLM Legacy
|
||||
"structure_with_llm",
|
||||
"LLMStructureError",
|
||||
# LLM v2 - Métadonnées
|
||||
"extract_metadata",
|
||||
# LLM v2 - TOC
|
||||
"extract_toc",
|
||||
# LLM v2 - Classification
|
||||
"classify_sections",
|
||||
"filter_indexable_sections",
|
||||
# LLM v2 - Nettoyage
|
||||
"clean_chunk",
|
||||
"clean_page_markers",
|
||||
"is_chunk_valid",
|
||||
# LLM v2 - Chunking
|
||||
"chunk_section_with_llm",
|
||||
"simple_chunk_by_paragraphs",
|
||||
"extract_concepts_from_chunk",
|
||||
"extract_paragraph_number",
|
||||
# LLM v2 - Validation
|
||||
"validate_document",
|
||||
"apply_corrections",
|
||||
"enrich_chunks_with_concepts",
|
||||
# Pipeline
|
||||
"process_pdf",
|
||||
"process_pdf_v2",
|
||||
"process_pdf_bytes",
|
||||
# Weaviate
|
||||
"ingest_document",
|
||||
"delete_document_chunks",
|
||||
]
|
||||
267
generations/library_rag/utils/hierarchy_parser.py
Normal file
267
generations/library_rag/utils/hierarchy_parser.py
Normal file
@@ -0,0 +1,267 @@
|
||||
"""Hierarchical Markdown document parser for semantic chunking.
|
||||
|
||||
This module provides utilities for parsing Markdown documents into
|
||||
hierarchical structures based on heading levels (# to ######). It is
|
||||
a key component of the RAG pipeline, enabling:
|
||||
|
||||
1. **Structure Extraction**: Parse Markdown into a tree of sections
|
||||
2. **Context Preservation**: Maintain hierarchical context (part > chapter > section)
|
||||
3. **Semantic Chunking**: Flatten hierarchy into chunks with full path context
|
||||
|
||||
The parser uses a stack-based algorithm to build nested section trees,
|
||||
preserving the document's logical structure for downstream processing.
|
||||
|
||||
Architecture:
|
||||
Input: Raw Markdown text with headings
|
||||
↓
|
||||
build_hierarchy() → DocumentHierarchy (tree structure)
|
||||
↓
|
||||
flatten_hierarchy() → List[FlatChunk] (with hierarchical context)
|
||||
|
||||
TypedDict Definitions:
|
||||
- HierarchyPath: Hierarchical path (part/chapter/section/subsection)
|
||||
- HierarchyNode: Tree node with title, level, content, children
|
||||
- DocumentHierarchy: Complete document structure
|
||||
- FlatChunk: Flattened chunk with context for RAG ingestion
|
||||
|
||||
Algorithm:
|
||||
The build_hierarchy() function uses a stack-based approach:
|
||||
1. Initialize a virtual root node at level 0
|
||||
2. For each line in the document:
|
||||
- If heading: pop stack until parent level found, then push new node
|
||||
- If content: append to current node's content
|
||||
3. Finalize nodes by joining content lines
|
||||
|
||||
Example:
|
||||
>>> markdown = '''
|
||||
... # Introduction
|
||||
... This is the intro.
|
||||
...
|
||||
... ## Background
|
||||
... Some background text.
|
||||
...
|
||||
... ## Methodology
|
||||
... Methods used here.
|
||||
... '''
|
||||
>>> hierarchy = build_hierarchy(markdown)
|
||||
>>> print(hierarchy["sections"][0]["title"])
|
||||
'Introduction'
|
||||
>>> chunks = flatten_hierarchy(hierarchy)
|
||||
>>> for chunk in chunks:
|
||||
... print(f"{chunk['chunk_id']}: {chunk['title']}")
|
||||
chunk_00001: Introduction
|
||||
chunk_00002: Background
|
||||
chunk_00003: Methodology
|
||||
|
||||
See Also:
|
||||
- utils.llm_chunker: Semantic chunking using LLM
|
||||
- utils.markdown_builder: Markdown generation from OCR
|
||||
- utils.weaviate_ingest: Ingestion of chunks into Weaviate
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import List, Optional, Pattern, TypedDict
|
||||
|
||||
# Import type definitions from central types module
|
||||
from utils.types import (
|
||||
DocumentHierarchy,
|
||||
FlatChunk,
|
||||
HierarchyNode,
|
||||
HierarchyPath,
|
||||
)
|
||||
|
||||
|
||||
class _BuildNode(TypedDict):
|
||||
"""Noeud interne pour la construction de la hiérarchie."""
|
||||
|
||||
title: Optional[str]
|
||||
level: int
|
||||
content: List[str]
|
||||
children: List[_BuildNode]
|
||||
|
||||
|
||||
def build_hierarchy(markdown_text: str) -> DocumentHierarchy:
|
||||
"""Construit une structure hiérarchique à partir des titres Markdown.
|
||||
|
||||
Analyse les titres (# à ######) et construit un arbre de sections
|
||||
avec leur contenu textuel.
|
||||
|
||||
Args:
|
||||
markdown_text: Texte Markdown à analyser
|
||||
|
||||
Returns:
|
||||
Dictionnaire avec :
|
||||
- preamble: Texte avant le premier titre
|
||||
- sections: Liste de sections imbriquées
|
||||
|
||||
Chaque section contient :
|
||||
- title: Titre de la section
|
||||
- level: Niveau (1-6)
|
||||
- content: Contenu textuel
|
||||
- children: Sous-sections
|
||||
"""
|
||||
# Regex pour les titres Markdown
|
||||
heading_re: Pattern[str] = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
|
||||
lines: List[str] = markdown_text.splitlines()
|
||||
|
||||
# Noeud racine (niveau 0, virtuel)
|
||||
root: _BuildNode = {
|
||||
"title": None,
|
||||
"level": 0,
|
||||
"content": [],
|
||||
"children": [],
|
||||
}
|
||||
|
||||
# Pile pour suivre la hiérarchie
|
||||
stack: List[_BuildNode] = [root]
|
||||
|
||||
for line in lines:
|
||||
stripped: str = line.rstrip()
|
||||
match: Optional[re.Match[str]] = heading_re.match(stripped)
|
||||
|
||||
if match:
|
||||
# C'est un titre
|
||||
level: int = len(match.group(1))
|
||||
title: str = match.group(2).strip()
|
||||
|
||||
# Remonter dans la pile jusqu'au parent approprié
|
||||
while stack and stack[-1]["level"] >= level:
|
||||
stack.pop()
|
||||
|
||||
# Créer le nouveau noeud
|
||||
node: _BuildNode = {
|
||||
"title": title,
|
||||
"level": level,
|
||||
"content": [],
|
||||
"children": [],
|
||||
}
|
||||
|
||||
# Ajouter au parent
|
||||
parent: _BuildNode = stack[-1]
|
||||
parent["children"].append(node)
|
||||
|
||||
# Empiler le nouveau noeud
|
||||
stack.append(node)
|
||||
else:
|
||||
# C'est du contenu, l'ajouter au noeud courant
|
||||
stack[-1]["content"].append(stripped)
|
||||
|
||||
# Finaliser les noeuds (joindre le contenu)
|
||||
def finalize(node: _BuildNode) -> HierarchyNode:
|
||||
"""Convertit un noeud de construction en noeud final."""
|
||||
return HierarchyNode(
|
||||
title=node["title"],
|
||||
level=node["level"],
|
||||
content="\n".join(node["content"]).strip(),
|
||||
children=[finalize(child) for child in node["children"]],
|
||||
)
|
||||
|
||||
# Extraire le préambule et les sections
|
||||
preamble: str = "\n".join(root["content"]).strip()
|
||||
sections: List[HierarchyNode] = [finalize(child) for child in root["children"]]
|
||||
|
||||
return DocumentHierarchy(
|
||||
preamble=preamble,
|
||||
sections=sections,
|
||||
)
|
||||
|
||||
|
||||
def flatten_hierarchy(hierarchy: DocumentHierarchy) -> List[FlatChunk]:
|
||||
"""Aplatit la hiérarchie en une liste de chunks.
|
||||
|
||||
Args:
|
||||
hierarchy: Structure hiérarchique (sortie de build_hierarchy)
|
||||
|
||||
Returns:
|
||||
Liste de chunks avec leur contexte hiérarchique
|
||||
"""
|
||||
chunks: List[FlatChunk] = []
|
||||
|
||||
# Préambule comme premier chunk
|
||||
if hierarchy.get("preamble"):
|
||||
preamble_chunk: FlatChunk = {
|
||||
"chunk_id": "chunk_00000",
|
||||
"text": hierarchy["preamble"],
|
||||
"hierarchy": HierarchyPath(
|
||||
part=None,
|
||||
chapter=None,
|
||||
section=None,
|
||||
subsection=None,
|
||||
),
|
||||
"type": "preamble",
|
||||
"level": 0,
|
||||
"title": None,
|
||||
}
|
||||
chunks.append(preamble_chunk)
|
||||
|
||||
def process_section(
|
||||
section: HierarchyNode,
|
||||
path: HierarchyPath,
|
||||
index: int,
|
||||
) -> int:
|
||||
"""Traite récursivement une section.
|
||||
|
||||
Args:
|
||||
section: Noeud de section à traiter
|
||||
path: Chemin hiérarchique courant
|
||||
index: Index du prochain chunk
|
||||
|
||||
Returns:
|
||||
Nouvel index après traitement
|
||||
"""
|
||||
level: int = section["level"]
|
||||
title: Optional[str] = section["title"]
|
||||
|
||||
# Mettre à jour le chemin hiérarchique
|
||||
current_path: HierarchyPath = path.copy()
|
||||
if level == 1:
|
||||
current_path = HierarchyPath(
|
||||
part=title,
|
||||
chapter=None,
|
||||
section=None,
|
||||
subsection=None,
|
||||
)
|
||||
elif level == 2:
|
||||
current_path["chapter"] = title
|
||||
current_path["section"] = None
|
||||
current_path["subsection"] = None
|
||||
elif level == 3:
|
||||
current_path["section"] = title
|
||||
current_path["subsection"] = None
|
||||
elif level >= 4:
|
||||
current_path["subsection"] = title
|
||||
|
||||
# Créer le chunk si contenu
|
||||
if section["content"]:
|
||||
chunk: FlatChunk = {
|
||||
"chunk_id": f"chunk_{index:05d}",
|
||||
"text": section["content"],
|
||||
"hierarchy": current_path.copy(),
|
||||
"type": "main_content",
|
||||
"level": level,
|
||||
"title": title,
|
||||
}
|
||||
chunks.append(chunk)
|
||||
index += 1
|
||||
|
||||
# Traiter les enfants
|
||||
for child in section["children"]:
|
||||
index = process_section(child, current_path, index)
|
||||
|
||||
return index
|
||||
|
||||
# Traiter toutes les sections
|
||||
idx: int = 1
|
||||
initial_path: HierarchyPath = HierarchyPath(
|
||||
part=None,
|
||||
chapter=None,
|
||||
section=None,
|
||||
subsection=None,
|
||||
)
|
||||
for section in hierarchy.get("sections", []):
|
||||
idx = process_section(section, initial_path, idx)
|
||||
|
||||
return chunks
|
||||
192
generations/library_rag/utils/image_extractor.py
Normal file
192
generations/library_rag/utils/image_extractor.py
Normal file
@@ -0,0 +1,192 @@
|
||||
"""Image extraction and storage from OCR API responses.
|
||||
|
||||
This module provides utilities for extracting and saving images from
|
||||
Mistral OCR API responses. It is a companion module to markdown_builder,
|
||||
handling the image-specific aspects of document processing.
|
||||
|
||||
Features:
|
||||
- **Image Writer Factory**: Creates reusable callbacks for image saving
|
||||
- **Batch Extraction**: Processes all images from an OCR response
|
||||
- **Protocol-based Design**: Flexible interface for custom implementations
|
||||
|
||||
Pipeline Position:
|
||||
OCR Response → **Image Extractor** → Saved images + paths for Markdown
|
||||
|
||||
Components:
|
||||
1. ImageWriterProtocol: Interface definition for image saving
|
||||
2. create_image_writer(): Factory for standard file-based writers
|
||||
3. extract_images(): Batch extraction from OCR responses
|
||||
|
||||
Integration:
|
||||
The image writer is designed to integrate with markdown_builder:
|
||||
|
||||
>>> from utils.image_extractor import create_image_writer
|
||||
>>> from utils.markdown_builder import build_markdown
|
||||
>>>
|
||||
>>> writer = create_image_writer(Path("output/doc/images"))
|
||||
>>> markdown = build_markdown(ocr_response, image_writer=writer)
|
||||
|
||||
Standalone Usage:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.image_extractor import extract_images
|
||||
>>>
|
||||
>>> # Extract all images from OCR response
|
||||
>>> paths = extract_images(ocr_response, Path("output/my_doc"))
|
||||
>>> print(f"Extracted {len(paths)} images")
|
||||
|
||||
File Naming Convention:
|
||||
Images are named: page{N}_img{M}.png
|
||||
- N: Page number (1-based)
|
||||
- M: Image index within page (1-based)
|
||||
- Format: Always PNG (base64 from Mistral is PNG)
|
||||
|
||||
Note:
|
||||
- All indices are 1-based for consistency with page numbering
|
||||
- The images subdirectory is created automatically if needed
|
||||
- Base64 data without proper encoding is silently skipped
|
||||
- Large documents may produce many images; monitor disk space
|
||||
|
||||
See Also:
|
||||
- utils.markdown_builder: Uses ImageWriter for markdown generation
|
||||
- utils.mistral_client: Source of OCR responses with image data
|
||||
"""
|
||||
|
||||
import base64
|
||||
from pathlib import Path
|
||||
from typing import Any, Callable, List, Optional, Protocol
|
||||
|
||||
|
||||
class ImageWriterProtocol(Protocol):
|
||||
"""Protocol for image writing callbacks.
|
||||
|
||||
This protocol defines the interface for functions that save
|
||||
images extracted from OCR responses and return a relative
|
||||
path for markdown references.
|
||||
|
||||
The protocol expects:
|
||||
- page_idx: 1-based page number
|
||||
- img_idx: 1-based image index within the page
|
||||
- image_b64: Base64-encoded image data
|
||||
|
||||
Returns:
|
||||
Relative path to the saved image for markdown inclusion.
|
||||
|
||||
Example:
|
||||
>>> def my_writer(page_idx: int, img_idx: int, image_b64: str) -> str:
|
||||
... # Custom saving logic
|
||||
... return f"images/page{page_idx}_img{img_idx}.png"
|
||||
"""
|
||||
|
||||
def __call__(self, page_idx: int, img_idx: int, image_b64: str) -> str:
|
||||
"""Save image and return relative path for markdown reference."""
|
||||
...
|
||||
|
||||
|
||||
# Type alias for image writer callables
|
||||
ImageWriter = Callable[[int, int, str], str]
|
||||
|
||||
|
||||
def create_image_writer(images_dir: Path) -> ImageWriter:
|
||||
"""Create a function for saving images to disk.
|
||||
|
||||
This factory function creates a closure that saves base64-encoded
|
||||
images to the specified directory and returns relative paths
|
||||
suitable for markdown image references.
|
||||
|
||||
Args:
|
||||
images_dir: Directory path where images will be saved.
|
||||
The directory will be created if it doesn't exist.
|
||||
|
||||
Returns:
|
||||
A callable that accepts (page_idx, img_idx, image_b64) and
|
||||
returns the relative path to the saved image.
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> writer = create_image_writer(Path("output/images"))
|
||||
>>> path = writer(1, 0, "iVBORw0KGgoAAAANS...")
|
||||
>>> print(path)
|
||||
'images/page1_img0.png'
|
||||
"""
|
||||
# Create directory if it doesn't exist
|
||||
images_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def writer(page_idx: int, img_idx: int, image_b64: str) -> str:
|
||||
"""Save an image and return its relative path.
|
||||
|
||||
Args:
|
||||
page_idx: Page number (1-based).
|
||||
img_idx: Image index within the page (1-based).
|
||||
image_b64: Base64-encoded image data.
|
||||
|
||||
Returns:
|
||||
Relative path to the saved image file.
|
||||
"""
|
||||
filename: str = f"page{page_idx}_img{img_idx}.png"
|
||||
filepath: Path = images_dir / filename
|
||||
|
||||
# Decode and save
|
||||
image_data: bytes = base64.b64decode(image_b64)
|
||||
filepath.write_bytes(image_data)
|
||||
|
||||
# Return relative path for markdown
|
||||
return f"images/{filename}"
|
||||
|
||||
return writer
|
||||
|
||||
|
||||
def extract_images(ocr_response: Any, output_dir: Path) -> List[str]:
|
||||
"""Extract all images from an OCR response.
|
||||
|
||||
Iterates through all pages in the OCR response, extracts any
|
||||
embedded images, decodes them from base64, and saves them
|
||||
to the output directory.
|
||||
|
||||
Args:
|
||||
ocr_response: OCR response object from Mistral API.
|
||||
Expected to have a pages attribute, where each page
|
||||
may have an images list containing objects with
|
||||
image_base64 attributes.
|
||||
output_dir: Base output directory. Images will be saved
|
||||
to a subdirectory named "images".
|
||||
|
||||
Returns:
|
||||
List of absolute file paths to the extracted images.
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> paths = extract_images(ocr_response, Path("output/my_doc"))
|
||||
>>> for path in paths:
|
||||
... print(path)
|
||||
'C:/output/my_doc/images/page1_img1.png'
|
||||
'C:/output/my_doc/images/page2_img1.png'
|
||||
|
||||
Note:
|
||||
- Pages and images are 1-indexed in filenames
|
||||
- Images without base64 data are silently skipped
|
||||
- The images subdirectory is created automatically
|
||||
"""
|
||||
images_dir: Path = output_dir / "images"
|
||||
images_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
extracted: List[str] = []
|
||||
|
||||
for page_index, page in enumerate(ocr_response.pages, start=1):
|
||||
if not getattr(page, "images", None):
|
||||
continue
|
||||
|
||||
for img_idx, img in enumerate(page.images, start=1):
|
||||
image_b64: Optional[str] = getattr(img, "image_base64", None)
|
||||
if not image_b64:
|
||||
continue
|
||||
|
||||
filename: str = f"page{page_index}_img{img_idx}.png"
|
||||
filepath: Path = images_dir / filename
|
||||
|
||||
# Decode and save
|
||||
image_data: bytes = base64.b64decode(image_b64)
|
||||
filepath.write_bytes(image_data)
|
||||
|
||||
extracted.append(str(filepath))
|
||||
|
||||
return extracted
|
||||
319
generations/library_rag/utils/llm_chat.py
Normal file
319
generations/library_rag/utils/llm_chat.py
Normal file
@@ -0,0 +1,319 @@
|
||||
"""Multi-LLM Integration Module for Chat Conversation.
|
||||
|
||||
Provides a unified interface for calling different LLM providers with streaming support:
|
||||
- Ollama (local, free)
|
||||
- Mistral API
|
||||
- Anthropic API (Claude)
|
||||
- OpenAI API
|
||||
|
||||
Example:
|
||||
>>> for token in call_llm("Hello world", "ollama", "qwen2.5:7b"):
|
||||
... print(token, end="", flush=True)
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
from typing import Iterator, Optional
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class LLMError(Exception):
|
||||
"""Base exception for LLM errors."""
|
||||
pass
|
||||
|
||||
|
||||
def call_llm(
|
||||
prompt: str,
|
||||
provider: str,
|
||||
model: str,
|
||||
stream: bool = True,
|
||||
temperature: float = 0.7,
|
||||
max_tokens: int = 16384,
|
||||
) -> Iterator[str]:
|
||||
"""Call an LLM provider with unified interface.
|
||||
|
||||
Args:
|
||||
prompt: The prompt to send to the LLM.
|
||||
provider: Provider name ("ollama", "mistral", "anthropic", "openai").
|
||||
model: Model name (e.g., "qwen2.5:7b", "mistral-small-latest", "claude-sonnet-4-5").
|
||||
stream: Whether to stream tokens (default: True).
|
||||
temperature: Temperature for generation (0-1).
|
||||
max_tokens: Maximum tokens to generate (default 16384 for philosophical discussions).
|
||||
|
||||
Yields:
|
||||
Tokens as strings (when streaming).
|
||||
|
||||
Raises:
|
||||
LLMError: If provider is invalid or API call fails.
|
||||
|
||||
Example:
|
||||
>>> for token in call_llm("Test", "ollama", "qwen2.5:7b"):
|
||||
... print(token, end="")
|
||||
"""
|
||||
provider = provider.lower()
|
||||
|
||||
logger.info(f"[LLM Call] Provider: {provider}, Model: {model}, Stream: {stream}")
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
if provider == "ollama":
|
||||
yield from _call_ollama(prompt, model, temperature, stream)
|
||||
elif provider == "mistral":
|
||||
yield from _call_mistral(prompt, model, temperature, max_tokens, stream)
|
||||
elif provider == "anthropic":
|
||||
yield from _call_anthropic(prompt, model, temperature, max_tokens, stream)
|
||||
elif provider == "openai":
|
||||
yield from _call_openai(prompt, model, temperature, max_tokens, stream)
|
||||
else:
|
||||
raise LLMError(f"Provider '{provider}' non supporté. Utilisez: ollama, mistral, anthropic, openai")
|
||||
|
||||
except Exception as e:
|
||||
elapsed = time.time() - start_time
|
||||
logger.error(f"[LLM Call] Error after {elapsed:.2f}s: {e}")
|
||||
raise
|
||||
|
||||
elapsed = time.time() - start_time
|
||||
logger.info(f"[LLM Call] Completed in {elapsed:.2f}s")
|
||||
|
||||
|
||||
def _call_ollama(prompt: str, model: str, temperature: float, stream: bool) -> Iterator[str]:
|
||||
"""Call Ollama API with streaming support.
|
||||
|
||||
Args:
|
||||
prompt: The prompt text.
|
||||
model: Ollama model name.
|
||||
temperature: Temperature (0-1).
|
||||
stream: Whether to stream.
|
||||
|
||||
Yields:
|
||||
Tokens from the model.
|
||||
"""
|
||||
import requests
|
||||
|
||||
base_url = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
|
||||
url = f"{base_url}/api/generate"
|
||||
|
||||
payload = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": stream,
|
||||
"options": {
|
||||
"temperature": temperature,
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(url, json=payload, stream=stream, timeout=120)
|
||||
response.raise_for_status()
|
||||
|
||||
if stream:
|
||||
# Stream mode: each line is a JSON object with "response" field
|
||||
for line in response.iter_lines():
|
||||
if line:
|
||||
try:
|
||||
data = json.loads(line)
|
||||
token = data.get("response", "")
|
||||
if token:
|
||||
yield token
|
||||
|
||||
# Check if done
|
||||
if data.get("done", False):
|
||||
break
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
else:
|
||||
# Non-stream mode
|
||||
data = response.json()
|
||||
yield data.get("response", "")
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
raise LLMError(f"Ollama API error: {e}")
|
||||
|
||||
|
||||
def _call_mistral(prompt: str, model: str, temperature: float, max_tokens: int, stream: bool) -> Iterator[str]:
|
||||
"""Call Mistral API with streaming support.
|
||||
|
||||
Args:
|
||||
prompt: The prompt text.
|
||||
model: Mistral model name.
|
||||
temperature: Temperature (0-1).
|
||||
max_tokens: Max tokens to generate.
|
||||
stream: Whether to stream.
|
||||
|
||||
Yields:
|
||||
Tokens from the model.
|
||||
"""
|
||||
api_key = os.getenv("MISTRAL_API_KEY")
|
||||
if not api_key:
|
||||
raise LLMError("MISTRAL_API_KEY not set in environment")
|
||||
|
||||
try:
|
||||
from mistralai import Mistral
|
||||
except ImportError:
|
||||
raise LLMError("mistralai package not installed. Run: pip install mistralai")
|
||||
|
||||
client = Mistral(api_key=api_key)
|
||||
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
|
||||
try:
|
||||
if stream:
|
||||
# Streaming mode
|
||||
stream_response = client.chat.stream(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=temperature,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
|
||||
for chunk in stream_response:
|
||||
if chunk.data.choices:
|
||||
delta = chunk.data.choices[0].delta
|
||||
if hasattr(delta, 'content') and delta.content:
|
||||
yield delta.content
|
||||
else:
|
||||
# Non-streaming mode
|
||||
response = client.chat.complete(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=temperature,
|
||||
max_tokens=max_tokens,
|
||||
)
|
||||
if response.choices:
|
||||
yield response.choices[0].message.content or ""
|
||||
|
||||
except Exception as e:
|
||||
raise LLMError(f"Mistral API error: {e}")
|
||||
|
||||
|
||||
def _call_anthropic(prompt: str, model: str, temperature: float, max_tokens: int, stream: bool) -> Iterator[str]:
|
||||
"""Call Anthropic API (Claude) with streaming support.
|
||||
|
||||
Args:
|
||||
prompt: The prompt text.
|
||||
model: Claude model name.
|
||||
temperature: Temperature (0-1).
|
||||
max_tokens: Max tokens to generate.
|
||||
stream: Whether to stream.
|
||||
|
||||
Yields:
|
||||
Tokens from the model.
|
||||
"""
|
||||
api_key = os.getenv("ANTHROPIC_API_KEY")
|
||||
if not api_key:
|
||||
raise LLMError("ANTHROPIC_API_KEY not set in environment")
|
||||
|
||||
try:
|
||||
from anthropic import Anthropic
|
||||
except ImportError:
|
||||
raise LLMError("anthropic package not installed. Run: pip install anthropic")
|
||||
|
||||
client = Anthropic(api_key=api_key)
|
||||
|
||||
try:
|
||||
if stream:
|
||||
# Streaming mode
|
||||
with client.messages.stream(
|
||||
model=model,
|
||||
max_tokens=max_tokens,
|
||||
temperature=temperature,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
) as stream:
|
||||
for text in stream.text_stream:
|
||||
yield text
|
||||
else:
|
||||
# Non-streaming mode
|
||||
response = client.messages.create(
|
||||
model=model,
|
||||
max_tokens=max_tokens,
|
||||
temperature=temperature,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
if response.content:
|
||||
yield response.content[0].text
|
||||
|
||||
except Exception as e:
|
||||
raise LLMError(f"Anthropic API error: {e}")
|
||||
|
||||
|
||||
def _call_openai(prompt: str, model: str, temperature: float, max_tokens: int, stream: bool) -> Iterator[str]:
|
||||
"""Call OpenAI API with streaming support.
|
||||
|
||||
Args:
|
||||
prompt: The prompt text.
|
||||
model: OpenAI model name.
|
||||
temperature: Temperature (0-1).
|
||||
max_tokens: Max tokens to generate.
|
||||
stream: Whether to stream.
|
||||
|
||||
Yields:
|
||||
Tokens from the model.
|
||||
"""
|
||||
api_key = os.getenv("OPENAI_API_KEY")
|
||||
if not api_key:
|
||||
raise LLMError("OPENAI_API_KEY not set in environment")
|
||||
|
||||
try:
|
||||
from openai import OpenAI
|
||||
except ImportError:
|
||||
raise LLMError("openai package not installed. Run: pip install openai")
|
||||
|
||||
client = OpenAI(api_key=api_key)
|
||||
|
||||
messages = [{"role": "user", "content": prompt}]
|
||||
|
||||
# Detect if model uses max_completion_tokens (o1, gpt-5.x) instead of max_tokens
|
||||
uses_completion_tokens = model.startswith("o1") or model.startswith("gpt-5")
|
||||
|
||||
try:
|
||||
if stream:
|
||||
# Streaming mode
|
||||
if uses_completion_tokens:
|
||||
stream_response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
max_completion_tokens=max_tokens,
|
||||
stream=True,
|
||||
)
|
||||
else:
|
||||
stream_response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=temperature,
|
||||
max_tokens=max_tokens,
|
||||
stream=True,
|
||||
)
|
||||
|
||||
for chunk in stream_response:
|
||||
if chunk.choices:
|
||||
delta = chunk.choices[0].delta
|
||||
if hasattr(delta, 'content') and delta.content:
|
||||
yield delta.content
|
||||
else:
|
||||
# Non-streaming mode
|
||||
if uses_completion_tokens:
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
max_completion_tokens=max_tokens,
|
||||
stream=False,
|
||||
)
|
||||
else:
|
||||
response = client.chat.completions.create(
|
||||
model=model,
|
||||
messages=messages,
|
||||
temperature=temperature,
|
||||
max_tokens=max_tokens,
|
||||
stream=False,
|
||||
)
|
||||
if response.choices:
|
||||
yield response.choices[0].message.content or ""
|
||||
|
||||
except Exception as e:
|
||||
raise LLMError(f"OpenAI API error: {e}")
|
||||
495
generations/library_rag/utils/llm_chunker.py
Normal file
495
generations/library_rag/utils/llm_chunker.py
Normal file
@@ -0,0 +1,495 @@
|
||||
"""Semantic chunking of documents via LLM.
|
||||
|
||||
This module provides intelligent semantic chunking capabilities for academic and
|
||||
philosophical texts, using Large Language Models (LLM) to identify coherent units
|
||||
of meaning (argumentative units, definitions, examples, citations, etc.).
|
||||
|
||||
Overview:
|
||||
The module offers two chunking strategies:
|
||||
|
||||
1. **LLM-based semantic chunking** (chunk_section_with_llm):
|
||||
Uses an LLM to identify semantic boundaries and create chunks that preserve
|
||||
argumentative coherence. Each chunk is annotated with summary, concepts, type.
|
||||
|
||||
2. **Simple paragraph-based chunking** (simple_chunk_by_paragraphs):
|
||||
A fast fallback that splits text by paragraph boundaries.
|
||||
|
||||
Semantic Unit Types:
|
||||
- argument: A logical argument or reasoning sequence
|
||||
- definition: A definition or conceptual clarification
|
||||
- example: An illustrative example or case study
|
||||
- citation: A quoted passage from another source
|
||||
- exposition: Expository content presenting ideas
|
||||
- transition: Transitional text between sections
|
||||
|
||||
Chunk Size Guidelines:
|
||||
- Target size: 300-500 words per chunk (configurable)
|
||||
- Chunks are never split mid-sentence or mid-paragraph
|
||||
- Short sections (< 80% of target) are kept as single chunks
|
||||
|
||||
LLM Provider Support:
|
||||
- ollama: Local LLM (free, slower, default)
|
||||
- mistral: Mistral API (faster, requires API key)
|
||||
|
||||
See Also:
|
||||
utils.llm_cleaner: Chunk cleaning and validation
|
||||
utils.llm_classifier: Section type classification
|
||||
utils.pdf_pipeline: Main pipeline orchestration
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Literal, Optional, TypedDict
|
||||
|
||||
from .llm_structurer import (
|
||||
_clean_json_string,
|
||||
_get_default_mistral_model,
|
||||
_get_default_model,
|
||||
call_llm,
|
||||
)
|
||||
from .llm_cleaner import clean_page_markers, is_chunk_valid
|
||||
from .types import LLMProvider, SemanticChunk
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Type Definitions for LLM Chunker
|
||||
# =============================================================================
|
||||
|
||||
#: Unit type for semantic chunking (specific to this module's LLM output)
|
||||
ChunkUnitType = Literal[
|
||||
"argument",
|
||||
"definition",
|
||||
"example",
|
||||
"citation",
|
||||
"exposition",
|
||||
"transition",
|
||||
"main_content",
|
||||
]
|
||||
|
||||
|
||||
class LLMChunkResponse(TypedDict, total=False):
|
||||
"""Individual chunk structure as returned by LLM.
|
||||
|
||||
Attributes:
|
||||
text: Chunk text content (exact copy from source)
|
||||
summary: Brief one-sentence summary
|
||||
concepts: Key concepts extracted (3-5 items)
|
||||
type: Semantic unit type
|
||||
"""
|
||||
|
||||
text: str
|
||||
summary: str
|
||||
concepts: List[str]
|
||||
type: str
|
||||
|
||||
|
||||
class LLMChunksResult(TypedDict):
|
||||
"""Complete response structure from LLM chunking.
|
||||
|
||||
Attributes:
|
||||
chunks: List of chunk objects
|
||||
"""
|
||||
|
||||
chunks: List[LLMChunkResponse]
|
||||
|
||||
|
||||
# Note: SemanticChunk is imported from utils.types
|
||||
|
||||
|
||||
def extract_paragraph_number(text: str) -> Optional[int]:
|
||||
"""Extract paragraph number from the beginning of text.
|
||||
|
||||
Many philosophical texts use numbered paragraphs. This function
|
||||
detects various numbering formats.
|
||||
|
||||
Args:
|
||||
text: Text content that may start with a paragraph number.
|
||||
|
||||
Returns:
|
||||
The paragraph number if detected, None otherwise.
|
||||
|
||||
Example:
|
||||
>>> extract_paragraph_number("9 On presente...")
|
||||
9
|
||||
>>> extract_paragraph_number("Normal text")
|
||||
None
|
||||
"""
|
||||
text = text.strip()
|
||||
|
||||
# Patterns possibles pour les numéros de paragraphe
|
||||
patterns: List[str] = [
|
||||
r'^(\d+)\s+[A-ZÀ-Ü]', # "9 On présente..."
|
||||
r'^(\d+)[A-ZÀ-Ü]', # "10Dans la classification..."
|
||||
r'^§\s*(\d+)', # "§ 15 ..."
|
||||
r'^\[(\d+)\]', # "[9] ..."
|
||||
r'^(\d+)\.', # "9. ..."
|
||||
r'^(\d+)\)', # "9) ..."
|
||||
]
|
||||
|
||||
for pattern in patterns:
|
||||
match: Optional[re.Match[str]] = re.match(pattern, text)
|
||||
if match:
|
||||
try:
|
||||
return int(match.group(1))
|
||||
except ValueError:
|
||||
continue
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _extract_json_from_response(text: str) -> Dict[str, Any]:
|
||||
"""Extract JSON from LLM response text.
|
||||
|
||||
Handles both wrapped JSON (in <JSON></JSON> tags) and raw JSON responses.
|
||||
Falls back to empty chunks list if parsing fails.
|
||||
|
||||
Args:
|
||||
text: Response text from LLM containing JSON.
|
||||
|
||||
Returns:
|
||||
Parsed JSON as dictionary with 'chunks' key. Returns
|
||||
{"chunks": []} if parsing fails.
|
||||
"""
|
||||
json_match: Optional[re.Match[str]] = re.search(
|
||||
r'<JSON>\s*(.*?)\s*</JSON>', text, re.DOTALL
|
||||
)
|
||||
if json_match:
|
||||
json_str: str = _clean_json_string(json_match.group(1))
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
if start != -1 and end > start:
|
||||
json_str = _clean_json_string(text[start:end + 1])
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"JSON invalide: {e}")
|
||||
|
||||
return {"chunks": []}
|
||||
|
||||
|
||||
def chunk_section_with_llm(
|
||||
section_content: str,
|
||||
section_title: str,
|
||||
chapter_title: Optional[str] = None,
|
||||
subsection_title: Optional[str] = None,
|
||||
section_level: int = 1,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.2,
|
||||
target_chunk_size: int = 400,
|
||||
) -> List[SemanticChunk]:
|
||||
"""Split a section into semantically coherent chunks using an LLM.
|
||||
|
||||
This is the main semantic chunking function. It uses an LLM to identify
|
||||
natural semantic boundaries in academic/philosophical texts, preserving
|
||||
argumentative coherence and annotating each chunk with metadata.
|
||||
|
||||
Args:
|
||||
section_content: The text content of the section to chunk.
|
||||
section_title: Title of the current section being chunked.
|
||||
chapter_title: Title of the parent chapter (level 1) for context.
|
||||
subsection_title: Title of parent subsection (level 2) if applicable.
|
||||
section_level: Hierarchy level (1=chapter, 2=section, etc.).
|
||||
model: LLM model name. If None, uses provider default.
|
||||
provider: LLM provider ("ollama" for local, "mistral" for API).
|
||||
temperature: LLM temperature (lower = more deterministic).
|
||||
target_chunk_size: Target number of words per chunk.
|
||||
|
||||
Returns:
|
||||
List of SemanticChunk dictionaries containing text, summary,
|
||||
concepts, type, section_level, and optionally paragraph_number.
|
||||
|
||||
Note:
|
||||
If section is shorter than 80% of target_chunk_size, it is returned
|
||||
as a single chunk. If LLM fails, returns section with error field.
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Nettoyer le contenu
|
||||
content: str = clean_page_markers(section_content)
|
||||
|
||||
# Si le contenu est court, ne pas découper
|
||||
word_count: int = len(content.split())
|
||||
if word_count < target_chunk_size * 0.8:
|
||||
para_num: Optional[int] = extract_paragraph_number(content)
|
||||
chunk: SemanticChunk = {
|
||||
"text": content,
|
||||
"summary": section_title,
|
||||
"concepts": [],
|
||||
"type": "main_content",
|
||||
"section_level": section_level,
|
||||
}
|
||||
if para_num is not None:
|
||||
chunk["paragraph_number"] = para_num
|
||||
if subsection_title and subsection_title != section_title:
|
||||
chunk["subsection_title"] = subsection_title
|
||||
return [chunk]
|
||||
|
||||
chapter_info: str = f"Chapitre: {chapter_title}\n" if chapter_title else ""
|
||||
|
||||
prompt = f"""Tu es un expert en analyse de textes académiques.
|
||||
|
||||
TÂCHE: Découper ce texte en unités sémantiques cohérentes.
|
||||
|
||||
{chapter_info}Section: {section_title}
|
||||
|
||||
RÈGLES DE DÉCOUPAGE:
|
||||
1. Chaque chunk doit avoir un SENS COMPLET (une idée, un argument)
|
||||
2. Taille idéale: {target_chunk_size - 100} à {target_chunk_size + 100} mots
|
||||
3. NE PAS couper au milieu d'une phrase ou d'un paragraphe
|
||||
4. NE PAS couper au milieu d'une citation
|
||||
5. Regrouper les paragraphes qui développent la même idée
|
||||
6. Un chunk peut être plus long si nécessaire pour préserver le sens
|
||||
|
||||
POUR CHAQUE CHUNK, INDIQUE:
|
||||
- text: le texte exact (copié, pas reformulé)
|
||||
- summary: résumé en 1 phrase courte
|
||||
- concepts: 3-5 concepts clés (mots ou expressions)
|
||||
- type: argument | définition | exemple | citation | exposition | transition
|
||||
|
||||
TEXTE À DÉCOUPER:
|
||||
{content}
|
||||
|
||||
RÉPONDS avec un JSON entre <JSON></JSON>:
|
||||
|
||||
<JSON>
|
||||
{{
|
||||
"chunks": [
|
||||
{{
|
||||
"text": "Premier paragraphe ou groupe de paragraphes...",
|
||||
"summary": "Présentation de l'idée principale",
|
||||
"concepts": ["concept1", "concept2", "concept3"],
|
||||
"type": "exposition"
|
||||
}},
|
||||
{{
|
||||
"text": "Deuxième partie du texte...",
|
||||
"summary": "Développement de l'argument",
|
||||
"concepts": ["concept4", "concept5"],
|
||||
"type": "argument"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
</JSON>
|
||||
"""
|
||||
|
||||
logger.info(f"Chunking sémantique de '{section_title}' ({word_count} mots) via {provider.upper()}")
|
||||
|
||||
try:
|
||||
response: str = call_llm(
|
||||
prompt, model=model, provider=provider, temperature=temperature, timeout=300
|
||||
)
|
||||
result: Dict[str, Any] = _extract_json_from_response(response)
|
||||
chunks: List[Dict[str, Any]] = result.get("chunks", [])
|
||||
|
||||
# Valider les chunks et extraire les numéros de paragraphe
|
||||
valid_chunks: List[SemanticChunk] = []
|
||||
for raw_chunk in chunks:
|
||||
text: str = raw_chunk.get("text", "")
|
||||
if is_chunk_valid(text):
|
||||
# Extraire le numéro de paragraphe s'il existe
|
||||
para_num = extract_paragraph_number(text)
|
||||
|
||||
chunk_data: SemanticChunk = {
|
||||
"text": text,
|
||||
"summary": raw_chunk.get("summary", ""),
|
||||
"concepts": raw_chunk.get("concepts", []),
|
||||
"type": raw_chunk.get("type", "main_content"),
|
||||
"section_level": section_level,
|
||||
}
|
||||
|
||||
# Ajouter le numéro de paragraphe si détecté
|
||||
if para_num is not None:
|
||||
chunk_data["paragraph_number"] = para_num
|
||||
|
||||
# Ajouter la hiérarchie complète
|
||||
if subsection_title and subsection_title != section_title:
|
||||
chunk_data["subsection_title"] = subsection_title
|
||||
|
||||
valid_chunks.append(chunk_data)
|
||||
|
||||
# Si aucun chunk valide, retourner le contenu complet
|
||||
if not valid_chunks:
|
||||
logger.warning(f"Aucun chunk valide pour '{section_title}', retour contenu complet")
|
||||
para_num = extract_paragraph_number(content)
|
||||
fallback: SemanticChunk = {
|
||||
"text": content,
|
||||
"summary": section_title,
|
||||
"concepts": [],
|
||||
"type": "main_content",
|
||||
"section_level": section_level,
|
||||
}
|
||||
if para_num is not None:
|
||||
fallback["paragraph_number"] = para_num
|
||||
return [fallback]
|
||||
|
||||
logger.info(f"Section '{section_title}' découpée en {len(valid_chunks)} chunks")
|
||||
return valid_chunks
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur chunking LLM: {e}")
|
||||
# Fallback: retourner le contenu complet
|
||||
para_num = extract_paragraph_number(content)
|
||||
fallback_err: SemanticChunk = {
|
||||
"text": content,
|
||||
"summary": section_title,
|
||||
"concepts": [],
|
||||
"type": "main_content",
|
||||
"section_level": section_level,
|
||||
"error": str(e),
|
||||
}
|
||||
if para_num is not None:
|
||||
fallback_err["paragraph_number"] = para_num
|
||||
return [fallback_err]
|
||||
|
||||
|
||||
def simple_chunk_by_paragraphs(
|
||||
content: str,
|
||||
max_words: int = 500,
|
||||
min_words: int = 100,
|
||||
) -> List[str]:
|
||||
"""Split text into chunks by paragraph boundaries (no LLM required).
|
||||
|
||||
This is a fast fallback chunking method that respects paragraph and
|
||||
sentence boundaries. Use when LLM processing is not desired.
|
||||
|
||||
The algorithm:
|
||||
1. Split by double newlines (paragraph boundaries)
|
||||
2. Merge small paragraphs until max_words is reached
|
||||
3. Split long paragraphs at sentence boundaries
|
||||
4. Filter chunks below min_words threshold
|
||||
|
||||
Args:
|
||||
content: Text content to split into chunks.
|
||||
max_words: Maximum words per chunk. Defaults to 500.
|
||||
min_words: Minimum words per chunk. Defaults to 100.
|
||||
|
||||
Returns:
|
||||
List of text chunks as strings.
|
||||
|
||||
Example:
|
||||
>>> chunks = simple_chunk_by_paragraphs(text, max_words=400)
|
||||
>>> len(chunks)
|
||||
3
|
||||
"""
|
||||
content = clean_page_markers(content)
|
||||
|
||||
# Découper par paragraphes (double saut de ligne)
|
||||
paragraphs: List[str] = re.split(r'\n\n+', content)
|
||||
|
||||
chunks: List[str] = []
|
||||
current_chunk: List[str] = []
|
||||
current_words: int = 0
|
||||
|
||||
for para in paragraphs:
|
||||
para = para.strip()
|
||||
if not para:
|
||||
continue
|
||||
|
||||
para_words: int = len(para.split())
|
||||
|
||||
# Si le paragraphe seul est trop long, le découper par phrases
|
||||
if para_words > max_words:
|
||||
if current_chunk:
|
||||
chunks.append('\n\n'.join(current_chunk))
|
||||
current_chunk = []
|
||||
current_words = 0
|
||||
|
||||
# Découper par phrases
|
||||
sentences: List[str] = re.split(r'(?<=[.!?])\s+', para)
|
||||
for sentence in sentences:
|
||||
sentence_words: int = len(sentence.split())
|
||||
if current_words + sentence_words > max_words and current_chunk:
|
||||
chunks.append('\n\n'.join(current_chunk))
|
||||
current_chunk = [sentence]
|
||||
current_words = sentence_words
|
||||
else:
|
||||
current_chunk.append(sentence)
|
||||
current_words += sentence_words
|
||||
|
||||
# Si ajouter ce paragraphe dépasse la limite
|
||||
elif current_words + para_words > max_words:
|
||||
if current_chunk:
|
||||
chunks.append('\n\n'.join(current_chunk))
|
||||
current_chunk = [para]
|
||||
current_words = para_words
|
||||
|
||||
else:
|
||||
current_chunk.append(para)
|
||||
current_words += para_words
|
||||
|
||||
# Dernier chunk
|
||||
if current_chunk:
|
||||
chunks.append('\n\n'.join(current_chunk))
|
||||
|
||||
# Filtrer les chunks trop courts
|
||||
return [c for c in chunks if len(c.split()) >= min_words or len(chunks) == 1]
|
||||
|
||||
|
||||
def extract_concepts_from_chunk(
|
||||
chunk_text: str,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
) -> List[str]:
|
||||
"""Extract key concepts from a text chunk using an LLM.
|
||||
|
||||
Useful for enriching chunks created without LLM processing or for
|
||||
extracting additional concepts from existing chunks.
|
||||
|
||||
Args:
|
||||
chunk_text: The text content to analyze for concepts.
|
||||
model: LLM model name. If None, uses provider default.
|
||||
provider: LLM provider ("ollama" or "mistral").
|
||||
|
||||
Returns:
|
||||
List of 3-5 key concepts (words or short phrases). Returns
|
||||
empty list if extraction fails or text is too short (< 100 chars).
|
||||
|
||||
Example:
|
||||
>>> concepts = extract_concepts_from_chunk("L'etre-pour-la-mort...")
|
||||
>>> concepts
|
||||
['etre-pour-la-mort', 'structure existentiale', 'Dasein']
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
if len(chunk_text) < 100:
|
||||
return []
|
||||
|
||||
prompt: str = f"""Extrait les 3-5 concepts clés de ce texte.
|
||||
Un concept = un mot ou une expression courte (2-3 mots max).
|
||||
|
||||
Texte:
|
||||
{chunk_text[:1500]}
|
||||
|
||||
Réponds avec une liste JSON simple:
|
||||
["concept1", "concept2", "concept3"]
|
||||
"""
|
||||
|
||||
try:
|
||||
response: str = call_llm(prompt, model=model, provider=provider, temperature=0.1, timeout=60)
|
||||
|
||||
# Chercher la liste JSON
|
||||
match: Optional[re.Match[str]] = re.search(r'\[.*?\]', response, re.DOTALL)
|
||||
if match:
|
||||
concepts: List[str] = json.loads(match.group())
|
||||
return concepts[:5] # Max 5 concepts
|
||||
|
||||
return []
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur extraction concepts: {e}")
|
||||
return []
|
||||
|
||||
582
generations/library_rag/utils/llm_classifier.py
Normal file
582
generations/library_rag/utils/llm_classifier.py
Normal file
@@ -0,0 +1,582 @@
|
||||
"""LLM-based section classification module for document structure analysis.
|
||||
|
||||
This module provides functionality to classify document sections by type
|
||||
(front_matter, chapter, appendix, etc.) using Large Language Models and
|
||||
determine which sections should be indexed for semantic search.
|
||||
|
||||
Key Features:
|
||||
- Section classification via LLM (classify_sections)
|
||||
- Automatic TOC/metadata section exclusion (is_excluded_section)
|
||||
- Post-classification validation (validate_classified_sections)
|
||||
- Filtering for indexable content (filter_indexable_sections)
|
||||
|
||||
Section Types:
|
||||
The following section types are recognized:
|
||||
|
||||
**Indexable Content (should_index=True):**
|
||||
- chapter: Main document content, essays, articles, book reviews
|
||||
- introduction: Document introductions
|
||||
- conclusion: Document conclusions
|
||||
- preface: Prefaces, forewords, warnings (intellectual content)
|
||||
- abstract: Summaries, abstracts
|
||||
|
||||
**Non-Indexable Content (should_index=False):**
|
||||
- front_matter: Title pages, copyright, credits, colophon
|
||||
- toc_display: Table of contents display (not content)
|
||||
- appendix: Document appendices
|
||||
- bibliography: References, bibliography
|
||||
- index: Document index
|
||||
- notes: End notes
|
||||
- ignore: Ads, empty pages, technical metadata
|
||||
|
||||
Classification Strategy:
|
||||
1. LLM analyzes section titles and content previews
|
||||
2. Automatic exclusion rules catch common TOC/metadata patterns
|
||||
3. Post-classification validation detects false positives
|
||||
4. Filtering extracts only indexable content
|
||||
|
||||
Typical Usage:
|
||||
>>> from utils.llm_classifier import classify_sections, filter_indexable_sections
|
||||
>>> sections = [
|
||||
... {"title": "Table of Contents", "content": "...", "level": 1},
|
||||
... {"title": "Introduction", "content": "...", "level": 1},
|
||||
... {"title": "Chapter 1", "content": "...", "level": 1}
|
||||
... ]
|
||||
>>> classified = classify_sections(sections, provider="ollama")
|
||||
>>> indexable = filter_indexable_sections(classified)
|
||||
>>> print([s["title"] for s in indexable])
|
||||
['Introduction', 'Chapter 1']
|
||||
|
||||
LLM Provider Options:
|
||||
- "ollama": Local processing, free but slower
|
||||
- "mistral": Cloud API, faster but incurs costs
|
||||
|
||||
Note:
|
||||
The classifier is designed to handle edge cases like:
|
||||
- Book reviews with analytical content (classified as chapter)
|
||||
- Editor's notes without analysis (classified as front_matter)
|
||||
- TOC fragments embedded in content (detected and excluded)
|
||||
|
||||
See Also:
|
||||
- llm_toc: Table of contents extraction
|
||||
- llm_chunker: Semantic chunking of classified sections
|
||||
- llm_metadata: Document metadata extraction
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import cast, Any, Dict, Final
|
||||
|
||||
from .llm_structurer import (
|
||||
_clean_json_string,
|
||||
_get_default_mistral_model,
|
||||
_get_default_model,
|
||||
call_llm,
|
||||
)
|
||||
from .types import LLMProvider
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Types de sections possibles
|
||||
SECTION_TYPES: Final[dict[str, str]] = {
|
||||
"front_matter": "Métadonnées, page de titre, copyright, crédits, NOTE DE L'ÉDITEUR, colophon",
|
||||
"toc_display": "Table des matières affichée (pas le contenu)",
|
||||
"preface": "Préface, avant-propos, avertissement (contenu intellectuel à indexer)",
|
||||
"abstract": "Résumé, abstract",
|
||||
"introduction": "Introduction de l'œuvre",
|
||||
"chapter": "Chapitre principal du document",
|
||||
"conclusion": "Conclusion de l'œuvre",
|
||||
"appendix": "Annexes",
|
||||
"bibliography": "Bibliographie, références",
|
||||
"index": "Index",
|
||||
"notes": "Notes de fin",
|
||||
"ignore": "À ignorer (publicités, pages vides, métadonnées techniques)",
|
||||
}
|
||||
|
||||
|
||||
def _extract_json_from_response(text: str) -> dict[str, Any]:
|
||||
"""Extract JSON from LLM response text.
|
||||
|
||||
Handles two formats:
|
||||
1. JSON wrapped in <JSON></JSON> tags
|
||||
2. Raw JSON object in the response
|
||||
|
||||
Args:
|
||||
text: Raw LLM response text.
|
||||
|
||||
Returns:
|
||||
Parsed JSON as dictionary. Returns {"classifications": []} on failure.
|
||||
"""
|
||||
json_match: re.Match[str] | None = re.search(
|
||||
r'<JSON>\s*(.*?)\s*</JSON>', text, re.DOTALL
|
||||
)
|
||||
if json_match:
|
||||
json_str: str = _clean_json_string(json_match.group(1))
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
if start != -1 and end > start:
|
||||
json_str = _clean_json_string(text[start:end + 1])
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"JSON invalide: {e}")
|
||||
|
||||
return {"classifications": []}
|
||||
|
||||
|
||||
def classify_sections(
|
||||
sections: list[dict[str, Any]],
|
||||
document_title: str | None = None,
|
||||
model: str | None = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.1,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Classify document sections by type using LLM.
|
||||
|
||||
Uses an LLM to analyze section titles and content previews to determine
|
||||
the type of each section (chapter, front_matter, toc_display, etc.) and
|
||||
whether it should be indexed for semantic search.
|
||||
|
||||
Args:
|
||||
sections: List of section dictionaries with keys:
|
||||
- title: Section title
|
||||
- content: Section content (preview used)
|
||||
- level: Hierarchy level (1=chapter, 2=section, etc.)
|
||||
document_title: Optional document title for context.
|
||||
model: LLM model name. If None, uses provider default.
|
||||
provider: LLM provider ("ollama" or "mistral").
|
||||
temperature: Model temperature (0.0-1.0). Lower = more deterministic.
|
||||
|
||||
Returns:
|
||||
Same sections list with added classification fields:
|
||||
- type: Section type (SectionType literal)
|
||||
- should_index: Whether to include in vector index
|
||||
- chapter_number: Chapter number if applicable
|
||||
- classification_reason: Explanation for the classification
|
||||
|
||||
Example:
|
||||
>>> sections = [{"title": "Introduction", "content": "...", "level": 1}]
|
||||
>>> classified = classify_sections(sections, provider="ollama")
|
||||
>>> classified[0]["type"]
|
||||
'introduction'
|
||||
>>> classified[0]["should_index"]
|
||||
True
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Préparer les sections pour le prompt
|
||||
sections_for_prompt: list[dict[str, Any]] = []
|
||||
for i, section in enumerate(sections[:50]): # Limiter à 50 sections
|
||||
sections_for_prompt.append({
|
||||
"index": i,
|
||||
"title": section.get("title", ""),
|
||||
"preview": section.get("content", "")[:200] if section.get("content") else "",
|
||||
"level": section.get("level", 1),
|
||||
})
|
||||
|
||||
types_description: str = "\n".join([f"- {k}: {v}" for k, v in SECTION_TYPES.items()])
|
||||
title_context: str = f"Titre du document: {document_title}\n" if document_title else ""
|
||||
|
||||
prompt: str = f"""Tu es un expert en analyse de structure documentaire.
|
||||
|
||||
TÂCHE: Classifier chaque section selon son type.
|
||||
|
||||
{title_context}
|
||||
TYPES DISPONIBLES:
|
||||
{types_description}
|
||||
|
||||
RÈGLES:
|
||||
1. "front_matter": UNIQUEMENT pages de titre SANS contenu, copyright, colophon (métadonnées pures)
|
||||
2. "toc_display": la TABLE DES MATIÈRES elle-même (pas son contenu)
|
||||
3. "preface": préface, avant-propos, avertissement (À INDEXER car contenu intellectuel)
|
||||
4. "chapter": TOUT contenu principal - chapitres, sections, articles, revues de livre, essais
|
||||
5. "ignore": publicités, pages vides, métadonnées techniques sans valeur
|
||||
|
||||
IMPORTANT - REVUES DE LIVRE ET ARTICLES:
|
||||
- Une REVUE DE LIVRE ("Book Review") avec analyse critique → chapter, should_index = true
|
||||
- Un ARTICLE académique avec contenu substantiel → chapter, should_index = true
|
||||
- Les métadonnées éditoriales (auteur, affiliation, journal) au début d'un article NE sont PAS un motif pour classer comme "front_matter"
|
||||
- Si le document contient un TEXTE ANALYTIQUE développé → chapter
|
||||
|
||||
CAS PARTICULIERS:
|
||||
- "NOTE DE L'ÉDITEUR" (infos édition, réimpression, SANS analyse) → front_matter, should_index = false
|
||||
- "PRÉFACE" ou "AVANT-PROPOS" (texte intellectuel) → preface, should_index = true
|
||||
- "Book Review" ou "Article" avec paragraphes d'analyse → chapter, should_index = true
|
||||
|
||||
INDEXATION:
|
||||
- should_index = true pour: preface, introduction, chapter, conclusion, abstract
|
||||
- should_index = false pour: front_matter, toc_display, ignore
|
||||
|
||||
⚠️ ATTENTION AUX FAUX POSITIFS - LISTE DE TITRES VS CONTENU RÉEL:
|
||||
|
||||
LISTE DE TITRES (toc_display, should_index=false):
|
||||
- Suite de titres courts sans texte explicatif
|
||||
- Lignes commençant par "Comment...", "Où...", "Les dispositions à..."
|
||||
- Énumération de sections sans phrase complète
|
||||
- Exemple: "Comment fixer la croyance?\\nOù la croyance s'oppose au savoir\\nL'idéal de rationalité"
|
||||
|
||||
CONTENU RÉEL (chapter, should_index=true):
|
||||
- Texte avec phrases complètes et verbes conjugués
|
||||
- Paragraphes développés avec arguments
|
||||
- Explications, définitions, raisonnements
|
||||
- Exemple: "Comment fixer la croyance? Cette question se pose dès lors que..."
|
||||
|
||||
SECTIONS À CLASSIFIER:
|
||||
{json.dumps(sections_for_prompt, ensure_ascii=False, indent=2)}
|
||||
|
||||
RÉPONDS avec un JSON entre <JSON></JSON>:
|
||||
|
||||
<JSON>
|
||||
{{
|
||||
"classifications": [
|
||||
{{
|
||||
"index": 0,
|
||||
"type": "front_matter",
|
||||
"should_index": false,
|
||||
"chapter_number": null,
|
||||
"reason": "Page de titre avec métadonnées éditeur"
|
||||
}},
|
||||
{{
|
||||
"index": 1,
|
||||
"type": "chapter",
|
||||
"should_index": true,
|
||||
"chapter_number": 1,
|
||||
"reason": "Premier chapitre du document"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
</JSON>
|
||||
"""
|
||||
|
||||
logger.info(f"Classification de {len(sections_for_prompt)} sections via {provider.upper()} ({model})")
|
||||
|
||||
try:
|
||||
response: str = call_llm(prompt, model=model, provider=provider, temperature=temperature, timeout=300)
|
||||
result: dict[str, Any] = _extract_json_from_response(response)
|
||||
classifications: list[dict[str, Any]] = result.get("classifications", [])
|
||||
|
||||
# Créer un mapping index -> classification
|
||||
class_map: dict[int, dict[str, Any]] = {
|
||||
c["index"]: c for c in classifications if "index" in c
|
||||
}
|
||||
|
||||
# Appliquer les classifications
|
||||
for i, section in enumerate(sections):
|
||||
if i in class_map:
|
||||
c: dict[str, Any] = class_map[i]
|
||||
section["type"] = c.get("type", "chapter")
|
||||
section["should_index"] = c.get("should_index", True)
|
||||
section["chapter_number"] = c.get("chapter_number")
|
||||
section["classification_reason"] = c.get("reason", "")
|
||||
else:
|
||||
# Défaut: traiter comme contenu
|
||||
section["type"] = "chapter"
|
||||
section["should_index"] = True
|
||||
section["chapter_number"] = None
|
||||
|
||||
# Stats
|
||||
types_count: dict[str, int] = {}
|
||||
for s in sections:
|
||||
t: str = s.get("type", "unknown")
|
||||
types_count[t] = types_count.get(t, 0) + 1
|
||||
|
||||
logger.info(f"Classification terminée: {types_count}")
|
||||
|
||||
return sections
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur classification sections: {e}")
|
||||
# En cas d'erreur, marquer tout comme indexable
|
||||
for section in sections:
|
||||
section["type"] = "chapter"
|
||||
section["should_index"] = True
|
||||
return sections
|
||||
|
||||
|
||||
# Titres à exclure automatiquement (insensible à la casse)
|
||||
EXCLUDED_SECTION_TITLES: Final[list[str]] = [
|
||||
"table des matières",
|
||||
"table des matieres",
|
||||
"sommaire",
|
||||
"table of contents",
|
||||
"contents",
|
||||
"toc",
|
||||
"index",
|
||||
"liste des figures",
|
||||
"liste des tableaux",
|
||||
"list of figures",
|
||||
"list of tables",
|
||||
"note de l'éditeur",
|
||||
"note de l'editeur",
|
||||
"note de la rédaction",
|
||||
"copyright",
|
||||
"mentions légales",
|
||||
"crédits",
|
||||
"colophon",
|
||||
"achevé d'imprimer",
|
||||
]
|
||||
|
||||
|
||||
def is_excluded_section(section: dict[str, Any]) -> bool:
|
||||
"""Check if a section should be automatically excluded from indexing.
|
||||
|
||||
Excludes sections based on:
|
||||
1. Title matching known TOC/metadata patterns
|
||||
2. Content analysis detecting TOC-like structure (short lines, title patterns)
|
||||
|
||||
Args:
|
||||
section: Section dictionary with optional keys:
|
||||
- title: Section title
|
||||
- chapterTitle: Parent chapter title
|
||||
- content: Section content
|
||||
|
||||
Returns:
|
||||
True if section should be excluded from indexing.
|
||||
|
||||
Example:
|
||||
>>> is_excluded_section({"title": "Table des matières"})
|
||||
True
|
||||
>>> is_excluded_section({"title": "Introduction", "content": "..."})
|
||||
False
|
||||
"""
|
||||
title: str = section.get("title", "").lower().strip()
|
||||
chapter_title: str = section.get("chapterTitle", "").lower().strip()
|
||||
|
||||
# Vérifier le titre de la section
|
||||
for excluded in EXCLUDED_SECTION_TITLES:
|
||||
if excluded in title or title == excluded:
|
||||
return True
|
||||
if excluded in chapter_title or chapter_title == excluded:
|
||||
return True
|
||||
|
||||
# Vérifier si le contenu ressemble à une liste de titres (TOC)
|
||||
content: str = section.get("content", "")
|
||||
if content:
|
||||
lines: list[str] = [l.strip() for l in content.split("\n") if l.strip()]
|
||||
|
||||
# Si pas assez de lignes, pas de détection
|
||||
if len(lines) < 3:
|
||||
return False
|
||||
|
||||
# Critère 1: Lignes courtes (moyenne < 50 chars)
|
||||
avg_len: float = sum(len(l) for l in lines) / len(lines)
|
||||
|
||||
# Critère 2: Toutes les lignes sont courtes (< 100 chars)
|
||||
all_short: bool = all(len(l) < 100 for l in lines[:10])
|
||||
|
||||
# Critère 3: Patterns typiques de titres de sections
|
||||
title_patterns: list[str] = [
|
||||
r'^Comment\s+.+\?', # "Comment fixer la croyance?"
|
||||
r'^Où\s+.+', # "Où la croyance s'oppose"
|
||||
r'^Les?\s+\w+\s+à\s+', # "Les dispositions à penser"
|
||||
r'^Que\s+.+\?', # "Que peut-on savoir?"
|
||||
r'^L[ae]\s+\w+\s+(de|du)\s+', # "La critique de l'intuition"
|
||||
r'^Entre\s+.+\s+et\s+', # "Entre nature et norme"
|
||||
]
|
||||
|
||||
# Compter combien de lignes matchent les patterns de titres
|
||||
title_like_count: int = 0
|
||||
for line in lines[:10]:
|
||||
for pattern in title_patterns:
|
||||
if re.match(pattern, line, re.IGNORECASE):
|
||||
title_like_count += 1
|
||||
break
|
||||
|
||||
# Critère 4: Pas de verbes conjugués typiques du contenu narratif
|
||||
narrative_verbs: list[str] = [
|
||||
r'\best\b', r'\bsont\b', r'\bétait\b', r'\bsera\b',
|
||||
r'\ba\b', r'\bont\b', r'\bavait\b', r'\bavaient\b',
|
||||
r'\bfait\b', r'\bdit\b', r'\bpense\b', r'\bexplique\b'
|
||||
]
|
||||
|
||||
has_narrative: bool = False
|
||||
for line in lines[:5]:
|
||||
for verb_pattern in narrative_verbs:
|
||||
if re.search(verb_pattern, line, re.IGNORECASE):
|
||||
has_narrative = True
|
||||
break
|
||||
if has_narrative:
|
||||
break
|
||||
|
||||
# Décision: C'est une liste de titres (TOC) si:
|
||||
# - Lignes courtes ET toutes < 100 chars ET (beaucoup de patterns de titres OU pas de verbes narratifs)
|
||||
if len(lines) >= 5 and avg_len < 50 and all_short:
|
||||
if title_like_count >= len(lines) * 0.4 or not has_narrative:
|
||||
logger.debug(f"Section '{title}' exclue: ressemble à une TOC (lignes courtes, {title_like_count}/{len(lines)} titres)")
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def filter_indexable_sections(sections: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
"""Filter sections to keep only those that should be indexed.
|
||||
|
||||
Applies multiple exclusion criteria:
|
||||
1. Automatic exclusion by title pattern (TOC, index, etc.)
|
||||
2. Parent chapter exclusion (if parent is TOC)
|
||||
3. LLM classification (should_index flag)
|
||||
|
||||
Args:
|
||||
sections: List of classified section dictionaries.
|
||||
|
||||
Returns:
|
||||
Filtered list containing only indexable sections.
|
||||
|
||||
Example:
|
||||
>>> sections = [
|
||||
... {"title": "TOC", "should_index": False},
|
||||
... {"title": "Chapter 1", "should_index": True}
|
||||
... ]
|
||||
>>> filtered = filter_indexable_sections(sections)
|
||||
>>> len(filtered)
|
||||
1
|
||||
"""
|
||||
filtered: list[dict[str, Any]] = []
|
||||
excluded_count: int = 0
|
||||
|
||||
for s in sections:
|
||||
# Vérifier l'exclusion automatique
|
||||
if is_excluded_section(s):
|
||||
logger.info(f"Section exclue automatiquement: '{s.get('title', 'Sans titre')}'")
|
||||
excluded_count += 1
|
||||
continue
|
||||
|
||||
# Vérifier si le chapitre parent est une TOC
|
||||
chapter_title: str = s.get("chapterTitle", "").lower().strip()
|
||||
if any(excluded in chapter_title for excluded in EXCLUDED_SECTION_TITLES):
|
||||
logger.info(f"Section exclue (chapitre TOC): '{s.get('title', 'Sans titre')}' dans '{chapter_title}'")
|
||||
excluded_count += 1
|
||||
continue
|
||||
|
||||
# Vérifier la classification LLM
|
||||
if s.get("should_index", True):
|
||||
filtered.append(s)
|
||||
else:
|
||||
excluded_count += 1
|
||||
|
||||
if excluded_count > 0:
|
||||
logger.info(f"Sections exclues: {excluded_count}, indexables: {len(filtered)}")
|
||||
|
||||
return filtered
|
||||
|
||||
|
||||
def validate_classified_sections(sections: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
"""Post-classification validation to detect false positives.
|
||||
|
||||
Performs additional checks on sections marked should_index=True to catch
|
||||
TOC fragments that escaped initial classification:
|
||||
1. Parent chapter is TOC -> exclude
|
||||
2. Content is mostly short title-like lines -> reclassify as toc_display
|
||||
|
||||
Args:
|
||||
sections: List of already-classified section dictionaries.
|
||||
|
||||
Returns:
|
||||
Validated sections with corrections applied. Corrections are logged
|
||||
and stored in 'validation_correction' field.
|
||||
|
||||
Example:
|
||||
>>> sections = [{"title": "Part 1", "should_index": True, "content": "..."}]
|
||||
>>> validated = validate_classified_sections(sections)
|
||||
>>> # May reclassify sections with TOC-like content
|
||||
"""
|
||||
validated: list[dict[str, Any]] = []
|
||||
fixed_count: int = 0
|
||||
|
||||
for section in sections:
|
||||
# Vérifier d'abord si le titre du chapitre parent est une TOC
|
||||
chapter_title: str = section.get("chapter_title", "").lower().strip()
|
||||
section_title: str = section.get("title", "").lower().strip()
|
||||
|
||||
# Exclure si le chapitre parent est une TOC
|
||||
is_toc_chapter: bool = False
|
||||
for excluded in EXCLUDED_SECTION_TITLES:
|
||||
if excluded in chapter_title:
|
||||
logger.warning(f"Section '{section.get('title', 'Sans titre')}' exclue: chapitre parent est '{chapter_title}'")
|
||||
section["should_index"] = False
|
||||
section["type"] = "toc_display"
|
||||
section["validation_correction"] = f"Exclue car chapitre parent = {chapter_title}"
|
||||
fixed_count += 1
|
||||
is_toc_chapter = True
|
||||
break
|
||||
|
||||
if is_toc_chapter:
|
||||
validated.append(section)
|
||||
continue
|
||||
|
||||
# Si déjà marquée comme non-indexable, garder tel quel
|
||||
if not section.get("should_index", True):
|
||||
validated.append(section)
|
||||
continue
|
||||
|
||||
content: str = section.get("content", "")
|
||||
|
||||
# Validation supplémentaire sur le contenu
|
||||
if content:
|
||||
lines: list[str] = [l.strip() for l in content.split("\n") if l.strip()]
|
||||
|
||||
# Si très peu de lignes, probablement pas un problème
|
||||
if len(lines) < 3:
|
||||
validated.append(section)
|
||||
continue
|
||||
|
||||
# Calculer le ratio de lignes qui ressemblent à des titres
|
||||
title_question_pattern: str = r'^(Comment|Où|Que|Quelle|Quel|Les?\s+\w+\s+(de|du|à)|Entre\s+.+\s+et)\s+'
|
||||
title_like: int = sum(1 for l in lines if re.match(title_question_pattern, l, re.IGNORECASE))
|
||||
|
||||
# Si > 50% des lignes ressemblent à des titres ET lignes courtes
|
||||
avg_len: float = sum(len(l) for l in lines) / len(lines)
|
||||
|
||||
if len(lines) >= 4 and title_like >= len(lines) * 0.5 and avg_len < 55:
|
||||
# C'est probablement une liste de titres extraite de la TOC
|
||||
logger.warning(f"Section '{section.get('title', 'Sans titre')}' reclassée: détectée comme liste de titres TOC")
|
||||
section["should_index"] = False
|
||||
section["type"] = "toc_display"
|
||||
section["validation_correction"] = "Reclassée comme toc_display (liste de titres)"
|
||||
fixed_count += 1
|
||||
validated.append(section)
|
||||
continue
|
||||
|
||||
validated.append(section)
|
||||
|
||||
if fixed_count > 0:
|
||||
logger.info(f"Validation post-classification: {fixed_count} section(s) reclassée(s)")
|
||||
|
||||
return validated
|
||||
|
||||
|
||||
def get_chapter_sections(sections: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
"""Filter sections to return only chapter-type content.
|
||||
|
||||
Returns sections with types that contain main document content:
|
||||
chapter, introduction, conclusion, abstract, preface.
|
||||
|
||||
Args:
|
||||
sections: List of classified section dictionaries.
|
||||
|
||||
Returns:
|
||||
Filtered list containing only chapter-type sections.
|
||||
|
||||
Example:
|
||||
>>> sections = [
|
||||
... {"title": "TOC", "type": "toc_display"},
|
||||
... {"title": "Chapter 1", "type": "chapter"}
|
||||
... ]
|
||||
>>> chapters = get_chapter_sections(sections)
|
||||
>>> len(chapters)
|
||||
1
|
||||
"""
|
||||
chapter_types: set[str] = {"chapter", "introduction", "conclusion", "abstract", "preface"}
|
||||
return [s for s in sections if s.get("type") in chapter_types]
|
||||
389
generations/library_rag/utils/llm_cleaner.py
Normal file
389
generations/library_rag/utils/llm_cleaner.py
Normal file
@@ -0,0 +1,389 @@
|
||||
"""Text cleaning and validation for OCR-extracted content.
|
||||
|
||||
This module provides utilities for cleaning OCR artifacts from extracted text,
|
||||
validating chunk content, and optionally using LLM for intelligent corrections.
|
||||
It handles common OCR issues like page markers, isolated page numbers,
|
||||
repeated headers/footers, and character recognition errors.
|
||||
|
||||
Overview:
|
||||
The module offers three levels of cleaning:
|
||||
|
||||
1. **Basic cleaning** (clean_page_markers, clean_ocr_artifacts):
|
||||
Fast regex-based cleaning for common issues. Always applied.
|
||||
|
||||
2. **LLM-enhanced cleaning** (clean_content_with_llm):
|
||||
Uses an LLM to correct subtle OCR errors while preserving meaning.
|
||||
Only applied when explicitly requested and for medium-length texts.
|
||||
|
||||
3. **Validation** (is_chunk_valid):
|
||||
Checks if a text chunk contains meaningful content.
|
||||
|
||||
Cleaning Operations:
|
||||
- Remove page markers (<!-- Page X -->)
|
||||
- Remove isolated page numbers
|
||||
- Remove short/repetitive header/footer lines
|
||||
- Normalize multiple spaces and blank lines
|
||||
- Correct obvious OCR character errors (LLM mode)
|
||||
- Preserve citations, technical vocabulary, paragraph structure
|
||||
|
||||
Validation Criteria:
|
||||
- Minimum character count (default: 20)
|
||||
- Minimum word count (default: 5)
|
||||
- Not pure metadata (URLs, ISBNs, DOIs, copyright notices)
|
||||
|
||||
LLM Provider Support:
|
||||
- ollama: Local LLM (free, slower, default)
|
||||
- mistral: Mistral API (faster, requires API key)
|
||||
|
||||
Example:
|
||||
>>> from utils.llm_cleaner import clean_chunk, is_chunk_valid
|
||||
>>>
|
||||
>>> # Clean a chunk with basic cleaning only
|
||||
>>> text = "<!-- Page 42 --> Some philosophical content..."
|
||||
>>> cleaned = clean_chunk(text)
|
||||
>>> print(cleaned)
|
||||
'Some philosophical content...'
|
||||
>>>
|
||||
>>> # Validate chunk before processing
|
||||
>>> if is_chunk_valid(cleaned):
|
||||
... process_chunk(cleaned)
|
||||
|
||||
See Also:
|
||||
utils.llm_chunker: Semantic chunking of sections
|
||||
utils.llm_validator: Document validation and concept extraction
|
||||
utils.pdf_pipeline: Main pipeline orchestration
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import List, Optional, Pattern
|
||||
|
||||
from .llm_structurer import call_llm, _get_default_model, _get_default_mistral_model
|
||||
from .types import LLMProvider
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
# Type alias for compiled regex patterns
|
||||
RegexPattern = Pattern[str]
|
||||
|
||||
|
||||
def clean_page_markers(text: str) -> str:
|
||||
r"""Remove page markers and normalize blank lines from text.
|
||||
|
||||
Page markers are HTML comments inserted during OCR processing to track
|
||||
page boundaries. This function removes them along with excessive blank
|
||||
lines that may result from the removal.
|
||||
|
||||
Args:
|
||||
text: Text content potentially containing page markers like
|
||||
'<!-- Page 42 -->' and multiple consecutive newlines.
|
||||
|
||||
Returns:
|
||||
Cleaned text with page markers removed and no more than two
|
||||
consecutive newlines. Text is stripped of leading/trailing whitespace.
|
||||
|
||||
Example:
|
||||
>>> text = "<!-- Page 1 -->\nContent here\n\n\n\n<!-- Page 2 -->"
|
||||
>>> clean_page_markers(text)
|
||||
'Content here'
|
||||
"""
|
||||
# Supprimer les marqueurs <!-- Page X -->
|
||||
text = re.sub(r'<!--\s*Page\s*\d+\s*-->', '', text)
|
||||
|
||||
# Supprimer les lignes vides multiples
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def clean_ocr_artifacts(text: str) -> str:
|
||||
r"""Remove common OCR artifacts without using LLM.
|
||||
|
||||
This function performs fast, rule-based cleaning of typical OCR issues:
|
||||
- Isolated page numbers (1-4 digits on their own line)
|
||||
- Very short lines likely to be headers/footers (<=3 chars)
|
||||
- Multiple consecutive spaces
|
||||
- Excessive blank lines (>2)
|
||||
|
||||
Lines starting with '#' (markdown headers) are preserved regardless
|
||||
of length. Empty lines are preserved (single blank lines only).
|
||||
|
||||
Args:
|
||||
text: Raw OCR-extracted text potentially containing artifacts
|
||||
like isolated page numbers, repeated headers, and irregular spacing.
|
||||
|
||||
Returns:
|
||||
Cleaned text with artifacts removed and spacing normalized.
|
||||
Leading/trailing whitespace is stripped.
|
||||
|
||||
Example:
|
||||
>>> text = "42\n\nActual content here\n\n\n\n\nMore text"
|
||||
>>> clean_ocr_artifacts(text)
|
||||
'Actual content here\n\nMore text'
|
||||
|
||||
Note:
|
||||
This function is always called as part of clean_chunk() and provides
|
||||
a baseline level of cleaning even when LLM cleaning is disabled.
|
||||
"""
|
||||
# Supprimer les numéros de page isolés
|
||||
text = re.sub(r'^\d{1,4}\s*$', '', text, flags=re.MULTILINE)
|
||||
|
||||
# Supprimer les en-têtes/pieds de page répétés (lignes très courtes isolées)
|
||||
lines: List[str] = text.split('\n')
|
||||
cleaned_lines: List[str] = []
|
||||
for line in lines:
|
||||
# Garder les lignes non vides et significatives
|
||||
stripped: str = line.strip()
|
||||
if stripped and (len(stripped) > 3 or stripped.startswith('#')):
|
||||
cleaned_lines.append(line)
|
||||
elif not stripped:
|
||||
cleaned_lines.append('') # Préserver les lignes vides simples
|
||||
|
||||
text = '\n'.join(cleaned_lines)
|
||||
|
||||
# Normaliser les espaces
|
||||
text = re.sub(r' {2,}', ' ', text)
|
||||
|
||||
# Supprimer les lignes vides multiples
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def clean_content_with_llm(
|
||||
text: str,
|
||||
context: Optional[str] = None,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.1,
|
||||
) -> str:
|
||||
"""Clean text content using an LLM for intelligent OCR error correction.
|
||||
|
||||
Uses a language model to correct subtle OCR errors that rule-based
|
||||
cleaning cannot handle, such as misrecognized characters in context.
|
||||
The LLM is instructed to preserve the intellectual content exactly
|
||||
while fixing obvious technical errors.
|
||||
|
||||
The function includes safeguards:
|
||||
- Texts < 50 chars: Only basic cleaning (LLM skipped)
|
||||
- Texts > 3000 chars: Only basic cleaning (timeout risk)
|
||||
- If LLM changes text by >50%: Fallback to basic cleaning
|
||||
|
||||
Args:
|
||||
text: Text content to clean. Should be between 50-3000 characters
|
||||
for LLM processing.
|
||||
context: Optional context about the document (title, subject) to
|
||||
help the LLM make better corrections. Example: "Heidegger's
|
||||
Being and Time, Chapter 2".
|
||||
model: LLM model name. If None, uses provider default
|
||||
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
|
||||
provider: LLM provider to use. Options: "ollama" (local, free)
|
||||
or "mistral" (API, faster).
|
||||
temperature: LLM temperature for response generation. Lower values
|
||||
(0.1) produce more deterministic corrections. Defaults to 0.1.
|
||||
|
||||
Returns:
|
||||
Cleaned text with OCR errors corrected. If LLM fails or produces
|
||||
suspicious output (too short/long), returns basic-cleaned text.
|
||||
|
||||
Raises:
|
||||
No exceptions raised - all errors caught and handled with fallback.
|
||||
|
||||
Example:
|
||||
>>> text = "Heidegger's concept of Dase1n is central..." # '1' should be 'i'
|
||||
>>> clean_content_with_llm(text, context="Being and Time")
|
||||
"Heidegger's concept of Dasein is central..."
|
||||
|
||||
Note:
|
||||
The LLM is explicitly instructed NOT to:
|
||||
- Modify meaning or intellectual content
|
||||
- Rephrase or summarize
|
||||
- Add any new content
|
||||
- Alter citations or technical vocabulary
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Ne pas traiter les textes trop courts
|
||||
if len(text.strip()) < 50:
|
||||
return clean_page_markers(text)
|
||||
|
||||
# Limiter la taille pour éviter les timeouts
|
||||
max_chars: int = 3000
|
||||
if len(text) > max_chars:
|
||||
# Pour les longs textes, nettoyer sans LLM
|
||||
return clean_page_markers(clean_ocr_artifacts(text))
|
||||
|
||||
context_info: str = f"Contexte: {context}\n" if context else ""
|
||||
|
||||
prompt: str = f"""Tu es un expert en correction de textes OCRisés.
|
||||
|
||||
TÂCHE: Nettoyer ce texte extrait par OCR.
|
||||
|
||||
{context_info}
|
||||
ACTIONS À EFFECTUER:
|
||||
1. Supprimer les marqueurs de page (<!-- Page X -->)
|
||||
2. Corriger les erreurs OCR ÉVIDENTES (caractères mal reconnus)
|
||||
3. Supprimer les artefacts (numéros de page isolés, en-têtes répétés)
|
||||
4. Normaliser la ponctuation et les espaces
|
||||
|
||||
RÈGLES STRICTES:
|
||||
- NE PAS modifier le sens ou le contenu intellectuel
|
||||
- NE PAS reformuler ou résumer
|
||||
- NE PAS ajouter de contenu
|
||||
- Préserver les citations et le vocabulaire technique
|
||||
- Garder la structure des paragraphes
|
||||
|
||||
TEXTE À NETTOYER:
|
||||
{text}
|
||||
|
||||
RÉPONDS UNIQUEMENT avec le texte nettoyé, sans commentaires ni balises."""
|
||||
|
||||
try:
|
||||
response: str = call_llm(
|
||||
prompt, model=model, provider=provider, temperature=temperature, timeout=120
|
||||
)
|
||||
|
||||
# Vérifier que la réponse est valide
|
||||
cleaned: str = response.strip()
|
||||
|
||||
# Si la réponse est trop différente (LLM a trop modifié), garder l'original nettoyé basiquement
|
||||
if len(cleaned) < len(text) * 0.5 or len(cleaned) > len(text) * 1.5:
|
||||
logger.warning("LLM a trop modifié le texte, utilisation du nettoyage basique")
|
||||
return clean_page_markers(clean_ocr_artifacts(text))
|
||||
|
||||
return cleaned
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur nettoyage LLM: {e}, utilisation du nettoyage basique")
|
||||
return clean_page_markers(clean_ocr_artifacts(text))
|
||||
|
||||
|
||||
def clean_chunk(
|
||||
chunk_text: str,
|
||||
use_llm: bool = False,
|
||||
context: Optional[str] = None,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
) -> str:
|
||||
r"""Clean a text chunk with optional LLM enhancement.
|
||||
|
||||
This is the main entry point for chunk cleaning. It always applies
|
||||
basic cleaning (page markers, OCR artifacts) and optionally uses
|
||||
LLM for more intelligent error correction.
|
||||
|
||||
Cleaning pipeline:
|
||||
1. Remove page markers (always)
|
||||
2. Remove OCR artifacts (always)
|
||||
3. LLM correction (if use_llm=True and text >= 50 chars)
|
||||
|
||||
Args:
|
||||
chunk_text: Raw text content of the chunk to clean.
|
||||
use_llm: Whether to use LLM for enhanced cleaning. Defaults to
|
||||
False. Set to True for higher quality but slower processing.
|
||||
context: Optional document context (title, chapter) passed to LLM
|
||||
for better corrections. Ignored if use_llm=False.
|
||||
model: LLM model name. If None, uses provider default.
|
||||
Ignored if use_llm=False.
|
||||
provider: LLM provider ("ollama" or "mistral"). Defaults to
|
||||
"ollama". Ignored if use_llm=False.
|
||||
|
||||
Returns:
|
||||
Cleaned chunk text ready for indexing or further processing.
|
||||
|
||||
Example:
|
||||
>>> # Basic cleaning only (fast)
|
||||
>>> chunk = "<!-- Page 5 -->\n42\n\nThe concept of being..."
|
||||
>>> clean_chunk(chunk)
|
||||
'The concept of being...'
|
||||
>>>
|
||||
>>> # With LLM enhancement (slower, higher quality)
|
||||
>>> clean_chunk(chunk, use_llm=True, context="Heidegger analysis")
|
||||
'The concept of being...'
|
||||
|
||||
See Also:
|
||||
is_chunk_valid: Validate cleaned chunks before processing
|
||||
clean_page_markers: Basic page marker removal
|
||||
clean_ocr_artifacts: Basic artifact removal
|
||||
"""
|
||||
# Nettoyage de base toujours appliqué
|
||||
text: str = clean_page_markers(chunk_text)
|
||||
text = clean_ocr_artifacts(text)
|
||||
|
||||
# Nettoyage LLM optionnel
|
||||
if use_llm and len(text) >= 50:
|
||||
text = clean_content_with_llm(text, context=context, model=model, provider=provider)
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def is_chunk_valid(chunk_text: str, min_chars: int = 20, min_words: int = 5) -> bool:
|
||||
"""Check if a text chunk contains meaningful content.
|
||||
|
||||
Validates that a chunk has sufficient length and is not purely
|
||||
metadata or boilerplate content. Used to filter out non-content
|
||||
chunks before indexing.
|
||||
|
||||
Validation criteria:
|
||||
1. Character count >= min_chars (after page marker removal)
|
||||
2. Word count >= min_words
|
||||
3. Not matching metadata patterns (URLs, ISBNs, DOIs, dates, copyright)
|
||||
|
||||
Args:
|
||||
chunk_text: Text content of the chunk to validate. Page markers
|
||||
are removed before validation.
|
||||
min_chars: Minimum number of characters required. Defaults to 20.
|
||||
Chunks shorter than this are considered invalid.
|
||||
min_words: Minimum number of words required. Defaults to 5.
|
||||
Chunks with fewer words are considered invalid.
|
||||
|
||||
Returns:
|
||||
True if the chunk passes all validation criteria and contains
|
||||
meaningful content suitable for indexing. False otherwise.
|
||||
|
||||
Example:
|
||||
>>> is_chunk_valid("The concept of Dasein is central to Heidegger.")
|
||||
True
|
||||
>>> is_chunk_valid("42") # Too short
|
||||
False
|
||||
>>> is_chunk_valid("ISBN 978-0-123456-78-9") # Metadata
|
||||
False
|
||||
>>> is_chunk_valid("https://example.com/page") # URL
|
||||
False
|
||||
|
||||
Note:
|
||||
Metadata patterns checked:
|
||||
- URLs (http://, https://)
|
||||
- Dates (YYYY-MM-DD format)
|
||||
- ISBN numbers
|
||||
- DOI identifiers
|
||||
- Copyright notices (©)
|
||||
"""
|
||||
text: str = clean_page_markers(chunk_text).strip()
|
||||
|
||||
# Vérifier la longueur
|
||||
if len(text) < min_chars:
|
||||
return False
|
||||
|
||||
# Compter les mots
|
||||
words: List[str] = text.split()
|
||||
if len(words) < min_words:
|
||||
return False
|
||||
|
||||
# Vérifier que ce n'est pas juste des métadonnées
|
||||
metadata_patterns: List[str] = [
|
||||
r'^https?://',
|
||||
r'^\d{4}-\d{2}-\d{2}$',
|
||||
r'^ISBN',
|
||||
r'^DOI',
|
||||
r'^©',
|
||||
]
|
||||
pattern: str
|
||||
for pattern in metadata_patterns:
|
||||
if re.match(pattern, text, re.IGNORECASE):
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
294
generations/library_rag/utils/llm_metadata.py
Normal file
294
generations/library_rag/utils/llm_metadata.py
Normal file
@@ -0,0 +1,294 @@
|
||||
r"""LLM-based bibliographic metadata extraction from documents.
|
||||
|
||||
This module extracts bibliographic metadata (title, author, publisher, year, etc.)
|
||||
from document text using Large Language Models. It supports both local (Ollama)
|
||||
and cloud-based (Mistral API) LLM providers.
|
||||
|
||||
The extraction process:
|
||||
1. Takes the first N characters of the document markdown (typically first pages)
|
||||
2. Sends a structured prompt to the LLM requesting JSON-formatted metadata
|
||||
3. Parses the LLM response to extract the JSON data
|
||||
4. Applies default values and cleanup for missing/invalid fields
|
||||
|
||||
Supported metadata fields:
|
||||
- title: Document title (including subtitle if present)
|
||||
- author: Primary author name
|
||||
- collection: Series or collection name
|
||||
- publisher: Publisher name
|
||||
- year: Publication year
|
||||
- doi: Digital Object Identifier
|
||||
- isbn: ISBN number
|
||||
- language: ISO 639-1 language code (default: "fr")
|
||||
- confidence: Dict of confidence scores per field (0.0-1.0)
|
||||
|
||||
LLM Provider Differences:
|
||||
- **Ollama** (local): Free, slower, requires local installation.
|
||||
Uses models like "mistral", "llama2", "mixtral".
|
||||
- **Mistral API** (cloud): Fast, paid (~0.002€/call for small prompts).
|
||||
Uses models like "mistral-small-latest", "mistral-medium-latest".
|
||||
|
||||
Cost Implications:
|
||||
- Ollama: No API cost, only local compute resources
|
||||
- Mistral API: ~0.002€ per metadata extraction call (small prompt)
|
||||
|
||||
Example:
|
||||
>>> from utils.llm_metadata import extract_metadata
|
||||
>>>
|
||||
>>> markdown = '''
|
||||
... # La technique et le temps
|
||||
... ## Tome 1 : La faute d'Épiméthée
|
||||
...
|
||||
... Bernard Stiegler
|
||||
...
|
||||
... Éditions Galilée, 1994
|
||||
... '''
|
||||
>>>
|
||||
>>> metadata = extract_metadata(markdown, provider="ollama")
|
||||
>>> print(metadata)
|
||||
{
|
||||
'title': 'La technique et le temps. Tome 1 : La faute d\'Épiméthée',
|
||||
'author': 'Bernard Stiegler',
|
||||
'publisher': 'Éditions Galilée',
|
||||
'year': 1994,
|
||||
'language': 'fr',
|
||||
'confidence': {'title': 0.95, 'author': 0.98}
|
||||
}
|
||||
|
||||
See Also:
|
||||
- llm_toc: Table of contents extraction via LLM
|
||||
- llm_structurer: Core LLM call infrastructure
|
||||
- pdf_pipeline: Orchestration using this module (Step 4)
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
from .llm_structurer import (
|
||||
_clean_json_string,
|
||||
_get_default_mistral_model,
|
||||
_get_default_model,
|
||||
call_llm,
|
||||
)
|
||||
from .types import LLMProvider
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _extract_json_from_response(text: str) -> Dict[str, Any]:
|
||||
"""Extract JSON data from an LLM response string.
|
||||
|
||||
Attempts to parse JSON from the LLM response using two strategies:
|
||||
1. First, looks for JSON enclosed in <JSON></JSON> tags (preferred format)
|
||||
2. Falls back to finding the first {...} block in the response
|
||||
|
||||
The function applies JSON string cleaning to handle common LLM quirks
|
||||
like trailing commas, unescaped quotes, etc.
|
||||
|
||||
Args:
|
||||
text: Raw LLM response text that may contain JSON data.
|
||||
|
||||
Returns:
|
||||
Parsed JSON as a dictionary. Returns empty dict if no valid
|
||||
JSON could be extracted.
|
||||
|
||||
Example:
|
||||
>>> response = '<JSON>{"title": "Test", "author": "Smith"}</JSON>'
|
||||
>>> _extract_json_from_response(response)
|
||||
{'title': 'Test', 'author': 'Smith'}
|
||||
|
||||
>>> response = 'Here is the metadata: {"title": "Test"}'
|
||||
>>> _extract_json_from_response(response)
|
||||
{'title': 'Test'}
|
||||
"""
|
||||
# Chercher entre balises <JSON> et </JSON>
|
||||
json_match: Optional[re.Match[str]] = re.search(r'<JSON>\s*(.*?)\s*</JSON>', text, re.DOTALL)
|
||||
if json_match:
|
||||
json_str: str = _clean_json_string(json_match.group(1))
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
# Fallback: chercher le premier objet JSON
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
if start != -1 and end > start:
|
||||
json_str = _clean_json_string(text[start:end + 1])
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"JSON invalide: {e}")
|
||||
|
||||
return {}
|
||||
|
||||
|
||||
def extract_metadata(
|
||||
markdown: str,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.1,
|
||||
max_chars: int = 6000,
|
||||
) -> Dict[str, Any]:
|
||||
"""Extract bibliographic metadata from a document using an LLM.
|
||||
|
||||
Analyzes the beginning of a document (typically first few pages) to extract
|
||||
bibliographic metadata including title, author, publisher, year, and more.
|
||||
Uses a structured prompt that guides the LLM to distinguish between
|
||||
document title vs. collection name vs. publisher name.
|
||||
|
||||
The LLM is instructed to return confidence scores for extracted fields,
|
||||
allowing downstream processing to handle uncertain extractions appropriately.
|
||||
|
||||
Args:
|
||||
markdown: Document text in Markdown format. For best results, provide
|
||||
at least the first 2-3 pages containing title page and colophon.
|
||||
model: LLM model name to use. If None, uses the default model for the
|
||||
selected provider (e.g., "mistral" for Ollama, "mistral-small-latest"
|
||||
for Mistral API).
|
||||
provider: LLM provider to use. Options are:
|
||||
- "ollama": Local LLM (free, slower, requires Ollama installation)
|
||||
- "mistral": Mistral API (fast, paid, requires API key)
|
||||
temperature: Model temperature for generation. Lower values (0.0-0.3)
|
||||
produce more consistent, deterministic results. Default 0.1.
|
||||
max_chars: Maximum number of characters to send to the LLM. Longer
|
||||
documents are truncated. Default 6000 (~2 pages).
|
||||
|
||||
Returns:
|
||||
Dictionary containing extracted metadata with the following keys:
|
||||
- title (str | None): Document title with subtitle if present
|
||||
- author (str | None): Primary author name
|
||||
- collection (str | None): Series or collection name
|
||||
- publisher (str | None): Publisher name
|
||||
- year (int | None): Publication year
|
||||
- doi (str | None): Digital Object Identifier
|
||||
- isbn (str | None): ISBN number
|
||||
- language (str): ISO 639-1 language code (default "fr")
|
||||
- confidence (dict): Confidence scores per field (0.0-1.0)
|
||||
- error (str): Error message if extraction failed (only on error)
|
||||
|
||||
Raises:
|
||||
No exceptions are raised; errors are captured in the return dict.
|
||||
|
||||
Note:
|
||||
- Cost for Mistral API: ~0.002€ per call (6000 chars input)
|
||||
- Ollama is free but requires local GPU/CPU resources
|
||||
- The prompt is in French as most processed documents are French texts
|
||||
- Low temperature (0.1) is used for consistent metadata extraction
|
||||
|
||||
Example:
|
||||
>>> # Extract from first pages of a philosophy book
|
||||
>>> markdown = Path("output/stiegler/stiegler.md").read_text()[:6000]
|
||||
>>> metadata = extract_metadata(markdown, provider="ollama")
|
||||
>>> print(f"Title: {metadata['title']}")
|
||||
Title: La technique et le temps
|
||||
|
||||
>>> # Using Mistral API for faster extraction
|
||||
>>> metadata = extract_metadata(markdown, provider="mistral")
|
||||
>>> print(f"Author: {metadata['author']} (confidence: {metadata['confidence'].get('author', 'N/A')})")
|
||||
Author: Bernard Stiegler (confidence: 0.98)
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Prendre les premières pages (métadonnées souvent au début)
|
||||
content: str = markdown[:max_chars]
|
||||
if len(markdown) > max_chars:
|
||||
content += "\n\n[... document tronqué ...]"
|
||||
|
||||
prompt: str = f"""Tu es un expert en bibliographie et édition scientifique.
|
||||
|
||||
TÂCHE: Extraire les métadonnées bibliographiques de ce document.
|
||||
|
||||
ATTENTION - PIÈGES COURANTS:
|
||||
- Le titre n'est PAS forcément le premier titre H1 (peut être le nom de la collection)
|
||||
- Le sous-titre fait partie du titre
|
||||
- L'auteur peut apparaître sous le titre, dans les métadonnées éditeur, ou ailleurs
|
||||
- Distingue bien: titre de l'œuvre ≠ nom de la collection/série ≠ nom de l'éditeur
|
||||
|
||||
INDICES POUR TROUVER LE VRAI TITRE:
|
||||
- Souvent en plus grand / plus visible
|
||||
- Accompagné du nom de l'auteur juste après
|
||||
- Répété sur la page de garde et la page de titre
|
||||
- Peut contenir un sous-titre après ":"
|
||||
|
||||
IMPORTANT - FORMAT DES DONNÉES:
|
||||
- N'ajoute JAMAIS d'annotations comme "(correct)", "(à confirmer)", "(possiblement)", etc.
|
||||
- Retourne uniquement les noms propres et titres sans commentaires
|
||||
- NE METS PAS de phrases comme "À confirmer avec...", "Vérifier si...", "Possiblement..."
|
||||
- Le champ "confidence" sert à exprimer ton niveau de certitude
|
||||
- Si tu n'es pas sûr du titre, mets le titre le plus probable ET un confidence faible
|
||||
- EXEMPLE CORRECT: "title": "La pensée-signe" avec "confidence": {{"title": 0.6}}
|
||||
- EXEMPLE INCORRECT: "title": "À confirmer avec le titre exact"
|
||||
|
||||
RÉPONDS UNIQUEMENT avec un JSON entre balises <JSON></JSON>:
|
||||
|
||||
<JSON>
|
||||
{{
|
||||
"title": "Le vrai titre de l'œuvre (avec sous-titre si présent)",
|
||||
"author": "Prénom Nom de l'auteur principal",
|
||||
"collection": "Nom de la collection ou série (null si absent)",
|
||||
"publisher": "Nom de l'éditeur",
|
||||
"year": 2023,
|
||||
"doi": "10.xxxx/xxxxx (null si absent)",
|
||||
"isbn": "978-x-xxxx-xxxx-x (null si absent)",
|
||||
"language": "fr",
|
||||
"confidence": {{
|
||||
"title": 0.95,
|
||||
"author": 0.90
|
||||
}}
|
||||
}}
|
||||
</JSON>
|
||||
|
||||
DOCUMENT À ANALYSER:
|
||||
{content}
|
||||
|
||||
Réponds UNIQUEMENT avec le JSON."""
|
||||
|
||||
logger.info(f"Extraction métadonnées via {provider.upper()} ({model})")
|
||||
|
||||
try:
|
||||
response: str = call_llm(prompt, model=model, provider=provider, temperature=temperature)
|
||||
metadata: Dict[str, Any] = _extract_json_from_response(response)
|
||||
|
||||
# Valeurs par défaut si non trouvées
|
||||
defaults: Dict[str, Optional[str]] = {
|
||||
"title": None,
|
||||
"author": None,
|
||||
"collection": None,
|
||||
"publisher": None,
|
||||
"year": None,
|
||||
"doi": None,
|
||||
"isbn": None,
|
||||
"language": "fr",
|
||||
}
|
||||
|
||||
for key, default in defaults.items():
|
||||
if key not in metadata or metadata[key] == "":
|
||||
metadata[key] = default
|
||||
|
||||
# Nettoyer les valeurs "null" string
|
||||
for key in metadata:
|
||||
if metadata[key] == "null" or metadata[key] == "None":
|
||||
metadata[key] = None
|
||||
|
||||
logger.info(f"Métadonnées extraites: titre='{metadata.get('title')}', auteur='{metadata.get('author')}'")
|
||||
return metadata
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur extraction métadonnées: {e}")
|
||||
return {
|
||||
"title": None,
|
||||
"author": None,
|
||||
"collection": None,
|
||||
"publisher": None,
|
||||
"year": None,
|
||||
"doi": None,
|
||||
"isbn": None,
|
||||
"language": "fr",
|
||||
"error": str(e),
|
||||
}
|
||||
|
||||
583
generations/library_rag/utils/llm_structurer.py
Normal file
583
generations/library_rag/utils/llm_structurer.py
Normal file
@@ -0,0 +1,583 @@
|
||||
"""Structuration de documents via LLM (Ollama ou Mistral API)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from typing import Any, Dict, List, Optional, TypedDict, Union, cast
|
||||
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
import threading
|
||||
|
||||
# Import type definitions from central types module
|
||||
from utils.types import LLMCostStats
|
||||
|
||||
# Charger les variables d'environnement
|
||||
load_dotenv()
|
||||
|
||||
# Logger
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
if not logging.getLogger().hasHandlers():
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="[%(asctime)s] %(levelname)s %(message)s"
|
||||
)
|
||||
|
||||
|
||||
class LLMStructureError(RuntimeError):
|
||||
"""Erreur lors de la structuration via LLM."""
|
||||
pass
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# TypedDict Definitions
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
class MistralPricingEntry(TypedDict):
|
||||
"""Mistral API pricing per million tokens."""
|
||||
input: float
|
||||
output: float
|
||||
|
||||
|
||||
class LLMHierarchyPath(TypedDict, total=False):
|
||||
"""Hierarchy path in structured output."""
|
||||
part: Optional[str]
|
||||
chapter: Optional[str]
|
||||
section: Optional[str]
|
||||
subsection: Optional[str]
|
||||
|
||||
|
||||
class LLMChunkOutput(TypedDict, total=False):
|
||||
"""Single chunk in LLM structured output."""
|
||||
chunk_id: str
|
||||
text: str
|
||||
hierarchy: LLMHierarchyPath
|
||||
type: str
|
||||
is_toc: bool
|
||||
|
||||
|
||||
class LLMDocumentSection(TypedDict, total=False):
|
||||
"""Document section in structured output."""
|
||||
path: LLMHierarchyPath
|
||||
type: str
|
||||
page_start: int
|
||||
page_end: int
|
||||
|
||||
|
||||
class LLMStructuredResult(TypedDict, total=False):
|
||||
"""Result from LLM document structuring."""
|
||||
document_structure: List[LLMDocumentSection]
|
||||
chunks: List[LLMChunkOutput]
|
||||
|
||||
|
||||
class OllamaResultContainer(TypedDict):
|
||||
"""Container for Ollama call result (internal use)."""
|
||||
response: Optional[str]
|
||||
error: Optional[Exception]
|
||||
done: bool
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Configuration
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
def _get_ollama_url() -> str:
|
||||
"""Retourne l'URL de base d'Ollama."""
|
||||
return os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
|
||||
|
||||
|
||||
def _get_default_model() -> str:
|
||||
"""Retourne le modèle LLM par défaut."""
|
||||
return os.getenv("STRUCTURE_LLM_MODEL", "qwen2.5:7b")
|
||||
|
||||
|
||||
def _get_mistral_api_key() -> Optional[str]:
|
||||
"""Retourne la clé API Mistral."""
|
||||
return os.getenv("MISTRAL_API_KEY")
|
||||
|
||||
|
||||
def _get_default_mistral_model() -> str:
|
||||
"""Retourne le modèle Mistral par défaut pour les tâches LLM."""
|
||||
return os.getenv("MISTRAL_LLM_MODEL", "mistral-small-latest")
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Appel Mistral API (rapide, cloud) avec tracking des coûts
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
# Prix Mistral API par million de tokens (€)
|
||||
MISTRAL_PRICING: Dict[str, MistralPricingEntry] = {
|
||||
"mistral-small-latest": {"input": 0.2, "output": 0.6},
|
||||
"mistral-medium-latest": {"input": 0.8, "output": 2.4},
|
||||
"mistral-large-latest": {"input": 2.0, "output": 6.0},
|
||||
# Fallback pour autres modèles
|
||||
"default": {"input": 0.5, "output": 1.5},
|
||||
}
|
||||
|
||||
# Accumulateur de coûts global (thread-local pour safety)
|
||||
_cost_tracker: threading.local = threading.local()
|
||||
|
||||
|
||||
def reset_llm_cost() -> None:
|
||||
"""Réinitialise le compteur de coût LLM."""
|
||||
_cost_tracker.total_cost = 0.0
|
||||
_cost_tracker.total_input_tokens = 0
|
||||
_cost_tracker.total_output_tokens = 0
|
||||
_cost_tracker.calls_count = 0
|
||||
|
||||
|
||||
def get_llm_cost() -> LLMCostStats:
|
||||
"""Retourne les statistiques de coût LLM accumulées."""
|
||||
return {
|
||||
"total_cost": getattr(_cost_tracker, "total_cost", 0.0),
|
||||
"total_input_tokens": getattr(_cost_tracker, "total_input_tokens", 0),
|
||||
"total_output_tokens": getattr(_cost_tracker, "total_output_tokens", 0),
|
||||
"calls_count": getattr(_cost_tracker, "calls_count", 0),
|
||||
}
|
||||
|
||||
|
||||
def _calculate_mistral_cost(model: str, input_tokens: int, output_tokens: int) -> float:
|
||||
"""Calcule le coût d'un appel Mistral API en euros."""
|
||||
pricing: MistralPricingEntry = MISTRAL_PRICING.get(model, MISTRAL_PRICING["default"])
|
||||
cost: float = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000
|
||||
return cost
|
||||
|
||||
|
||||
def _call_mistral_api(
|
||||
prompt: str,
|
||||
model: str = "mistral-small-latest",
|
||||
temperature: float = 0.2,
|
||||
max_tokens: int = 4096,
|
||||
timeout: int = 120,
|
||||
) -> str:
|
||||
"""Appelle l'API Mistral pour générer une réponse.
|
||||
|
||||
Modèles disponibles (du plus rapide au plus puissant) :
|
||||
- mistral-small-latest : Rapide, économique (~0.2€/M tokens input)
|
||||
- mistral-medium-latest : Équilibré (~0.8€/M tokens input)
|
||||
- mistral-large-latest : Puissant (~2€/M tokens input)
|
||||
|
||||
Args:
|
||||
prompt: Le prompt à envoyer
|
||||
model: Nom du modèle Mistral
|
||||
temperature: Température (0-1)
|
||||
max_tokens: Nombre max de tokens en réponse
|
||||
timeout: Timeout en secondes
|
||||
|
||||
Returns:
|
||||
Réponse textuelle du LLM
|
||||
"""
|
||||
api_key: Optional[str] = _get_mistral_api_key()
|
||||
if not api_key:
|
||||
raise LLMStructureError("MISTRAL_API_KEY non définie dans .env")
|
||||
|
||||
logger.info(f"Appel Mistral API - modèle: {model}")
|
||||
|
||||
url: str = "https://api.mistral.ai/v1/chat/completions"
|
||||
headers: Dict[str, str] = {
|
||||
"Authorization": f"Bearer {api_key}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
payload: Dict[str, Any] = {
|
||||
"model": model,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"temperature": temperature,
|
||||
"max_tokens": max_tokens,
|
||||
}
|
||||
|
||||
try:
|
||||
start: float = time.time()
|
||||
response: requests.Response = requests.post(url, headers=headers, json=payload, timeout=timeout)
|
||||
elapsed: float = time.time() - start
|
||||
|
||||
response.raise_for_status()
|
||||
data: Dict[str, Any] = response.json()
|
||||
|
||||
content: str = data.get("choices", [{}])[0].get("message", {}).get("content", "")
|
||||
usage: Dict[str, Any] = data.get("usage", {})
|
||||
|
||||
input_tokens: int = usage.get("prompt_tokens", 0)
|
||||
output_tokens: int = usage.get("completion_tokens", 0)
|
||||
|
||||
# Calculer et accumuler le coût
|
||||
call_cost: float = _calculate_mistral_cost(model, input_tokens, output_tokens)
|
||||
|
||||
# Mettre à jour le tracker
|
||||
if not hasattr(_cost_tracker, "total_cost"):
|
||||
reset_llm_cost()
|
||||
|
||||
_cost_tracker.total_cost += call_cost
|
||||
_cost_tracker.total_input_tokens += input_tokens
|
||||
_cost_tracker.total_output_tokens += output_tokens
|
||||
_cost_tracker.calls_count += 1
|
||||
|
||||
logger.info(f"Mistral API terminé en {elapsed:.1f}s - {input_tokens}+{output_tokens} tokens = {call_cost:.6f}€")
|
||||
|
||||
return content
|
||||
|
||||
except requests.exceptions.Timeout:
|
||||
raise LLMStructureError(f"Timeout Mistral API ({timeout}s)")
|
||||
except requests.exceptions.HTTPError as e:
|
||||
raise LLMStructureError(f"Erreur HTTP Mistral: {e}")
|
||||
except Exception as e:
|
||||
raise LLMStructureError(f"Erreur Mistral API: {e}")
|
||||
|
||||
|
||||
def _prepare_prompt(
|
||||
markdown: str,
|
||||
hierarchy: Dict[str, Any],
|
||||
max_chars: int = 8000,
|
||||
) -> str:
|
||||
"""Prépare le prompt pour le LLM.
|
||||
|
||||
Args:
|
||||
markdown: Texte Markdown du document
|
||||
hierarchy: Structure hiérarchique initiale
|
||||
max_chars: Nombre max de caractères du Markdown à inclure
|
||||
|
||||
Returns:
|
||||
Prompt formaté pour le LLM
|
||||
"""
|
||||
# Tronquer le Markdown si nécessaire
|
||||
truncated: str = markdown[:max_chars]
|
||||
if len(markdown) > max_chars:
|
||||
truncated += f"\n\n... [tronqué à {max_chars} caractères]"
|
||||
|
||||
# Sérialiser la hiérarchie
|
||||
outline_json: str = json.dumps(hierarchy, ensure_ascii=False, indent=2)
|
||||
|
||||
prompt: str = f"""Tu es un expert en édition scientifique chargé d'analyser la structure logique d'un document.
|
||||
|
||||
IMPORTANT: Réponds UNIQUEMENT avec un objet JSON valide. Pas de texte avant ou après.
|
||||
|
||||
À partir du Markdown OCRisé et d'un premier découpage hiérarchique, tu dois :
|
||||
1. Identifier les parties liminaires (préface, introduction...), le corps du document (parties, chapitres, sections) et les parties finales (conclusion, annexes, bibliographie...).
|
||||
2. Reconstruire l'organisation réelle du texte.
|
||||
3. Produire un JSON avec :
|
||||
- "document_structure": vue hiérarchique du document
|
||||
- "chunks": liste des chunks avec chunk_id, text, hierarchy, type
|
||||
|
||||
FORMAT DE RÉPONSE (entre balises <JSON></JSON>):
|
||||
<JSON>
|
||||
{{
|
||||
"document_structure": [
|
||||
{{
|
||||
"path": {{"part": "Titre"}},
|
||||
"type": "main_content",
|
||||
"page_start": 1,
|
||||
"page_end": 10
|
||||
}}
|
||||
],
|
||||
"chunks": [
|
||||
{{
|
||||
"chunk_id": "chunk_00001",
|
||||
"text": "Contenu...",
|
||||
"hierarchy": {{
|
||||
"part": "Titre partie",
|
||||
"chapter": "Titre chapitre",
|
||||
"section": null,
|
||||
"subsection": null
|
||||
}},
|
||||
"type": "main_content",
|
||||
"is_toc": false
|
||||
}}
|
||||
]
|
||||
}}
|
||||
</JSON>
|
||||
|
||||
### Hiérarchie initiale
|
||||
{outline_json}
|
||||
|
||||
### Markdown OCR
|
||||
{truncated}
|
||||
|
||||
Réponds UNIQUEMENT avec le JSON entre <JSON> et </JSON>."""
|
||||
|
||||
return prompt.strip()
|
||||
|
||||
|
||||
def _call_ollama(
|
||||
prompt: str,
|
||||
model: str,
|
||||
base_url: Optional[str] = None,
|
||||
temperature: float = 0.2,
|
||||
timeout: int = 300,
|
||||
) -> str:
|
||||
"""Appelle Ollama pour générer une réponse.
|
||||
|
||||
Args:
|
||||
prompt: Le prompt à envoyer
|
||||
model: Nom du modèle Ollama
|
||||
base_url: URL de base d'Ollama
|
||||
temperature: Température du modèle
|
||||
timeout: Timeout en secondes
|
||||
|
||||
Returns:
|
||||
Réponse textuelle du LLM
|
||||
|
||||
Raises:
|
||||
LLMStructureError: En cas d'erreur d'appel
|
||||
"""
|
||||
# Essayer d'abord le SDK ollama
|
||||
try:
|
||||
import ollama
|
||||
|
||||
logger.info(f"Appel Ollama SDK - modèle: {model}, timeout: {timeout}s")
|
||||
|
||||
# Note: Le SDK ollama ne supporte pas directement le timeout
|
||||
# On utilise un wrapper avec threading.Timer pour forcer le timeout
|
||||
result_container: OllamaResultContainer = {"response": None, "error": None, "done": False}
|
||||
|
||||
def _run_ollama_call() -> None:
|
||||
try:
|
||||
resp: Any
|
||||
if hasattr(ollama, "generate"):
|
||||
resp = ollama.generate(
|
||||
model=model,
|
||||
prompt=prompt,
|
||||
stream=False,
|
||||
options={"temperature": temperature}
|
||||
)
|
||||
if isinstance(resp, dict):
|
||||
result_container["response"] = resp.get("response", json.dumps(resp))
|
||||
elif hasattr(resp, "response"):
|
||||
result_container["response"] = resp.response
|
||||
else:
|
||||
result_container["response"] = str(resp)
|
||||
else:
|
||||
# Fallback sur chat
|
||||
resp = ollama.chat(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
options={"temperature": temperature}
|
||||
)
|
||||
if isinstance(resp, dict):
|
||||
result_container["response"] = resp.get("message", {}).get("content", str(resp))
|
||||
else:
|
||||
result_container["response"] = str(resp)
|
||||
result_container["done"] = True
|
||||
except Exception as e:
|
||||
result_container["error"] = e
|
||||
result_container["done"] = True
|
||||
|
||||
thread: threading.Thread = threading.Thread(target=_run_ollama_call, daemon=True)
|
||||
thread.start()
|
||||
thread.join(timeout=timeout)
|
||||
|
||||
if not result_container["done"]:
|
||||
raise LLMStructureError(f"Timeout Ollama SDK après {timeout}s (modèle: {model})")
|
||||
|
||||
if result_container["error"]:
|
||||
raise result_container["error"]
|
||||
|
||||
if result_container["response"]:
|
||||
return result_container["response"]
|
||||
|
||||
raise LLMStructureError("Aucune réponse du SDK Ollama")
|
||||
|
||||
except ImportError:
|
||||
logger.info("SDK ollama non disponible, utilisation de l'API HTTP")
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur SDK ollama: {e}, fallback HTTP")
|
||||
|
||||
# Fallback HTTP
|
||||
base: str = base_url or _get_ollama_url()
|
||||
url: str = f"{base.rstrip('/')}/api/generate"
|
||||
|
||||
payload: Dict[str, Any] = {
|
||||
"model": model,
|
||||
"prompt": prompt,
|
||||
"stream": False,
|
||||
"options": {"temperature": temperature},
|
||||
}
|
||||
|
||||
# Retry avec backoff
|
||||
max_retries: int = 2
|
||||
backoff: float = 1.0
|
||||
|
||||
for attempt in range(max_retries + 1):
|
||||
try:
|
||||
logger.info(f"Appel HTTP Ollama (tentative {attempt + 1})")
|
||||
response: requests.Response = requests.post(url, json=payload, timeout=timeout)
|
||||
|
||||
if response.status_code != 200:
|
||||
raise LLMStructureError(
|
||||
f"Erreur Ollama ({response.status_code}): {response.text}"
|
||||
)
|
||||
|
||||
data: Dict[str, Any] = response.json()
|
||||
if "response" not in data:
|
||||
raise LLMStructureError(f"Réponse Ollama inattendue: {data}")
|
||||
|
||||
return cast(str, data["response"])
|
||||
|
||||
except requests.RequestException as e:
|
||||
if attempt < max_retries:
|
||||
time.sleep(backoff)
|
||||
backoff *= 2
|
||||
continue
|
||||
raise LLMStructureError(f"Impossible de contacter Ollama: {e}") from e
|
||||
|
||||
raise LLMStructureError("Échec après plusieurs tentatives")
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
# Fonction générique d'appel LLM
|
||||
# ═══════════════════════════════════════════════════════════════════════════════
|
||||
|
||||
def call_llm(
|
||||
prompt: str,
|
||||
model: Optional[str] = None,
|
||||
provider: str = "ollama", # "ollama" ou "mistral"
|
||||
temperature: float = 0.2,
|
||||
timeout: int = 300,
|
||||
) -> str:
|
||||
"""Appelle un LLM (Ollama local ou Mistral API).
|
||||
|
||||
Args:
|
||||
prompt: Le prompt à envoyer
|
||||
model: Nom du modèle (auto-détecté si None)
|
||||
provider: "ollama" (local, lent) ou "mistral" (API, rapide)
|
||||
temperature: Température du modèle
|
||||
timeout: Timeout en secondes
|
||||
|
||||
Returns:
|
||||
Réponse textuelle du LLM
|
||||
"""
|
||||
resolved_model: str
|
||||
if provider == "mistral":
|
||||
# Mistral API (rapide, cloud)
|
||||
resolved_model = model or _get_default_mistral_model()
|
||||
return _call_mistral_api(
|
||||
prompt,
|
||||
model=resolved_model,
|
||||
temperature=temperature,
|
||||
timeout=timeout,
|
||||
)
|
||||
else:
|
||||
# Ollama (local, lent mais gratuit)
|
||||
resolved_model = model or _get_default_model()
|
||||
return _call_ollama(
|
||||
prompt,
|
||||
model=resolved_model,
|
||||
temperature=temperature,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
|
||||
def _clean_json_string(json_str: str) -> str:
|
||||
"""Nettoie une chaîne JSON des caractères de contrôle invalides.
|
||||
|
||||
Stratégie robuste : Remplace TOUS les caractères de contrôle (x00-x1f)
|
||||
par des espaces, puis réduit les espaces multiples. Cela évite les erreurs
|
||||
"Invalid control character" de json.loads().
|
||||
"""
|
||||
# Remplacer tous les caractères de contrôle par des espaces
|
||||
cleaned: str = re.sub(r'[\x00-\x1f]', ' ', json_str)
|
||||
# Réduire les espaces multiples
|
||||
cleaned = re.sub(r'\s+', ' ', cleaned)
|
||||
return cleaned
|
||||
|
||||
|
||||
def _extract_json(text: str) -> LLMStructuredResult:
|
||||
"""Extrait le JSON de la réponse du LLM.
|
||||
|
||||
Args:
|
||||
text: Réponse textuelle du LLM
|
||||
|
||||
Returns:
|
||||
Dictionnaire JSON parsé
|
||||
|
||||
Raises:
|
||||
LLMStructureError: Si le JSON est invalide ou absent
|
||||
"""
|
||||
# Chercher entre balises <JSON> et </JSON>
|
||||
json_start: int = text.find("<JSON>")
|
||||
json_end: int = text.find("</JSON>")
|
||||
|
||||
if json_start != -1 and json_end != -1 and json_end > json_start:
|
||||
json_content: str = text[json_start + 6:json_end].strip()
|
||||
json_content = _clean_json_string(json_content)
|
||||
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_content)
|
||||
if "chunks" not in result:
|
||||
raise LLMStructureError(
|
||||
f"JSON sans clé 'chunks'. Clés: {list(result.keys())}"
|
||||
)
|
||||
return cast(LLMStructuredResult, result)
|
||||
except json.JSONDecodeError:
|
||||
pass # Fallback ci-dessous
|
||||
|
||||
# Fallback: chercher par accolades
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
|
||||
if start == -1 or end == -1 or end <= start:
|
||||
raise LLMStructureError(
|
||||
f"Pas de JSON trouvé dans la réponse.\nDébut: {text[:500]}"
|
||||
)
|
||||
|
||||
json_str: str = _clean_json_string(text[start:end + 1])
|
||||
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
if "chunks" not in result:
|
||||
raise LLMStructureError(
|
||||
f"JSON sans clé 'chunks'. Clés: {list(result.keys())}"
|
||||
)
|
||||
return cast(LLMStructuredResult, result)
|
||||
except json.JSONDecodeError as e:
|
||||
raise LLMStructureError(f"JSON invalide: {e}\nContenu: {json_str[:500]}") from e
|
||||
|
||||
|
||||
def structure_with_llm(
|
||||
markdown: str,
|
||||
hierarchy: Dict[str, Any],
|
||||
model: Optional[str] = None,
|
||||
base_url: Optional[str] = None,
|
||||
temperature: float = 0.2,
|
||||
max_chars: int = 8000,
|
||||
timeout: int = 300,
|
||||
) -> LLMStructuredResult:
|
||||
"""Améliore la structure d'un document via LLM.
|
||||
|
||||
Args:
|
||||
markdown: Texte Markdown du document
|
||||
hierarchy: Structure hiérarchique initiale (de build_hierarchy)
|
||||
model: Modèle Ollama à utiliser
|
||||
base_url: URL de base d'Ollama
|
||||
temperature: Température du modèle
|
||||
max_chars: Nombre max de caractères du Markdown
|
||||
timeout: Timeout en secondes
|
||||
|
||||
Returns:
|
||||
Structure améliorée avec document_structure et chunks
|
||||
|
||||
Raises:
|
||||
LLMStructureError: En cas d'erreur
|
||||
"""
|
||||
resolved_model: str = model or _get_default_model()
|
||||
|
||||
logger.info(f"Structuration LLM - modèle: {resolved_model}")
|
||||
|
||||
# Préparer le prompt
|
||||
prompt: str = _prepare_prompt(markdown, hierarchy, max_chars)
|
||||
|
||||
# Appeler le LLM
|
||||
raw_response: str = _call_ollama(
|
||||
prompt,
|
||||
model=resolved_model,
|
||||
base_url=base_url,
|
||||
temperature=temperature,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
# Extraire le JSON
|
||||
return _extract_json(raw_response)
|
||||
|
||||
420
generations/library_rag/utils/llm_toc.py
Normal file
420
generations/library_rag/utils/llm_toc.py
Normal file
@@ -0,0 +1,420 @@
|
||||
"""LLM-based Table of Contents (TOC) extraction module.
|
||||
|
||||
This module provides functionality to extract hierarchical table of contents
|
||||
from markdown documents using Large Language Models. It intelligently parses
|
||||
document structure and creates both hierarchical and flat representations
|
||||
of the TOC.
|
||||
|
||||
Key Features:
|
||||
- Hierarchical TOC extraction with chapters, sections, and subsections
|
||||
- Flat TOC generation with full paths for navigation
|
||||
- Content-to-TOC matching for associating sections with TOC entries
|
||||
- Support for multiple LLM providers (Ollama local, Mistral API)
|
||||
|
||||
TOC Structure Levels:
|
||||
- Level 1: Introduction, main chapters, Conclusion, Bibliography
|
||||
- Level 2: Sections listed under a chapter (same visual level)
|
||||
- Level 3: Only if explicit indentation or subsection visible
|
||||
|
||||
Typical Usage:
|
||||
>>> from utils.llm_toc import extract_toc
|
||||
>>> result = extract_toc(
|
||||
... markdown=document_text,
|
||||
... document_title="The Republic",
|
||||
... provider="ollama"
|
||||
... )
|
||||
>>> print(result["toc"]) # Hierarchical structure
|
||||
[
|
||||
{
|
||||
"title": "Introduction",
|
||||
"level": 1,
|
||||
"children": []
|
||||
},
|
||||
{
|
||||
"title": "Book I: Justice",
|
||||
"level": 1,
|
||||
"chapter_number": 1,
|
||||
"children": [
|
||||
{"title": "The Nature of Justice", "level": 2, "children": []}
|
||||
]
|
||||
}
|
||||
]
|
||||
>>> print(result["flat_toc"]) # Flat list with paths
|
||||
[
|
||||
{"title": "Introduction", "level": 1, "path": "Introduction"},
|
||||
{"title": "Book I: Justice", "level": 1, "path": "Book I: Justice"},
|
||||
{
|
||||
"title": "The Nature of Justice",
|
||||
"level": 2,
|
||||
"path": "Book I: Justice > The Nature of Justice"
|
||||
}
|
||||
]
|
||||
|
||||
LLM Provider Options:
|
||||
- "ollama": Local processing, free but slower
|
||||
- "mistral": Cloud API, faster but incurs costs
|
||||
|
||||
Note:
|
||||
For documents without a clear TOC (short articles, book reviews),
|
||||
the module returns an empty TOC list rather than inventing structure.
|
||||
|
||||
See Also:
|
||||
- llm_metadata: Document metadata extraction
|
||||
- llm_classifier: Section classification
|
||||
- toc_extractor: Non-LLM TOC extraction alternatives
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import cast, Any, Dict, List, Optional
|
||||
|
||||
from .llm_structurer import (
|
||||
_clean_json_string,
|
||||
_get_default_mistral_model,
|
||||
_get_default_model,
|
||||
call_llm,
|
||||
)
|
||||
from .types import FlatTOCEntry, LLMProvider, TOCEntry, TOCResult
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _extract_json_from_response(text: str) -> Dict[str, Any]:
|
||||
"""Extract JSON data from an LLM response.
|
||||
|
||||
Parses the LLM response to extract JSON content, handling both
|
||||
explicitly tagged JSON (between <JSON></JSON> tags) and raw JSON
|
||||
embedded in the response text.
|
||||
|
||||
Args:
|
||||
text: The raw LLM response text that may contain JSON.
|
||||
|
||||
Returns:
|
||||
A dictionary containing the parsed JSON data. Returns
|
||||
{"toc": []} if no valid JSON can be extracted.
|
||||
|
||||
Note:
|
||||
This function attempts two parsing strategies:
|
||||
1. Look for JSON between <JSON></JSON> tags
|
||||
2. Find JSON by locating first '{' and last '}'
|
||||
"""
|
||||
json_match: Optional[re.Match[str]] = re.search(r'<JSON>\s*(.*?)\s*</JSON>', text, re.DOTALL)
|
||||
if json_match:
|
||||
json_str: str = _clean_json_string(json_match.group(1))
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
if start != -1 and end > start:
|
||||
json_str = _clean_json_string(text[start:end + 1])
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"JSON invalide: {e}")
|
||||
|
||||
return {"toc": []}
|
||||
|
||||
|
||||
def extract_toc(
|
||||
markdown: str,
|
||||
document_title: Optional[str] = None,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.1,
|
||||
) -> Dict[str, Any]:
|
||||
r"""Extract a structured table of contents from a document using LLM.
|
||||
|
||||
Analyzes markdown content to identify the document's hierarchical
|
||||
structure and generates both a nested TOC (with children) and a
|
||||
flat TOC (with navigation paths).
|
||||
|
||||
Args:
|
||||
markdown: Complete markdown text of the document to analyze.
|
||||
document_title: Optional title of the document for context.
|
||||
Helps the LLM better understand the document structure.
|
||||
model: LLM model name to use. If None, uses the default model
|
||||
for the specified provider.
|
||||
provider: LLM provider to use. Either "ollama" for local
|
||||
processing or "mistral" for cloud API.
|
||||
temperature: Model temperature for response generation.
|
||||
Lower values (0.1) produce more consistent results.
|
||||
|
||||
Returns:
|
||||
A dictionary containing:
|
||||
- toc: Hierarchical list of TOC entries, each with:
|
||||
- title: Section title
|
||||
- level: Hierarchy level (1, 2, or 3)
|
||||
- chapter_number: Optional chapter number
|
||||
- children: List of nested TOC entries
|
||||
- flat_toc: Flat list of all TOC entries with paths:
|
||||
- title: Section title
|
||||
- level: Hierarchy level
|
||||
- path: Full navigation path (e.g., "Chapter 1 > Section 1")
|
||||
- error: Error message string (only if extraction failed)
|
||||
|
||||
Raises:
|
||||
No exceptions are raised; errors are captured in the return dict.
|
||||
|
||||
Example:
|
||||
>>> result = extract_toc(
|
||||
... markdown="# Introduction\n...\n# Chapter 1\n## Section 1.1",
|
||||
... document_title="My Book",
|
||||
... provider="ollama"
|
||||
... )
|
||||
>>> len(result["toc"])
|
||||
2
|
||||
>>> result["toc"][0]["title"]
|
||||
'Introduction'
|
||||
|
||||
Note:
|
||||
- Documents longer than 12,000 characters are truncated
|
||||
- Short articles without clear TOC return empty lists
|
||||
- The LLM is instructed to never invent structure
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Tronquer si trop long mais garder les sections importantes
|
||||
max_chars: int = 12000
|
||||
content: str = markdown[:max_chars]
|
||||
if len(markdown) > max_chars:
|
||||
content += "\n\n[... suite du document ...]"
|
||||
|
||||
title_context: str = f"Titre du document: {document_title}\n" if document_title else ""
|
||||
|
||||
prompt: str = f"""Tu es un expert en structuration de documents académiques.
|
||||
|
||||
TÂCHE: Extraire la table des matières FIDÈLE au document fourni.
|
||||
|
||||
{title_context}
|
||||
⚠️ RÈGLES CRITIQUES:
|
||||
|
||||
1. **ANALYSER LE DOCUMENT RÉEL** - Ne JAMAIS copier les exemples ci-dessous!
|
||||
2. **DOCUMENTS SANS TOC** - Si le document est un article court, une revue de livre, ou n'a pas de table des matières explicite, retourner {{"toc": []}}
|
||||
3. **RESPECTER LA STRUCTURE PLATE** - Ne pas inventer de hiérarchie entre des lignes au même niveau
|
||||
4. **IGNORER** - Métadonnées éditoriales (DOI, ISBN, éditeur, copyright, numéros de page)
|
||||
|
||||
NIVEAUX DE STRUCTURE:
|
||||
- level 1: Introduction, Chapitres principaux, Conclusion, Bibliographie
|
||||
- level 2: Sections listées sous un chapitre (même niveau visuel)
|
||||
- level 3: UNIQUEMENT si indentation ou sous-titre explicite visible
|
||||
|
||||
FORMAT DE RÉPONSE (JSON entre balises <JSON></JSON>):
|
||||
|
||||
Pour un livre avec TOC:
|
||||
<JSON>
|
||||
{{
|
||||
"toc": [
|
||||
{{
|
||||
"title": "Titre Chapitre 1",
|
||||
"level": 1,
|
||||
"chapter_number": 1,
|
||||
"children": [
|
||||
{{"title": "Section 1.1", "level": 2, "children": []}},
|
||||
{{"title": "Section 1.2", "level": 2, "children": []}}
|
||||
]
|
||||
}}
|
||||
]
|
||||
}}
|
||||
</JSON>
|
||||
|
||||
Pour un article SANS TOC (revue de livre, article court, etc.):
|
||||
<JSON>
|
||||
{{
|
||||
"toc": []
|
||||
}}
|
||||
</JSON>
|
||||
|
||||
⚠️ NE PAS COPIER CES EXEMPLES ! Analyser uniquement le DOCUMENT RÉEL ci-dessous.
|
||||
|
||||
DOCUMENT À ANALYSER:
|
||||
{content}
|
||||
|
||||
Réponds UNIQUEMENT avec le JSON correspondant à CE document (pas aux exemples)."""
|
||||
|
||||
logger.info(f"Extraction TOC via {provider.upper()} ({model})")
|
||||
|
||||
try:
|
||||
response: str = call_llm(prompt, model=model, provider=provider, temperature=temperature, timeout=360)
|
||||
result: Dict[str, Any] = _extract_json_from_response(response)
|
||||
|
||||
toc: List[Dict[str, Any]] = result.get("toc", [])
|
||||
|
||||
# Générer la version plate de la TOC
|
||||
flat_toc: List[Dict[str, Any]] = _flatten_toc(toc)
|
||||
|
||||
logger.info(f"TOC extraite: {len(toc)} entrées niveau 1, {len(flat_toc)} entrées totales")
|
||||
|
||||
return {
|
||||
"toc": toc,
|
||||
"flat_toc": flat_toc,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur extraction TOC: {e}")
|
||||
return {
|
||||
"toc": [],
|
||||
"flat_toc": [],
|
||||
"error": str(e),
|
||||
}
|
||||
|
||||
|
||||
def _flatten_toc(
|
||||
toc: List[Dict[str, Any]],
|
||||
parent_path: str = "",
|
||||
result: Optional[List[Dict[str, Any]]] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Flatten a hierarchical TOC into a list with navigation paths.
|
||||
|
||||
Recursively traverses a nested TOC structure and produces a flat
|
||||
list where each entry includes its full path from the root.
|
||||
|
||||
Args:
|
||||
toc: Hierarchical TOC list with nested children.
|
||||
parent_path: Path accumulated from parent entries. Used
|
||||
internally during recursion.
|
||||
result: Accumulator list for results. Used internally
|
||||
during recursion.
|
||||
|
||||
Returns:
|
||||
A flat list of TOC entries, each containing:
|
||||
- title: The section title
|
||||
- level: Hierarchy level (1, 2, or 3)
|
||||
- path: Full navigation path (e.g., "Chapter > Section")
|
||||
- chapter_number: Optional chapter number if present
|
||||
|
||||
Example:
|
||||
>>> hierarchical_toc = [
|
||||
... {
|
||||
... "title": "Chapter 1",
|
||||
... "level": 1,
|
||||
... "children": [
|
||||
... {"title": "Section 1.1", "level": 2, "children": []}
|
||||
... ]
|
||||
... }
|
||||
... ]
|
||||
>>> flat = _flatten_toc(hierarchical_toc)
|
||||
>>> flat[0]["path"]
|
||||
'Chapter 1'
|
||||
>>> flat[1]["path"]
|
||||
'Chapter 1 > Section 1.1'
|
||||
"""
|
||||
if result is None:
|
||||
result = []
|
||||
|
||||
for item in toc:
|
||||
title: str = item.get("title", "")
|
||||
level: int = item.get("level", 1)
|
||||
|
||||
# Construire le chemin
|
||||
path: str
|
||||
if parent_path:
|
||||
path = f"{parent_path} > {title}"
|
||||
else:
|
||||
path = title
|
||||
|
||||
result.append({
|
||||
"title": title,
|
||||
"level": level,
|
||||
"path": path,
|
||||
"chapter_number": item.get("chapter_number"),
|
||||
})
|
||||
|
||||
# Récursion sur les enfants
|
||||
children: List[Dict[str, Any]] = item.get("children", [])
|
||||
if children:
|
||||
_flatten_toc(children, path, result)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def match_content_to_toc(
|
||||
content_sections: List[Dict[str, Any]],
|
||||
flat_toc: List[Dict[str, Any]],
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Match content sections to TOC entries using LLM.
|
||||
|
||||
Uses an LLM to intelligently associate extracted content sections
|
||||
with their corresponding entries in the table of contents. This
|
||||
enables navigation and context-aware content organization.
|
||||
|
||||
Args:
|
||||
content_sections: List of content sections extracted from
|
||||
the document. Each section should have a "title" key.
|
||||
flat_toc: Flat TOC list as returned by extract_toc()["flat_toc"].
|
||||
Each entry should have a "title" key.
|
||||
model: LLM model name to use. If None, uses the default
|
||||
model for the specified provider.
|
||||
provider: LLM provider to use. Either "ollama" for local
|
||||
processing or "mistral" for cloud API.
|
||||
|
||||
Returns:
|
||||
The input content_sections list with a "toc_match" key added
|
||||
to each section. The value is either:
|
||||
- The matched TOC entry dict (if a match was found)
|
||||
- None (if no match was found)
|
||||
|
||||
Example:
|
||||
>>> sections = [{"title": "Introduction"}, {"title": "Methods"}]
|
||||
>>> toc = [{"title": "Introduction", "level": 1, "path": "Introduction"}]
|
||||
>>> matched = match_content_to_toc(sections, toc)
|
||||
>>> matched[0]["toc_match"]["title"]
|
||||
'Introduction'
|
||||
>>> matched[1]["toc_match"] is None
|
||||
True
|
||||
|
||||
Note:
|
||||
- Only the first 30 content sections are processed to limit costs
|
||||
- Failed matches are silently handled (sections get toc_match=None)
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Préparer les données pour le prompt
|
||||
toc_titles: List[str] = [item["title"] for item in flat_toc]
|
||||
section_titles: List[str] = [s.get("title", "") for s in content_sections[:30]] # Limiter
|
||||
|
||||
prompt: str = f"""Tu dois associer les sections de contenu aux entrées de la table des matières.
|
||||
|
||||
TABLE DES MATIÈRES:
|
||||
{json.dumps(toc_titles, ensure_ascii=False, indent=2)}
|
||||
|
||||
SECTIONS DE CONTENU:
|
||||
{json.dumps(section_titles, ensure_ascii=False, indent=2)}
|
||||
|
||||
Pour chaque section de contenu, indique l'index (0-based) de l'entrée TOC correspondante.
|
||||
Si pas de correspondance, indique -1.
|
||||
|
||||
RÉPONDS avec un JSON:
|
||||
<JSON>
|
||||
{{
|
||||
"matches": [0, 1, 2, -1, 3, ...]
|
||||
}}
|
||||
</JSON>
|
||||
"""
|
||||
|
||||
try:
|
||||
response: str = call_llm(prompt, model=model, provider=provider, temperature=0.1)
|
||||
result: Dict[str, Any] = _extract_json_from_response(response)
|
||||
matches: List[int] = result.get("matches", [])
|
||||
|
||||
# Appliquer les correspondances
|
||||
for i, section in enumerate(content_sections):
|
||||
if i < len(matches) and matches[i] >= 0 and matches[i] < len(flat_toc):
|
||||
section["toc_match"] = flat_toc[matches[i]]
|
||||
else:
|
||||
section["toc_match"] = None
|
||||
|
||||
return content_sections
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur correspondance TOC: {e}")
|
||||
return content_sections
|
||||
513
generations/library_rag/utils/llm_validator.py
Normal file
513
generations/library_rag/utils/llm_validator.py
Normal file
@@ -0,0 +1,513 @@
|
||||
"""Document validation and enrichment using Large Language Models.
|
||||
|
||||
This module provides comprehensive validation, correction, and enrichment
|
||||
functionality for parsed documents. It uses LLMs to verify document coherence,
|
||||
detect inconsistencies, suggest corrections, and extract key concepts from
|
||||
text chunks.
|
||||
|
||||
Overview:
|
||||
The module performs three main functions:
|
||||
|
||||
1. **Document Validation** (validate_document):
|
||||
Verifies the coherence of parsed documents by checking metadata,
|
||||
table of contents, and chunk content quality. Returns detailed
|
||||
validation results with issues, corrections, and confidence scores.
|
||||
|
||||
2. **Content Enrichment** (enrich_chunks_with_concepts, generate_section_summary):
|
||||
Enhances document content by extracting key philosophical concepts
|
||||
from chunks and generating concise summaries for sections.
|
||||
|
||||
3. **Correction Application** (apply_corrections, clean_validation_annotations):
|
||||
Applies suggested corrections from validation results and cleans
|
||||
LLM-generated annotation artifacts from text.
|
||||
|
||||
Validation Criteria:
|
||||
The validator checks several aspects of document quality:
|
||||
|
||||
- **Metadata Quality**: Verifies title and author are correctly identified
|
||||
(not collection names, not "Unknown" when visible in text)
|
||||
- **TOC Coherence**: Checks for duplicates, proper ordering, completeness
|
||||
- **Chunk Content**: Ensures chunks contain substantive content, not just
|
||||
metadata fragments or headers
|
||||
|
||||
Validation Result Structure:
|
||||
The ValidationResult TypedDict contains:
|
||||
|
||||
- valid (bool): Overall validation pass/fail
|
||||
- errors (List[str]): Critical issues requiring attention
|
||||
- warnings (List[str]): Non-critical suggestions
|
||||
- corrections (Dict[str, str]): Suggested field corrections
|
||||
- concepts (List[str]): Extracted key concepts
|
||||
- score (float): Confidence score (0.0 to 1.0)
|
||||
|
||||
LLM Provider Support:
|
||||
- ollama: Local LLM (free, slower, privacy-preserving)
|
||||
- mistral: Mistral API (faster, requires API key, ~0.001 per validation)
|
||||
|
||||
Example:
|
||||
>>> from utils.llm_validator import validate_document, apply_corrections
|
||||
>>>
|
||||
>>> # Validate a parsed document
|
||||
>>> parsed_doc = {
|
||||
... "metadata": {"title": "Phenomenologie", "author": "Hegel"},
|
||||
... "toc": [{"title": "Preface", "level": 1, "page": 1}],
|
||||
... "chunks": [{"text": "La conscience...", "section_path": "Preface"}]
|
||||
... }
|
||||
>>> result = validate_document(parsed_doc, provider="ollama")
|
||||
>>> print(f"Valid: {result['valid']}, Score: {result['score']}")
|
||||
Valid: True, Score: 0.85
|
||||
|
||||
See Also:
|
||||
utils.llm_cleaner: Text cleaning and validation
|
||||
utils.llm_chunker: Semantic chunking of sections
|
||||
utils.pdf_pipeline: Main pipeline orchestration
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional, Match
|
||||
|
||||
from .llm_structurer import call_llm, _get_default_model, _get_default_mistral_model, _clean_json_string
|
||||
from .types import LLMProvider, ValidationResult, ParsedDocument, ChunkData
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _extract_json_from_response(text: str) -> Dict[str, Any]:
|
||||
"""Extract JSON from an LLM response text.
|
||||
|
||||
Attempts to parse JSON from the response using two strategies:
|
||||
1. Look for content wrapped in <JSON></JSON> tags
|
||||
2. Find the first { and last } to extract raw JSON
|
||||
|
||||
Args:
|
||||
text: LLM response text potentially containing JSON data.
|
||||
May include markdown, explanatory text, or XML-style tags.
|
||||
|
||||
Returns:
|
||||
Parsed dictionary from the JSON content. Returns an empty dict
|
||||
if no valid JSON is found or parsing fails.
|
||||
|
||||
Example:
|
||||
>>> response = '<JSON>{"valid": true, "score": 0.9}</JSON>'
|
||||
>>> _extract_json_from_response(response)
|
||||
{'valid': True, 'score': 0.9}
|
||||
"""
|
||||
json_match: Optional[Match[str]] = re.search(r'<JSON>\s*(.*?)\s*</JSON>', text, re.DOTALL)
|
||||
if json_match:
|
||||
json_str: str = _clean_json_string(json_match.group(1))
|
||||
try:
|
||||
result: Dict[str, Any] = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
|
||||
start: int = text.find("{")
|
||||
end: int = text.rfind("}")
|
||||
if start != -1 and end > start:
|
||||
json_str = _clean_json_string(text[start:end + 1])
|
||||
try:
|
||||
result = json.loads(json_str)
|
||||
return result
|
||||
except json.JSONDecodeError as e:
|
||||
logger.warning(f"JSON invalide: {e}")
|
||||
|
||||
return {}
|
||||
|
||||
|
||||
def validate_document(
|
||||
parsed_doc: Dict[str, Any],
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
temperature: float = 0.1,
|
||||
) -> ValidationResult:
|
||||
"""Validate a parsed document's coherence and suggest corrections.
|
||||
|
||||
Uses an LLM to analyze the document structure and content, checking
|
||||
for common issues like incorrect metadata, inconsistent TOC, or
|
||||
low-quality chunk content.
|
||||
|
||||
Args:
|
||||
parsed_doc: Dictionary containing the parsed document with keys:
|
||||
- metadata: Dict with title, author, year, language
|
||||
- toc: List of TOC entries with title, level, page
|
||||
- chunks: List of text chunks with content and metadata
|
||||
model: LLM model name. If None, uses provider's default model.
|
||||
provider: LLM provider, either "ollama" (local) or "mistral" (API).
|
||||
temperature: Model temperature for response generation (0.0-1.0).
|
||||
Lower values produce more deterministic results.
|
||||
|
||||
Returns:
|
||||
ValidationResult TypedDict containing:
|
||||
- valid: Overall validation status (True if no critical errors)
|
||||
- errors: List of critical issues as strings
|
||||
- warnings: List of non-critical suggestions
|
||||
- corrections: Dict mapping field names to suggested corrections
|
||||
- concepts: Extracted key concepts (empty for this function)
|
||||
- score: Confidence score from 0.0 to 1.0
|
||||
|
||||
Note:
|
||||
The function always returns a valid result, even on LLM errors.
|
||||
Check the 'score' field - a score of 0.0 indicates an error occurred.
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Extraire les infos clés
|
||||
metadata: Dict[str, Any] = parsed_doc.get("metadata", {})
|
||||
toc: List[Dict[str, Any]] = parsed_doc.get("toc", [])
|
||||
chunks: List[Dict[str, Any]] = parsed_doc.get("chunks", [])
|
||||
|
||||
# Préparer le résumé du document
|
||||
doc_summary: Dict[str, Any] = {
|
||||
"title": metadata.get("title"),
|
||||
"author": metadata.get("author"),
|
||||
"toc_count": len(toc),
|
||||
"toc_preview": [t.get("title") for t in toc[:10]] if toc else [],
|
||||
"chunks_count": len(chunks),
|
||||
"first_chunks_preview": [
|
||||
c.get("text", "")[:100] for c in chunks[:5]
|
||||
] if chunks else [],
|
||||
}
|
||||
|
||||
prompt: str = f"""Tu es un expert en validation de documents structurés.
|
||||
|
||||
TÂCHE: Vérifier la cohérence de ce document parsé et détecter les erreurs.
|
||||
|
||||
DOCUMENT PARSÉ:
|
||||
{json.dumps(doc_summary, ensure_ascii=False, indent=2)}
|
||||
|
||||
VÉRIFICATIONS À EFFECTUER:
|
||||
1. Le titre correspond-il au contenu? (pas le nom d'une collection)
|
||||
2. L'auteur est-il correctement identifié? (pas "Inconnu" si visible)
|
||||
3. La TOC est-elle cohérente? (pas de doublons, bon ordre)
|
||||
4. Les chunks contiennent-ils du vrai contenu? (pas que des métadonnées)
|
||||
|
||||
RÉPONDS avec un JSON entre <JSON></JSON>:
|
||||
|
||||
<JSON>
|
||||
{{
|
||||
"is_valid": true,
|
||||
"confidence": 0.85,
|
||||
"issues": [
|
||||
{{
|
||||
"field": "title",
|
||||
"severity": "warning",
|
||||
"message": "Le titre semble être le nom de la collection",
|
||||
"suggestion": "Vrai titre suggéré"
|
||||
}}
|
||||
],
|
||||
"corrections": {{
|
||||
"title": "Titre corrigé si nécessaire",
|
||||
"author": "Auteur corrigé si nécessaire"
|
||||
}},
|
||||
"quality_score": {{
|
||||
"metadata": 0.8,
|
||||
"toc": 0.9,
|
||||
"chunks": 0.7
|
||||
}}
|
||||
}}
|
||||
</JSON>
|
||||
"""
|
||||
|
||||
logger.info(f"Validation du document parsé via {provider.upper()}")
|
||||
|
||||
try:
|
||||
response: str = call_llm(
|
||||
prompt, model=model, provider=provider, temperature=temperature, timeout=180
|
||||
)
|
||||
result: Dict[str, Any] = _extract_json_from_response(response)
|
||||
|
||||
# Construire ValidationResult avec valeurs par défaut
|
||||
is_valid: bool = result.get("is_valid", True)
|
||||
issues: List[str] = result.get("issues", [])
|
||||
corrections: Dict[str, str] = result.get("corrections", {})
|
||||
confidence: float = result.get("confidence", 0.5)
|
||||
|
||||
logger.info(f"Validation terminée: valid={is_valid}, issues={len(issues)}")
|
||||
|
||||
validation_result: ValidationResult = {
|
||||
"valid": is_valid,
|
||||
"errors": [str(issue) for issue in issues] if issues else [],
|
||||
"warnings": [],
|
||||
"corrections": corrections,
|
||||
"concepts": [],
|
||||
"score": confidence,
|
||||
}
|
||||
return validation_result
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur validation document: {e}")
|
||||
error_result: ValidationResult = {
|
||||
"valid": True,
|
||||
"errors": [str(e)],
|
||||
"warnings": [],
|
||||
"corrections": {},
|
||||
"concepts": [],
|
||||
"score": 0.0,
|
||||
}
|
||||
return error_result
|
||||
|
||||
|
||||
def generate_section_summary(
|
||||
section_content: str,
|
||||
section_title: str,
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
max_words: int = 50,
|
||||
) -> str:
|
||||
"""Generate a concise summary for a document section using LLM.
|
||||
|
||||
Creates a single-sentence summary capturing the main idea of the section.
|
||||
For very short sections (< 100 characters), returns the section title
|
||||
instead of calling the LLM.
|
||||
|
||||
Args:
|
||||
section_content: Full text content of the section to summarize.
|
||||
section_title: Title of the section, used as fallback if summarization
|
||||
fails or content is too short.
|
||||
model: LLM model name. If None, uses provider's default model.
|
||||
provider: LLM provider, either "ollama" (local) or "mistral" (API).
|
||||
max_words: Maximum number of words for the generated summary.
|
||||
Defaults to 50 words.
|
||||
|
||||
Returns:
|
||||
Generated summary string, truncated to max_words if necessary.
|
||||
Returns section_title if content is too short or on error.
|
||||
|
||||
Note:
|
||||
Only the first 2000 characters of section_content are sent to the LLM
|
||||
to manage context window limits and costs.
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
if len(section_content) < 100:
|
||||
return section_title
|
||||
|
||||
prompt: str = f"""Résume cette section en maximum {max_words} mots.
|
||||
Le résumé doit capturer l'idée principale.
|
||||
|
||||
Titre: {section_title}
|
||||
Contenu:
|
||||
{section_content[:2000]}
|
||||
|
||||
Résumé (en une phrase):"""
|
||||
|
||||
try:
|
||||
response: str = call_llm(
|
||||
prompt, model=model, provider=provider, temperature=0.2, timeout=60
|
||||
)
|
||||
|
||||
# Nettoyer la réponse
|
||||
summary: str = response.strip()
|
||||
|
||||
# Limiter la longueur
|
||||
words: List[str] = summary.split()
|
||||
if len(words) > max_words:
|
||||
summary = ' '.join(words[:max_words]) + '...'
|
||||
|
||||
return summary or section_title
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur génération résumé: {e}")
|
||||
return section_title
|
||||
|
||||
|
||||
def enrich_chunks_with_concepts(
|
||||
chunks: List[Dict[str, Any]],
|
||||
model: Optional[str] = None,
|
||||
provider: LLMProvider = "ollama",
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Enrich text chunks with extracted key concepts using LLM.
|
||||
|
||||
Processes each chunk to extract 3-5 key philosophical or thematic
|
||||
concepts, adding them to the chunk's 'concepts' field. Skips chunks
|
||||
that already have concepts or are too short (< 100 characters).
|
||||
|
||||
Args:
|
||||
chunks: List of chunk dictionaries, each containing at minimum:
|
||||
- text: The chunk's text content
|
||||
May also contain existing 'concepts' field (will be skipped).
|
||||
model: LLM model name. If None, uses provider's default model.
|
||||
provider: LLM provider, either "ollama" (local) or "mistral" (API).
|
||||
|
||||
Returns:
|
||||
The same list of chunks, modified in-place with 'concepts' field
|
||||
added to each chunk. Each concepts field is a list of 0-5 strings.
|
||||
|
||||
Note:
|
||||
- Chunks are processed individually with logging every 10 chunks.
|
||||
- Only the first 1000 characters of each chunk are analyzed.
|
||||
- The function modifies chunks in-place AND returns them.
|
||||
- On extraction error, sets concepts to an empty list.
|
||||
"""
|
||||
if model is None:
|
||||
model = _get_default_mistral_model() if provider == "mistral" else _get_default_model()
|
||||
|
||||
# Limiter le nombre de chunks à traiter en une fois
|
||||
batch_size: int = 10
|
||||
|
||||
i: int
|
||||
chunk: Dict[str, Any]
|
||||
for i, chunk in enumerate(chunks):
|
||||
if "concepts" in chunk and chunk["concepts"]:
|
||||
continue # Déjà enrichi
|
||||
|
||||
text: str = chunk.get("text", "")
|
||||
if len(text) < 100:
|
||||
chunk["concepts"] = []
|
||||
continue
|
||||
|
||||
# Traiter par batch pour optimiser
|
||||
if i % batch_size == 0:
|
||||
logger.info(f"Enrichissement concepts: chunks {i} à {min(i+batch_size, len(chunks))}")
|
||||
|
||||
prompt: str = f"""Extrait 3-5 concepts clés de ce texte.
|
||||
Réponds avec une liste JSON: ["concept1", "concept2", ...]
|
||||
|
||||
Texte:
|
||||
{text[:1000]}
|
||||
|
||||
Concepts:"""
|
||||
|
||||
try:
|
||||
response: str = call_llm(
|
||||
prompt, model=model, provider=provider, temperature=0.1, timeout=30
|
||||
)
|
||||
|
||||
# Chercher la liste JSON
|
||||
match: Optional[Match[str]] = re.search(r'\[.*?\]', response, re.DOTALL)
|
||||
if match:
|
||||
concepts: List[str] = json.loads(match.group())
|
||||
chunk["concepts"] = concepts[:5]
|
||||
else:
|
||||
chunk["concepts"] = []
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur extraction concepts chunk {i}: {e}")
|
||||
chunk["concepts"] = []
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def clean_validation_annotations(text: str) -> str:
|
||||
"""Remove LLM-generated validation annotations from text.
|
||||
|
||||
Cleans common annotation patterns that LLMs may add when validating
|
||||
or correcting text, such as confidence markers or verification notes.
|
||||
|
||||
Patterns removed:
|
||||
- "(correct)" or "(a confirmer)" at end of text
|
||||
- "(a confirmer comme titre principal)"
|
||||
- "(possiblement...)" or "(probablement...)"
|
||||
- Isolated "(correct)" or "(a confirmer)" mid-text
|
||||
|
||||
Args:
|
||||
text: Text potentially containing LLM annotation artifacts.
|
||||
|
||||
Returns:
|
||||
Cleaned text with annotations removed and whitespace normalized.
|
||||
Returns the original text if input is None or empty.
|
||||
|
||||
Example:
|
||||
>>> clean_validation_annotations("Phenomenologie (a confirmer)")
|
||||
"Phenomenologie"
|
||||
>>> clean_validation_annotations("G.W.F. Hegel (correct)")
|
||||
'G.W.F. Hegel'
|
||||
"""
|
||||
if not text:
|
||||
return text
|
||||
|
||||
# Supprimer les annotations à la fin du texte
|
||||
text = re.sub(
|
||||
r'\s*\([^)]*(?:correct|à confirmer|possiblement|probablement)[^)]*\)\s*$',
|
||||
'',
|
||||
text,
|
||||
flags=re.IGNORECASE
|
||||
)
|
||||
|
||||
# Nettoyer aussi les annotations au milieu si elles sont isolées
|
||||
text = re.sub(r'\s*\((?:correct|à confirmer)\)\s*', ' ', text, flags=re.IGNORECASE)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def apply_corrections(
|
||||
parsed_doc: Dict[str, Any],
|
||||
validation_result: Optional[Dict[str, Any]] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""Apply validation corrections to a parsed document.
|
||||
|
||||
Takes the corrections suggested by validate_document() and applies them
|
||||
to the document's metadata. Also cleans any LLM annotation artifacts
|
||||
from existing metadata fields.
|
||||
|
||||
Args:
|
||||
parsed_doc: Parsed document dictionary containing at minimum:
|
||||
- metadata: Dict with title, author, and other fields
|
||||
May also contain 'work' field as fallback title source.
|
||||
validation_result: Result from validate_document() containing:
|
||||
- corrections: Dict mapping field names to corrected values
|
||||
If None, only cleans existing metadata annotations.
|
||||
|
||||
Returns:
|
||||
The modified parsed_doc with:
|
||||
- Corrected metadata fields applied
|
||||
- Original values preserved in 'original_<field>' keys
|
||||
- LLM annotations cleaned from all text fields
|
||||
- 'validation' key added with the validation_result
|
||||
|
||||
Note:
|
||||
- Modifies parsed_doc in-place AND returns it
|
||||
- Empty correction values are ignored
|
||||
- If title contains validation phrases and 'work' field exists,
|
||||
the work field value is used as the corrected title
|
||||
"""
|
||||
corrections: Dict[str, str] = (
|
||||
validation_result.get("corrections", {}) if validation_result else {}
|
||||
)
|
||||
|
||||
metadata: Dict[str, Any] = parsed_doc.get("metadata", {})
|
||||
|
||||
# Appliquer les corrections de métadonnées
|
||||
if "title" in corrections and corrections["title"]:
|
||||
old_title: Optional[str] = metadata.get("title")
|
||||
# Nettoyer les annotations de validation
|
||||
clean_title: str = clean_validation_annotations(corrections["title"])
|
||||
metadata["title"] = clean_title
|
||||
metadata["original_title"] = old_title
|
||||
logger.info(f"Titre corrigé: '{old_title}' -> '{clean_title}'")
|
||||
|
||||
if "author" in corrections and corrections["author"]:
|
||||
old_author: Optional[str] = metadata.get("author")
|
||||
# Nettoyer les annotations de validation
|
||||
clean_author: str = clean_validation_annotations(corrections["author"])
|
||||
metadata["author"] = clean_author
|
||||
metadata["original_author"] = old_author
|
||||
logger.info(f"Auteur corrigé: '{old_author}' -> '{clean_author}'")
|
||||
|
||||
# Nettoyer aussi les métadonnées existantes si pas de corrections
|
||||
if "title" in metadata and metadata["title"]:
|
||||
title: str = metadata["title"]
|
||||
# Si le titre contient des phrases de validation, utiliser le champ "work" à la place
|
||||
validation_phrases: List[str] = ["à confirmer", "confirmer avec", "vérifier"]
|
||||
if any(phrase in title.lower() for phrase in validation_phrases):
|
||||
if "work" in metadata and metadata["work"]:
|
||||
logger.info(f"Titre remplacé par 'work': '{title}' -> '{metadata['work']}'")
|
||||
metadata["original_title"] = title
|
||||
metadata["title"] = metadata["work"]
|
||||
else:
|
||||
metadata["title"] = clean_validation_annotations(title)
|
||||
|
||||
if "author" in metadata and metadata["author"]:
|
||||
metadata["author"] = clean_validation_annotations(metadata["author"])
|
||||
|
||||
parsed_doc["metadata"] = metadata
|
||||
parsed_doc["validation"] = validation_result
|
||||
|
||||
return parsed_doc
|
||||
|
||||
141
generations/library_rag/utils/markdown_builder.py
Normal file
141
generations/library_rag/utils/markdown_builder.py
Normal file
@@ -0,0 +1,141 @@
|
||||
"""Markdown document builder from OCR API responses.
|
||||
|
||||
This module transforms Mistral OCR API responses into structured Markdown text.
|
||||
It handles text extraction, page marker insertion, and image processing
|
||||
(either base64 embedding or disk-based storage with relative path references).
|
||||
|
||||
The builder is a core component of the PDF processing pipeline, sitting between
|
||||
OCR extraction and hierarchical parsing.
|
||||
|
||||
Pipeline Position:
|
||||
PDF → OCR (mistral_client) → **Markdown Builder** → Hierarchy Parser → Chunks
|
||||
|
||||
Features:
|
||||
- Page markers: Inserts HTML comments (<!-- Page N -->) for traceability
|
||||
- Image handling: Supports both inline base64 and external file references
|
||||
- Type safety: Uses Protocol-based typing for OCR response structures
|
||||
|
||||
Workflow:
|
||||
1. Iterate through pages in the OCR response
|
||||
2. Extract Markdown content from each page
|
||||
3. Process images (embed as base64 or save via ImageWriter callback)
|
||||
4. Assemble the complete Markdown document
|
||||
|
||||
Image Handling Modes:
|
||||
1. **No images**: Set embed_images=False and image_writer=None
|
||||
2. **Inline base64**: Set embed_images=True (large file size)
|
||||
3. **External files**: Provide image_writer callback (recommended)
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.image_extractor import create_image_writer
|
||||
>>>
|
||||
>>> # Create image writer for output directory
|
||||
>>> writer = create_image_writer(Path("output/my_doc/images"))
|
||||
>>>
|
||||
>>> # Build markdown with external image references
|
||||
>>> markdown = build_markdown(
|
||||
... ocr_response,
|
||||
... embed_images=False,
|
||||
... image_writer=writer
|
||||
... )
|
||||
>>> print(markdown[:100])
|
||||
<!-- Page 1 -->
|
||||
# Document Title
|
||||
...
|
||||
|
||||
Note:
|
||||
- Page indices are 1-based for human readability
|
||||
- The OCR response must follow the Mistral API structure
|
||||
- Empty pages produce only the page marker comment
|
||||
|
||||
See Also:
|
||||
- utils.mistral_client: OCR API client for obtaining responses
|
||||
- utils.image_extractor: Image writer factory and extraction
|
||||
- utils.hierarchy_parser: Next step in pipeline (structure parsing)
|
||||
"""
|
||||
|
||||
from typing import Any, Callable, List, Optional, Protocol
|
||||
|
||||
|
||||
# Type pour le writer d'images
|
||||
ImageWriterCallable = Callable[[int, int, str], Optional[str]]
|
||||
|
||||
|
||||
class OCRImage(Protocol):
|
||||
"""Protocol pour une image extraite par OCR."""
|
||||
|
||||
image_base64: Optional[str]
|
||||
|
||||
|
||||
class OCRPage(Protocol):
|
||||
"""Protocol pour une page extraite par OCR."""
|
||||
|
||||
markdown: Optional[str]
|
||||
images: Optional[List[OCRImage]]
|
||||
|
||||
|
||||
class OCRResponseProtocol(Protocol):
|
||||
"""Protocol pour la réponse complète de l'API OCR Mistral."""
|
||||
|
||||
pages: List[OCRPage]
|
||||
|
||||
|
||||
def build_markdown(
|
||||
ocr_response: OCRResponseProtocol,
|
||||
embed_images: bool = False,
|
||||
image_writer: Optional[ImageWriterCallable] = None,
|
||||
) -> str:
|
||||
"""Construit le texte Markdown à partir de la réponse OCR.
|
||||
|
||||
Args:
|
||||
ocr_response: Réponse de l'API OCR Mistral contenant les pages extraites.
|
||||
embed_images: Intégrer les images en base64 dans le Markdown.
|
||||
image_writer: Fonction pour sauvegarder les images sur disque.
|
||||
Signature: (page_idx, img_idx, base64_data) -> chemin_relatif.
|
||||
|
||||
Returns:
|
||||
Texte Markdown complet du document avec marqueurs de page et images.
|
||||
|
||||
Example:
|
||||
>>> markdown = build_markdown(
|
||||
... ocr_response,
|
||||
... embed_images=False,
|
||||
... image_writer=lambda p, i, b64: f"images/p{p}_i{i}.png"
|
||||
... )
|
||||
"""
|
||||
md_parts: List[str] = []
|
||||
|
||||
for page_index, page in enumerate(ocr_response.pages, start=1):
|
||||
# Commentaire de page
|
||||
md_parts.append(f"<!-- Page {page_index} -->\n\n")
|
||||
|
||||
# Contenu Markdown de la page
|
||||
page_markdown: Optional[str] = getattr(page, "markdown", None)
|
||||
if page_markdown:
|
||||
md_parts.append(page_markdown)
|
||||
md_parts.append("\n\n")
|
||||
|
||||
# Traitement des images
|
||||
page_images: Optional[List[OCRImage]] = getattr(page, "images", None)
|
||||
if page_images:
|
||||
for img_idx, img in enumerate(page_images, start=1):
|
||||
image_b64: Optional[str] = getattr(img, "image_base64", None)
|
||||
if not image_b64:
|
||||
continue
|
||||
|
||||
if embed_images:
|
||||
# Image intégrée en base64
|
||||
data_uri: str = f"data:image/png;base64,{image_b64}"
|
||||
md_parts.append(
|
||||
f"\n\n"
|
||||
)
|
||||
elif image_writer:
|
||||
# Image sauvegardée sur disque
|
||||
rel_path: Optional[str] = image_writer(page_index, img_idx, image_b64)
|
||||
if rel_path:
|
||||
md_parts.append(
|
||||
f"\n\n"
|
||||
)
|
||||
|
||||
return "".join(md_parts)
|
||||
169
generations/library_rag/utils/mistral_client.py
Normal file
169
generations/library_rag/utils/mistral_client.py
Normal file
@@ -0,0 +1,169 @@
|
||||
"""Mistral API Client Management.
|
||||
|
||||
This module provides utilities for managing the Mistral API client,
|
||||
including API key retrieval and OCR cost estimation. It serves as the
|
||||
foundation for all Mistral API interactions in the Library RAG pipeline.
|
||||
|
||||
Key Features:
|
||||
- Automatic API key discovery from multiple sources
|
||||
- Client instantiation with proper authentication
|
||||
- OCR cost estimation for budget planning
|
||||
|
||||
API Key Priority:
|
||||
The module searches for the Mistral API key in this order:
|
||||
1. Explicit argument passed to functions
|
||||
2. MISTRAL_API_KEY environment variable
|
||||
3. .env file in the project root
|
||||
|
||||
Cost Estimation:
|
||||
Mistral OCR pricing (as of 2024):
|
||||
- Standard OCR: ~1 EUR per 1000 pages (0.001 EUR/page)
|
||||
- OCR with annotations: ~3 EUR per 1000 pages (0.003 EUR/page)
|
||||
|
||||
Example:
|
||||
Basic client creation and usage::
|
||||
|
||||
from utils.mistral_client import create_client, estimate_ocr_cost
|
||||
|
||||
# Create authenticated client
|
||||
client = create_client()
|
||||
|
||||
# Estimate cost for a 100-page document
|
||||
cost = estimate_ocr_cost(100, use_annotations=False)
|
||||
print(f"Estimated cost: {cost:.2f} EUR") # Output: Estimated cost: 0.10 EUR
|
||||
|
||||
Using explicit API key::
|
||||
|
||||
client = create_client(api_key="your-api-key-here")
|
||||
|
||||
See Also:
|
||||
- :mod:`utils.ocr_processor`: OCR execution functions using this client
|
||||
- :mod:`utils.pdf_uploader`: PDF upload utilities for OCR processing
|
||||
|
||||
Note:
|
||||
Ensure MISTRAL_API_KEY is set before using this module in production.
|
||||
The API key can be obtained from the Mistral AI platform dashboard.
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Optional
|
||||
|
||||
from dotenv import load_dotenv
|
||||
from mistralai import Mistral
|
||||
|
||||
|
||||
def get_api_key(api_key: Optional[str] = None) -> str:
|
||||
"""Retrieve the Mistral API key from available sources.
|
||||
|
||||
Searches for the API key in the following priority order:
|
||||
1. Explicit argument passed to this function
|
||||
2. MISTRAL_API_KEY environment variable
|
||||
3. .env file in the project root
|
||||
|
||||
Args:
|
||||
api_key: Optional API key to use directly. If provided and non-empty,
|
||||
this value is used without checking other sources.
|
||||
|
||||
Returns:
|
||||
The Mistral API key as a string.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If no API key is found in any of the checked sources.
|
||||
|
||||
Example:
|
||||
>>> # Using environment variable
|
||||
>>> key = get_api_key()
|
||||
>>> len(key) > 0
|
||||
True
|
||||
|
||||
>>> # Using explicit key
|
||||
>>> key = get_api_key("my-api-key")
|
||||
>>> key
|
||||
'my-api-key'
|
||||
"""
|
||||
# 1. Argument fourni
|
||||
if api_key and api_key.strip():
|
||||
return api_key.strip()
|
||||
|
||||
# 2. Variable d environnement
|
||||
env_key = os.getenv("MISTRAL_API_KEY", "").strip()
|
||||
if env_key:
|
||||
return env_key
|
||||
|
||||
# 3. Fichier .env
|
||||
load_dotenv()
|
||||
env_key = os.getenv("MISTRAL_API_KEY", "").strip()
|
||||
if env_key:
|
||||
return env_key
|
||||
|
||||
raise RuntimeError(
|
||||
"MISTRAL_API_KEY manquante. "
|
||||
"Definissez la variable d environnement ou creez un fichier .env"
|
||||
)
|
||||
|
||||
|
||||
def create_client(api_key: Optional[str] = None) -> Mistral:
|
||||
"""Create and return an authenticated Mistral client.
|
||||
|
||||
This is the primary entry point for obtaining a Mistral client instance.
|
||||
The client can be used for OCR operations, chat completions, and other
|
||||
Mistral API features.
|
||||
|
||||
Args:
|
||||
api_key: Optional API key. If not provided, the key is automatically
|
||||
retrieved from environment variables or .env file.
|
||||
|
||||
Returns:
|
||||
An authenticated Mistral client instance ready for API calls.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If no API key is found (propagated from get_api_key).
|
||||
|
||||
Example:
|
||||
>>> client = create_client()
|
||||
>>> # Client is now ready for OCR or other operations
|
||||
>>> response = client.ocr.process(...) # doctest: +SKIP
|
||||
"""
|
||||
key = get_api_key(api_key)
|
||||
return Mistral(api_key=key)
|
||||
|
||||
|
||||
def estimate_ocr_cost(nb_pages: int, use_annotations: bool = False) -> float:
|
||||
"""Estimate the cost of OCR processing for a document.
|
||||
|
||||
Calculates the expected cost based on Mistral OCR pricing model.
|
||||
This is useful for budget planning before processing large document
|
||||
collections.
|
||||
|
||||
Pricing Model:
|
||||
- Standard OCR: ~1 EUR per 1000 pages (0.001 EUR/page)
|
||||
- OCR with annotations: ~3 EUR per 1000 pages (0.003 EUR/page)
|
||||
|
||||
The annotation mode is approximately 3x more expensive but provides
|
||||
additional structural information useful for TOC extraction.
|
||||
|
||||
Args:
|
||||
nb_pages: Number of pages in the document to process.
|
||||
use_annotations: If True, uses the higher annotation pricing.
|
||||
Annotations provide bounding box and structural data.
|
||||
|
||||
Returns:
|
||||
Estimated cost in euros as a float.
|
||||
|
||||
Example:
|
||||
>>> # Standard OCR for 100 pages
|
||||
>>> estimate_ocr_cost(100)
|
||||
0.1
|
||||
|
||||
>>> # OCR with annotations for 100 pages
|
||||
>>> estimate_ocr_cost(100, use_annotations=True)
|
||||
0.3
|
||||
|
||||
>>> # Large document collection
|
||||
>>> estimate_ocr_cost(10000)
|
||||
10.0
|
||||
"""
|
||||
if use_annotations:
|
||||
return nb_pages * 0.003 # 3 EUR / 1000 pages
|
||||
else:
|
||||
return nb_pages * 0.001 # 1 EUR / 1000 pages
|
||||
312
generations/library_rag/utils/ocr_processor.py
Normal file
312
generations/library_rag/utils/ocr_processor.py
Normal file
@@ -0,0 +1,312 @@
|
||||
"""OCR Processing via Mistral API.
|
||||
|
||||
This module provides functions for executing OCR (Optical Character Recognition)
|
||||
on PDF documents using the Mistral API. It handles both standard OCR and advanced
|
||||
OCR with structured annotations for better document understanding.
|
||||
|
||||
Key Features:
|
||||
- Standard OCR for text extraction with optional image embedding
|
||||
- Advanced OCR with document and bounding box annotations
|
||||
- Response serialization for JSON storage and further processing
|
||||
- Support for page-by-page processing
|
||||
|
||||
OCR Modes:
|
||||
1. **Standard OCR** (run_ocr):
|
||||
- Extracts text and optionally images
|
||||
- Cost: ~1 EUR per 1000 pages (0.001 EUR/page)
|
||||
- Best for: Simple text extraction, content indexing
|
||||
|
||||
2. **OCR with Annotations** (run_ocr_with_annotations):
|
||||
- Extracts text with structural metadata (bounding boxes, document structure)
|
||||
- Cost: ~3 EUR per 1000 pages (0.003 EUR/page)
|
||||
- Best for: TOC extraction, layout analysis, structured documents
|
||||
- Document annotations limited to 8 pages max
|
||||
- Bounding box annotations have no page limit
|
||||
|
||||
Response Structure:
|
||||
The OCR response contains:
|
||||
- pages: List of page objects with text content
|
||||
- images: Optional base64-encoded images (if include_images=True)
|
||||
- annotations: Structural metadata (if using annotation mode)
|
||||
|
||||
Example:
|
||||
Basic OCR processing::
|
||||
|
||||
from utils.mistral_client import create_client
|
||||
from utils.ocr_processor import run_ocr, serialize_ocr_response
|
||||
|
||||
# Create client and read PDF
|
||||
client = create_client()
|
||||
with open("document.pdf", "rb") as f:
|
||||
pdf_bytes = f.read()
|
||||
|
||||
# Run OCR
|
||||
response = run_ocr(client, pdf_bytes, "document.pdf")
|
||||
|
||||
# Serialize for storage
|
||||
ocr_dict = serialize_ocr_response(response)
|
||||
print(f"Extracted {len(ocr_dict['pages'])} pages")
|
||||
|
||||
Cost Considerations:
|
||||
- Always estimate costs before batch processing with estimate_ocr_cost()
|
||||
- Use pages parameter to limit processing when full document is not needed
|
||||
- Annotation mode is 3x more expensive - use only when structure is needed
|
||||
- Cache OCR results to avoid reprocessing (saved in output/<doc>/<doc>.json)
|
||||
|
||||
See Also:
|
||||
- utils.mistral_client: Client creation and cost estimation
|
||||
- utils.pdf_uploader: PDF upload utilities
|
||||
- utils.pdf_pipeline: Full pipeline orchestration
|
||||
|
||||
Note:
|
||||
OCR responses are Pydantic models from the Mistral SDK. Use
|
||||
serialize_ocr_response() to convert to dictionaries before JSON storage.
|
||||
"""
|
||||
|
||||
import json
|
||||
from typing import Any, Dict, List, Optional, Type
|
||||
|
||||
from mistralai import Mistral
|
||||
from pydantic import BaseModel
|
||||
|
||||
from .pdf_uploader import upload_pdf
|
||||
from .types import OCRResponse
|
||||
|
||||
|
||||
def run_ocr(
|
||||
client: Mistral,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
include_images: bool = True,
|
||||
) -> Any:
|
||||
"""Execute standard OCR on a PDF document via Mistral API.
|
||||
|
||||
Uploads the PDF to Mistral servers and runs OCR to extract text content.
|
||||
Optionally includes base64-encoded images from the document.
|
||||
|
||||
This is the most cost-effective OCR mode (~0.001 EUR/page) suitable for
|
||||
basic text extraction and content indexing.
|
||||
|
||||
Args:
|
||||
client: Authenticated Mistral client instance created via
|
||||
utils.mistral_client.create_client().
|
||||
file_bytes: Binary content of the PDF file to process.
|
||||
filename: Original filename of the PDF (used for identification).
|
||||
include_images: If True, includes base64-encoded images from each page
|
||||
in the response. Set to False to reduce response size when images
|
||||
are not needed. Defaults to True.
|
||||
|
||||
Returns:
|
||||
OCR response object from Mistral API (Pydantic model). Contains:
|
||||
- pages: List of page objects with extracted text
|
||||
- images: Base64 images if include_images=True
|
||||
|
||||
Use serialize_ocr_response() to convert to a dictionary.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If the Mistral client is not properly authenticated.
|
||||
HTTPError: If the API request fails (network issues, rate limits).
|
||||
|
||||
Example:
|
||||
>>> from utils.mistral_client import create_client
|
||||
>>> client = create_client()
|
||||
>>> with open("document.pdf", "rb") as f:
|
||||
... pdf_bytes = f.read()
|
||||
>>> response = run_ocr(client, pdf_bytes, "document.pdf")
|
||||
>>> # Access extracted text from first page
|
||||
>>> first_page_text = response.pages[0].markdown # doctest: +SKIP
|
||||
|
||||
Note:
|
||||
The PDF is first uploaded to Mistral servers via
|
||||
utils.pdf_uploader.upload_pdf(), then processed. The uploaded
|
||||
file is automatically cleaned up by Mistral after processing.
|
||||
"""
|
||||
# Upload du document
|
||||
doc_url: str = upload_pdf(client, file_bytes, filename)
|
||||
|
||||
# Appel OCR
|
||||
response = client.ocr.process(
|
||||
model="mistral-ocr-latest",
|
||||
document={
|
||||
"type": "document_url",
|
||||
"document_url": doc_url,
|
||||
},
|
||||
include_image_base64=include_images,
|
||||
)
|
||||
|
||||
return response
|
||||
|
||||
|
||||
def run_ocr_with_annotations(
|
||||
client: Mistral,
|
||||
file_bytes: bytes,
|
||||
filename: str,
|
||||
include_images: bool = True,
|
||||
document_annotation_format: Optional[Type[BaseModel]] = None,
|
||||
bbox_annotation_format: Optional[Type[BaseModel]] = None,
|
||||
pages: Optional[List[int]] = None,
|
||||
) -> Any:
|
||||
"""Execute OCR with structured annotations on a PDF document.
|
||||
|
||||
This advanced OCR mode extracts text along with structural metadata
|
||||
defined by Pydantic schemas. Useful for extracting structured data
|
||||
like table of contents, form fields, or document hierarchy.
|
||||
|
||||
Two annotation modes are available:
|
||||
- Document annotations: Extract document-level structure (limited to 8 pages)
|
||||
- Bounding box annotations: Extract element positions (no page limit)
|
||||
|
||||
This mode is approximately 3x more expensive than standard OCR (~0.003 EUR/page).
|
||||
|
||||
Args:
|
||||
client: Authenticated Mistral client instance created via
|
||||
utils.mistral_client.create_client().
|
||||
file_bytes: Binary content of the PDF file to process.
|
||||
filename: Original filename of the PDF (used for identification).
|
||||
include_images: If True, includes base64-encoded images from each page.
|
||||
Defaults to True.
|
||||
document_annotation_format: Optional Pydantic model defining the expected
|
||||
document-level annotation structure. The model is converted to JSON
|
||||
schema for the API. Limited to processing 8 pages maximum.
|
||||
bbox_annotation_format: Optional Pydantic model defining the expected
|
||||
bounding box annotation structure. No page limit applies.
|
||||
pages: Optional list of 0-indexed page numbers to process. If None,
|
||||
all pages are processed. Use this to limit costs and processing time.
|
||||
|
||||
Returns:
|
||||
OCR response object with annotations from Mistral API. Contains:
|
||||
- pages: List of page objects with extracted text
|
||||
- annotations: Structured data matching the provided Pydantic schema
|
||||
- images: Base64 images if include_images=True
|
||||
|
||||
Use serialize_ocr_response() to convert to a dictionary.
|
||||
|
||||
Raises:
|
||||
RuntimeError: If the Mistral client is not properly authenticated.
|
||||
HTTPError: If the API request fails (network issues, rate limits).
|
||||
ValueError: If document_annotation_format is used with more than 8 pages.
|
||||
|
||||
Example:
|
||||
Extract table of contents from first 8 pages::
|
||||
|
||||
from pydantic import BaseModel
|
||||
from typing import List, Optional
|
||||
|
||||
class TOCEntry(BaseModel):
|
||||
title: str
|
||||
page: int
|
||||
level: int
|
||||
children: Optional[List["TOCEntry"]] = None
|
||||
|
||||
response = run_ocr_with_annotations(
|
||||
client,
|
||||
pdf_bytes,
|
||||
"book.pdf",
|
||||
document_annotation_format=TOCEntry,
|
||||
pages=[0, 1, 2, 3, 4, 5, 6, 7]
|
||||
)
|
||||
|
||||
# Access annotations
|
||||
toc_data = response.annotations # doctest: +SKIP
|
||||
|
||||
Note:
|
||||
- Document annotations are more expensive but provide rich structure
|
||||
- For large documents, use pages parameter to limit processing
|
||||
- Consider caching results to avoid reprocessing costs
|
||||
"""
|
||||
from mistralai.extra import response_format_from_pydantic_model
|
||||
|
||||
# Upload du document
|
||||
doc_url: str = upload_pdf(client, file_bytes, filename)
|
||||
|
||||
# Construire les arguments de l'appel OCR
|
||||
kwargs: Dict[str, Any] = {
|
||||
"model": "mistral-ocr-latest",
|
||||
"document": {
|
||||
"type": "document_url",
|
||||
"document_url": doc_url,
|
||||
},
|
||||
"include_image_base64": include_images,
|
||||
}
|
||||
|
||||
# Ajouter les pages si spécifié
|
||||
if pages is not None:
|
||||
kwargs["pages"] = pages
|
||||
|
||||
# Ajouter le format d'annotation document si fourni
|
||||
if document_annotation_format is not None:
|
||||
kwargs["document_annotation_format"] = response_format_from_pydantic_model(
|
||||
document_annotation_format
|
||||
)
|
||||
|
||||
# Ajouter le format d'annotation bbox si fourni
|
||||
if bbox_annotation_format is not None:
|
||||
kwargs["bbox_annotation_format"] = response_format_from_pydantic_model(
|
||||
bbox_annotation_format
|
||||
)
|
||||
|
||||
# Appel OCR avec annotations
|
||||
response = client.ocr.process(**kwargs)
|
||||
return response
|
||||
|
||||
|
||||
def serialize_ocr_response(response: Any) -> Dict[str, Any]:
|
||||
"""Convert an OCR response object to a JSON-serializable dictionary.
|
||||
|
||||
The Mistral OCR API returns Pydantic model objects that need to be
|
||||
converted to plain dictionaries for JSON storage or further processing.
|
||||
This function handles various response formats from different versions
|
||||
of the Mistral SDK.
|
||||
|
||||
Args:
|
||||
response: OCR response object from Mistral API. Can be any object
|
||||
that has model_dump(), dict(), or json() method.
|
||||
|
||||
Returns:
|
||||
A dictionary representation of the OCR response, suitable for:
|
||||
- JSON serialization with json.dumps()
|
||||
- Storage in files (output/<doc>/<doc>.json)
|
||||
- Further processing in the pipeline
|
||||
|
||||
The dictionary typically contains:
|
||||
- pages: List of page data with text content
|
||||
- images: Base64-encoded images (if requested)
|
||||
- model: OCR model used
|
||||
- usage: Token/page usage statistics
|
||||
|
||||
Raises:
|
||||
TypeError: If the response object cannot be serialized using any
|
||||
of the supported methods (model_dump, dict, json).
|
||||
|
||||
Example:
|
||||
>>> # Assuming response is from run_ocr()
|
||||
>>> ocr_dict = serialize_ocr_response(response) # doctest: +SKIP
|
||||
>>> import json
|
||||
>>> with open("ocr_result.json", "w") as f:
|
||||
... json.dump(ocr_dict, f, indent=2) # doctest: +SKIP
|
||||
|
||||
>>> # Access page count
|
||||
>>> num_pages = len(ocr_dict["pages"]) # doctest: +SKIP
|
||||
|
||||
Note:
|
||||
This function tries multiple serialization methods in order of
|
||||
preference:
|
||||
1. model_dump() - Pydantic v2 (preferred)
|
||||
2. dict() - Pydantic v1 compatibility
|
||||
3. json() - Fallback for other Pydantic models
|
||||
"""
|
||||
if hasattr(response, "model_dump"):
|
||||
result: Dict[str, Any] = response.model_dump()
|
||||
return result
|
||||
|
||||
if hasattr(response, "dict"):
|
||||
result = response.dict()
|
||||
return result
|
||||
|
||||
if hasattr(response, "json"):
|
||||
result = json.loads(response.json())
|
||||
return result
|
||||
|
||||
raise TypeError("Réponse OCR non sérialisable")
|
||||
|
||||
|
||||
55
generations/library_rag/utils/ocr_schemas.py
Normal file
55
generations/library_rag/utils/ocr_schemas.py
Normal file
@@ -0,0 +1,55 @@
|
||||
"""Schémas Pydantic pour l'extraction structurée via OCR avec annotations.
|
||||
|
||||
Utilisés avec document_annotation_format et bbox_annotation_format de l'API Mistral.
|
||||
"""
|
||||
|
||||
from typing import List, Optional
|
||||
from pydantic import BaseModel, Field
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class TocEntryType(str, Enum):
|
||||
"""Type d'entrée de table des matières."""
|
||||
CHAPTER = "chapter"
|
||||
SECTION = "section"
|
||||
SUBSECTION = "subsection"
|
||||
PREAMBLE = "preamble"
|
||||
APPENDIX = "appendix"
|
||||
|
||||
|
||||
class TocEntry(BaseModel):
|
||||
"""Entrée de table des matières avec hiérarchie."""
|
||||
title: str = Field(..., description="Titre exact de la section tel qu'il apparaît dans la table des matières")
|
||||
page_number: int = Field(..., description="Numéro de page réel tel qu'imprimé/affiché dans le livre (PAS l'index séquentiel du PDF, mais le numéro visible sur la page elle-même)")
|
||||
level: int = Field(..., description="""Niveau hiérarchique détecté VISUELLEMENT dans la mise en page de la table des matières:
|
||||
- level=1 si le titre est aligné à gauche SANS indentation (titres principaux)
|
||||
- level=2 si le titre a une PETITE indentation ou est légèrement décalé vers la droite
|
||||
- level=3 si le titre a une DOUBLE indentation ou est très décalé vers la droite
|
||||
Regardez attentivement l'alignement horizontal et les espaces avant chaque titre pour déterminer le niveau.""")
|
||||
entry_type: TocEntryType = Field(default=TocEntryType.SECTION, description="Type d'entrée: 'preamble' pour préfaces/introductions, 'chapter' pour chapitres, 'section' pour sections, 'subsection' pour sous-sections, 'appendix' pour annexes")
|
||||
parent_title: Optional[str] = Field(None, description="Si level > 1, indiquer le titre du parent direct (l'entrée de level=1 sous laquelle cette entrée est indentée)")
|
||||
|
||||
|
||||
class DocumentTOC(BaseModel):
|
||||
"""Table des matières complète du document."""
|
||||
entries: List[TocEntry] = Field(..., description="""Liste COMPLÈTE de TOUTES les entrées de la table des matières dans l'ordre d'apparition.
|
||||
IMPORTANT : Analysez attentivement l'indentation/alignement horizontal de chaque titre pour assigner le bon niveau hiérarchique:
|
||||
- Les titres alignés à gauche (non indentés) = level 1
|
||||
- Les titres légèrement indentés/décalés vers la droite = level 2 (sous-sections du titre level 1 précédent)
|
||||
- Les titres avec double indentation = level 3 (sous-sections du titre level 2 précédent)
|
||||
Chaque entrée doit avoir son vrai numéro de page tel qu'imprimé dans le livre.""")
|
||||
has_explicit_toc: bool = Field(..., description="Le document contient-il une table des matières explicite et visible ? (généralement en début de document)")
|
||||
toc_page_numbers: List[int] = Field(..., description="Liste des numéros de pages où se trouve la table des matières (généralement pages 2-5)")
|
||||
|
||||
|
||||
class DocumentMetadata(BaseModel):
|
||||
"""Métadonnées enrichies du document."""
|
||||
title: str = Field(..., description="Titre complet du document")
|
||||
author: str = Field(..., description="Auteur principal du document")
|
||||
languages: List[str] = Field(..., description="Liste des langues présentes dans le document (codes ISO 639-1, ex: ['fr', 'en'])")
|
||||
summary: str = Field(..., description="Résumé du document en 2-3 phrases maximum")
|
||||
collection: Optional[str] = Field(None, description="Nom de la collection ou série éditoriale")
|
||||
publisher: Optional[str] = Field(None, description="Nom de l'éditeur")
|
||||
year: Optional[int] = Field(None, description="Année de publication")
|
||||
total_pages: int = Field(..., description="Nombre total de pages dans le document")
|
||||
toc: DocumentTOC = Field(..., description="Table des matières structurée avec hiérarchie et numéros de page réels")
|
||||
1439
generations/library_rag/utils/pdf_pipeline.py
Normal file
1439
generations/library_rag/utils/pdf_pipeline.py
Normal file
File diff suppressed because it is too large
Load Diff
31
generations/library_rag/utils/pdf_uploader.py
Normal file
31
generations/library_rag/utils/pdf_uploader.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""Upload de fichiers PDF vers l'API Mistral."""
|
||||
|
||||
from mistralai import Mistral
|
||||
|
||||
|
||||
def upload_pdf(client: Mistral, file_bytes: bytes, filename: str) -> str:
|
||||
"""Upload un PDF vers Mistral et retourne l'URL signée.
|
||||
|
||||
Args:
|
||||
client: Client Mistral authentifié
|
||||
file_bytes: Contenu binaire du fichier PDF
|
||||
filename: Nom du fichier
|
||||
|
||||
Returns:
|
||||
URL signée du document uploadé
|
||||
"""
|
||||
# Upload du fichier
|
||||
uploaded = client.files.upload(
|
||||
file={
|
||||
"file_name": filename,
|
||||
"content": file_bytes,
|
||||
},
|
||||
purpose="ocr",
|
||||
)
|
||||
|
||||
# Récupération de l'URL signée
|
||||
signed = client.files.get_signed_url(file_id=uploaded.id)
|
||||
|
||||
return signed.url
|
||||
|
||||
|
||||
382
generations/library_rag/utils/toc_enricher.py
Normal file
382
generations/library_rag/utils/toc_enricher.py
Normal file
@@ -0,0 +1,382 @@
|
||||
"""TOC Enrichment Module for Chunk Metadata Enhancement.
|
||||
|
||||
This module provides functions to enrich chunk metadata with hierarchical
|
||||
information from the table of contents (TOC). It matches chunks to their
|
||||
corresponding TOC entries and extracts:
|
||||
- Full hierarchical paths (e.g., "Peirce: CP 1.628 > 628. It is...")
|
||||
- Chapter titles
|
||||
- Canonical academic references (e.g., "CP 1.628", "Ménon 80a")
|
||||
|
||||
The enrichment happens before Weaviate ingestion to ensure chunks have
|
||||
complete metadata for rigorous academic citation.
|
||||
|
||||
Usage:
|
||||
>>> from utils.toc_enricher import enrich_chunks_with_toc
|
||||
>>> enriched_chunks = enrich_chunks_with_toc(chunks, toc, hierarchy)
|
||||
|
||||
See Also:
|
||||
- utils.types: FlatTOCEntryEnriched type definition
|
||||
- utils.weaviate_ingest: Integration point for enrichment
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
from .types import FlatTOCEntryEnriched
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def flatten_toc_with_paths(
|
||||
toc: List[Dict[str, Any]],
|
||||
hierarchy: Dict[str, Any],
|
||||
) -> List[FlatTOCEntryEnriched]:
|
||||
"""Flatten hierarchical or flat TOC and build full paths with metadata.
|
||||
|
||||
Handles both hierarchical TOCs (with 'children' keys) and flat TOCs
|
||||
(where parent-child relationships are inferred from 'level' field).
|
||||
|
||||
Traverses the TOC structure and creates enriched flat entries with:
|
||||
- Full hierarchical path (e.g., "Peirce: CP 1.628 > 628. It is...")
|
||||
- Canonical reference extraction (e.g., "CP 1.628")
|
||||
- Chapter title tracking (first level 1 ancestor)
|
||||
- Parent title list for context
|
||||
|
||||
Args:
|
||||
toc: TOC structure with 'title' and 'level' fields, optionally 'children'
|
||||
hierarchy: Document hierarchy (currently unused, reserved for future)
|
||||
|
||||
Returns:
|
||||
List of enriched flat TOC entries with full metadata.
|
||||
|
||||
Example:
|
||||
>>> toc = [
|
||||
... {"title": "Peirce: CP 1.628", "level": 1},
|
||||
... {"title": "628. It is the instincts...", "level": 2}
|
||||
... ]
|
||||
>>> flat = flatten_toc_with_paths(toc, {})
|
||||
>>> flat[1]["full_path"]
|
||||
'Peirce: CP 1.628 > 628. It is the instincts...'
|
||||
>>> flat[1]["canonical_ref"]
|
||||
'CP 1.628'
|
||||
"""
|
||||
flat_toc: List[FlatTOCEntryEnriched] = []
|
||||
|
||||
# Check if TOC is hierarchical (has children) or flat (level-based)
|
||||
is_hierarchical = any("children" in entry for entry in toc if entry)
|
||||
|
||||
if is_hierarchical:
|
||||
# Original recursive approach for hierarchical TOCs
|
||||
def traverse(
|
||||
entries: List[Dict[str, Any]],
|
||||
parent_titles: List[str],
|
||||
current_chapter: str,
|
||||
current_canonical: Optional[str],
|
||||
) -> None:
|
||||
"""Recursively traverse TOC entries and build flat list."""
|
||||
for entry in entries:
|
||||
title = entry.get("title", "")
|
||||
level = entry.get("level", 0)
|
||||
children = entry.get("children", [])
|
||||
|
||||
# Build full path from parents + current title
|
||||
full_path_parts = parent_titles + [title]
|
||||
full_path = " > ".join(full_path_parts)
|
||||
|
||||
# Extract canonical reference if present in title
|
||||
canonical_ref = current_canonical
|
||||
cp_match = re.search(r'CP\s+(\d+\.\d+)', title)
|
||||
stephanus_match = re.search(r'(\w+\s+\d+[a-z])', title)
|
||||
|
||||
if cp_match:
|
||||
canonical_ref = f"CP {cp_match.group(1)}"
|
||||
elif stephanus_match:
|
||||
canonical_ref = stephanus_match.group(1)
|
||||
|
||||
# Update chapter title when entering level 1
|
||||
chapter_title = current_chapter
|
||||
if level == 1:
|
||||
chapter_title = title
|
||||
|
||||
# Create enriched entry
|
||||
enriched_entry: FlatTOCEntryEnriched = {
|
||||
"title": title,
|
||||
"level": level,
|
||||
"full_path": full_path,
|
||||
"chapter_title": chapter_title,
|
||||
"canonical_ref": canonical_ref,
|
||||
"parent_titles": parent_titles.copy(),
|
||||
"index_in_flat_list": len(flat_toc),
|
||||
}
|
||||
flat_toc.append(enriched_entry)
|
||||
|
||||
# Recursively process children
|
||||
if children:
|
||||
traverse(
|
||||
children,
|
||||
parent_titles + [title],
|
||||
chapter_title,
|
||||
canonical_ref,
|
||||
)
|
||||
|
||||
traverse(toc, [], "", None)
|
||||
else:
|
||||
# New iterative approach for flat TOCs (infer hierarchy from levels)
|
||||
parent_stack: List[Dict[str, Any]] = [] # Stack of (level, title, canonical_ref)
|
||||
current_chapter = ""
|
||||
current_canonical: Optional[str] = None
|
||||
|
||||
for entry in toc:
|
||||
title = entry.get("title", "")
|
||||
level = entry.get("level", 1)
|
||||
|
||||
# Pop parents that are at same or deeper level
|
||||
while parent_stack and parent_stack[-1]["level"] >= level:
|
||||
parent_stack.pop()
|
||||
|
||||
# Build parent titles list
|
||||
parent_titles = [p["title"] for p in parent_stack]
|
||||
|
||||
# Build full path
|
||||
full_path_parts = parent_titles + [title]
|
||||
full_path = " > ".join(full_path_parts)
|
||||
|
||||
# Extract canonical reference if present in title
|
||||
cp_match = re.search(r'CP\s+(\d+\.\d+)', title)
|
||||
stephanus_match = re.search(r'(\w+\s+\d+[a-z])', title)
|
||||
|
||||
if cp_match:
|
||||
current_canonical = f"CP {cp_match.group(1)}"
|
||||
elif stephanus_match:
|
||||
current_canonical = stephanus_match.group(1)
|
||||
elif level == 1:
|
||||
# Reset canonical ref at level 1 if none found
|
||||
current_canonical = None
|
||||
|
||||
# Inherit canonical ref from parent if not found
|
||||
if not current_canonical and parent_stack:
|
||||
current_canonical = parent_stack[-1].get("canonical_ref")
|
||||
|
||||
# Update chapter title when at level 1
|
||||
if level == 1:
|
||||
current_chapter = title
|
||||
|
||||
# Create enriched entry
|
||||
enriched_entry: FlatTOCEntryEnriched = {
|
||||
"title": title,
|
||||
"level": level,
|
||||
"full_path": full_path,
|
||||
"chapter_title": current_chapter,
|
||||
"canonical_ref": current_canonical,
|
||||
"parent_titles": parent_titles.copy(),
|
||||
"index_in_flat_list": len(flat_toc),
|
||||
}
|
||||
flat_toc.append(enriched_entry)
|
||||
|
||||
# Add current entry to parent stack for next iteration
|
||||
parent_stack.append({
|
||||
"level": level,
|
||||
"title": title,
|
||||
"canonical_ref": current_canonical,
|
||||
})
|
||||
|
||||
return flat_toc
|
||||
|
||||
|
||||
def extract_paragraph_number(section_text: str) -> Optional[str]:
|
||||
"""Extract paragraph number from section text.
|
||||
|
||||
Handles various academic paragraph numbering formats:
|
||||
- "628. Text..." → "628"
|
||||
- "§42 Text..." → "42"
|
||||
- "80a. Text..." → "80a" (Stephanus pagination)
|
||||
- "CP 5.628. Text..." → "628"
|
||||
|
||||
Args:
|
||||
section_text: Section title or path text
|
||||
|
||||
Returns:
|
||||
Extracted paragraph number or None if not found.
|
||||
|
||||
Example:
|
||||
>>> extract_paragraph_number("628. It is the instincts...")
|
||||
'628'
|
||||
>>> extract_paragraph_number("§42 On the nature of...")
|
||||
'42'
|
||||
>>> extract_paragraph_number("80a. SOCRATE: Sais-tu...")
|
||||
'80a'
|
||||
"""
|
||||
if not section_text:
|
||||
return None
|
||||
|
||||
# Pattern 1: Standard paragraph number at start "628. Text"
|
||||
match = re.match(r'^(\d+[a-z]?)\.\s', section_text)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
# Pattern 2: Section symbol "§42 Text"
|
||||
match = re.match(r'^§\s*(\d+[a-z]?)\s', section_text)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
# Pattern 3: CP reference "CP 5.628. Text" → extract paragraph only
|
||||
match = re.match(r'^CP\s+\d+\.(\d+)\.\s', section_text)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def find_matching_toc_entry(
|
||||
chunk: Dict[str, Any],
|
||||
flat_toc: List[FlatTOCEntryEnriched],
|
||||
) -> Optional[FlatTOCEntryEnriched]:
|
||||
"""Find matching TOC entry for a chunk using multi-strategy matching.
|
||||
|
||||
Matching strategies (in priority order):
|
||||
1. **Exact text match**: chunk.section == toc.title
|
||||
2. **Paragraph number match**: Extract paragraph number from both and compare
|
||||
3. **Proximity match**: Use order_index to find nearest TOC entry
|
||||
|
||||
Args:
|
||||
chunk: Chunk dict with 'section', 'sectionPath', 'order_index' fields
|
||||
flat_toc: Flattened TOC with enriched metadata
|
||||
|
||||
Returns:
|
||||
Best matching TOC entry or None if no match found.
|
||||
|
||||
Example:
|
||||
>>> chunk = {"section": "628. It is the instincts...", "order_index": 42}
|
||||
>>> toc_entry = find_matching_toc_entry(chunk, flat_toc)
|
||||
>>> toc_entry["canonical_ref"]
|
||||
'CP 1.628'
|
||||
"""
|
||||
if not flat_toc:
|
||||
return None
|
||||
|
||||
chunk_section = chunk.get("section", chunk.get("sectionPath", ""))
|
||||
if not chunk_section:
|
||||
return None
|
||||
|
||||
# Strategy 1: Exact title match
|
||||
for entry in flat_toc:
|
||||
if entry["title"] == chunk_section:
|
||||
return entry
|
||||
|
||||
# Strategy 2: Paragraph number match
|
||||
chunk_para = extract_paragraph_number(chunk_section)
|
||||
if chunk_para:
|
||||
# Look for matching paragraph in level 2 entries (actual content)
|
||||
for i, entry in enumerate(flat_toc):
|
||||
if entry["level"] == 2:
|
||||
entry_para = extract_paragraph_number(entry["title"])
|
||||
if entry_para == chunk_para:
|
||||
# Additional text similarity check to disambiguate
|
||||
# Get first significant word from chunk section
|
||||
chunk_words = [w for w in chunk_section.split() if len(w) > 3]
|
||||
entry_words = [w for w in entry["title"].split() if len(w) > 3]
|
||||
|
||||
if chunk_words and entry_words:
|
||||
# Check if first significant words match
|
||||
if chunk_words[0].lower() in entry["title"].lower():
|
||||
return entry
|
||||
else:
|
||||
# No text to compare, return paragraph match
|
||||
return entry
|
||||
|
||||
# Strategy 3: Proximity match using order_index
|
||||
chunk_order = chunk.get("order_index")
|
||||
if chunk_order is not None and flat_toc:
|
||||
# Find TOC entry with closest index_in_flat_list to chunk order
|
||||
# This is a fallback heuristic assuming TOC and chunks follow similar order
|
||||
closest_entry = min(
|
||||
flat_toc,
|
||||
key=lambda e: abs(e["index_in_flat_list"] - chunk_order),
|
||||
)
|
||||
return closest_entry
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def enrich_chunks_with_toc(
|
||||
chunks: List[Dict[str, Any]],
|
||||
toc: List[Dict[str, Any]],
|
||||
hierarchy: Dict[str, Any],
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""Enrich chunks with hierarchical metadata from TOC.
|
||||
|
||||
Main orchestration function that:
|
||||
1. Checks if TOC is available (guard clause)
|
||||
2. Flattens TOC once for efficiency
|
||||
3. Matches each chunk to its TOC entry
|
||||
4. Updates chunk metadata: sectionPath, chapterTitle, canonical_reference
|
||||
|
||||
Args:
|
||||
chunks: List of chunk dicts from pdf_pipeline
|
||||
toc: Hierarchical TOC structure (may be empty)
|
||||
hierarchy: Document hierarchy dict (may be empty)
|
||||
|
||||
Returns:
|
||||
List of chunks with enriched metadata (same objects, modified in place).
|
||||
If TOC is empty, returns chunks unchanged (no regression).
|
||||
|
||||
Example:
|
||||
>>> chunks = [{"text": "...", "section": "628. It is..."}]
|
||||
>>> toc = [
|
||||
... {"title": "Peirce: CP 1.628", "level": 1, "children": [
|
||||
... {"title": "628. It is...", "level": 2, "children": []}
|
||||
... ]}
|
||||
... ]
|
||||
>>> enriched = enrich_chunks_with_toc(chunks, toc, {})
|
||||
>>> enriched[0]["sectionPath"]
|
||||
'Peirce: CP 1.628 > 628. It is the instincts...'
|
||||
>>> enriched[0]["chapterTitle"]
|
||||
'Peirce: CP 1.628'
|
||||
>>> enriched[0]["canonical_reference"]
|
||||
'CP 1.628'
|
||||
"""
|
||||
# Guard: If no TOC, return chunks unchanged (graceful fallback)
|
||||
if not toc:
|
||||
logger.info("No TOC available, skipping chunk enrichment")
|
||||
return chunks
|
||||
|
||||
logger.info(f"Enriching {len(chunks)} chunks with TOC metadata...")
|
||||
|
||||
# Flatten TOC once for efficient matching
|
||||
try:
|
||||
flat_toc = flatten_toc_with_paths(toc, hierarchy)
|
||||
logger.info(f"Flattened TOC: {len(flat_toc)} entries")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to flatten TOC: {e}")
|
||||
return chunks # Fallback on error
|
||||
|
||||
# Match each chunk to TOC entry and enrich
|
||||
enriched_count = 0
|
||||
for chunk in chunks:
|
||||
matching_entry = find_matching_toc_entry(chunk, flat_toc)
|
||||
|
||||
if matching_entry:
|
||||
# Update sectionPath with full hierarchical path
|
||||
chunk["sectionPath"] = matching_entry["full_path"]
|
||||
|
||||
# Update chapterTitle
|
||||
chunk["chapterTitle"] = matching_entry["chapter_title"]
|
||||
|
||||
# Add canonicalReference if available
|
||||
if matching_entry["canonical_ref"]:
|
||||
chunk["canonicalReference"] = matching_entry["canonical_ref"]
|
||||
|
||||
enriched_count += 1
|
||||
|
||||
if chunks:
|
||||
logger.info(
|
||||
f"Enriched {enriched_count}/{len(chunks)} chunks "
|
||||
f"({100 * enriched_count / len(chunks):.1f}%)"
|
||||
)
|
||||
else:
|
||||
logger.info("No chunks to enrich")
|
||||
|
||||
return chunks
|
||||
260
generations/library_rag/utils/toc_extractor.py
Normal file
260
generations/library_rag/utils/toc_extractor.py
Normal file
@@ -0,0 +1,260 @@
|
||||
"""Table of Contents (TOC) extraction using Mistral OCR with annotations.
|
||||
|
||||
This module is the **primary entry point** for TOC extraction in the Library RAG
|
||||
pipeline. It provides intelligent routing between two extraction strategies:
|
||||
|
||||
1. **Visual (bbox) Analysis** (default, recommended): Uses bounding box coordinates
|
||||
to detect indentation and hierarchy based on horizontal positioning.
|
||||
2. **Semantic (annotation) Analysis**: Uses Mistral's document_annotation_format
|
||||
for structured metadata and TOC extraction.
|
||||
|
||||
The visual approach is more reliable for philosophical texts with complex
|
||||
hierarchies (parts, chapters, sections, subsections).
|
||||
|
||||
Extraction Strategies:
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ extract_toc_from_annotations(use_visual_bbox=True) │
|
||||
│ ↓ (default) │
|
||||
│ toc_extractor_visual.py → X-coordinate based hierarchy │
|
||||
│ │
|
||||
│ extract_toc_from_annotations(use_visual_bbox=False) │
|
||||
│ ↓ │
|
||||
│ DocumentMetadata Pydantic schema → Structured extraction │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
|
||||
Cost Considerations:
|
||||
- Annotated OCR: ~0.003€/page (3x standard OCR cost)
|
||||
- Only first N pages are processed (default: 8)
|
||||
- Total cost: max_toc_pages × 0.003€
|
||||
|
||||
Output Structure:
|
||||
{
|
||||
"success": bool,
|
||||
"metadata": {...}, # Document metadata
|
||||
"toc": [...], # Hierarchical TOC (nested children)
|
||||
"toc_flat": [...], # Flat list with levels
|
||||
"cost_ocr_annotated": float
|
||||
}
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.toc_extractor import extract_toc_from_annotations
|
||||
>>>
|
||||
>>> # Extract TOC using visual analysis (recommended)
|
||||
>>> result = extract_toc_from_annotations(
|
||||
... pdf_path=Path("input/philosophy_book.pdf"),
|
||||
... max_toc_pages=8,
|
||||
... use_visual_bbox=True # default
|
||||
... )
|
||||
>>> if result["success"]:
|
||||
... for entry in result["toc"]:
|
||||
... print(f"{entry['title']} (p.{entry['page']})")
|
||||
|
||||
Functions:
|
||||
- extract_toc_from_annotations(): Main entry point with strategy routing
|
||||
- build_hierarchical_toc(): Converts flat TOC entries to nested structure
|
||||
- map_toc_to_content(): Associates TOC entries with document content
|
||||
|
||||
See Also:
|
||||
- utils.toc_extractor_visual: Visual/bbox-based extraction (default)
|
||||
- utils.toc_extractor_markdown: Markdown indentation-based extraction
|
||||
- utils.llm_toc: LLM-based TOC extraction (alternative approach)
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Any, Dict, List, Optional, Union, cast
|
||||
from pathlib import Path
|
||||
|
||||
from .ocr_schemas import DocumentMetadata, TocEntry
|
||||
from .ocr_processor import run_ocr_with_annotations
|
||||
from .mistral_client import create_client
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# TypedDict for hierarchical TOC nodes
|
||||
class TOCNode(Dict[str, Any]):
|
||||
"""Type alias for TOC node structure with title, page, level, type, children."""
|
||||
pass
|
||||
|
||||
|
||||
def extract_toc_from_annotations(
|
||||
pdf_path: Path,
|
||||
api_key: Optional[str] = None,
|
||||
max_toc_pages: int = 8,
|
||||
use_visual_bbox: bool = True, # NOUVEAU : Utiliser l'analyse visuelle par défaut
|
||||
) -> Dict[str, Any]:
|
||||
"""Extrait la TOC structurée via OCR avec annotations.
|
||||
|
||||
Coût : 3€/1000 pages pour les pages annotées (vs 1€/1000 pour OCR basique).
|
||||
|
||||
Args:
|
||||
pdf_path: Chemin du fichier PDF
|
||||
api_key: Clé API Mistral (optionnel, sinon charge depuis .env)
|
||||
max_toc_pages: Nombre max de pages à annoter (défaut 8, limite API pour document_annotation)
|
||||
use_visual_bbox: Si True, utilise l'analyse visuelle des bounding boxes (plus fiable)
|
||||
|
||||
Returns:
|
||||
Dict avec :
|
||||
- success: bool
|
||||
- metadata: dict avec métadonnées enrichies
|
||||
- toc: liste hiérarchique [{title, page, level, children}]
|
||||
- toc_flat: liste plate [{title, page, level, type, parent_title}]
|
||||
- cost_ocr_annotated: float (coût en €)
|
||||
- error: str (si échec)
|
||||
"""
|
||||
# Si demandé, utiliser l'approche visuelle (bbox)
|
||||
if use_visual_bbox:
|
||||
logger.info("Utilisation de l'analyse visuelle (bbox) pour extraction TOC")
|
||||
from .toc_extractor_visual import extract_toc_with_visual_analysis
|
||||
return cast(Dict[str, Any], extract_toc_with_visual_analysis(pdf_path, api_key, max_toc_pages))
|
||||
|
||||
# Sinon, continuer avec l'approche sémantique (document_annotation_format)
|
||||
try:
|
||||
client = create_client(api_key)
|
||||
pdf_bytes = pdf_path.read_bytes()
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur initialisation client/lecture PDF : {e}")
|
||||
return {"success": False, "error": f"Initialisation échouée : {str(e)}"}
|
||||
|
||||
# Phase 1 : Annoter les premières pages pour extraire TOC + métadonnées
|
||||
logger.info(f"Extraction TOC avec annotations sur {max_toc_pages} premières pages")
|
||||
|
||||
try:
|
||||
annotated_response = run_ocr_with_annotations(
|
||||
client=client,
|
||||
file_bytes=pdf_bytes,
|
||||
filename=pdf_path.name,
|
||||
include_images=False, # Pas besoin d'images pour la TOC
|
||||
document_annotation_format=DocumentMetadata,
|
||||
pages=list(range(max_toc_pages)), # Pages 0 à max_toc_pages-1
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur appel OCR avec annotations : {e}")
|
||||
return {"success": False, "error": f"Appel OCR échoué : {str(e)}"}
|
||||
|
||||
# Extraire les annotations du document
|
||||
doc_annotation = getattr(annotated_response, "document_annotation", None)
|
||||
|
||||
if not doc_annotation:
|
||||
return {"success": False, "error": "Aucune annotation retournée par l'API"}
|
||||
|
||||
# Convertir en dictionnaire
|
||||
try:
|
||||
if isinstance(doc_annotation, str):
|
||||
metadata_dict = json.loads(doc_annotation)
|
||||
else:
|
||||
metadata_dict = doc_annotation
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur parsing annotations : {e}")
|
||||
return {"success": False, "error": f"Parsing annotations échoué : {str(e)}"}
|
||||
|
||||
# Valider avec Pydantic
|
||||
try:
|
||||
metadata = DocumentMetadata(**metadata_dict)
|
||||
toc_entries = metadata.toc.entries
|
||||
|
||||
logger.info(f"TOC extraite : {len(toc_entries)} entrées")
|
||||
|
||||
# Construire la TOC hiérarchique
|
||||
hierarchical_toc = build_hierarchical_toc(toc_entries)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"metadata": metadata.model_dump(),
|
||||
"toc": hierarchical_toc,
|
||||
"toc_flat": [entry.model_dump() for entry in toc_entries],
|
||||
"cost_ocr_annotated": max_toc_pages * 0.003, # 3€/1000 pages
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur validation annotations : {e}")
|
||||
return {"success": False, "error": f"Validation Pydantic échouée : {str(e)}"}
|
||||
|
||||
|
||||
def build_hierarchical_toc(entries: List[TocEntry]) -> List[Dict[str, Any]]:
|
||||
"""Construit une TOC hiérarchique à partir des entrées plates avec niveaux.
|
||||
|
||||
Utilise une stack pour gérer la hiérarchie basée sur les niveaux.
|
||||
|
||||
Args:
|
||||
entries: Liste d'entrées TocEntry avec level (1=racine, 2=enfant de 1, etc.)
|
||||
|
||||
Returns:
|
||||
TOC hiérarchique avec structure [{title, page, level, type, children: [...]}]
|
||||
"""
|
||||
if not entries:
|
||||
return []
|
||||
|
||||
toc: List[Dict[str, Any]] = []
|
||||
stack: List[Dict[str, Any]] = [] # Stack pour gérer la hiérarchie courante
|
||||
|
||||
for entry in entries:
|
||||
node: Dict[str, Any] = {
|
||||
"title": entry.title,
|
||||
"page": entry.page_number,
|
||||
"level": entry.level,
|
||||
"type": entry.entry_type.value,
|
||||
"children": [],
|
||||
}
|
||||
|
||||
# Remonter dans la stack jusqu'au parent approprié
|
||||
# Un élément de level N doit être enfant du dernier élément de level < N
|
||||
while stack and stack[-1]["level"] >= entry.level:
|
||||
stack.pop()
|
||||
|
||||
if stack:
|
||||
# Ajouter comme enfant du dernier élément de la stack
|
||||
children: List[Dict[str, Any]] = stack[-1]["children"]
|
||||
children.append(node)
|
||||
else:
|
||||
# Ajouter à la racine de la TOC
|
||||
toc.append(node)
|
||||
|
||||
# Empiler ce nœud pour les prochaines itérations
|
||||
stack.append(node)
|
||||
|
||||
return toc
|
||||
|
||||
|
||||
def map_toc_to_content(
|
||||
toc_entries: List[TocEntry],
|
||||
all_pages_markdown: str,
|
||||
) -> Dict[str, str]:
|
||||
"""Associe les entrées de TOC au contenu réel du document.
|
||||
|
||||
Utilise les vrais numéros de page pour découper le contenu par section.
|
||||
|
||||
Args:
|
||||
toc_entries: Entrées de TOC avec numéros de page réels
|
||||
all_pages_markdown: Markdown complet du document avec <!-- Page N --> markers
|
||||
|
||||
Returns:
|
||||
Mapping {section_title: content_text}
|
||||
"""
|
||||
# Découper le markdown par commentaires de page
|
||||
pages: List[str] = all_pages_markdown.split("<!-- Page ")
|
||||
|
||||
content_map: Dict[str, str] = {}
|
||||
|
||||
for i, entry in enumerate(toc_entries):
|
||||
start_page: int = entry.page_number
|
||||
|
||||
# Trouver la page de fin (numéro de page de la prochaine entrée ou fin du doc)
|
||||
end_page: int
|
||||
if i < len(toc_entries) - 1:
|
||||
end_page = toc_entries[i + 1].page_number
|
||||
else:
|
||||
end_page = len(pages) # Jusqu'à la fin
|
||||
|
||||
# Extraire le contenu entre start_page et end_page
|
||||
section_content: List[str] = []
|
||||
for page_idx in range(start_page, end_page):
|
||||
if page_idx < len(pages):
|
||||
# Nettoyer le commentaire de page et extraire le contenu
|
||||
page_text: str = pages[page_idx].split("-->", 1)[-1].strip()
|
||||
section_content.append(page_text)
|
||||
|
||||
content_map[entry.title] = "\n\n".join(section_content)
|
||||
|
||||
return content_map
|
||||
303
generations/library_rag/utils/toc_extractor_markdown.py
Normal file
303
generations/library_rag/utils/toc_extractor_markdown.py
Normal file
@@ -0,0 +1,303 @@
|
||||
"""TOC extraction via Markdown indentation analysis.
|
||||
|
||||
This module provides a **cost-free** TOC extraction strategy that works on
|
||||
already-generated Markdown text. Unlike the OCR annotation approach, this
|
||||
method doesn't require additional API calls.
|
||||
|
||||
Strategy:
|
||||
1. Search for "Table des matières" heading in the first N lines
|
||||
2. Parse lines matching pattern: "Title.....Page" or "Title Page"
|
||||
3. Detect hierarchy from leading whitespace (indentation)
|
||||
4. Build nested TOC structure using stack-based algorithm
|
||||
|
||||
When to Use:
|
||||
- When OCR has already been performed (markdown available)
|
||||
- When cost optimization is critical (no additional API calls)
|
||||
- For documents with clear indentation in the TOC
|
||||
|
||||
Limitations:
|
||||
- Requires French "Table des matières" header (can be extended)
|
||||
- Indentation detection may be less accurate than visual/bbox analysis
|
||||
- Only works if OCR preserved whitespace accurately
|
||||
|
||||
Indentation Levels:
|
||||
- 0-2 spaces: Level 1 (main chapters/parts)
|
||||
- 3-6 spaces: Level 2 (sections)
|
||||
- 7+ spaces: Level 3 (subsections)
|
||||
|
||||
Output Structure:
|
||||
{
|
||||
"success": bool,
|
||||
"toc": [...], # Hierarchical TOC
|
||||
"toc_flat": [...], # Flat entries with levels
|
||||
"cost_ocr_annotated": 0.0, # No additional cost
|
||||
"method": "markdown_indentation"
|
||||
}
|
||||
|
||||
Example:
|
||||
>>> from utils.toc_extractor_markdown import extract_toc_from_markdown
|
||||
>>>
|
||||
>>> markdown = '''
|
||||
... # Table des matières
|
||||
... Introduction.............................5
|
||||
... Première partie..........................10
|
||||
... Chapitre 1............................15
|
||||
... Chapitre 2............................25
|
||||
... Deuxième partie..........................50
|
||||
... '''
|
||||
>>> result = extract_toc_from_markdown(markdown)
|
||||
>>> if result["success"]:
|
||||
... print(f"Found {len(result['toc_flat'])} entries")
|
||||
Found 5 entries
|
||||
|
||||
Functions:
|
||||
- extract_toc_from_markdown(): Main extraction from markdown text
|
||||
- build_hierarchy(): Converts flat entries to nested structure
|
||||
|
||||
See Also:
|
||||
- utils.toc_extractor: Main entry point (routes to visual by default)
|
||||
- utils.toc_extractor_visual: More accurate X-position based extraction
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Any, Dict, List, Optional, TypedDict, Union
|
||||
from pathlib import Path
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# Type definitions for internal data structures
|
||||
class MarkdownTOCEntryRaw(TypedDict):
|
||||
"""Raw TOC entry extracted from markdown with indentation info."""
|
||||
title: str
|
||||
page_number: int
|
||||
level: int
|
||||
leading_spaces: int
|
||||
|
||||
|
||||
class MarkdownTOCNode(TypedDict):
|
||||
"""Hierarchical TOC node with children."""
|
||||
title: str
|
||||
page: int
|
||||
level: int
|
||||
type: str
|
||||
children: List[MarkdownTOCNode]
|
||||
|
||||
|
||||
class MarkdownTOCFlatEntry(TypedDict):
|
||||
"""Flat TOC entry with parent information."""
|
||||
title: str
|
||||
page_number: int
|
||||
level: int
|
||||
entry_type: str
|
||||
parent_title: Optional[str]
|
||||
|
||||
|
||||
class MarkdownTOCResultSuccess(TypedDict):
|
||||
"""Successful TOC extraction result."""
|
||||
success: bool # Always True
|
||||
metadata: Dict[str, Any]
|
||||
toc: List[MarkdownTOCNode]
|
||||
toc_flat: List[MarkdownTOCFlatEntry]
|
||||
cost_ocr_annotated: float
|
||||
method: str
|
||||
|
||||
|
||||
class MarkdownTOCResultError(TypedDict):
|
||||
"""Failed TOC extraction result."""
|
||||
success: bool # Always False
|
||||
error: str
|
||||
|
||||
|
||||
# Union type for function return
|
||||
MarkdownTOCResult = Union[MarkdownTOCResultSuccess, MarkdownTOCResultError]
|
||||
|
||||
|
||||
def extract_toc_from_markdown(
|
||||
markdown_text: str,
|
||||
max_lines: int = 200,
|
||||
) -> MarkdownTOCResult:
|
||||
"""Extract table of contents by analyzing raw markdown text.
|
||||
|
||||
Detects hierarchy by counting leading spaces (indentation) at the
|
||||
beginning of each line. This is a cost-free alternative to OCR
|
||||
annotation-based extraction.
|
||||
|
||||
Args:
|
||||
markdown_text: Complete markdown text of the document.
|
||||
max_lines: Maximum number of lines to analyze (searches TOC at start).
|
||||
|
||||
Returns:
|
||||
Dictionary with hierarchical TOC structure. On success, includes:
|
||||
- success: True
|
||||
- metadata: Empty dict (for consistency with other extractors)
|
||||
- toc: Hierarchical nested TOC structure
|
||||
- toc_flat: Flat list of entries with levels
|
||||
- cost_ocr_annotated: 0.0 (no additional cost)
|
||||
- method: "markdown_indentation"
|
||||
On failure, includes:
|
||||
- success: False
|
||||
- error: Error message string
|
||||
|
||||
Example:
|
||||
>>> markdown = '''
|
||||
... # Table des matières
|
||||
... Introduction.....5
|
||||
... Part One........10
|
||||
... Chapter 1.....15
|
||||
... '''
|
||||
>>> result = extract_toc_from_markdown(markdown)
|
||||
>>> if result["success"]:
|
||||
... print(len(result["toc_flat"]))
|
||||
3
|
||||
"""
|
||||
logger.info("Extraction TOC depuis markdown (analyse indentation)")
|
||||
|
||||
lines: List[str] = markdown_text.split('\n')[:max_lines]
|
||||
|
||||
# Find "Table des matières" section
|
||||
toc_start: Optional[int] = None
|
||||
for i, line in enumerate(lines):
|
||||
if re.search(r'table\s+des\s+mati[èe]res', line, re.IGNORECASE):
|
||||
toc_start = i + 1
|
||||
logger.info(f"TOC trouvée à la ligne {i}")
|
||||
break
|
||||
|
||||
if toc_start is None:
|
||||
logger.warning("Aucune table des matières trouvée dans le markdown")
|
||||
return MarkdownTOCResultError(
|
||||
success=False,
|
||||
error="Table des matières introuvable"
|
||||
)
|
||||
|
||||
# Extract TOC entries
|
||||
entries: List[MarkdownTOCEntryRaw] = []
|
||||
toc_pattern: re.Pattern[str] = re.compile(r'^(\s*)(.+?)\s*\.+\s*(\d+)\s*$')
|
||||
|
||||
for line in lines[toc_start:toc_start + 100]: # Max 100 lines of TOC
|
||||
line_stripped: str = line.strip()
|
||||
if not line_stripped or line_stripped.startswith('#') or line_stripped.startswith('---'):
|
||||
continue
|
||||
|
||||
# Search for pattern "Title.....Page"
|
||||
# Must analyze line BEFORE strip() to count leading spaces
|
||||
original_line: str = lines[lines.index(line) if line in lines else 0]
|
||||
leading_spaces: int = len(original_line) - len(original_line.lstrip())
|
||||
|
||||
# Alternative pattern: search for title + number at end
|
||||
match: Optional[re.Match[str]] = re.match(r'^(.+?)\s*\.{2,}\s*(\d+)\s*$', line_stripped)
|
||||
if not match:
|
||||
# Try without dotted leaders
|
||||
match = re.match(r'^(.+?)\s+(\d+)\s*$', line_stripped)
|
||||
|
||||
if match:
|
||||
title: str = match.group(1).strip()
|
||||
page: int = int(match.group(2))
|
||||
|
||||
# Ignore lines too short or that don't look like titles
|
||||
if len(title) < 3 or title.isdigit():
|
||||
continue
|
||||
|
||||
# Determine level based on indentation
|
||||
# 0-2 spaces = level 1
|
||||
# 3-6 spaces = level 2
|
||||
# 7+ spaces = level 3
|
||||
level: int
|
||||
if leading_spaces <= 2:
|
||||
level = 1
|
||||
elif leading_spaces <= 6:
|
||||
level = 2
|
||||
else:
|
||||
level = 3
|
||||
|
||||
entries.append(MarkdownTOCEntryRaw(
|
||||
title=title,
|
||||
page_number=page,
|
||||
level=level,
|
||||
leading_spaces=leading_spaces,
|
||||
))
|
||||
|
||||
logger.debug(f" '{title}' → {leading_spaces} espaces → level {level} (page {page})")
|
||||
|
||||
if not entries:
|
||||
logger.warning("Aucune entrée TOC extraite")
|
||||
return MarkdownTOCResultError(
|
||||
success=False,
|
||||
error="Aucune entrée TOC trouvée"
|
||||
)
|
||||
|
||||
logger.info(f"✅ {len(entries)} entrées extraites depuis markdown")
|
||||
|
||||
# Build hierarchy
|
||||
toc: List[MarkdownTOCNode] = build_hierarchy(entries)
|
||||
|
||||
return MarkdownTOCResultSuccess(
|
||||
success=True,
|
||||
metadata={},
|
||||
toc=toc,
|
||||
toc_flat=[
|
||||
MarkdownTOCFlatEntry(
|
||||
title=e["title"],
|
||||
page_number=e["page_number"],
|
||||
level=e["level"],
|
||||
entry_type="section",
|
||||
parent_title=None,
|
||||
)
|
||||
for e in entries
|
||||
],
|
||||
cost_ocr_annotated=0.0, # No additional cost, uses existing OCR
|
||||
method="markdown_indentation",
|
||||
)
|
||||
|
||||
|
||||
def build_hierarchy(entries: List[MarkdownTOCEntryRaw]) -> List[MarkdownTOCNode]:
|
||||
"""Build hierarchical structure from flat entries based on levels.
|
||||
|
||||
Uses a stack-based algorithm to construct nested TOC structure where
|
||||
entries with higher indentation become children of the previous
|
||||
less-indented entry.
|
||||
|
||||
Args:
|
||||
entries: List of raw TOC entries with title, page, and level.
|
||||
|
||||
Returns:
|
||||
Nested list of TOC nodes where each node contains children.
|
||||
|
||||
Example:
|
||||
>>> entries = [
|
||||
... {"title": "Part 1", "page_number": 1, "level": 1, "leading_spaces": 0},
|
||||
... {"title": "Chapter 1", "page_number": 5, "level": 2, "leading_spaces": 4},
|
||||
... ]
|
||||
>>> hierarchy = build_hierarchy(entries)
|
||||
>>> len(hierarchy[0]["children"])
|
||||
1
|
||||
"""
|
||||
toc: List[MarkdownTOCNode] = []
|
||||
stack: List[MarkdownTOCNode] = []
|
||||
|
||||
for entry in entries:
|
||||
node: MarkdownTOCNode = MarkdownTOCNode(
|
||||
title=entry["title"],
|
||||
page=entry["page_number"],
|
||||
level=entry["level"],
|
||||
type="section",
|
||||
children=[],
|
||||
)
|
||||
|
||||
# Pop from stack until we find a parent at lower level
|
||||
while stack and stack[-1]["level"] >= node["level"]:
|
||||
stack.pop()
|
||||
|
||||
if stack:
|
||||
# Add as child to top of stack
|
||||
stack[-1]["children"].append(node)
|
||||
else:
|
||||
# Add as root-level entry
|
||||
toc.append(node)
|
||||
|
||||
stack.append(node)
|
||||
|
||||
return toc
|
||||
512
generations/library_rag/utils/toc_extractor_visual.py
Normal file
512
generations/library_rag/utils/toc_extractor_visual.py
Normal file
@@ -0,0 +1,512 @@
|
||||
"""Visual TOC extraction using bounding box X-coordinate analysis.
|
||||
|
||||
This module provides the **most accurate** TOC extraction strategy for
|
||||
philosophical texts by analyzing the horizontal position (X-coordinate)
|
||||
of each TOC entry. This approach is more reliable than text indentation
|
||||
analysis because it directly measures visual layout.
|
||||
|
||||
How It Works:
|
||||
1. OCR with annotations extracts text + bounding box positions
|
||||
2. Pydantic schema (TocEntryBbox) captures title, page, and x_position
|
||||
3. X-coordinates are clustered to identify distinct indentation levels
|
||||
4. Hierarchy is built based on relative X-positions
|
||||
|
||||
X-Position Interpretation:
|
||||
The x_position is normalized between 0.0 (left edge) and 1.0 (right edge):
|
||||
|
||||
- x ≈ 0.05-0.12: Level 1 (no indentation, main parts/chapters)
|
||||
- x ≈ 0.13-0.22: Level 2 (small indentation, sections)
|
||||
- x ≈ 0.23-0.35: Level 3 (double indentation, subsections)
|
||||
|
||||
Positions within 0.03 tolerance are grouped into the same level.
|
||||
|
||||
Advantages over Markdown Analysis:
|
||||
- Works regardless of OCR whitespace accuracy
|
||||
- More reliable for complex hierarchies
|
||||
- Handles both printed and handwritten indentation
|
||||
|
||||
Cost:
|
||||
- Uses OCR with annotations: ~0.003€/page
|
||||
- Only processes first N pages (default: 8)
|
||||
|
||||
Pydantic Schemas:
|
||||
- TocEntryBbox: Single TOC entry with text, page_number, x_position
|
||||
- DocumentTocBbox: Container for list of entries
|
||||
|
||||
Output Structure:
|
||||
{
|
||||
"success": bool,
|
||||
"metadata": {...},
|
||||
"toc": [...], # Hierarchical TOC
|
||||
"toc_flat": [...], # Flat entries with levels
|
||||
"cost_ocr_annotated": float,
|
||||
"method": "visual_x_position"
|
||||
}
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.toc_extractor_visual import extract_toc_with_visual_analysis
|
||||
>>>
|
||||
>>> result = extract_toc_with_visual_analysis(
|
||||
... pdf_path=Path("input/philosophy_book.pdf"),
|
||||
... max_toc_pages=8
|
||||
... )
|
||||
>>> if result["success"]:
|
||||
... for entry in result["toc"]:
|
||||
... indent = " " * (entry["level"] - 1)
|
||||
... print(f"{indent}{entry['title']} (p.{entry['page']})")
|
||||
|
||||
Algorithm Details:
|
||||
1. Collect all x_position values from OCR response
|
||||
2. Sort and cluster positions (tolerance: 0.03)
|
||||
3. Compute cluster centroids as level thresholds
|
||||
4. Assign level to each entry based on nearest centroid
|
||||
5. Build hierarchy using stack-based approach
|
||||
|
||||
Functions:
|
||||
- extract_toc_with_visual_analysis(): Main extraction function
|
||||
- build_hierarchy_from_bbox(): Converts entries with X-positions to hierarchy
|
||||
- flatten_toc(): Flattens hierarchical TOC for storage
|
||||
|
||||
See Also:
|
||||
- utils.toc_extractor: Main entry point (routes here by default)
|
||||
- utils.toc_extractor_markdown: Alternative cost-free extraction
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional, Type, TypedDict, Union
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
from .mistral_client import create_client
|
||||
from .ocr_processor import run_ocr_with_annotations
|
||||
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TocEntryBbox(BaseModel):
|
||||
"""TOC entry with bounding box for visual detection.
|
||||
|
||||
Attributes:
|
||||
text: Complete entry text as it appears in the table of contents.
|
||||
Example: 'Presentation' or 'What is virtue?' or 'Meno or on virtue'.
|
||||
DO NOT include leader dots or page number in this field.
|
||||
page_number: Actual page number as printed in the book (the visible number
|
||||
on the right in the TOC). Example: if the line says 'Presentation.....3',
|
||||
extract the number 3. This is the BOOK page number, not the PDF index.
|
||||
x_position: Horizontal position (X coordinate) of the text start, normalized
|
||||
between 0 and 1. This is the CRUCIAL COORDINATE for detecting indentation:
|
||||
- x ≈ 0.05-0.12 = left-aligned title, NOT indented (hierarchical level 1)
|
||||
- x ≈ 0.13-0.22 = title with SMALL indentation (hierarchical level 2)
|
||||
- x ≈ 0.23-0.35 = title with DOUBLE indentation (hierarchical level 3)
|
||||
Measure precisely where the first character of the title begins.
|
||||
"""
|
||||
text: str = Field(..., description="""Texte COMPLET de l'entrée tel qu'il apparaît dans la table des matières.
|
||||
Exemple: 'Présentation' ou 'Qu'est-ce que la vertu ?' ou 'Ménon ou de la vertu'.
|
||||
NE PAS inclure les points de suite ni le numéro de page dans ce champ.""")
|
||||
page_number: int = Field(..., description="""Numéro de page réel tel qu'imprimé dans le livre (le numéro visible à droite dans la TOC).
|
||||
Exemple: si la ligne dit 'Présentation.....3', extraire le nombre 3.
|
||||
C'est le numéro de page du LIVRE, pas l'index PDF.""")
|
||||
x_position: float = Field(..., description="""Position horizontale (coordonnée X) du début du texte, normalisée entre 0 et 1.
|
||||
C'est LA COORDONNÉE CRUCIALE pour détecter l'indentation:
|
||||
- x ≈ 0.05-0.12 = titre aligné à gauche, NON indenté (niveau hiérarchique 1)
|
||||
- x ≈ 0.13-0.22 = titre avec PETITE indentation (niveau hiérarchique 2)
|
||||
- x ≈ 0.23-0.35 = titre avec DOUBLE indentation (niveau hiérarchique 3)
|
||||
Mesurer précisément où commence le premier caractère du titre.""")
|
||||
|
||||
|
||||
class DocumentTocBbox(BaseModel):
|
||||
"""Schema for extracting all TOC entries with their positions.
|
||||
|
||||
Attributes:
|
||||
entries: Complete list of ALL entries found in the table of contents.
|
||||
For EACH line in the TOC, extract:
|
||||
1. The title text (without leader dots)
|
||||
2. The page number (the number on the right)
|
||||
3. The exact horizontal X position of the title start (to detect indentation)
|
||||
|
||||
Include ALL entries, even those that appear to be at the same visual level.
|
||||
"""
|
||||
|
||||
entries: List[TocEntryBbox] = Field(
|
||||
...,
|
||||
description="""Complete list of ALL entries found in the table of contents.
|
||||
For EACH line in the TOC, extract:
|
||||
1. The title text (without leader dots)
|
||||
2. The page number (the number on the right)
|
||||
3. The exact horizontal X position of the title start (to detect indentation)
|
||||
|
||||
Include ALL entries, even those that appear to be at the same visual level.""",
|
||||
)
|
||||
|
||||
|
||||
# TypedDict classes for structured return types
|
||||
class VisualTOCMetadata(TypedDict):
|
||||
"""Metadata extracted from the document.
|
||||
|
||||
Attributes:
|
||||
title: Document title.
|
||||
author: Document author.
|
||||
languages: List of languages present in the document.
|
||||
summary: Brief document summary.
|
||||
"""
|
||||
|
||||
title: str
|
||||
author: str
|
||||
languages: List[str]
|
||||
summary: str
|
||||
|
||||
|
||||
class VisualTOCNode(TypedDict):
|
||||
"""Hierarchical TOC node.
|
||||
|
||||
Attributes:
|
||||
title: Entry title text.
|
||||
page: Page number in the book.
|
||||
level: Hierarchical level (1 = top level, 2 = subsection, etc.).
|
||||
type: Entry type (e.g., "section", "chapter").
|
||||
children: List of child nodes.
|
||||
"""
|
||||
|
||||
title: str
|
||||
page: int
|
||||
level: int
|
||||
type: str
|
||||
children: List[VisualTOCNode]
|
||||
|
||||
|
||||
class VisualTOCFlatEntry(TypedDict):
|
||||
"""Flattened TOC entry for storage.
|
||||
|
||||
Attributes:
|
||||
title: Entry title text.
|
||||
page_number: Page number in the book.
|
||||
level: Hierarchical level.
|
||||
entry_type: Entry type (e.g., "section", "chapter").
|
||||
parent_title: Title of the parent entry, if any.
|
||||
"""
|
||||
|
||||
title: str
|
||||
page_number: int
|
||||
level: int
|
||||
entry_type: str
|
||||
parent_title: Optional[str]
|
||||
|
||||
|
||||
class VisualTOCResultSuccess(TypedDict):
|
||||
"""Successful TOC extraction result.
|
||||
|
||||
Attributes:
|
||||
success: Always True for success case.
|
||||
metadata: Document metadata.
|
||||
toc: Hierarchical TOC structure.
|
||||
toc_flat: Flattened TOC entries.
|
||||
cost_ocr_annotated: OCR processing cost in euros.
|
||||
method: Extraction method identifier.
|
||||
"""
|
||||
|
||||
success: bool
|
||||
metadata: VisualTOCMetadata
|
||||
toc: List[VisualTOCNode]
|
||||
toc_flat: List[VisualTOCFlatEntry]
|
||||
cost_ocr_annotated: float
|
||||
method: str
|
||||
|
||||
|
||||
class VisualTOCResultError(TypedDict):
|
||||
"""Failed TOC extraction result.
|
||||
|
||||
Attributes:
|
||||
success: Always False for error case.
|
||||
error: Error message describing the failure.
|
||||
"""
|
||||
|
||||
success: bool
|
||||
error: str
|
||||
|
||||
|
||||
# Union type for the function return
|
||||
VisualTOCResult = Union[VisualTOCResultSuccess, VisualTOCResultError]
|
||||
|
||||
|
||||
class VisualTOCEntryInternal(TypedDict):
|
||||
"""Internal representation of TOC entry during processing.
|
||||
|
||||
Attributes:
|
||||
text: Entry title text.
|
||||
page_number: Page number in the book.
|
||||
x_position: Normalized X position (0.0 to 1.0).
|
||||
x_start: Same as x_position (for processing).
|
||||
page: Same as page_number (for processing).
|
||||
level: Computed hierarchical level.
|
||||
"""
|
||||
|
||||
text: str
|
||||
page_number: int
|
||||
x_position: float
|
||||
x_start: float
|
||||
page: int
|
||||
level: int
|
||||
|
||||
|
||||
def extract_toc_with_visual_analysis(
|
||||
pdf_path: Path,
|
||||
api_key: Optional[str] = None,
|
||||
max_toc_pages: int = 8,
|
||||
) -> VisualTOCResult:
|
||||
"""Extract TOC by visually analyzing bounding boxes.
|
||||
|
||||
Detects hierarchy from horizontal alignment (X coordinate). This method
|
||||
uses OCR with annotations to extract the precise X-coordinate of each
|
||||
TOC entry, then clusters these positions to identify indentation levels.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to the PDF file.
|
||||
api_key: Mistral API key (optional, uses environment variable if not provided).
|
||||
max_toc_pages: Number of pages to analyze (default: 8).
|
||||
|
||||
Returns:
|
||||
Dictionary containing either:
|
||||
- Success: metadata, hierarchical TOC, flat TOC, cost, method
|
||||
- Error: success=False and error message
|
||||
|
||||
Raises:
|
||||
Does not raise exceptions; errors are returned in the result dictionary.
|
||||
|
||||
Example:
|
||||
>>> from pathlib import Path
|
||||
>>> result = extract_toc_with_visual_analysis(Path("book.pdf"))
|
||||
>>> if result["success"]:
|
||||
... print(f"Extracted {len(result['toc'])} top-level entries")
|
||||
... else:
|
||||
... print(f"Error: {result['error']}")
|
||||
"""
|
||||
try:
|
||||
client = create_client(api_key)
|
||||
pdf_bytes: bytes = pdf_path.read_bytes()
|
||||
except Exception as e:
|
||||
logger.error(f"Initialization error: {e}")
|
||||
return {"success": False, "error": str(e)}
|
||||
|
||||
logger.info(f"Visual TOC extraction on {max_toc_pages} pages")
|
||||
|
||||
# Call OCR with document_annotation_format for global structure
|
||||
try:
|
||||
response = run_ocr_with_annotations(
|
||||
client=client,
|
||||
file_bytes=pdf_bytes,
|
||||
filename=pdf_path.name,
|
||||
include_images=False,
|
||||
document_annotation_format=DocumentTocBbox,
|
||||
pages=list(range(max_toc_pages)),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"OCR with annotations error: {e}")
|
||||
return {"success": False, "error": f"OCR failed: {str(e)}"}
|
||||
|
||||
# Extract annotations
|
||||
doc_annotation: Any = getattr(response, "document_annotation", None)
|
||||
|
||||
if not doc_annotation:
|
||||
return {"success": False, "error": "No annotation returned"}
|
||||
|
||||
# Parse entries
|
||||
try:
|
||||
if isinstance(doc_annotation, str):
|
||||
toc_data: Any = json.loads(doc_annotation)
|
||||
else:
|
||||
toc_data = doc_annotation
|
||||
|
||||
entries_data: List[Dict[str, Any]] = (
|
||||
toc_data.get("entries", []) if isinstance(toc_data, dict) else toc_data
|
||||
)
|
||||
|
||||
# Build hierarchy from X coordinates
|
||||
toc_entries: List[VisualTOCNode] = build_hierarchy_from_bbox(entries_data)
|
||||
|
||||
logger.info(f"TOC extracted visually: {len(toc_entries)} entries")
|
||||
|
||||
# Basic metadata (no enriched metadata in visual mode)
|
||||
metadata: VisualTOCMetadata = {
|
||||
"title": pdf_path.stem,
|
||||
"author": "Unknown author",
|
||||
"languages": [],
|
||||
"summary": "",
|
||||
}
|
||||
|
||||
result: VisualTOCResultSuccess = {
|
||||
"success": True,
|
||||
"metadata": metadata,
|
||||
"toc": toc_entries,
|
||||
"toc_flat": flatten_toc(toc_entries),
|
||||
"cost_ocr_annotated": max_toc_pages * 0.003,
|
||||
"method": "visual_x_position",
|
||||
}
|
||||
return result
|
||||
except Exception as e:
|
||||
logger.error(f"Bbox parsing error: {e}")
|
||||
return {"success": False, "error": f"Parsing failed: {str(e)}"}
|
||||
|
||||
|
||||
def build_hierarchy_from_bbox(entries: List[Dict[str, Any]]) -> List[VisualTOCNode]:
|
||||
"""Build TOC hierarchy from X positions (indentation).
|
||||
|
||||
Detects the hierarchical level by analyzing the horizontal X coordinate.
|
||||
Clusters nearby X positions to identify distinct indentation levels, then
|
||||
builds a tree structure using a stack-based approach.
|
||||
|
||||
Args:
|
||||
entries: List of entries with x_position field. Each entry should have:
|
||||
- text: Entry title
|
||||
- page_number: Page number
|
||||
- x_position: Normalized X coordinate (0.0 to 1.0)
|
||||
|
||||
Returns:
|
||||
Hierarchical TOC structure as a list of nodes. Each node contains:
|
||||
- title: Entry title
|
||||
- page: Page number
|
||||
- level: Hierarchical level (1, 2, 3, ...)
|
||||
- type: Entry type (always "section")
|
||||
- children: List of child nodes
|
||||
|
||||
Example:
|
||||
>>> entries = [
|
||||
... {"text": "Chapter 1", "page_number": 1, "x_position": 0.1},
|
||||
... {"text": "Section 1.1", "page_number": 2, "x_position": 0.2},
|
||||
... ]
|
||||
>>> hierarchy = build_hierarchy_from_bbox(entries)
|
||||
>>> hierarchy[0]["children"][0]["title"]
|
||||
'Section 1.1'
|
||||
"""
|
||||
if not entries:
|
||||
return []
|
||||
|
||||
# Extract X positions and normalize entry data
|
||||
entry_list: List[VisualTOCEntryInternal] = []
|
||||
for entry in entries:
|
||||
x_start: float = entry.get("x_position", 0.1)
|
||||
page_num: int = entry.get("page_number", 0)
|
||||
entry["x_start"] = x_start
|
||||
entry["page"] = page_num
|
||||
entry_list.append(entry) # type: ignore[arg-type]
|
||||
|
||||
# Find unique indentation thresholds
|
||||
x_positions: List[float] = sorted(set(e["x_start"] for e in entry_list))
|
||||
|
||||
if not x_positions:
|
||||
logger.warning("No X position detected")
|
||||
return []
|
||||
|
||||
# Group nearby positions (tolerance 0.03 to normalize small variations)
|
||||
x_levels: List[float] = []
|
||||
current_group: List[float] = [x_positions[0]]
|
||||
|
||||
for x in x_positions[1:]:
|
||||
if x - current_group[-1] < 0.03:
|
||||
current_group.append(x)
|
||||
else:
|
||||
x_levels.append(sum(current_group) / len(current_group))
|
||||
current_group = [x]
|
||||
|
||||
if current_group:
|
||||
x_levels.append(sum(current_group) / len(current_group))
|
||||
|
||||
logger.info(
|
||||
f"Indentation levels detected (X positions): {[f'{x:.3f}' for x in x_levels]}"
|
||||
)
|
||||
|
||||
# Assign levels based on X position
|
||||
for entry_item in entry_list:
|
||||
x_val: float = entry_item["x_start"]
|
||||
# Find the closest level
|
||||
level: int = min(range(len(x_levels)), key=lambda i: abs(x_levels[i] - x_val)) + 1
|
||||
entry_item["level"] = level
|
||||
logger.debug(f" '{entry_item.get('text', '')}' -> X={x_val:.3f} -> level {level}")
|
||||
|
||||
# Build hierarchy
|
||||
toc: List[VisualTOCNode] = []
|
||||
stack: List[VisualTOCNode] = []
|
||||
|
||||
for entry_item in entry_list:
|
||||
node: VisualTOCNode = {
|
||||
"title": entry_item.get("text", "").strip(),
|
||||
"page": entry_item["page"],
|
||||
"level": entry_item["level"],
|
||||
"type": "section",
|
||||
"children": [],
|
||||
}
|
||||
|
||||
# Pop from stack while current level is less than or equal to stack top
|
||||
while stack and stack[-1]["level"] >= node["level"]:
|
||||
stack.pop()
|
||||
|
||||
if stack:
|
||||
stack[-1]["children"].append(node)
|
||||
else:
|
||||
toc.append(node)
|
||||
|
||||
stack.append(node)
|
||||
|
||||
return toc
|
||||
|
||||
|
||||
def flatten_toc(toc: List[VisualTOCNode]) -> List[VisualTOCFlatEntry]:
|
||||
"""Flatten a hierarchical TOC.
|
||||
|
||||
Converts a nested TOC structure into a flat list of entries, preserving
|
||||
parent-child relationships through the parent_title field.
|
||||
|
||||
Args:
|
||||
toc: Hierarchical TOC structure (list of VisualTOCNode).
|
||||
|
||||
Returns:
|
||||
Flat list of TOC entries with parent references.
|
||||
|
||||
Example:
|
||||
>>> toc = [{
|
||||
... "title": "Chapter 1",
|
||||
... "page": 1,
|
||||
... "level": 1,
|
||||
... "type": "section",
|
||||
... "children": [{
|
||||
... "title": "Section 1.1",
|
||||
... "page": 2,
|
||||
... "level": 2,
|
||||
... "type": "section",
|
||||
... "children": []
|
||||
... }]
|
||||
... }]
|
||||
>>> flat = flatten_toc(toc)
|
||||
>>> len(flat)
|
||||
2
|
||||
>>> flat[1]["parent_title"]
|
||||
'Chapter 1'
|
||||
"""
|
||||
flat: List[VisualTOCFlatEntry] = []
|
||||
|
||||
def recurse(items: List[VisualTOCNode], parent_title: Optional[str] = None) -> None:
|
||||
"""Recursively flatten TOC nodes.
|
||||
|
||||
Args:
|
||||
items: List of TOC nodes to process.
|
||||
parent_title: Title of the parent node (None for top level).
|
||||
"""
|
||||
for item in items:
|
||||
flat_entry: VisualTOCFlatEntry = {
|
||||
"title": item["title"],
|
||||
"page_number": item["page"],
|
||||
"level": item["level"],
|
||||
"entry_type": item["type"],
|
||||
"parent_title": parent_title,
|
||||
}
|
||||
flat.append(flat_entry)
|
||||
if item.get("children"):
|
||||
recurse(item["children"], item["title"])
|
||||
|
||||
recurse(toc)
|
||||
return flat
|
||||
|
||||
1218
generations/library_rag/utils/types.py
Normal file
1218
generations/library_rag/utils/types.py
Normal file
File diff suppressed because it is too large
Load Diff
815
generations/library_rag/utils/weaviate_ingest.py
Normal file
815
generations/library_rag/utils/weaviate_ingest.py
Normal file
@@ -0,0 +1,815 @@
|
||||
"""Weaviate document ingestion module for the Library RAG pipeline.
|
||||
|
||||
This module handles the ingestion of processed documents (chunks, metadata,
|
||||
summaries) into the Weaviate vector database. It supports the V3.0 schema
|
||||
with nested objects for efficient semantic search.
|
||||
|
||||
Architecture:
|
||||
The module uses four Weaviate collections:
|
||||
|
||||
- **Work**: Represents a literary/philosophical work (title, author, year)
|
||||
- **Document**: A specific edition/version of a work (sourceId, pages, TOC)
|
||||
- **Chunk**: Text chunks with vectorized content for semantic search
|
||||
- **Summary**: Section summaries with vectorized concepts
|
||||
|
||||
Chunks and Summaries use nested objects to reference their parent
|
||||
Work and Document, avoiding data duplication while enabling
|
||||
efficient filtering.
|
||||
|
||||
Batch Operations:
|
||||
The module uses Weaviate insert_many() for efficient batch insertion.
|
||||
Chunks are prepared as a list and inserted in a single operation,
|
||||
which is significantly faster than individual insertions.
|
||||
|
||||
Nested Objects:
|
||||
Each Chunk contains nested work and document objects::
|
||||
|
||||
{
|
||||
"text": "La justice est une vertu...",
|
||||
"work": {"title": "La Republique", "author": "Platon"},
|
||||
"document": {"sourceId": "platon_republique", "edition": "GF"}
|
||||
}
|
||||
|
||||
This enables filtering like: document.sourceId == "platon_republique"
|
||||
|
||||
Typical Usage:
|
||||
>>> from utils.weaviate_ingest import ingest_document, delete_document_chunks
|
||||
>>>
|
||||
>>> # Ingest a processed document
|
||||
>>> result = ingest_document(
|
||||
... doc_name="platon_republique",
|
||||
... chunks=[{"text": "La justice est...", "section": "Livre I"}],
|
||||
... metadata={"title": "La Republique", "author": "Platon"},
|
||||
... language="fr",
|
||||
... )
|
||||
>>> print(f"Ingested {result['count']} chunks")
|
||||
|
||||
Connection:
|
||||
The module connects to a local Weaviate instance using:
|
||||
|
||||
- HTTP port: 8080
|
||||
- gRPC port: 50051
|
||||
|
||||
Ensure Weaviate is running via: docker-compose up -d
|
||||
|
||||
See Also:
|
||||
- schema.py: Weaviate schema definitions
|
||||
- pdf_pipeline.py: Document processing pipeline
|
||||
- flask_app.py: Web interface for search
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
from contextlib import contextmanager
|
||||
from datetime import datetime, timezone
|
||||
from typing import Any, Dict, Generator, List, Optional, TypedDict
|
||||
|
||||
import weaviate
|
||||
from weaviate import WeaviateClient
|
||||
from weaviate.collections import Collection
|
||||
import weaviate.classes.query as wvq
|
||||
|
||||
# Import type definitions from central types module
|
||||
from utils.types import WeaviateIngestResult as IngestResult
|
||||
|
||||
# Import TOC enrichment functions
|
||||
from .toc_enricher import enrich_chunks_with_toc
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# Type Definitions (module-specific, not exported to utils.types)
|
||||
# =============================================================================
|
||||
|
||||
|
||||
class SummaryObject(TypedDict):
|
||||
"""Weaviate Summary object structure for section summaries.
|
||||
|
||||
This TypedDict defines the structure of Summary objects stored in Weaviate.
|
||||
Summaries are vectorized and can be searched semantically.
|
||||
|
||||
Attributes:
|
||||
sectionPath: Full hierarchical path (e.g., "Livre I > Chapitre 2").
|
||||
title: Section title.
|
||||
level: Hierarchy level (1 = top level, 2 = subsection, etc.).
|
||||
text: Summary text content (vectorized for search).
|
||||
concepts: List of key concepts extracted from the section.
|
||||
chunksCount: Number of chunks in this section.
|
||||
document: Nested object with document reference (sourceId).
|
||||
"""
|
||||
|
||||
sectionPath: str
|
||||
title: str
|
||||
level: int
|
||||
text: str
|
||||
concepts: List[str]
|
||||
chunksCount: int
|
||||
document: Dict[str, str]
|
||||
|
||||
|
||||
class ChunkObject(TypedDict, total=False):
|
||||
"""Weaviate Chunk object structure for text chunks.
|
||||
|
||||
This TypedDict defines the structure of Chunk objects stored in Weaviate.
|
||||
The text and keywords fields are vectorized for semantic search.
|
||||
|
||||
Attributes:
|
||||
text: Chunk text content (vectorized for search).
|
||||
sectionPath: Full hierarchical path (e.g., "Livre I > Chapitre 2").
|
||||
sectionLevel: Hierarchy level (1 = top level).
|
||||
chapterTitle: Title of the containing chapter.
|
||||
canonicalReference: Canonical academic reference (e.g., "CP 1.628", "Ménon 80a").
|
||||
unitType: Type of argumentative unit (main_content, exposition, etc.).
|
||||
keywords: List of keywords/concepts (vectorized for search).
|
||||
language: Language code (e.g., "fr", "en").
|
||||
orderIndex: Position in document for ordering.
|
||||
work: Nested object with work metadata (title, author).
|
||||
document: Nested object with document reference (sourceId, edition).
|
||||
|
||||
Note:
|
||||
Uses total=False because some fields are optional during creation.
|
||||
"""
|
||||
|
||||
text: str
|
||||
sectionPath: str
|
||||
sectionLevel: int
|
||||
chapterTitle: str
|
||||
canonicalReference: str
|
||||
unitType: str
|
||||
keywords: List[str]
|
||||
language: str
|
||||
orderIndex: int
|
||||
work: Dict[str, str]
|
||||
document: Dict[str, str]
|
||||
|
||||
|
||||
class InsertedChunkSummary(TypedDict):
|
||||
"""Summary of an inserted chunk for display purposes.
|
||||
|
||||
This TypedDict provides a preview of inserted chunks, useful for
|
||||
displaying ingestion results to users.
|
||||
|
||||
Attributes:
|
||||
chunk_id: Generated chunk identifier.
|
||||
sectionPath: Hierarchical path of the chunk.
|
||||
work: Title of the work.
|
||||
author: Author name.
|
||||
text_preview: First 150 characters of chunk text.
|
||||
unitType: Type of argumentative unit.
|
||||
"""
|
||||
|
||||
chunk_id: str
|
||||
sectionPath: str
|
||||
work: str
|
||||
author: str
|
||||
text_preview: str
|
||||
unitType: str
|
||||
|
||||
|
||||
# Note: IngestResult is imported from utils.types as WeaviateIngestResult
|
||||
|
||||
|
||||
class DeleteResult(TypedDict, total=False):
|
||||
"""Result from document deletion operation.
|
||||
|
||||
This TypedDict contains the result of a deletion operation,
|
||||
including counts of deleted objects from each collection.
|
||||
|
||||
Attributes:
|
||||
success: Whether deletion succeeded.
|
||||
error: Error message if deletion failed.
|
||||
deleted_chunks: Number of chunks deleted from Chunk collection.
|
||||
deleted_summaries: Number of summaries deleted from Summary collection.
|
||||
deleted_document: Whether the Document object was deleted.
|
||||
|
||||
Example:
|
||||
>>> result = delete_document_chunks("platon_republique")
|
||||
>>> print(f"Deleted {result['deleted_chunks']} chunks")
|
||||
"""
|
||||
|
||||
success: bool
|
||||
error: str
|
||||
deleted_chunks: int
|
||||
deleted_summaries: int
|
||||
deleted_document: bool
|
||||
|
||||
|
||||
class DocumentStats(TypedDict, total=False):
|
||||
"""Document statistics from Weaviate.
|
||||
|
||||
This TypedDict contains statistics about a document stored in Weaviate,
|
||||
retrieved by querying the Chunk collection.
|
||||
|
||||
Attributes:
|
||||
success: Whether stats retrieval succeeded.
|
||||
error: Error message if retrieval failed.
|
||||
sourceId: Document identifier.
|
||||
chunks_count: Total number of chunks for this document.
|
||||
work: Title of the work (from first chunk).
|
||||
author: Author name (from first chunk).
|
||||
|
||||
Example:
|
||||
>>> stats = get_document_stats("platon_republique")
|
||||
>>> print(f"Document has {stats['chunks_count']} chunks")
|
||||
"""
|
||||
|
||||
success: bool
|
||||
error: str
|
||||
sourceId: str
|
||||
chunks_count: int
|
||||
work: Optional[str]
|
||||
author: Optional[str]
|
||||
|
||||
|
||||
# Logger
|
||||
logger: logging.Logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@contextmanager
|
||||
def get_weaviate_client() -> Generator[Optional[WeaviateClient], None, None]:
|
||||
"""Context manager for Weaviate connection with automatic cleanup.
|
||||
|
||||
Creates a connection to the local Weaviate instance and ensures
|
||||
proper cleanup when the context exits. Handles connection errors
|
||||
gracefully by yielding None instead of raising.
|
||||
|
||||
Yields:
|
||||
Connected WeaviateClient instance, or None if connection failed.
|
||||
|
||||
Example:
|
||||
>>> with get_weaviate_client() as client:
|
||||
... if client is not None:
|
||||
... chunks = client.collections.get("Chunk")
|
||||
... # Perform operations...
|
||||
... else:
|
||||
... print("Connection failed")
|
||||
|
||||
Note:
|
||||
Connects to localhost:8080 (HTTP) and localhost:50051 (gRPC).
|
||||
Ensure Weaviate is running via docker-compose up -d.
|
||||
"""
|
||||
client: Optional[WeaviateClient] = None
|
||||
try:
|
||||
# Increased timeout for long text vectorization (e.g., Peirce CP 3.403, CP 8.388, Menon chunk 10)
|
||||
# Default is 60s, increased to 600s (10 minutes) for exceptionally large texts
|
||||
from weaviate.classes.init import AdditionalConfig, Timeout
|
||||
|
||||
client = weaviate.connect_to_local(
|
||||
host="localhost",
|
||||
port=8080,
|
||||
grpc_port=50051,
|
||||
additional_config=AdditionalConfig(
|
||||
timeout=Timeout(init=30, query=600, insert=600) # 10 min for insert/query
|
||||
)
|
||||
)
|
||||
yield client
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur connexion Weaviate: {e}")
|
||||
yield None
|
||||
finally:
|
||||
if client:
|
||||
client.close()
|
||||
|
||||
|
||||
def ingest_document_metadata(
|
||||
client: WeaviateClient,
|
||||
doc_name: str,
|
||||
metadata: Dict[str, Any],
|
||||
toc: List[Dict[str, Any]],
|
||||
hierarchy: Dict[str, Any],
|
||||
chunks_count: int,
|
||||
pages: int,
|
||||
) -> Optional[str]:
|
||||
"""Insert document metadata into the Document collection.
|
||||
|
||||
Creates a Document object containing metadata about a processed document,
|
||||
including its table of contents, hierarchy structure, and statistics.
|
||||
|
||||
Args:
|
||||
client: Active Weaviate client connection.
|
||||
doc_name: Unique document identifier (sourceId).
|
||||
metadata: Extracted metadata dict with keys: title, author, language.
|
||||
toc: Table of contents as a hierarchical list of dicts.
|
||||
hierarchy: Complete document hierarchy structure.
|
||||
chunks_count: Total number of chunks in the document.
|
||||
pages: Number of pages in the source PDF.
|
||||
|
||||
Returns:
|
||||
UUID string of the created Document object, or None if insertion failed.
|
||||
|
||||
Example:
|
||||
>>> with get_weaviate_client() as client:
|
||||
... uuid = ingest_document_metadata(
|
||||
... client,
|
||||
... doc_name="platon_republique",
|
||||
... metadata={"title": "La Republique", "author": "Platon"},
|
||||
... toc=[{"title": "Livre I", "level": 1}],
|
||||
... hierarchy={},
|
||||
... chunks_count=150,
|
||||
... pages=300,
|
||||
... )
|
||||
|
||||
Note:
|
||||
The TOC and hierarchy are serialized to JSON strings for storage.
|
||||
The createdAt field is set to the current timestamp.
|
||||
"""
|
||||
try:
|
||||
doc_collection: Collection[Any, Any] = client.collections.get("Document")
|
||||
except Exception as e:
|
||||
logger.warning(f"Collection Document non trouvée: {e}")
|
||||
return None
|
||||
|
||||
try:
|
||||
doc_obj: Dict[str, Any] = {
|
||||
"sourceId": doc_name,
|
||||
"title": metadata.get("title") or doc_name,
|
||||
"author": metadata.get("author") or "Inconnu",
|
||||
"toc": json.dumps(toc, ensure_ascii=False) if toc else "[]",
|
||||
"hierarchy": json.dumps(hierarchy, ensure_ascii=False) if hierarchy else "{}",
|
||||
"pages": pages,
|
||||
"chunksCount": chunks_count,
|
||||
"language": metadata.get("language", "fr"),
|
||||
"createdAt": datetime.now(timezone.utc).isoformat(),
|
||||
}
|
||||
|
||||
result = doc_collection.data.insert(doc_obj)
|
||||
logger.info(f"Document metadata ingéré: {doc_name}")
|
||||
return str(result)
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur ingestion document metadata: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def ingest_summaries(
|
||||
client: WeaviateClient,
|
||||
doc_name: str,
|
||||
toc: List[Dict[str, Any]],
|
||||
summaries_content: Dict[str, str],
|
||||
) -> int:
|
||||
"""Insert section summaries into the Summary collection.
|
||||
|
||||
Creates Summary objects for each entry in the table of contents,
|
||||
with optional summary text content. Summaries are vectorized and
|
||||
can be searched semantically.
|
||||
|
||||
Args:
|
||||
client: Active Weaviate client connection.
|
||||
doc_name: Document identifier for linking summaries.
|
||||
toc: Hierarchical table of contents list.
|
||||
summaries_content: Mapping of section titles to summary text.
|
||||
If a title is not in this dict, the title itself is used as text.
|
||||
|
||||
Returns:
|
||||
Number of summaries successfully inserted.
|
||||
|
||||
Example:
|
||||
>>> with get_weaviate_client() as client:
|
||||
... count = ingest_summaries(
|
||||
... client,
|
||||
... doc_name="platon_republique",
|
||||
... toc=[{"title": "Livre I", "level": 1}],
|
||||
... summaries_content={"Livre I": "Discussion sur la justice..."},
|
||||
... )
|
||||
... print(f"Inserted {count} summaries")
|
||||
|
||||
Note:
|
||||
Uses batch insertion via insert_many() for efficiency.
|
||||
Recursively processes nested TOC entries (children).
|
||||
"""
|
||||
try:
|
||||
summary_collection: Collection[Any, Any] = client.collections.get("Summary")
|
||||
except Exception as e:
|
||||
logger.warning(f"Collection Summary non trouvée: {e}")
|
||||
return 0
|
||||
|
||||
summaries_to_insert: List[SummaryObject] = []
|
||||
|
||||
def process_toc(items: List[Dict[str, Any]], parent_path: str = "") -> None:
|
||||
for item in items:
|
||||
title: str = item.get("title", "")
|
||||
level: int = item.get("level", 1)
|
||||
path: str = f"{parent_path} > {title}" if parent_path else title
|
||||
|
||||
summary_obj: SummaryObject = {
|
||||
"sectionPath": path,
|
||||
"title": title,
|
||||
"level": level,
|
||||
"text": summaries_content.get(title, title),
|
||||
"concepts": item.get("concepts", []),
|
||||
"chunksCount": 0,
|
||||
"document": {
|
||||
"sourceId": doc_name,
|
||||
},
|
||||
}
|
||||
summaries_to_insert.append(summary_obj)
|
||||
|
||||
if "children" in item:
|
||||
process_toc(item["children"], path)
|
||||
|
||||
process_toc(toc)
|
||||
|
||||
if not summaries_to_insert:
|
||||
return 0
|
||||
|
||||
# Insérer par petits lots pour éviter les timeouts
|
||||
BATCH_SIZE = 50
|
||||
total_inserted = 0
|
||||
|
||||
try:
|
||||
logger.info(f"Ingesting {len(summaries_to_insert)} summaries in batches of {BATCH_SIZE}...")
|
||||
|
||||
for batch_start in range(0, len(summaries_to_insert), BATCH_SIZE):
|
||||
batch_end = min(batch_start + BATCH_SIZE, len(summaries_to_insert))
|
||||
batch = summaries_to_insert[batch_start:batch_end]
|
||||
|
||||
try:
|
||||
summary_collection.data.insert_many(batch)
|
||||
total_inserted += len(batch)
|
||||
logger.info(f" Batch {batch_start//BATCH_SIZE + 1}: Inserted {len(batch)} summaries ({total_inserted}/{len(summaries_to_insert)})")
|
||||
except Exception as batch_error:
|
||||
logger.warning(f" Batch {batch_start//BATCH_SIZE + 1} failed: {batch_error}")
|
||||
continue
|
||||
|
||||
logger.info(f"{total_inserted} résumés ingérés pour {doc_name}")
|
||||
return total_inserted
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur ingestion résumés: {e}")
|
||||
return 0
|
||||
|
||||
|
||||
def ingest_document(
|
||||
doc_name: str,
|
||||
chunks: List[Dict[str, Any]],
|
||||
metadata: Dict[str, Any],
|
||||
language: str = "fr",
|
||||
toc: Optional[List[Dict[str, Any]]] = None,
|
||||
hierarchy: Optional[Dict[str, Any]] = None,
|
||||
pages: int = 0,
|
||||
ingest_document_collection: bool = True,
|
||||
ingest_summary_collection: bool = False,
|
||||
) -> IngestResult:
|
||||
"""Ingest document chunks into Weaviate with nested objects.
|
||||
|
||||
Main ingestion function that inserts chunks into the Chunk collection
|
||||
with nested Work and Document references. Optionally also creates
|
||||
entries in the Document and Summary collections.
|
||||
|
||||
This function uses batch insertion for optimal performance and
|
||||
constructs proper nested objects for filtering capabilities.
|
||||
|
||||
Args:
|
||||
doc_name: Unique document identifier (used as sourceId).
|
||||
chunks: List of chunk dicts, each containing at minimum:
|
||||
- text: The chunk text content
|
||||
- section (optional): Section path string
|
||||
- hierarchy (optional): Dict with part/chapter/section
|
||||
- type (optional): Argumentative unit type
|
||||
- concepts/keywords (optional): List of keywords
|
||||
metadata: Document metadata dict with keys:
|
||||
- title: Work title
|
||||
- author: Author name
|
||||
- edition (optional): Edition identifier
|
||||
language: ISO language code. Defaults to "fr".
|
||||
toc: Optional table of contents for Document/Summary collections.
|
||||
hierarchy: Optional complete document hierarchy structure.
|
||||
pages: Number of pages in source document. Defaults to 0.
|
||||
ingest_document_collection: If True, also insert into Document
|
||||
collection. Defaults to True.
|
||||
ingest_summary_collection: If True, also insert into Summary
|
||||
collection (requires toc). Defaults to False.
|
||||
|
||||
Returns:
|
||||
IngestResult dict containing:
|
||||
- success: True if ingestion succeeded
|
||||
- count: Number of chunks inserted
|
||||
- inserted: Preview of first 10 inserted chunks
|
||||
- work: Work title
|
||||
- author: Author name
|
||||
- document_uuid: UUID of Document object (if created)
|
||||
- all_objects: Complete list of inserted ChunkObjects
|
||||
- error: Error message (if failed)
|
||||
|
||||
Raises:
|
||||
No exceptions are raised; errors are returned in the result dict.
|
||||
|
||||
Example:
|
||||
>>> result = ingest_document(
|
||||
... doc_name="platon_republique",
|
||||
... chunks=[{"text": "La justice est...", "section": "Livre I"}],
|
||||
... metadata={"title": "La Republique", "author": "Platon"},
|
||||
... language="fr",
|
||||
... pages=450,
|
||||
... )
|
||||
>>> if result["success"]:
|
||||
... print(f"Ingested {result['count']} chunks")
|
||||
|
||||
Note:
|
||||
Empty chunks (no text or whitespace-only) are automatically skipped.
|
||||
The function logs progress and errors using the module logger.
|
||||
"""
|
||||
try:
|
||||
with get_weaviate_client() as client:
|
||||
if client is None:
|
||||
return IngestResult(
|
||||
success=False,
|
||||
error="Connexion Weaviate impossible",
|
||||
inserted=[],
|
||||
)
|
||||
|
||||
# Récupérer la collection Chunk
|
||||
try:
|
||||
chunk_collection: Collection[Any, Any] = client.collections.get("Chunk")
|
||||
except Exception as e:
|
||||
return IngestResult(
|
||||
success=False,
|
||||
error=f"Collection Chunk non trouvée: {e}",
|
||||
inserted=[],
|
||||
)
|
||||
|
||||
# Insérer les métadonnées du document (optionnel)
|
||||
doc_uuid: Optional[str] = None
|
||||
if ingest_document_collection:
|
||||
doc_uuid = ingest_document_metadata(
|
||||
client, doc_name, metadata, toc or [], hierarchy or {},
|
||||
len(chunks), pages
|
||||
)
|
||||
|
||||
# Insérer les résumés (optionnel)
|
||||
if ingest_summary_collection and toc:
|
||||
ingest_summaries(client, doc_name, toc, {})
|
||||
|
||||
# NOUVEAU : Enrichir chunks avec métadonnées TOC si disponibles
|
||||
if toc and hierarchy:
|
||||
logger.info(f"Enriching {len(chunks)} chunks with TOC metadata...")
|
||||
chunks = enrich_chunks_with_toc(chunks, toc, hierarchy)
|
||||
else:
|
||||
logger.info("No TOC/hierarchy available, using basic metadata")
|
||||
|
||||
# Préparer les objets Chunk à insérer avec nested objects
|
||||
objects_to_insert: List[ChunkObject] = []
|
||||
|
||||
title: str = metadata.get("title") or metadata.get("work") or doc_name
|
||||
author: str = metadata.get("author") or "Inconnu"
|
||||
edition: str = metadata.get("edition", "")
|
||||
|
||||
for idx, chunk in enumerate(chunks):
|
||||
# Extraire le texte du chunk
|
||||
text: str = chunk.get("text", "")
|
||||
if not text or not text.strip():
|
||||
continue
|
||||
|
||||
# Utiliser sectionPath enrichi si disponible, sinon fallback vers logique existante
|
||||
section_path: str = chunk.get("sectionPath", "")
|
||||
if not section_path:
|
||||
section_path = chunk.get("section", "")
|
||||
if not section_path:
|
||||
chunk_hierarchy: Dict[str, Any] = chunk.get("hierarchy", {})
|
||||
section_parts: List[str] = []
|
||||
if chunk_hierarchy.get("part"):
|
||||
section_parts.append(chunk_hierarchy["part"])
|
||||
if chunk_hierarchy.get("chapter"):
|
||||
section_parts.append(chunk_hierarchy["chapter"])
|
||||
if chunk_hierarchy.get("section"):
|
||||
section_parts.append(chunk_hierarchy["section"])
|
||||
section_path = " > ".join(section_parts) if section_parts else chunk.get("title", f"Section {idx}")
|
||||
|
||||
# Utiliser chapterTitle enrichi si disponible
|
||||
chapter_title: str = chunk.get("chapterTitle", chunk.get("chapter_title", ""))
|
||||
|
||||
# Utiliser canonicalReference enrichi si disponible
|
||||
canonical_ref: str = chunk.get("canonicalReference", "")
|
||||
|
||||
# Créer l objet Chunk avec nested objects
|
||||
chunk_obj: ChunkObject = {
|
||||
"text": text,
|
||||
"sectionPath": section_path,
|
||||
"sectionLevel": chunk.get("section_level", chunk.get("level", 1)),
|
||||
"chapterTitle": chapter_title,
|
||||
"canonicalReference": canonical_ref,
|
||||
"unitType": chunk.get("type", "main_content"),
|
||||
"keywords": chunk.get("concepts", chunk.get("keywords", [])),
|
||||
"language": language,
|
||||
"orderIndex": idx,
|
||||
"work": {
|
||||
"title": title,
|
||||
"author": author,
|
||||
},
|
||||
"document": {
|
||||
"sourceId": doc_name,
|
||||
"edition": edition,
|
||||
},
|
||||
}
|
||||
|
||||
objects_to_insert.append(chunk_obj)
|
||||
|
||||
if not objects_to_insert:
|
||||
return IngestResult(
|
||||
success=True,
|
||||
message="Aucun chunk à insérer",
|
||||
inserted=[],
|
||||
count=0,
|
||||
)
|
||||
|
||||
# Insérer les objets par petits lots pour éviter les timeouts
|
||||
BATCH_SIZE = 50 # Process 50 chunks at a time
|
||||
total_inserted = 0
|
||||
|
||||
logger.info(f"Ingesting {len(objects_to_insert)} chunks in batches of {BATCH_SIZE}...")
|
||||
|
||||
for batch_start in range(0, len(objects_to_insert), BATCH_SIZE):
|
||||
batch_end = min(batch_start + BATCH_SIZE, len(objects_to_insert))
|
||||
batch = objects_to_insert[batch_start:batch_end]
|
||||
|
||||
try:
|
||||
_response = chunk_collection.data.insert_many(objects=batch)
|
||||
total_inserted += len(batch)
|
||||
logger.info(f" Batch {batch_start//BATCH_SIZE + 1}: Inserted {len(batch)} chunks ({total_inserted}/{len(objects_to_insert)})")
|
||||
except Exception as batch_error:
|
||||
logger.error(f" Batch {batch_start//BATCH_SIZE + 1} failed: {batch_error}")
|
||||
# Continue with next batch instead of failing completely
|
||||
continue
|
||||
|
||||
# Préparer le résumé des objets insérés
|
||||
inserted_summary: List[InsertedChunkSummary] = []
|
||||
for i, obj in enumerate(objects_to_insert[:10]):
|
||||
text_content: str = obj.get("text", "")
|
||||
work_obj: Dict[str, str] = obj.get("work", {})
|
||||
inserted_summary.append(InsertedChunkSummary(
|
||||
chunk_id=f"chunk_{i:05d}",
|
||||
sectionPath=obj.get("sectionPath", ""),
|
||||
work=work_obj.get("title", ""),
|
||||
author=work_obj.get("author", ""),
|
||||
text_preview=text_content[:150] + "..." if len(text_content) > 150 else text_content,
|
||||
unitType=obj.get("unitType", ""),
|
||||
))
|
||||
|
||||
logger.info(f"Ingestion réussie: {total_inserted} chunks insérés pour {doc_name}")
|
||||
|
||||
return IngestResult(
|
||||
success=True,
|
||||
count=total_inserted,
|
||||
inserted=inserted_summary,
|
||||
work=title,
|
||||
author=author,
|
||||
document_uuid=doc_uuid,
|
||||
all_objects=objects_to_insert,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur ingestion: {e}")
|
||||
return IngestResult(
|
||||
success=False,
|
||||
error=str(e),
|
||||
inserted=[],
|
||||
)
|
||||
|
||||
|
||||
def delete_document_chunks(doc_name: str) -> DeleteResult:
|
||||
"""Delete all data for a document from Weaviate collections.
|
||||
|
||||
Removes chunks, summaries, and the document metadata from their
|
||||
respective collections. Uses nested object filtering to find
|
||||
related objects.
|
||||
|
||||
This function is useful for re-processing a document after changes
|
||||
to the processing pipeline or to clean up test data.
|
||||
|
||||
Args:
|
||||
doc_name: Document identifier (sourceId) to delete.
|
||||
|
||||
Returns:
|
||||
DeleteResult dict containing:
|
||||
- success: True if deletion succeeded (even if no objects found)
|
||||
- deleted_chunks: Number of Chunk objects deleted
|
||||
- deleted_summaries: Number of Summary objects deleted
|
||||
- deleted_document: True if Document object was deleted
|
||||
- error: Error message (if failed)
|
||||
|
||||
Example:
|
||||
>>> result = delete_document_chunks("platon_republique")
|
||||
>>> if result["success"]:
|
||||
... print(f"Deleted {result['deleted_chunks']} chunks")
|
||||
... # Now safe to re-ingest
|
||||
... ingest_document("platon_republique", new_chunks, metadata)
|
||||
|
||||
Note:
|
||||
Uses delete_many() with filters on nested object properties.
|
||||
Continues even if some collections fail (logs warnings).
|
||||
"""
|
||||
try:
|
||||
with get_weaviate_client() as client:
|
||||
if client is None:
|
||||
return DeleteResult(success=False, error="Connexion Weaviate impossible")
|
||||
|
||||
deleted_chunks: int = 0
|
||||
deleted_summaries: int = 0
|
||||
deleted_document: bool = False
|
||||
|
||||
# Supprimer les chunks (filtrer sur document.sourceId nested)
|
||||
try:
|
||||
chunk_collection: Collection[Any, Any] = client.collections.get("Chunk")
|
||||
result = chunk_collection.data.delete_many(
|
||||
where=wvq.Filter.by_property("document.sourceId").equal(doc_name)
|
||||
)
|
||||
deleted_chunks = result.successful
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur suppression chunks: {e}")
|
||||
|
||||
# Supprimer les summaries (filtrer sur document.sourceId nested)
|
||||
try:
|
||||
summary_collection: Collection[Any, Any] = client.collections.get("Summary")
|
||||
result = summary_collection.data.delete_many(
|
||||
where=wvq.Filter.by_property("document.sourceId").equal(doc_name)
|
||||
)
|
||||
deleted_summaries = result.successful
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur suppression summaries: {e}")
|
||||
|
||||
# Supprimer le document
|
||||
try:
|
||||
doc_collection: Collection[Any, Any] = client.collections.get("Document")
|
||||
result = doc_collection.data.delete_many(
|
||||
where=wvq.Filter.by_property("sourceId").equal(doc_name)
|
||||
)
|
||||
deleted_document = result.successful > 0
|
||||
except Exception as e:
|
||||
logger.warning(f"Erreur suppression document: {e}")
|
||||
|
||||
logger.info(f"Suppression: {deleted_chunks} chunks, {deleted_summaries} summaries pour {doc_name}")
|
||||
|
||||
return DeleteResult(
|
||||
success=True,
|
||||
deleted_chunks=deleted_chunks,
|
||||
deleted_summaries=deleted_summaries,
|
||||
deleted_document=deleted_document,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur suppression: {e}")
|
||||
return DeleteResult(success=False, error=str(e))
|
||||
|
||||
|
||||
def get_document_stats(doc_name: str) -> DocumentStats:
|
||||
"""Retrieve statistics for a document from Weaviate.
|
||||
|
||||
Queries the Chunk collection to count chunks and extract work
|
||||
metadata for a given document identifier.
|
||||
|
||||
Args:
|
||||
doc_name: Document identifier (sourceId) to query.
|
||||
|
||||
Returns:
|
||||
DocumentStats dict containing:
|
||||
- success: True if query succeeded
|
||||
- sourceId: The queried document identifier
|
||||
- chunks_count: Number of chunks found
|
||||
- work: Work title (from first chunk, if any)
|
||||
- author: Author name (from first chunk, if any)
|
||||
- error: Error message (if failed)
|
||||
|
||||
Example:
|
||||
>>> stats = get_document_stats("platon_republique")
|
||||
>>> if stats["success"]:
|
||||
... print(f"Document: {stats['work']} by {stats['author']}")
|
||||
... print(f"Chunks: {stats['chunks_count']}")
|
||||
|
||||
Note:
|
||||
Limited to 1000 chunks for counting. For documents with more
|
||||
chunks, consider using Weaviate's aggregate queries.
|
||||
"""
|
||||
try:
|
||||
with get_weaviate_client() as client:
|
||||
if client is None:
|
||||
return DocumentStats(success=False, error="Connexion Weaviate impossible")
|
||||
|
||||
# Compter les chunks (filtrer sur document.sourceId nested)
|
||||
chunk_collection: Collection[Any, Any] = client.collections.get("Chunk")
|
||||
chunks = chunk_collection.query.fetch_objects(
|
||||
filters=wvq.Filter.by_property("document.sourceId").equal(doc_name),
|
||||
limit=1000,
|
||||
)
|
||||
|
||||
chunks_count: int = len(chunks.objects)
|
||||
|
||||
# Récupérer les infos du premier chunk
|
||||
work: Optional[str] = None
|
||||
author: Optional[str] = None
|
||||
if chunks.objects:
|
||||
first: Dict[str, Any] = chunks.objects[0].properties
|
||||
work_obj: Any = first.get("work", {})
|
||||
work = work_obj.get("title") if isinstance(work_obj, dict) else None
|
||||
author = work_obj.get("author") if isinstance(work_obj, dict) else None
|
||||
|
||||
return DocumentStats(
|
||||
success=True,
|
||||
sourceId=doc_name,
|
||||
chunks_count=chunks_count,
|
||||
work=work,
|
||||
author=author,
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Erreur stats document: {e}")
|
||||
return DocumentStats(success=False, error=str(e))
|
||||
Reference in New Issue
Block a user