Remove obsolete documentation and backup files
- Remove REMOTE_WEAVIATE_ARCHITECTURE.md (moved to library_rag) - Remove navette.txt (obsolete notes) - Remove backup and obsolete app spec files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,431 +0,0 @@
|
||||
# Architecture pour Weaviate distant (Synology/VPS)
|
||||
|
||||
## Votre cas d'usage
|
||||
|
||||
**Situation** : Application LLM (local ou cloud) → Weaviate (Synology ou VPS distant)
|
||||
|
||||
**Besoins** :
|
||||
- ✅ Fiabilité maximale
|
||||
- ✅ Sécurité (données privées)
|
||||
- ✅ Performance acceptable
|
||||
- ✅ Maintenance simple
|
||||
|
||||
---
|
||||
|
||||
## 🏆 Option recommandée : API REST + Tunnel sécurisé
|
||||
|
||||
### Architecture globale
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ Application LLM │
|
||||
│ (Claude API, OpenAI, Ollama local, etc.) │
|
||||
└────────────────────┬─────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ API REST Custom (Flask/FastAPI) │
|
||||
│ - Authentification JWT/API Key │
|
||||
│ - Rate limiting │
|
||||
│ - Logging │
|
||||
│ - HTTPS (Let's Encrypt) │
|
||||
└────────────────────┬─────────────────────────────────────────┘
|
||||
│
|
||||
▼ (réseau privé ou VPN)
|
||||
┌──────────────────────────────────────────────────────────────┐
|
||||
│ Synology NAS / VPS │
|
||||
│ ┌────────────────────────────────────────────────────┐ │
|
||||
│ │ Docker Compose │ │
|
||||
│ │ ┌──────────────────┐ ┌─────────────────────┐ │ │
|
||||
│ │ │ Weaviate :8080 │ │ text2vec-transformers│ │ │
|
||||
│ │ └──────────────────┘ └─────────────────────┘ │ │
|
||||
│ └────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Pourquoi cette option ?
|
||||
|
||||
✅ **Fiabilité maximale** (5/5)
|
||||
- HTTP/REST = protocole standard, éprouvé
|
||||
- Retry automatique facile
|
||||
- Gestion d'erreur claire
|
||||
|
||||
✅ **Sécurité** (5/5)
|
||||
- HTTPS obligatoire
|
||||
- Authentification par API key
|
||||
- IP whitelisting possible
|
||||
- Logs d'audit
|
||||
|
||||
✅ **Performance** (4/5)
|
||||
- Latence réseau inévitable
|
||||
- Compression gzip possible
|
||||
- Cache Redis optionnel
|
||||
|
||||
✅ **Maintenance** (5/5)
|
||||
- Code simple (Flask/FastAPI)
|
||||
- Monitoring facile
|
||||
- Déploiement standard
|
||||
|
||||
---
|
||||
|
||||
## Comparaison des 4 options
|
||||
|
||||
### Option 1 : API REST Custom (⭐ RECOMMANDÉ)
|
||||
|
||||
**Architecture** : App → API REST → Weaviate
|
||||
|
||||
**Code exemple** :
|
||||
|
||||
```python
|
||||
# api_server.py (déployé sur VPS/Synology)
|
||||
from fastapi import FastAPI, HTTPException, Security
|
||||
from fastapi.security import APIKeyHeader
|
||||
import weaviate
|
||||
|
||||
app = FastAPI()
|
||||
api_key_header = APIKeyHeader(name="X-API-Key")
|
||||
|
||||
# Connect to Weaviate (local on same machine)
|
||||
client = weaviate.connect_to_local()
|
||||
|
||||
def verify_api_key(api_key: str = Security(api_key_header)):
|
||||
if api_key != os.getenv("API_KEY"):
|
||||
raise HTTPException(status_code=403, detail="Invalid API key")
|
||||
return api_key
|
||||
|
||||
@app.post("/search")
|
||||
async def search_chunks(
|
||||
query: str,
|
||||
limit: int = 10,
|
||||
api_key: str = Security(verify_api_key)
|
||||
):
|
||||
collection = client.collections.get("Chunk")
|
||||
result = collection.query.near_text(
|
||||
query=query,
|
||||
limit=limit
|
||||
)
|
||||
return {"results": [obj.properties for obj in result.objects]}
|
||||
|
||||
@app.post("/insert_pdf")
|
||||
async def insert_pdf(
|
||||
pdf_path: str,
|
||||
api_key: str = Security(verify_api_key)
|
||||
):
|
||||
# Appeler le pipeline library_rag
|
||||
from utils.pdf_pipeline import process_pdf
|
||||
result = process_pdf(Path(pdf_path))
|
||||
return result
|
||||
```
|
||||
|
||||
**Déploiement** :
|
||||
|
||||
```bash
|
||||
# Sur VPS/Synology
|
||||
docker-compose up -d weaviate text2vec
|
||||
uvicorn api_server:app --host 0.0.0.0 --port 8000 --ssl-keyfile key.pem --ssl-certfile cert.pem
|
||||
```
|
||||
|
||||
**Avantages** :
|
||||
- ✅ Contrôle total sur l'API
|
||||
- ✅ Facile à sécuriser (HTTPS + API key)
|
||||
- ✅ Peut wrapper tout le pipeline library_rag
|
||||
- ✅ Monitoring et logging faciles
|
||||
|
||||
**Inconvénients** :
|
||||
- ⚠️ Code custom à maintenir
|
||||
- ⚠️ Besoin d'un serveur web (nginx/uvicorn)
|
||||
|
||||
---
|
||||
|
||||
### Option 2 : Accès direct Weaviate via VPN
|
||||
|
||||
**Architecture** : App → VPN → Weaviate:8080
|
||||
|
||||
**Configuration** :
|
||||
|
||||
```bash
|
||||
# Sur Synology : activer VPN Server (OpenVPN/WireGuard)
|
||||
# Sur client : se connecter au VPN
|
||||
# Accès direct à http://192.168.x.x:8080 (IP privée Synology)
|
||||
```
|
||||
|
||||
**Code client** :
|
||||
|
||||
```python
|
||||
# Dans votre app LLM
|
||||
import weaviate
|
||||
|
||||
# Via VPN, IP privée Synology
|
||||
client = weaviate.connect_to_custom(
|
||||
http_host="192.168.1.100",
|
||||
http_port=8080,
|
||||
http_secure=False, # En VPN, pas besoin HTTPS
|
||||
grpc_host="192.168.1.100",
|
||||
grpc_port=50051,
|
||||
)
|
||||
|
||||
# Utilisation directe
|
||||
collection = client.collections.get("Chunk")
|
||||
result = collection.query.near_text(query="justice")
|
||||
```
|
||||
|
||||
**Avantages** :
|
||||
- ✅ Très simple (pas de code custom)
|
||||
- ✅ Sécurité via VPN
|
||||
- ✅ Utilise Weaviate client Python directement
|
||||
|
||||
**Inconvénients** :
|
||||
- ⚠️ VPN doit être actif en permanence
|
||||
- ⚠️ Latence VPN
|
||||
- ⚠️ Pas de couche d'abstraction (app doit connaître Weaviate)
|
||||
|
||||
---
|
||||
|
||||
### Option 3 : MCP Server HTTP sur VPS
|
||||
|
||||
**Architecture** : App → MCP HTTP → Weaviate
|
||||
|
||||
**Problème** : FastMCP SSE ne fonctionne pas bien en production (comme on l'a vu)
|
||||
|
||||
**Solution** : Wrapper custom MCP over HTTP
|
||||
|
||||
```python
|
||||
# mcp_http_wrapper.py (sur VPS)
|
||||
from fastapi import FastAPI
|
||||
from mcp_tools import parse_pdf_handler, search_chunks_handler
|
||||
from pydantic import BaseModel
|
||||
|
||||
app = FastAPI()
|
||||
|
||||
class SearchRequest(BaseModel):
|
||||
query: str
|
||||
limit: int = 10
|
||||
|
||||
@app.post("/mcp/search_chunks")
|
||||
async def mcp_search(req: SearchRequest):
|
||||
# Appeler directement le handler MCP
|
||||
input_data = SearchChunksInput(query=req.query, limit=req.limit)
|
||||
result = await search_chunks_handler(input_data)
|
||||
return result.model_dump()
|
||||
```
|
||||
|
||||
**Avantages** :
|
||||
- ✅ Réutilise le code MCP existant
|
||||
- ✅ HTTP standard
|
||||
|
||||
**Inconvénients** :
|
||||
- ⚠️ MCP stdio ne peut pas être utilisé
|
||||
- ⚠️ Besoin d'un wrapper HTTP custom de toute façon
|
||||
- ⚠️ Équivalent à l'option 1 en plus complexe
|
||||
|
||||
**Verdict** : Option 1 (API REST pure) est meilleure
|
||||
|
||||
---
|
||||
|
||||
### Option 4 : Tunnel SSH + Port forwarding
|
||||
|
||||
**Architecture** : App → SSH tunnel → localhost:8080 (Weaviate distant)
|
||||
|
||||
**Configuration** :
|
||||
|
||||
```bash
|
||||
# Sur votre machine locale
|
||||
ssh -L 8080:localhost:8080 user@synology-ip
|
||||
|
||||
# Weaviate distant est maintenant accessible sur localhost:8080
|
||||
```
|
||||
|
||||
**Code** :
|
||||
|
||||
```python
|
||||
# Dans votre app (pense que Weaviate est local)
|
||||
client = weaviate.connect_to_local() # Va sur localhost:8080 = tunnel SSH
|
||||
```
|
||||
|
||||
**Avantages** :
|
||||
- ✅ Sécurité SSH
|
||||
- ✅ Simple à configurer
|
||||
- ✅ Pas de code custom
|
||||
|
||||
**Inconvénients** :
|
||||
- ⚠️ Tunnel doit rester ouvert
|
||||
- ⚠️ Pas adapté pour une app cloud
|
||||
- ⚠️ Latence SSH
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Recommandations selon votre cas
|
||||
|
||||
### Cas 1 : Application locale (votre PC) → Weaviate Synology/VPS
|
||||
|
||||
**Recommandation** : **VPN + Accès direct Weaviate** (Option 2)
|
||||
|
||||
**Pourquoi** :
|
||||
- Simple à configurer sur Synology (VPN Server intégré)
|
||||
- Pas de code custom
|
||||
- Sécurité via VPN
|
||||
- Performance acceptable en réseau local/VPN
|
||||
|
||||
**Setup** :
|
||||
|
||||
1. Synology : Activer VPN Server (OpenVPN)
|
||||
2. Client : Se connecter au VPN
|
||||
3. Python : `weaviate.connect_to_custom(http_host="192.168.x.x", ...)`
|
||||
|
||||
---
|
||||
|
||||
### Cas 2 : Application cloud (serveur distant) → Weaviate Synology/VPS
|
||||
|
||||
**Recommandation** : **API REST Custom** (Option 1)
|
||||
|
||||
**Pourquoi** :
|
||||
- Pas de VPN nécessaire
|
||||
- HTTPS public avec Let's Encrypt
|
||||
- Contrôle d'accès par API key
|
||||
- Rate limiting
|
||||
- Monitoring
|
||||
|
||||
**Setup** :
|
||||
|
||||
1. VPS/Synology : Docker Compose (Weaviate + API REST)
|
||||
2. Domaine : api.monrag.com → VPS IP
|
||||
3. Let's Encrypt : HTTPS automatique
|
||||
4. App cloud : Appelle `https://api.monrag.com/search?api_key=xxx`
|
||||
|
||||
---
|
||||
|
||||
### Cas 3 : Développement local temporaire → Weaviate distant
|
||||
|
||||
**Recommandation** : **Tunnel SSH** (Option 4)
|
||||
|
||||
**Pourquoi** :
|
||||
- Setup en 1 ligne
|
||||
- Aucune config permanente
|
||||
- Parfait pour le dev/debug
|
||||
|
||||
**Setup** :
|
||||
|
||||
```bash
|
||||
ssh -L 8080:localhost:8080 user@vps
|
||||
# Weaviate distant accessible sur localhost:8080
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Déploiement recommandé pour VPS
|
||||
|
||||
### Stack complète
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml (sur VPS)
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
# Weaviate + embeddings
|
||||
weaviate:
|
||||
image: cr.weaviate.io/semitechnologies/weaviate:1.34.4
|
||||
ports:
|
||||
- "127.0.0.1:8080:8080" # Uniquement localhost (sécurité)
|
||||
environment:
|
||||
AUTHENTICATION_APIKEY_ENABLED: "true"
|
||||
AUTHENTICATION_APIKEY_ALLOWED_KEYS: "my-secret-key"
|
||||
# ... autres configs
|
||||
volumes:
|
||||
- weaviate_data:/var/lib/weaviate
|
||||
|
||||
text2vec-transformers:
|
||||
image: cr.weaviate.io/semitechnologies/transformers-inference:baai-bge-m3-onnx-latest
|
||||
# ... config
|
||||
|
||||
# API REST custom
|
||||
api:
|
||||
build: ./api
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
WEAVIATE_URL: http://weaviate:8080
|
||||
API_KEY: ${API_KEY}
|
||||
MISTRAL_API_KEY: ${MISTRAL_API_KEY}
|
||||
depends_on:
|
||||
- weaviate
|
||||
restart: always
|
||||
|
||||
# NGINX reverse proxy + HTTPS
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
ports:
|
||||
- "80:80"
|
||||
- "443:443"
|
||||
volumes:
|
||||
- ./nginx.conf:/etc/nginx/nginx.conf
|
||||
- /etc/letsencrypt:/etc/letsencrypt
|
||||
depends_on:
|
||||
- api
|
||||
|
||||
volumes:
|
||||
weaviate_data:
|
||||
```
|
||||
|
||||
### NGINX config
|
||||
|
||||
```nginx
|
||||
# nginx.conf
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name api.monrag.com;
|
||||
|
||||
ssl_certificate /etc/letsencrypt/live/api.monrag.com/fullchain.pem;
|
||||
ssl_certificate_key /etc/letsencrypt/live/api.monrag.com/privkey.pem;
|
||||
|
||||
location / {
|
||||
proxy_pass http://api:8000;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
|
||||
# Rate limiting
|
||||
limit_req zone=api_limit burst=10 nodelay;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Comparaison finale
|
||||
|
||||
| Critère | VPN + Direct | API REST | Tunnel SSH | MCP HTTP |
|
||||
|---------|--------------|----------|------------|----------|
|
||||
| **Fiabilité** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
|
||||
| **Sécurité** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
|
||||
| **Simplicité** | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
|
||||
| **Performance** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
|
||||
| **Maintenance** | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
|
||||
| **Production** | ✅ Oui | ✅ Oui | ❌ Non | ⚠️ Possible |
|
||||
|
||||
---
|
||||
|
||||
## 💡 Ma recommandation finale
|
||||
|
||||
### Pour Synology (usage personnel/équipe)
|
||||
**VPN + Accès direct Weaviate** (Option 2)
|
||||
- Synology a un excellent VPN Server intégré
|
||||
- Sécurité maximale
|
||||
- Simple à maintenir
|
||||
|
||||
### Pour VPS (usage production/public)
|
||||
**API REST Custom** (Option 1)
|
||||
- Contrôle total
|
||||
- HTTPS public
|
||||
- Scalable
|
||||
- Monitoring complet
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Prochaine étape recommandée
|
||||
|
||||
Voulez-vous que je crée :
|
||||
|
||||
1. **Le code de l'API REST** (Flask/FastAPI) avec auth + rate limiting ?
|
||||
2. **Le docker-compose VPS complet** avec nginx + Let's Encrypt ?
|
||||
3. **Le guide d'installation Synology VPN** + config client ?
|
||||
|
||||
Dites-moi votre cas d'usage exact et je vous prépare la solution complète ! 🎯
|
||||
2510
navette.txt
2510
navette.txt
File diff suppressed because it is too large
Load Diff
@@ -1,679 +0,0 @@
|
||||
<project_specification>
|
||||
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
|
||||
|
||||
<overview>
|
||||
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
|
||||
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
|
||||
improve code maintainability, enable static type checking with mypy, and provide clear documentation
|
||||
for all functions, classes, and modules.
|
||||
|
||||
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
|
||||
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
|
||||
for document upload, processing, and semantic search.
|
||||
</overview>
|
||||
|
||||
<technology_stack>
|
||||
<backend>
|
||||
<runtime>Python 3.10+</runtime>
|
||||
<web_framework>Flask 3.0</web_framework>
|
||||
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
|
||||
<ocr>Mistral OCR API</ocr>
|
||||
<llm>Ollama (local) or Mistral API</llm>
|
||||
<type_checking>mypy with strict configuration</type_checking>
|
||||
</backend>
|
||||
<infrastructure>
|
||||
<containerization>Docker Compose (Weaviate + transformers)</containerization>
|
||||
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
|
||||
</infrastructure>
|
||||
</technology_stack>
|
||||
|
||||
<current_state>
|
||||
<project_structure>
|
||||
- flask_app.py: Main Flask application (640 lines)
|
||||
- schema.py: Weaviate schema definition (383 lines)
|
||||
- utils/: 16+ modules for PDF processing pipeline
|
||||
- pdf_pipeline.py: Main orchestration (879 lines)
|
||||
- mistral_client.py: OCR API client
|
||||
- ocr_processor.py: OCR processing
|
||||
- markdown_builder.py: Markdown generation
|
||||
- llm_metadata.py: Metadata extraction via LLM
|
||||
- llm_toc.py: Table of contents extraction
|
||||
- llm_classifier.py: Section classification
|
||||
- llm_chunker.py: Semantic chunking
|
||||
- llm_cleaner.py: Chunk cleaning
|
||||
- llm_validator.py: Document validation
|
||||
- weaviate_ingest.py: Database ingestion
|
||||
- hierarchy_parser.py: Document hierarchy parsing
|
||||
- image_extractor.py: Image extraction from PDFs
|
||||
- toc_extractor*.py: Various TOC extraction methods
|
||||
- templates/: Jinja2 templates for Flask UI
|
||||
- tests/utils2/: Minimal test coverage (3 test files)
|
||||
</project_structure>
|
||||
|
||||
<issues>
|
||||
- Inconsistent type annotations across modules (some have partial types, many have none)
|
||||
- Missing or incomplete docstrings (no Google-style format)
|
||||
- No mypy configuration for strict type checking
|
||||
- Type hints missing on function parameters and return values
|
||||
- Dict[str, Any] used extensively without proper typing
|
||||
- No type stubs for complex nested structures
|
||||
</issues>
|
||||
</current_state>
|
||||
|
||||
<core_features>
|
||||
<type_annotations>
|
||||
<strict_typing>
|
||||
- Add complete type annotations to ALL functions and methods
|
||||
- Use proper generic types (List, Dict, Optional, Union) from typing module
|
||||
- Add TypedDict for complex dictionary structures
|
||||
- Add Protocol types for duck-typed interfaces
|
||||
- Use Literal types for string constants
|
||||
- Add ParamSpec and TypeVar where appropriate
|
||||
- Type all class attributes and instance variables
|
||||
- Add type annotations to lambda functions where possible
|
||||
</strict_typing>
|
||||
|
||||
<mypy_configuration>
|
||||
- Create mypy.ini with strict configuration
|
||||
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
|
||||
- Enable: disallow_untyped_calls, disallow_untyped_decorators
|
||||
- Enable: warn_return_any, warn_redundant_casts
|
||||
- Enable: strict_equality, strict_optional
|
||||
- Set python_version to 3.10
|
||||
- Configure per-module overrides if needed for gradual migration
|
||||
</mypy_configuration>
|
||||
|
||||
<type_stubs>
|
||||
- Create TypedDict definitions for common data structures:
|
||||
- OCR response structures
|
||||
- Metadata dictionaries
|
||||
- TOC entries
|
||||
- Chunk objects
|
||||
- Weaviate objects
|
||||
- Pipeline results
|
||||
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
|
||||
- Create Protocol types for callback functions
|
||||
</type_stubs>
|
||||
|
||||
<specific_improvements>
|
||||
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
|
||||
- flask_app.py: Type all route handlers, request/response types
|
||||
- schema.py: Type Weaviate configuration objects
|
||||
- llm_*.py: Type LLM request/response structures
|
||||
- mistral_client.py: Type API client methods and responses
|
||||
- weaviate_ingest.py: Type ingestion functions and batch operations
|
||||
</specific_improvements>
|
||||
</type_annotations>
|
||||
|
||||
<documentation>
|
||||
<google_style_docstrings>
|
||||
- Add comprehensive Google-style docstrings to ALL:
|
||||
- Module-level docstrings explaining purpose and usage
|
||||
- Class docstrings with Attributes section
|
||||
- Function/method docstrings with Args, Returns, Raises sections
|
||||
- Complex algorithm explanations with Examples section
|
||||
- Include code examples for public APIs
|
||||
- Document all exceptions that can be raised
|
||||
- Add Notes section for important implementation details
|
||||
- Add See Also section for related functions
|
||||
</google_style_docstrings>
|
||||
|
||||
<module_documentation>
|
||||
<utils_modules>
|
||||
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
|
||||
- mistral_client.py: Document OCR API usage, cost calculation
|
||||
- llm_metadata.py: Document metadata extraction logic
|
||||
- llm_toc.py: Document TOC extraction strategies
|
||||
- llm_classifier.py: Document section classification types
|
||||
- llm_chunker.py: Document semantic vs basic chunking
|
||||
- llm_cleaner.py: Document cleaning rules and validation
|
||||
- llm_validator.py: Document validation criteria
|
||||
- weaviate_ingest.py: Document ingestion process, nested objects
|
||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
||||
</utils_modules>
|
||||
|
||||
<flask_app>
|
||||
- Document all routes with request/response examples
|
||||
- Document SSE (Server-Sent Events) implementation
|
||||
- Document Weaviate query patterns
|
||||
- Document upload processing workflow
|
||||
- Document background job management
|
||||
</flask_app>
|
||||
|
||||
<schema>
|
||||
- Document Weaviate schema design decisions
|
||||
- Document each collection's purpose and relationships
|
||||
- Document nested object structure
|
||||
- Document vectorization strategy
|
||||
</schema>
|
||||
</module_documentation>
|
||||
|
||||
<inline_comments>
|
||||
- Add inline comments for complex logic only (don't over-comment)
|
||||
- Explain WHY not WHAT (code should be self-documenting)
|
||||
- Document performance considerations
|
||||
- Document cost implications (OCR, LLM API calls)
|
||||
- Document error handling strategies
|
||||
</inline_comments>
|
||||
</documentation>
|
||||
|
||||
<validation>
|
||||
<type_checking>
|
||||
- All modules must pass mypy --strict
|
||||
- No # type: ignore comments without justification
|
||||
- CI/CD should run mypy checks
|
||||
- Type coverage should be 100%
|
||||
</type_checking>
|
||||
|
||||
<documentation_quality>
|
||||
- All public functions must have docstrings
|
||||
- All docstrings must follow Google style
|
||||
- Examples should be executable and tested
|
||||
- Documentation should be clear and concise
|
||||
</documentation_quality>
|
||||
</validation>
|
||||
</core_features>
|
||||
|
||||
<implementation_priority>
|
||||
<critical_modules>
|
||||
Priority 1 (Most used, most complex):
|
||||
1. utils/pdf_pipeline.py - Main orchestration
|
||||
2. flask_app.py - Web application entry point
|
||||
3. utils/weaviate_ingest.py - Database operations
|
||||
4. schema.py - Schema definition
|
||||
|
||||
Priority 2 (Core LLM modules):
|
||||
5. utils/llm_metadata.py
|
||||
6. utils/llm_toc.py
|
||||
7. utils/llm_classifier.py
|
||||
8. utils/llm_chunker.py
|
||||
9. utils/llm_cleaner.py
|
||||
10. utils/llm_validator.py
|
||||
|
||||
Priority 3 (OCR and parsing):
|
||||
11. utils/mistral_client.py
|
||||
12. utils/ocr_processor.py
|
||||
13. utils/markdown_builder.py
|
||||
14. utils/hierarchy_parser.py
|
||||
15. utils/image_extractor.py
|
||||
|
||||
Priority 4 (Supporting modules):
|
||||
16. utils/toc_extractor.py
|
||||
17. utils/toc_extractor_markdown.py
|
||||
18. utils/toc_extractor_visual.py
|
||||
19. utils/llm_structurer.py (legacy)
|
||||
</critical_modules>
|
||||
</implementation_priority>
|
||||
|
||||
<implementation_steps>
|
||||
<feature_1>
|
||||
<title>Setup Type Checking Infrastructure</title>
|
||||
<description>
|
||||
Configure mypy with strict settings and create foundational type definitions
|
||||
</description>
|
||||
<tasks>
|
||||
- Create mypy.ini configuration file with strict settings
|
||||
- Add mypy to requirements.txt or dev dependencies
|
||||
- Create utils/types.py module for common TypedDict definitions
|
||||
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
|
||||
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
|
||||
- Create Protocol types for callbacks (ProgressCallback, etc.)
|
||||
- Document type definitions in utils/types.py module docstring
|
||||
- Test mypy configuration on a single module to verify settings
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- mypy.ini exists with strict configuration
|
||||
- utils/types.py contains all foundational types with docstrings
|
||||
- mypy runs without errors on utils/types.py
|
||||
- Type definitions are comprehensive and reusable
|
||||
</acceptance_criteria>
|
||||
</feature_1>
|
||||
|
||||
<feature_2>
|
||||
<title>Add Types to PDF Pipeline Orchestration</title>
|
||||
<description>
|
||||
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
|
||||
</description>
|
||||
<tasks>
|
||||
- Add type annotations to all function signatures in pdf_pipeline.py
|
||||
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
|
||||
- Type progress_callback parameter with Protocol or Callable
|
||||
- Add TypedDict for pipeline options dictionary
|
||||
- Add TypedDict for pipeline result dictionary structure
|
||||
- Type all helper functions (extract_document_metadata_legacy, etc.)
|
||||
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
|
||||
- Fix any mypy errors that arise
|
||||
- Verify mypy --strict passes on pdf_pipeline.py
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All functions in pdf_pipeline.py have complete type annotations
|
||||
- progress_callback is properly typed with Protocol
|
||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
||||
- mypy --strict pdf_pipeline.py passes with zero errors
|
||||
- No # type: ignore comments (or justified if absolutely necessary)
|
||||
</acceptance_criteria>
|
||||
</feature_2>
|
||||
|
||||
<feature_3>
|
||||
<title>Add Types to Flask Application</title>
|
||||
<description>
|
||||
Add complete type annotations to flask_app.py and type all routes
|
||||
</description>
|
||||
<tasks>
|
||||
- Add type annotations to all Flask route handlers
|
||||
- Type request.args, request.form, request.files usage
|
||||
- Type jsonify() return values
|
||||
- Type get_weaviate_client context manager
|
||||
- Type get_collection_stats, get_all_chunks, search_chunks functions
|
||||
- Add TypedDict for Weaviate query results
|
||||
- Type background job processing functions (run_processing_job)
|
||||
- Type SSE generator function (upload_progress)
|
||||
- Add type hints for template rendering
|
||||
- Verify mypy --strict passes on flask_app.py
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All Flask routes have complete type annotations
|
||||
- Request/response types are clear and documented
|
||||
- Weaviate query functions are properly typed
|
||||
- SSE generator is correctly typed
|
||||
- mypy --strict flask_app.py passes with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_3>
|
||||
|
||||
<feature_4>
|
||||
<title>Add Types to Core LLM Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
|
||||
</description>
|
||||
<tasks>
|
||||
- llm_metadata.py: Type extract_metadata function, return structure
|
||||
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
|
||||
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
|
||||
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
|
||||
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
|
||||
- llm_validator.py: Type validate_document, validation result structure
|
||||
- Add TypedDict for LLM request/response structures
|
||||
- Type provider selection ("ollama" | "mistral" as Literal)
|
||||
- Type model names with Literal or constants
|
||||
- Verify mypy --strict passes on all llm_*.py modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All LLM modules have complete type annotations
|
||||
- Section types use Literal for type safety
|
||||
- Provider and model parameters are strongly typed
|
||||
- LLM request/response structures use TypedDict
|
||||
- mypy --strict passes on all llm_*.py modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_4>
|
||||
|
||||
<feature_5>
|
||||
<title>Add Types to Weaviate and Database Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to schema.py and weaviate_ingest.py
|
||||
</description>
|
||||
<tasks>
|
||||
- schema.py: Type Weaviate configuration objects
|
||||
- schema.py: Type collection property definitions
|
||||
- weaviate_ingest.py: Type ingest_document function signature
|
||||
- weaviate_ingest.py: Type delete_document_chunks function
|
||||
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
|
||||
- Type batch insertion operations
|
||||
- Type nested object references (work, document)
|
||||
- Add proper error types for Weaviate exceptions
|
||||
- Verify mypy --strict passes on both modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- schema.py has complete type annotations for Weaviate config
|
||||
- weaviate_ingest.py functions are fully typed
|
||||
- Nested object structures use TypedDict
|
||||
- Weaviate client operations are properly typed
|
||||
- mypy --strict passes on both modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_5>
|
||||
|
||||
<feature_6>
|
||||
<title>Add Types to OCR and Parsing Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
|
||||
</description>
|
||||
<tasks>
|
||||
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
|
||||
- mistral_client.py: Add TypedDict for Mistral API response structures
|
||||
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
|
||||
- markdown_builder.py: Type build_markdown, image_writer parameter
|
||||
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
|
||||
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
|
||||
- image_extractor.py: Type create_image_writer, image handling
|
||||
- Verify mypy --strict passes on all modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All OCR/parsing modules have complete type annotations
|
||||
- Mistral API structures use TypedDict
|
||||
- Hierarchy nodes are properly typed
|
||||
- Image handling functions are typed
|
||||
- mypy --strict passes on all modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_6>
|
||||
|
||||
<feature_7>
|
||||
<title>Add Google-Style Docstrings to Core Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
|
||||
</description>
|
||||
<tasks>
|
||||
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
|
||||
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
|
||||
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
|
||||
- pdf_pipeline.py: Add Examples section showing typical usage
|
||||
- flask_app.py: Add module docstring explaining Flask application
|
||||
- flask_app.py: Document all routes with request/response examples
|
||||
- flask_app.py: Document Weaviate connection management
|
||||
- schema.py: Add module docstring explaining schema design
|
||||
- schema.py: Document each collection's purpose and relationships
|
||||
- weaviate_ingest.py: Document ingestion process with examples
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All core modules have comprehensive module-level docstrings
|
||||
- All public functions have Google-style docstrings
|
||||
- Args, Returns, Raises sections are complete and accurate
|
||||
- Examples are provided for complex functions
|
||||
- Docstrings explain WHY, not just WHAT
|
||||
</acceptance_criteria>
|
||||
</feature_7>
|
||||
|
||||
<feature_8>
|
||||
<title>Add Google-Style Docstrings to LLM Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to all LLM processing modules
|
||||
</description>
|
||||
<tasks>
|
||||
- llm_metadata.py: Document metadata extraction logic with examples
|
||||
- llm_toc.py: Document TOC extraction strategies and fallbacks
|
||||
- llm_classifier.py: Document section types and classification criteria
|
||||
- llm_chunker.py: Document semantic vs basic chunking approaches
|
||||
- llm_cleaner.py: Document cleaning rules and validation logic
|
||||
- llm_validator.py: Document validation criteria and corrections
|
||||
- Add Examples sections showing input/output for each function
|
||||
- Document LLM provider differences (Ollama vs Mistral)
|
||||
- Document cost implications in Notes sections
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All LLM modules have comprehensive docstrings
|
||||
- Each function has Args, Returns, Raises sections
|
||||
- Examples show realistic input/output
|
||||
- Provider differences are documented
|
||||
- Cost implications are noted where relevant
|
||||
</acceptance_criteria>
|
||||
</feature_8>
|
||||
|
||||
<feature_9>
|
||||
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
|
||||
</description>
|
||||
<tasks>
|
||||
- mistral_client.py: Document OCR API usage, cost calculation
|
||||
- ocr_processor.py: Document OCR response processing
|
||||
- markdown_builder.py: Document markdown generation strategy
|
||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
||||
- image_extractor.py: Document image extraction process
|
||||
- toc_extractor*.py: Document various TOC extraction methods
|
||||
- Add Examples sections for complex algorithms
|
||||
- Document edge cases and error handling
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All OCR/parsing modules have comprehensive docstrings
|
||||
- Complex algorithms are well explained
|
||||
- Edge cases are documented
|
||||
- Error handling is documented
|
||||
- Examples demonstrate typical usage
|
||||
</acceptance_criteria>
|
||||
</feature_9>
|
||||
|
||||
<feature_10>
|
||||
<title>Final Validation and CI Integration</title>
|
||||
<description>
|
||||
Verify all type annotations and docstrings, integrate mypy into CI/CD
|
||||
</description>
|
||||
<tasks>
|
||||
- Run mypy --strict on entire codebase, verify 100% pass rate
|
||||
- Verify all public functions have docstrings
|
||||
- Check docstring formatting with pydocstyle or similar tool
|
||||
- Create GitHub Actions workflow to run mypy on every commit
|
||||
- Update README.md with type checking instructions
|
||||
- Update CLAUDE.md with documentation standards
|
||||
- Create CONTRIBUTING.md with type annotation and docstring guidelines
|
||||
- Generate API documentation with Sphinx or pdoc
|
||||
- Fix any remaining mypy errors or missing docstrings
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- mypy --strict passes on entire codebase with zero errors
|
||||
- All public functions have Google-style docstrings
|
||||
- CI/CD runs mypy checks automatically
|
||||
- Documentation is generated and accessible
|
||||
- Contributing guidelines document type/docstring requirements
|
||||
</acceptance_criteria>
|
||||
</feature_10>
|
||||
</implementation_steps>
|
||||
|
||||
<success_criteria>
|
||||
<type_safety>
|
||||
- 100% type coverage across all modules
|
||||
- mypy --strict passes with zero errors
|
||||
- No # type: ignore comments without justification
|
||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
||||
- Proper use of generics, protocols, and type variables
|
||||
- NewType used for semantic type safety
|
||||
</type_safety>
|
||||
|
||||
<documentation_quality>
|
||||
- All modules have comprehensive module-level docstrings
|
||||
- All public functions/classes have Google-style docstrings
|
||||
- All docstrings include Args, Returns, Raises sections
|
||||
- Complex functions include Examples sections
|
||||
- Cost implications documented in Notes sections
|
||||
- Error handling clearly documented
|
||||
- Provider differences (Ollama vs Mistral) documented
|
||||
</documentation_quality>
|
||||
|
||||
<code_quality>
|
||||
- Code is self-documenting with clear variable names
|
||||
- Inline comments explain WHY, not WHAT
|
||||
- Complex algorithms are well explained
|
||||
- Performance considerations documented
|
||||
- Security considerations documented
|
||||
</code_quality>
|
||||
|
||||
<developer_experience>
|
||||
- IDE autocomplete works perfectly with type hints
|
||||
- Type errors caught at development time, not runtime
|
||||
- Documentation is easily accessible in IDE
|
||||
- API examples are executable and tested
|
||||
- Contributing guidelines are clear and comprehensive
|
||||
</developer_experience>
|
||||
|
||||
<maintainability>
|
||||
- Refactoring is safer with type checking
|
||||
- Function signatures are self-documenting
|
||||
- API contracts are explicit and enforced
|
||||
- Breaking changes are caught by type checker
|
||||
- New developers can understand code quickly
|
||||
</maintainability>
|
||||
</success_criteria>
|
||||
|
||||
<constraints>
|
||||
<compatibility>
|
||||
- Must maintain backward compatibility with existing code
|
||||
- Cannot break existing Flask routes or API contracts
|
||||
- Weaviate schema must remain unchanged
|
||||
- Existing tests must continue to pass
|
||||
</compatibility>
|
||||
|
||||
<gradual_migration>
|
||||
- Can use per-module mypy configuration for gradual migration
|
||||
- Can temporarily disable strict checks on legacy modules
|
||||
- Priority modules must be completed first
|
||||
- Low-priority modules can be deferred
|
||||
</gradual_migration>
|
||||
|
||||
<standards>
|
||||
- All type annotations must use Python 3.10+ syntax
|
||||
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
|
||||
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
|
||||
- Use from __future__ import annotations if needed for forward references
|
||||
</standards>
|
||||
</constraints>
|
||||
|
||||
<testing_strategy>
|
||||
<type_checking>
|
||||
- Run mypy --strict on each module after adding types
|
||||
- Use mypy daemon (dmypy) for faster incremental checking
|
||||
- Add mypy to pre-commit hooks
|
||||
- CI/CD must run mypy and fail on type errors
|
||||
</type_checking>
|
||||
|
||||
<documentation_validation>
|
||||
- Use pydocstyle to validate Google-style format
|
||||
- Use sphinx-build to generate docs and catch errors
|
||||
- Manual review of docstring examples
|
||||
- Verify examples are executable and correct
|
||||
</documentation_validation>
|
||||
|
||||
<integration_testing>
|
||||
- Verify existing tests still pass after type additions
|
||||
- Add new tests for complex typed structures
|
||||
- Test mypy configuration on sample code
|
||||
- Verify IDE autocomplete works correctly
|
||||
</integration_testing>
|
||||
</testing_strategy>
|
||||
|
||||
<documentation_examples>
|
||||
<module_docstring>
|
||||
```python
|
||||
"""
|
||||
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
|
||||
|
||||
This module orchestrates a 10-step pipeline for processing PDF documents:
|
||||
1. OCR via Mistral API
|
||||
2. Markdown construction with images
|
||||
3. Metadata extraction via LLM
|
||||
4. Table of contents (TOC) extraction
|
||||
5. Section classification
|
||||
6. Semantic chunking
|
||||
7. Chunk cleaning and validation
|
||||
8. Enrichment with concepts
|
||||
9. Validation and corrections
|
||||
10. Ingestion into Weaviate vector database
|
||||
|
||||
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
|
||||
various processing modes (skip OCR, semantic chunking, OCR annotations).
|
||||
|
||||
Typical usage:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.pdf_pipeline import process_pdf
|
||||
>>>
|
||||
>>> result = process_pdf(
|
||||
... Path("document.pdf"),
|
||||
... use_llm=True,
|
||||
... llm_provider="ollama",
|
||||
... ingest_to_weaviate=True,
|
||||
... )
|
||||
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
|
||||
|
||||
See Also:
|
||||
mistral_client: OCR API client
|
||||
llm_metadata: Metadata extraction
|
||||
weaviate_ingest: Database ingestion
|
||||
"""
|
||||
```
|
||||
</module_docstring>
|
||||
|
||||
<function_docstring>
|
||||
```python
|
||||
def process_pdf_v2(
|
||||
pdf_path: Path,
|
||||
output_dir: Path = Path("output"),
|
||||
*,
|
||||
use_llm: bool = True,
|
||||
llm_provider: Literal["ollama", "mistral"] = "ollama",
|
||||
llm_model: Optional[str] = None,
|
||||
skip_ocr: bool = False,
|
||||
ingest_to_weaviate: bool = True,
|
||||
progress_callback: Optional[ProgressCallback] = None,
|
||||
) -> PipelineResult:
|
||||
"""
|
||||
Process a PDF through the complete V2 pipeline with LLM enhancement.
|
||||
|
||||
This function orchestrates all 10 steps of the intelligent document processing
|
||||
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
|
||||
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
|
||||
|
||||
Args:
|
||||
pdf_path: Absolute path to the PDF file to process.
|
||||
output_dir: Base directory for output files. Defaults to "./output".
|
||||
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
|
||||
If False, uses basic heuristic processing.
|
||||
llm_provider: LLM provider to use. "ollama" for local (free but slow),
|
||||
"mistral" for API (fast but paid).
|
||||
llm_model: Specific model name. If None, auto-detects based on provider
|
||||
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
|
||||
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
|
||||
Requires output_dir/<doc_name>/<doc_name>.md to exist.
|
||||
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
|
||||
progress_callback: Optional callback for real-time progress updates.
|
||||
Called with (step_id, status, detail) for each pipeline step.
|
||||
|
||||
Returns:
|
||||
Dictionary containing processing results with the following keys:
|
||||
- success (bool): True if processing completed without errors
|
||||
- document_name (str): Name of the processed document
|
||||
- pages (int): Number of pages in the PDF
|
||||
- chunks_count (int): Number of chunks generated
|
||||
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
|
||||
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
|
||||
- cost_total (float): Total cost (ocr + llm)
|
||||
- metadata (dict): Extracted metadata (title, author, etc.)
|
||||
- toc (list): Hierarchical table of contents
|
||||
- files (dict): Paths to generated files (markdown, chunks, etc.)
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If pdf_path does not exist.
|
||||
ValueError: If skip_ocr=True but markdown file not found.
|
||||
RuntimeError: If Weaviate connection fails during ingestion.
|
||||
|
||||
Examples:
|
||||
Basic usage with Ollama (free):
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... llm_provider="ollama"
|
||||
... )
|
||||
>>> print(f"Cost: {result['cost_total']:.4f}€")
|
||||
Cost: 0.0270€ # OCR only
|
||||
|
||||
With Mistral API (faster):
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... llm_provider="mistral",
|
||||
... llm_model="mistral-small-latest"
|
||||
... )
|
||||
|
||||
Skip OCR to avoid cost:
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... skip_ocr=True, # Reuses existing markdown
|
||||
... ingest_to_weaviate=False
|
||||
... )
|
||||
|
||||
Notes:
|
||||
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
|
||||
- LLM cost: Free with Ollama, variable with Mistral API
|
||||
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
|
||||
- Weaviate must be running (docker-compose up -d) before ingestion
|
||||
"""
|
||||
```
|
||||
</function_docstring>
|
||||
</documentation_examples>
|
||||
</project_specification>
|
||||
@@ -1,490 +0,0 @@
|
||||
<project_specification>
|
||||
<project_name>Library RAG - Native Markdown Support</project_name>
|
||||
|
||||
<overview>
|
||||
Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
|
||||
and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
|
||||
skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
|
||||
chunking, and Weaviate vectorization.
|
||||
|
||||
This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
|
||||
for users who have philosophical texts in Markdown format.
|
||||
</overview>
|
||||
|
||||
<technology_stack>
|
||||
<backend>
|
||||
<framework>Flask 3.0</framework>
|
||||
<pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
|
||||
<validation>Werkzeug secure_filename</validation>
|
||||
<llm>Ollama (local) or Mistral API</llm>
|
||||
<vectorization>Weaviate with BAAI/bge-m3</vectorization>
|
||||
</backend>
|
||||
<type_safety>
|
||||
<type_checker>mypy strict mode</type_checker>
|
||||
<docstrings>Google-style docstrings required</docstrings>
|
||||
</type_safety>
|
||||
</technology_stack>
|
||||
|
||||
<core_features>
|
||||
<feature_1>
|
||||
<title>Update Flask File Validation</title>
|
||||
<description>
|
||||
Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
|
||||
configuration and file validation logic to support .md files while maintaining backward compatibility
|
||||
with existing PDF workflows.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
|
||||
- Update allowed_file() function to accept both extensions
|
||||
- Update upload.html template to accept .md files in file input
|
||||
- Update error messages to reflect both formats
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Start Flask app
|
||||
2. Navigate to /upload
|
||||
3. Attempt to upload a .md file
|
||||
4. Verify file is accepted (no "Format non supporté" error)
|
||||
5. Verify PDF upload still works
|
||||
</test_steps>
|
||||
</feature_1>
|
||||
|
||||
<feature_2>
|
||||
<title>Add Markdown Detection in Pipeline</title>
|
||||
<description>
|
||||
Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
|
||||
Add logic to automatically skip OCR processing for .md files and copy the Markdown content
|
||||
directly to the output directory.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add file extension detection: `file_ext = pdf_path.suffix.lower()`
|
||||
- If file_ext == ".md":
|
||||
- Skip OCR step entirely (no Mistral API call)
|
||||
- Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
|
||||
- Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
|
||||
- Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
|
||||
- Set cost_ocr = 0.0
|
||||
- Emit progress: "markdown_load" instead of "ocr"
|
||||
- If file_ext == ".pdf":
|
||||
- Continue with existing OCR workflow
|
||||
- Both paths converge at LLM processing (metadata, TOC, chunking)
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Create test Markdown file with philosophical content
|
||||
2. Call process_pdf(Path("test.md"), use_llm=True)
|
||||
3. Verify OCR is skipped (cost_ocr = 0.0)
|
||||
4. Verify output/test/test.md is created
|
||||
5. Verify no _ocr.json file is created
|
||||
6. Verify LLM processing runs normally
|
||||
</test_steps>
|
||||
</feature_2>
|
||||
|
||||
<feature_3>
|
||||
<title>Markdown-Specific Progress Callback</title>
|
||||
<description>
|
||||
Update the progress callback system to emit appropriate events for Markdown file processing.
|
||||
Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
|
||||
user feedback during Server-Sent Events streaming.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (emit_progress calls)
|
||||
- flask_app.py (process_file_background function)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add conditional progress messages based on file type
|
||||
- For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
|
||||
- For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
|
||||
- Update frontend to handle "markdown_load" event type
|
||||
- Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Upload Markdown file via Flask interface
|
||||
2. Monitor SSE progress stream at /upload/progress/<job_id>
|
||||
3. Verify first step shows "Chargement du fichier Markdown..."
|
||||
4. Verify no OCR-related messages appear
|
||||
5. Verify subsequent steps (metadata, TOC, etc.) work normally
|
||||
</test_steps>
|
||||
</feature_3>
|
||||
|
||||
<feature_4>
|
||||
<title>Update process_pdf_bytes for Markdown</title>
|
||||
<description>
|
||||
Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
|
||||
This function currently creates a temporary PDF file, but for Markdown uploads,
|
||||
it should create a temporary .md file instead.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Detect file type from filename parameter
|
||||
- If filename ends with .md:
|
||||
- Create temp file with suffix=".md"
|
||||
- Write file_bytes as UTF-8 text
|
||||
- If filename ends with .pdf:
|
||||
- Existing behavior (suffix=".pdf", binary write)
|
||||
- Pass temp file path to process_pdf() which now handles both types
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Create Flask test client
|
||||
2. POST multipart form with .md file to /upload
|
||||
3. Verify process_pdf_bytes creates .md temp file
|
||||
4. Verify temp file contains correct Markdown content
|
||||
5. Verify cleanup deletes temp file after processing
|
||||
</test_steps>
|
||||
</feature_4>
|
||||
|
||||
<feature_5>
|
||||
<title>Add Markdown File Validation</title>
|
||||
<description>
|
||||
Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
|
||||
and basic Markdown structure. Reject files that are too large, contain binary data,
|
||||
or have no meaningful content.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<files_to_create>
|
||||
- utils/markdown_validator.py
|
||||
</files_to_create>
|
||||
<implementation_details>
|
||||
- Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
|
||||
- Checks:
|
||||
- File size < 10 MB
|
||||
- Valid UTF-8 encoding
|
||||
- Contains at least one header (#, ##, etc.)
|
||||
- Not empty (at least 100 characters)
|
||||
- No null bytes or excessive binary content
|
||||
- Return dict with success, error, and warnings keys
|
||||
- Call from process_pdf_v2 before processing
|
||||
- Type annotations and Google-style docstrings required
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Test with valid Markdown file → passes validation
|
||||
2. Test with empty file → fails with "File too short"
|
||||
3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
|
||||
4. Test with very large file (>10MB) → fails with "File too large"
|
||||
5. Test with plain text no headers → warning but continues
|
||||
</test_steps>
|
||||
</feature_5>
|
||||
|
||||
<feature_6>
|
||||
<title>Update Documentation</title>
|
||||
<description>
|
||||
Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
|
||||
Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>documentation</category>
|
||||
<files_to_modify>
|
||||
- README.md (add section under "Pipeline de Traitement")
|
||||
- .claude/CLAUDE.md (update development guidelines)
|
||||
- templates/upload.html (add help text)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- README.md:
|
||||
- Add "Support Markdown Natif" section
|
||||
- Document accepted formats: PDF, MD
|
||||
- Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
|
||||
- Add example: process_pdf(Path("document.md"))
|
||||
- CLAUDE.md:
|
||||
- Update "Pipeline de Traitement" section
|
||||
- Note conditional OCR step
|
||||
- Document markdown_validator.py module
|
||||
- upload.html:
|
||||
- Update file input accept attribute: accept=".pdf,.md"
|
||||
- Add help text: "Formats acceptés : PDF, Markdown (.md)"
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Read README.md markdown support section
|
||||
2. Verify examples are clear and accurate
|
||||
3. Check CLAUDE.md developer notes
|
||||
4. Open /upload in browser
|
||||
5. Verify help text displays correctly
|
||||
</test_steps>
|
||||
</feature_6>
|
||||
|
||||
<feature_7>
|
||||
<title>Add Unit Tests for Markdown Processing</title>
|
||||
<description>
|
||||
Create comprehensive unit tests for Markdown file handling to ensure reliability
|
||||
and prevent regressions. Cover file validation, pipeline processing, and edge cases.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>testing</category>
|
||||
<files_to_create>
|
||||
- tests/utils/test_markdown_validator.py
|
||||
- tests/utils/test_pdf_pipeline_markdown.py
|
||||
- tests/fixtures/sample.md
|
||||
</files_to_create>
|
||||
<implementation_details>
|
||||
- test_markdown_validator.py:
|
||||
- Test valid Markdown acceptance
|
||||
- Test invalid encoding rejection
|
||||
- Test file size limits
|
||||
- Test empty file rejection
|
||||
- Test binary data detection
|
||||
- test_pdf_pipeline_markdown.py:
|
||||
- Test Markdown file processing end-to-end
|
||||
- Test OCR skip for .md files
|
||||
- Test cost_ocr = 0.0
|
||||
- Test LLM processing (metadata, TOC, chunking)
|
||||
- Mock Weaviate ingestion
|
||||
- Verify output files created correctly
|
||||
- fixtures/sample.md:
|
||||
- Create realistic philosophical text in Markdown
|
||||
- Include headers, paragraphs, formatting
|
||||
- ~1000 words for realistic testing
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Run: pytest tests/utils/test_markdown_validator.py -v
|
||||
2. Verify all validation tests pass
|
||||
3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
|
||||
4. Verify end-to-end Markdown processing works
|
||||
5. Check test coverage: pytest --cov=utils --cov-report=html
|
||||
</test_steps>
|
||||
</feature_7>
|
||||
|
||||
<feature_8>
|
||||
<title>Type Safety and Documentation</title>
|
||||
<description>
|
||||
Ensure all new code follows strict type safety requirements and includes comprehensive
|
||||
Google-style docstrings. Run mypy checks and update type definitions as needed.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>type_safety</category>
|
||||
<files_to_modify>
|
||||
- utils/types.py (add Markdown-specific types if needed)
|
||||
- All modified modules (type annotations)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add type annotations to all new functions
|
||||
- Update existing functions that handle both PDF and MD
|
||||
- Consider adding:
|
||||
- FileFormat = Literal["pdf", "md"]
|
||||
- MarkdownValidationResult = TypedDict(...)
|
||||
- Run mypy --strict on all modified files
|
||||
- Add Google-style docstrings with:
|
||||
- Args section documenting all parameters
|
||||
- Returns section with structure details
|
||||
- Raises section for exceptions
|
||||
- Examples section for complex functions
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Run: mypy utils/pdf_pipeline.py --strict
|
||||
2. Run: mypy utils/markdown_validator.py --strict
|
||||
3. Verify no type errors
|
||||
4. Run: pydocstyle utils/markdown_validator.py --convention=google
|
||||
5. Verify all docstrings follow Google style
|
||||
</test_steps>
|
||||
</feature_8>
|
||||
|
||||
<feature_9>
|
||||
<title>Handle Markdown-Specific Edge Cases</title>
|
||||
<description>
|
||||
Address edge cases specific to Markdown processing: front matter (YAML/TOML),
|
||||
embedded code blocks, special characters, and non-standard Markdown extensions.
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/markdown_validator.py
|
||||
- utils/llm_metadata.py (handle front matter)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Front matter handling:
|
||||
- Detect YAML/TOML front matter (--- or +++)
|
||||
- Extract metadata if present (title, author, date)
|
||||
- Pass to LLM or use directly if valid
|
||||
- Strip front matter before content processing
|
||||
- Code block handling:
|
||||
- Don't treat code blocks as actual content
|
||||
- Preserve them for chunking but don't analyze
|
||||
- Special characters:
|
||||
- Handle Unicode properly (Greek, Latin, French accents)
|
||||
- Preserve LaTeX equations in $ or $$
|
||||
- GitHub Flavored Markdown:
|
||||
- Support tables, task lists, strikethrough
|
||||
- Convert to standard format if needed
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Upload Markdown with YAML front matter
|
||||
2. Verify metadata extracted correctly
|
||||
3. Upload Markdown with code blocks
|
||||
4. Verify code not treated as philosophical content
|
||||
5. Upload Markdown with Greek/Latin text
|
||||
6. Verify Unicode handled correctly
|
||||
</test_steps>
|
||||
</feature_9>
|
||||
|
||||
<feature_10>
|
||||
<title>Update UI/UX for Markdown Upload</title>
|
||||
<description>
|
||||
Enhance the upload interface to clearly communicate Markdown support and provide
|
||||
visual feedback about the file type being processed. Show format-specific information
|
||||
(e.g., "No OCR cost for Markdown files").
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>frontend</category>
|
||||
<files_to_modify>
|
||||
- templates/upload.html
|
||||
- templates/upload_progress.html
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- upload.html:
|
||||
- Add file type indicator icon (📄 PDF vs 📝 MD)
|
||||
- Show format-specific help text on hover
|
||||
- Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
|
||||
- Add example Markdown file download link
|
||||
- upload_progress.html:
|
||||
- Show different icon for Markdown processing
|
||||
- Adjust progress bar (9 steps vs 10 steps)
|
||||
- Display "No OCR cost" badge for Markdown
|
||||
- Update step descriptions based on file type
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Open /upload page
|
||||
2. Verify help text mentions both PDF and MD
|
||||
3. Select a .md file
|
||||
4. Verify file type indicator shows 📝
|
||||
5. Submit upload
|
||||
6. Verify progress shows "Chargement Markdown..."
|
||||
7. Verify "No OCR cost" badge displays
|
||||
</test_steps>
|
||||
</feature_10>
|
||||
</core_features>
|
||||
|
||||
<implementation_steps>
|
||||
<step number="1">
|
||||
<title>Setup and Configuration</title>
|
||||
<tasks>
|
||||
- Update ALLOWED_EXTENSIONS in flask_app.py
|
||||
- Modify allowed_file() validation function
|
||||
- Update upload.html file input accept attribute
|
||||
- Add Markdown MIME type handling
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="2">
|
||||
<title>Core Pipeline Extension</title>
|
||||
<tasks>
|
||||
- Add file extension detection in process_pdf_v2()
|
||||
- Implement Markdown file reading logic
|
||||
- Skip OCR for .md files
|
||||
- Add conditional progress callbacks
|
||||
- Update process_pdf_bytes() for Markdown
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="3">
|
||||
<title>Validation and Error Handling</title>
|
||||
<tasks>
|
||||
- Create markdown_validator.py module
|
||||
- Implement UTF-8 encoding validation
|
||||
- Add file size limits
|
||||
- Handle front matter extraction
|
||||
- Add comprehensive error messages
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="4">
|
||||
<title>Testing Infrastructure</title>
|
||||
<tasks>
|
||||
- Create test fixtures (sample.md)
|
||||
- Write validation tests
|
||||
- Write pipeline integration tests
|
||||
- Add edge case tests
|
||||
- Verify mypy strict compliance
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="5">
|
||||
<title>Documentation and Polish</title>
|
||||
<tasks>
|
||||
- Update README.md with Markdown support
|
||||
- Update .claude/CLAUDE.md developer docs
|
||||
- Add Google-style docstrings
|
||||
- Update UI templates with new messaging
|
||||
- Create usage examples
|
||||
</tasks>
|
||||
</step>
|
||||
</implementation_steps>
|
||||
|
||||
<success_criteria>
|
||||
<functionality>
|
||||
- Markdown files upload successfully via Flask
|
||||
- OCR is skipped for .md files (cost_ocr = 0.0)
|
||||
- LLM processing works identically for PDF and MD
|
||||
- Chunks are created and vectorized correctly
|
||||
- Both file types can be searched in Weaviate
|
||||
- Existing PDF workflow remains unchanged
|
||||
</functionality>
|
||||
|
||||
<type_safety>
|
||||
- All code passes mypy --strict
|
||||
- All functions have type annotations
|
||||
- Google-style docstrings on all modules
|
||||
- No Any types without justification
|
||||
- TypedDict definitions for new data structures
|
||||
</type_safety>
|
||||
|
||||
<testing>
|
||||
- Unit tests cover Markdown validation
|
||||
- Integration tests verify end-to-end processing
|
||||
- Edge cases handled (front matter, Unicode, large files)
|
||||
- Test coverage >80% for new code
|
||||
- All tests pass in CI/CD pipeline
|
||||
</testing>
|
||||
|
||||
<user_experience>
|
||||
- Upload interface clearly shows both formats supported
|
||||
- Progress feedback accurate for both PDF and MD
|
||||
- Cost savings clearly communicated ("0€ for Markdown")
|
||||
- Error messages helpful and specific
|
||||
- Documentation clear with examples
|
||||
</user_experience>
|
||||
|
||||
<performance>
|
||||
- Markdown processing faster than PDF (no OCR)
|
||||
- No regression in PDF processing speed
|
||||
- Memory usage reasonable for large MD files
|
||||
- Validation completes in <100ms
|
||||
- Overall pipeline <30s for typical Markdown document
|
||||
</performance>
|
||||
</success_criteria>
|
||||
|
||||
<technical_notes>
|
||||
<cost_comparison>
|
||||
- PDF processing: OCR ~0.003€/page + LLM variable
|
||||
- Markdown processing: 0€ OCR + LLM variable
|
||||
- Estimated savings: 50-70% for documents with Markdown source
|
||||
</cost_comparison>
|
||||
|
||||
<compatibility>
|
||||
- Maintains backward compatibility with existing PDFs
|
||||
- No breaking changes to API or database schema
|
||||
- Existing chunks and documents unaffected
|
||||
- Can process both formats in same session
|
||||
</compatibility>
|
||||
|
||||
<future_enhancements>
|
||||
- Support for .txt plain text files
|
||||
- Support for .docx Word documents (via pandoc)
|
||||
- Support for .epub ebooks
|
||||
- Batch upload of multiple Markdown files
|
||||
- Markdown to PDF export for archival
|
||||
</future_enhancements>
|
||||
</technical_notes>
|
||||
</project_specification>
|
||||
@@ -1,498 +0,0 @@
|
||||
<project_specification>
|
||||
<project_name>ikario - Tavily MCP Integration for Internet Access</project_name>
|
||||
|
||||
<overview>
|
||||
This specification adds Tavily search capabilities via MCP (Model Context Protocol) to give Ikario
|
||||
internet access for real-time web searches. Tavily provides high-quality search results optimized
|
||||
for AI agents, making it ideal for research, fact-checking, and accessing current information.
|
||||
|
||||
This integration adds a new MCP server connection to the existing architecture (alongside the
|
||||
ikario-memory MCP server) and exposes Tavily search tools to Ikario during conversations.
|
||||
|
||||
All changes are additive and backward-compatible. Existing functionality remains unchanged.
|
||||
</overview>
|
||||
|
||||
<architecture_design>
|
||||
<mcp_integration>
|
||||
Tavily MCP Server Connection:
|
||||
- Uses @modelcontextprotocol/sdk Client to connect to Tavily MCP server
|
||||
- Connection can be stdio-based (local MCP server) or HTTP-based (remote)
|
||||
- Tavily MCP server provides search tools that are exposed to Claude via Tool Use API
|
||||
- Backend routes handle tool execution and return results to Claude
|
||||
</mcp_integration>
|
||||
|
||||
<benefits>
|
||||
- Real-time internet access for Ikario
|
||||
- High-quality search results optimized for LLMs
|
||||
- Fact-checking and verification capabilities
|
||||
- Access to current events and news
|
||||
- Research assistance with cited sources
|
||||
- Seamless integration with existing memory tools
|
||||
</benefits>
|
||||
</architecture_design>
|
||||
|
||||
<technology_stack>
|
||||
<mcp_server>
|
||||
<name>Tavily MCP Server</name>
|
||||
<protocol>Model Context Protocol (MCP)</protocol>
|
||||
<connection>stdio or HTTP transport</connection>
|
||||
<sdk>@modelcontextprotocol/sdk</sdk>
|
||||
<api_key>Tavily API key (from https://tavily.com)</api_key>
|
||||
</mcp_server>
|
||||
<backend>
|
||||
<runtime>Node.js with Express (existing)</runtime>
|
||||
<mcp_client>MCP Client for Tavily server connection</mcp_client>
|
||||
<tool_executor>Existing toolExecutor service extended with Tavily tools</tool_executor>
|
||||
</backend>
|
||||
<api_endpoints>
|
||||
<tavily_routes>GET/POST /api/tavily/* for Tavily-specific operations</tavily_routes>
|
||||
<existing_routes>Existing /api/claude/chat routes support Tavily tools automatically</existing_routes>
|
||||
</api_endpoints>
|
||||
</technology_stack>
|
||||
|
||||
<prerequisites>
|
||||
<environment_setup>
|
||||
- Tavily API key obtained from https://tavily.com (free tier available)
|
||||
- API key stored in environment variable TAVILY_API_KEY or configuration file
|
||||
- MCP SDK already installed (@modelcontextprotocol/sdk exists for ikario-memory)
|
||||
- Tavily MCP server installed (npm package or Python package)
|
||||
</environment_setup>
|
||||
<configuration>
|
||||
- Add Tavily MCP server config to server/.claude_settings.json or similar
|
||||
- Configure connection parameters (stdio vs HTTP)
|
||||
- Set API key securely
|
||||
</configuration>
|
||||
</prerequisites>
|
||||
|
||||
<core_features>
|
||||
<feature_1>
|
||||
<title>Tavily MCP Client Setup</title>
|
||||
<description>
|
||||
Create MCP client connection to Tavily search server. This is similar to the existing
|
||||
ikario-memory MCP client but connects to Tavily instead.
|
||||
|
||||
Implementation:
|
||||
- Create server/services/tavilyMcpClient.js
|
||||
- Initialize MCP client with Tavily server connection
|
||||
- Handle connection lifecycle (connect, disconnect, reconnect)
|
||||
- Implement health checks and connection status
|
||||
- Export client instance and helper functions
|
||||
|
||||
Configuration:
|
||||
- Read Tavily API key from environment or config file
|
||||
- Configure transport (stdio or HTTP)
|
||||
- Set connection timeout and retry logic
|
||||
- Log connection status for debugging
|
||||
|
||||
Error Handling:
|
||||
- Graceful degradation if Tavily is unavailable
|
||||
- Connection retry with exponential backoff
|
||||
- Clear error messages for configuration issues
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Verify MCP client can connect to Tavily server on startup
|
||||
2. Test connection health check endpoint returns correct status
|
||||
3. Verify graceful handling when Tavily API key is missing
|
||||
4. Test reconnection logic when connection drops
|
||||
5. Verify connection status is logged correctly
|
||||
6. Test that server starts even if Tavily is unavailable
|
||||
</test_steps>
|
||||
</feature_1>
|
||||
|
||||
<feature_2>
|
||||
<title>Tavily Tool Configuration</title>
|
||||
<description>
|
||||
Configure Tavily search tools to be available to Claude during conversations.
|
||||
This integrates with the existing tool system (like memory tools).
|
||||
|
||||
Implementation:
|
||||
- Create server/config/tavilyTools.js
|
||||
- Define tool schemas for Tavily search capabilities
|
||||
- Integrate with existing toolExecutor service
|
||||
- Add Tavily tools to system prompt alongside memory tools
|
||||
|
||||
Tavily Tools to Expose:
|
||||
- tavily_search: General web search with AI-optimized results
|
||||
- Parameters: query (string), max_results (number), search_depth (basic/advanced)
|
||||
- Returns: Array of search results with title, url, content, score
|
||||
|
||||
- tavily_search_news: News-specific search for current events
|
||||
- Parameters: query (string), max_results (number), days (number)
|
||||
- Returns: Recent news articles with metadata
|
||||
|
||||
Tool Schema:
|
||||
- Follow Claude Tool Use API format
|
||||
- Clear descriptions for each tool
|
||||
- Well-defined input schemas with validation
|
||||
- Proper error handling in tool execution
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Verify Tavily tools are listed in available tools
|
||||
2. Test tool schema validation with valid inputs
|
||||
3. Test tool schema validation rejects invalid inputs
|
||||
4. Verify tools appear in Claude's system prompt
|
||||
5. Test that tool descriptions are clear and accurate
|
||||
6. Verify tools can be called without errors
|
||||
</test_steps>
|
||||
</feature_2>
|
||||
|
||||
<feature_3>
|
||||
<title>Tavily Tool Executor Integration</title>
|
||||
<description>
|
||||
Integrate Tavily tools into the existing toolExecutor service so Claude can
|
||||
use them during conversations.
|
||||
|
||||
Implementation:
|
||||
- Extend server/services/toolExecutor.js to handle Tavily tools
|
||||
- Add tool detection for tavily_search and tavily_search_news
|
||||
- Implement tool execution logic using Tavily MCP client
|
||||
- Format Tavily results for Claude consumption
|
||||
- Handle errors and timeouts gracefully
|
||||
|
||||
Tool Execution Flow:
|
||||
1. Claude requests tool use (e.g., tavily_search)
|
||||
2. toolExecutor detects Tavily tool request
|
||||
3. Call Tavily MCP client with tool parameters
|
||||
4. Receive and format search results
|
||||
5. Return formatted results to Claude
|
||||
6. Claude incorporates results into response
|
||||
|
||||
Result Formatting:
|
||||
- Convert Tavily results to Claude-friendly format
|
||||
- Include source URLs for citation
|
||||
- Add relevance scores
|
||||
- Truncate content if too long
|
||||
- Handle empty results gracefully
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Test tavily_search tool execution with valid query
|
||||
2. Verify results are properly formatted
|
||||
3. Test tavily_search_news tool execution
|
||||
4. Verify error handling when Tavily API fails
|
||||
5. Test timeout handling for slow searches
|
||||
6. Verify results include proper citations and URLs
|
||||
7. Test with empty search results
|
||||
8. Test with very long search queries
|
||||
</test_steps>
|
||||
</feature_3>
|
||||
|
||||
<feature_4>
|
||||
<title>System Prompt Enhancement for Internet Access</title>
|
||||
<description>
|
||||
Update the system prompt to inform Ikario about internet access capabilities.
|
||||
This should be added alongside existing memory tools instructions.
|
||||
|
||||
Implementation:
|
||||
- Update MEMORY_SYSTEM_PROMPT in server/routes/messages.js and claude.js
|
||||
- Add Tavily tools documentation
|
||||
- Provide usage guidelines for when to search the internet
|
||||
- Include examples of good search queries
|
||||
|
||||
Prompt Addition:
|
||||
"## Internet Access via Tavily
|
||||
|
||||
Tu as accès à internet en temps réel via deux outils de recherche :
|
||||
|
||||
1. tavily_search : Recherche web générale optimisée pour l'IA
|
||||
- Utilise pour : rechercher des informations actuelles, vérifier des faits,
|
||||
trouver des sources fiables
|
||||
- Paramètres : query (ta question), max_results (nombre de résultats, défaut: 5),
|
||||
search_depth ('basic' ou 'advanced')
|
||||
- Retourne : Résultats avec titre, URL, contenu et score de pertinence
|
||||
|
||||
2. tavily_search_news : Recherche d'actualités récentes
|
||||
- Utilise pour : événements actuels, nouvelles, actualités
|
||||
- Paramètres : query, max_results, days (nombre de jours en arrière, défaut: 7)
|
||||
|
||||
Quand utiliser la recherche internet :
|
||||
- Quand l'utilisateur demande des informations récentes ou actuelles
|
||||
- Pour vérifier des faits ou données que tu n'es pas sûr de connaître
|
||||
- Quand ta base de connaissances est trop ancienne (après janvier 2025)
|
||||
- Pour trouver des sources et citations spécifiques
|
||||
- Pour des requêtes nécessitant des données en temps réel
|
||||
|
||||
N'utilise PAS la recherche pour :
|
||||
- Des questions sur ta propre identité ou capacités
|
||||
- Des concepts généraux que tu connais déjà bien
|
||||
- Des questions purement créatives ou d'opinion
|
||||
|
||||
Utilise ces outils de façon autonome selon les besoins de la conversation.
|
||||
Cite toujours tes sources quand tu utilises des informations de Tavily."
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Verify system prompt includes Tavily instructions
|
||||
2. Test that Claude understands when to use Tavily search
|
||||
3. Verify Claude cites sources from Tavily results
|
||||
4. Test that Claude uses appropriate search queries
|
||||
5. Verify Claude chooses between tavily_search and tavily_search_news correctly
|
||||
6. Test that Claude doesn't over-use search for simple questions
|
||||
</test_steps>
|
||||
</feature_4>
|
||||
|
||||
<feature_5>
|
||||
<title>Tavily Status API Endpoint</title>
|
||||
<description>
|
||||
Create API endpoint to check Tavily MCP connection status and search capabilities.
|
||||
Similar to /api/memory/status endpoint.
|
||||
|
||||
Implementation:
|
||||
- Create GET /api/tavily/status endpoint
|
||||
- Return connection status, available tools, and configuration
|
||||
- Create GET /api/tavily/health endpoint for health checks
|
||||
- Add Tavily status to existing /api/memory/stats (rename to /api/tools/stats)
|
||||
|
||||
Response Format:
|
||||
{
|
||||
"success": true,
|
||||
"data": {
|
||||
"connected": true,
|
||||
"message": "Tavily MCP server is connected",
|
||||
"tools": ["tavily_search", "tavily_search_news"],
|
||||
"apiKeyConfigured": true,
|
||||
"transport": "stdio"
|
||||
}
|
||||
}
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Test GET /api/tavily/status returns correct status
|
||||
2. Verify status shows "connected" when Tavily is available
|
||||
3. Verify status shows "disconnected" when Tavily is unavailable
|
||||
4. Test health endpoint returns proper status code
|
||||
5. Verify tools list is accurate
|
||||
6. Test with missing API key shows proper error
|
||||
</test_steps>
|
||||
</feature_5>
|
||||
|
||||
<feature_6>
|
||||
<title>Frontend UI Indicator for Internet Access</title>
|
||||
<description>
|
||||
Add visual indicator in the UI to show when Ikario has internet access via Tavily.
|
||||
This can be displayed alongside the existing memory status indicator.
|
||||
|
||||
Implementation:
|
||||
- Add Tavily status indicator in header or sidebar
|
||||
- Show online/offline status for Tavily connection
|
||||
- Optional: Show when Tavily is being used during a conversation
|
||||
- Optional: Add tooltip explaining internet access capabilities
|
||||
|
||||
Visual Design:
|
||||
- Globe or wifi icon to represent internet access
|
||||
- Green when connected, gray when disconnected
|
||||
- Subtle animation when search is in progress
|
||||
- Tooltip: "Internet access via Tavily" or similar
|
||||
|
||||
Integration:
|
||||
- Use existing useMemory hook pattern or create useTavily hook
|
||||
- Poll /api/tavily/status periodically (every 60s)
|
||||
- Update status in real-time during searches
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>frontend</category>
|
||||
<test_steps>
|
||||
1. Verify internet access indicator appears in UI
|
||||
2. Test status updates when Tavily connects/disconnects
|
||||
3. Verify tooltip shows correct information
|
||||
4. Test that indicator shows activity during searches
|
||||
5. Verify status polling doesn't impact performance
|
||||
6. Test with Tavily disabled shows offline status
|
||||
</test_steps>
|
||||
</feature_6>
|
||||
|
||||
<feature_7>
|
||||
<title>Manual Search UI (Optional Enhancement)</title>
|
||||
<description>
|
||||
Optional: Add manual search interface to allow users to trigger Tavily searches directly,
|
||||
similar to the memory search panel.
|
||||
|
||||
Implementation:
|
||||
- Add "Internet Search" panel in sidebar (alongside Memory panel)
|
||||
- Search input for manual Tavily queries
|
||||
- Display search results with title, snippet, URL
|
||||
- Click to insert results into conversation
|
||||
- Filter by search type (general vs news)
|
||||
|
||||
This is OPTIONAL and lower priority. The primary use case is autonomous search by Claude.
|
||||
</description>
|
||||
<priority>4</priority>
|
||||
<category>frontend</category>
|
||||
<test_steps>
|
||||
1. Verify search panel appears in sidebar
|
||||
2. Test manual search returns results
|
||||
3. Verify results display properly with links
|
||||
4. Test inserting results into conversation
|
||||
5. Test news search filter works correctly
|
||||
6. Verify search history is saved (optional)
|
||||
</test_steps>
|
||||
</feature_7>
|
||||
|
||||
<feature_8>
|
||||
<title>Configuration and Settings</title>
|
||||
<description>
|
||||
Add Tavily configuration options to settings and environment.
|
||||
|
||||
Implementation:
|
||||
- Add TAVILY_API_KEY to environment variables
|
||||
- Add Tavily settings to .claude_settings.json or similar config file
|
||||
- Create server/config/tavilyConfig.js for configuration management
|
||||
- Document configuration options in README
|
||||
|
||||
Configuration Options:
|
||||
- API key
|
||||
- Max results per search (default: 5)
|
||||
- Search depth (basic/advanced)
|
||||
- Timeout duration
|
||||
- Enable/disable Tavily globally
|
||||
- Rate limiting settings
|
||||
|
||||
Security:
|
||||
- API key should NOT be exposed to frontend
|
||||
- Use environment variable or secure config file
|
||||
- Validate API key on startup
|
||||
- Log warnings if API key is missing
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Verify API key is read from environment variable
|
||||
2. Test fallback to config file if env var not set
|
||||
3. Verify API key validation on startup
|
||||
4. Test configuration options are applied correctly
|
||||
5. Verify API key is never exposed in API responses
|
||||
6. Test enabling/disabling Tavily via config
|
||||
</test_steps>
|
||||
</feature_8>
|
||||
|
||||
<feature_9>
|
||||
<title>Error Handling and Rate Limiting</title>
|
||||
<description>
|
||||
Implement robust error handling and rate limiting for Tavily API calls.
|
||||
|
||||
Implementation:
|
||||
- Detect and handle Tavily API errors (rate limits, invalid API key, etc.)
|
||||
- Implement client-side rate limiting to avoid hitting Tavily limits
|
||||
- Cache search results for duplicate queries (optional)
|
||||
- Provide clear error messages to Claude when searches fail
|
||||
|
||||
Error Types:
|
||||
- 401: Invalid API key
|
||||
- 429: Rate limit exceeded
|
||||
- 500: Tavily server error
|
||||
- Timeout: Search took too long
|
||||
- Network: Connection failed
|
||||
|
||||
Rate Limiting:
|
||||
- Track searches per minute/hour
|
||||
- Queue requests if limit reached
|
||||
- Return cached results for duplicate queries within 5 minutes
|
||||
- Log rate limit warnings
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<test_steps>
|
||||
1. Test error handling for invalid API key
|
||||
2. Verify rate limit detection and handling
|
||||
3. Test timeout handling for slow searches
|
||||
4. Verify error messages are clear to Claude
|
||||
5. Test rate limiting prevents API abuse
|
||||
6. Verify caching works for duplicate queries
|
||||
</test_steps>
|
||||
</feature_9>
|
||||
|
||||
<feature_10>
|
||||
<title>Documentation and README Updates</title>
|
||||
<description>
|
||||
Update project documentation to explain Tavily integration.
|
||||
|
||||
Implementation:
|
||||
- Update main README.md with Tavily setup instructions
|
||||
- Add TAVILY_SETUP.md with detailed configuration guide
|
||||
- Document API endpoints in README
|
||||
- Add examples of using Tavily with Ikario
|
||||
- Document troubleshooting steps
|
||||
|
||||
Documentation Sections:
|
||||
- Prerequisites (Tavily API key)
|
||||
- Installation steps
|
||||
- Configuration options
|
||||
- Testing Tavily connection
|
||||
- Example conversations using internet search
|
||||
- Troubleshooting common issues
|
||||
- API reference for Tavily endpoints
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>documentation</category>
|
||||
<test_steps>
|
||||
1. Verify README has Tavily setup section
|
||||
2. Test that setup instructions are clear and complete
|
||||
3. Verify all configuration options are documented
|
||||
4. Test examples work as described
|
||||
5. Verify troubleshooting section covers common issues
|
||||
</test_steps>
|
||||
</feature_10>
|
||||
</core_features>
|
||||
|
||||
<implementation_notes>
|
||||
<order>
|
||||
Recommended implementation order:
|
||||
1. Feature 1 (MCP Client Setup) - Foundation
|
||||
2. Feature 2 (Tool Configuration) - Core functionality
|
||||
3. Feature 3 (Tool Executor Integration) - Core functionality
|
||||
4. Feature 8 (Configuration) - Required for testing
|
||||
5. Feature 4 (System Prompt) - Makes tools accessible to Claude
|
||||
6. Feature 9 (Error Handling) - Production readiness
|
||||
7. Feature 5 (Status API) - Monitoring
|
||||
8. Feature 10 (Documentation) - User onboarding
|
||||
9. Feature 6 (UI Indicator) - Nice to have
|
||||
10. Feature 7 (Manual Search UI) - Optional enhancement
|
||||
</order>
|
||||
|
||||
<testing>
|
||||
After implementing features 1-5, you should be able to:
|
||||
- Ask Ikario: "Quelle est l'actualité aujourd'hui ?"
|
||||
- Ask Ikario: "Recherche des informations sur [topic actuel]"
|
||||
- Ask Ikario: "Vérifie cette information : [claim]"
|
||||
|
||||
Ikario should autonomously use Tavily search and cite sources.
|
||||
</testing>
|
||||
|
||||
<compatibility>
|
||||
- This specification is fully compatible with existing ikario-memory MCP integration
|
||||
- Ikario will have both memory tools AND internet search tools
|
||||
- Tools can be used together in the same conversation
|
||||
- No conflicts expected between tool systems
|
||||
</compatibility>
|
||||
</implementation_notes>
|
||||
|
||||
<safety_requirements>
|
||||
<critical>
|
||||
- DO NOT expose Tavily API key to frontend or in API responses
|
||||
- DO NOT modify existing MCP memory integration
|
||||
- DO NOT break existing conversation functionality
|
||||
- Tavily should gracefully degrade if unavailable (don't crash the app)
|
||||
- Implement proper rate limiting to avoid API abuse
|
||||
- Validate all user inputs before passing to Tavily
|
||||
- Sanitize search results before displaying (XSS prevention)
|
||||
- Log all Tavily API calls for monitoring and debugging
|
||||
</critical>
|
||||
</safety_requirements>
|
||||
|
||||
<success_metrics>
|
||||
- Ikario can successfully perform internet searches when asked
|
||||
- Search results are relevant and well-formatted
|
||||
- Sources are properly cited
|
||||
- Tavily integration doesn't slow down conversations
|
||||
- Error handling is robust and user-friendly
|
||||
- Configuration is straightforward
|
||||
- Documentation is clear and complete
|
||||
</success_metrics>
|
||||
</project_specification>
|
||||
@@ -1,679 +0,0 @@
|
||||
<project_specification>
|
||||
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
|
||||
|
||||
<overview>
|
||||
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
|
||||
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
|
||||
improve code maintainability, enable static type checking with mypy, and provide clear documentation
|
||||
for all functions, classes, and modules.
|
||||
|
||||
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
|
||||
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
|
||||
for document upload, processing, and semantic search.
|
||||
</overview>
|
||||
|
||||
<technology_stack>
|
||||
<backend>
|
||||
<runtime>Python 3.10+</runtime>
|
||||
<web_framework>Flask 3.0</web_framework>
|
||||
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
|
||||
<ocr>Mistral OCR API</ocr>
|
||||
<llm>Ollama (local) or Mistral API</llm>
|
||||
<type_checking>mypy with strict configuration</type_checking>
|
||||
</backend>
|
||||
<infrastructure>
|
||||
<containerization>Docker Compose (Weaviate + transformers)</containerization>
|
||||
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
|
||||
</infrastructure>
|
||||
</technology_stack>
|
||||
|
||||
<current_state>
|
||||
<project_structure>
|
||||
- flask_app.py: Main Flask application (640 lines)
|
||||
- schema.py: Weaviate schema definition (383 lines)
|
||||
- utils/: 16+ modules for PDF processing pipeline
|
||||
- pdf_pipeline.py: Main orchestration (879 lines)
|
||||
- mistral_client.py: OCR API client
|
||||
- ocr_processor.py: OCR processing
|
||||
- markdown_builder.py: Markdown generation
|
||||
- llm_metadata.py: Metadata extraction via LLM
|
||||
- llm_toc.py: Table of contents extraction
|
||||
- llm_classifier.py: Section classification
|
||||
- llm_chunker.py: Semantic chunking
|
||||
- llm_cleaner.py: Chunk cleaning
|
||||
- llm_validator.py: Document validation
|
||||
- weaviate_ingest.py: Database ingestion
|
||||
- hierarchy_parser.py: Document hierarchy parsing
|
||||
- image_extractor.py: Image extraction from PDFs
|
||||
- toc_extractor*.py: Various TOC extraction methods
|
||||
- templates/: Jinja2 templates for Flask UI
|
||||
- tests/utils2/: Minimal test coverage (3 test files)
|
||||
</project_structure>
|
||||
|
||||
<issues>
|
||||
- Inconsistent type annotations across modules (some have partial types, many have none)
|
||||
- Missing or incomplete docstrings (no Google-style format)
|
||||
- No mypy configuration for strict type checking
|
||||
- Type hints missing on function parameters and return values
|
||||
- Dict[str, Any] used extensively without proper typing
|
||||
- No type stubs for complex nested structures
|
||||
</issues>
|
||||
</current_state>
|
||||
|
||||
<core_features>
|
||||
<type_annotations>
|
||||
<strict_typing>
|
||||
- Add complete type annotations to ALL functions and methods
|
||||
- Use proper generic types (List, Dict, Optional, Union) from typing module
|
||||
- Add TypedDict for complex dictionary structures
|
||||
- Add Protocol types for duck-typed interfaces
|
||||
- Use Literal types for string constants
|
||||
- Add ParamSpec and TypeVar where appropriate
|
||||
- Type all class attributes and instance variables
|
||||
- Add type annotations to lambda functions where possible
|
||||
</strict_typing>
|
||||
|
||||
<mypy_configuration>
|
||||
- Create mypy.ini with strict configuration
|
||||
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
|
||||
- Enable: disallow_untyped_calls, disallow_untyped_decorators
|
||||
- Enable: warn_return_any, warn_redundant_casts
|
||||
- Enable: strict_equality, strict_optional
|
||||
- Set python_version to 3.10
|
||||
- Configure per-module overrides if needed for gradual migration
|
||||
</mypy_configuration>
|
||||
|
||||
<type_stubs>
|
||||
- Create TypedDict definitions for common data structures:
|
||||
- OCR response structures
|
||||
- Metadata dictionaries
|
||||
- TOC entries
|
||||
- Chunk objects
|
||||
- Weaviate objects
|
||||
- Pipeline results
|
||||
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
|
||||
- Create Protocol types for callback functions
|
||||
</type_stubs>
|
||||
|
||||
<specific_improvements>
|
||||
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
|
||||
- flask_app.py: Type all route handlers, request/response types
|
||||
- schema.py: Type Weaviate configuration objects
|
||||
- llm_*.py: Type LLM request/response structures
|
||||
- mistral_client.py: Type API client methods and responses
|
||||
- weaviate_ingest.py: Type ingestion functions and batch operations
|
||||
</specific_improvements>
|
||||
</type_annotations>
|
||||
|
||||
<documentation>
|
||||
<google_style_docstrings>
|
||||
- Add comprehensive Google-style docstrings to ALL:
|
||||
- Module-level docstrings explaining purpose and usage
|
||||
- Class docstrings with Attributes section
|
||||
- Function/method docstrings with Args, Returns, Raises sections
|
||||
- Complex algorithm explanations with Examples section
|
||||
- Include code examples for public APIs
|
||||
- Document all exceptions that can be raised
|
||||
- Add Notes section for important implementation details
|
||||
- Add See Also section for related functions
|
||||
</google_style_docstrings>
|
||||
|
||||
<module_documentation>
|
||||
<utils_modules>
|
||||
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
|
||||
- mistral_client.py: Document OCR API usage, cost calculation
|
||||
- llm_metadata.py: Document metadata extraction logic
|
||||
- llm_toc.py: Document TOC extraction strategies
|
||||
- llm_classifier.py: Document section classification types
|
||||
- llm_chunker.py: Document semantic vs basic chunking
|
||||
- llm_cleaner.py: Document cleaning rules and validation
|
||||
- llm_validator.py: Document validation criteria
|
||||
- weaviate_ingest.py: Document ingestion process, nested objects
|
||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
||||
</utils_modules>
|
||||
|
||||
<flask_app>
|
||||
- Document all routes with request/response examples
|
||||
- Document SSE (Server-Sent Events) implementation
|
||||
- Document Weaviate query patterns
|
||||
- Document upload processing workflow
|
||||
- Document background job management
|
||||
</flask_app>
|
||||
|
||||
<schema>
|
||||
- Document Weaviate schema design decisions
|
||||
- Document each collection's purpose and relationships
|
||||
- Document nested object structure
|
||||
- Document vectorization strategy
|
||||
</schema>
|
||||
</module_documentation>
|
||||
|
||||
<inline_comments>
|
||||
- Add inline comments for complex logic only (don't over-comment)
|
||||
- Explain WHY not WHAT (code should be self-documenting)
|
||||
- Document performance considerations
|
||||
- Document cost implications (OCR, LLM API calls)
|
||||
- Document error handling strategies
|
||||
</inline_comments>
|
||||
</documentation>
|
||||
|
||||
<validation>
|
||||
<type_checking>
|
||||
- All modules must pass mypy --strict
|
||||
- No # type: ignore comments without justification
|
||||
- CI/CD should run mypy checks
|
||||
- Type coverage should be 100%
|
||||
</type_checking>
|
||||
|
||||
<documentation_quality>
|
||||
- All public functions must have docstrings
|
||||
- All docstrings must follow Google style
|
||||
- Examples should be executable and tested
|
||||
- Documentation should be clear and concise
|
||||
</documentation_quality>
|
||||
</validation>
|
||||
</core_features>
|
||||
|
||||
<implementation_priority>
|
||||
<critical_modules>
|
||||
Priority 1 (Most used, most complex):
|
||||
1. utils/pdf_pipeline.py - Main orchestration
|
||||
2. flask_app.py - Web application entry point
|
||||
3. utils/weaviate_ingest.py - Database operations
|
||||
4. schema.py - Schema definition
|
||||
|
||||
Priority 2 (Core LLM modules):
|
||||
5. utils/llm_metadata.py
|
||||
6. utils/llm_toc.py
|
||||
7. utils/llm_classifier.py
|
||||
8. utils/llm_chunker.py
|
||||
9. utils/llm_cleaner.py
|
||||
10. utils/llm_validator.py
|
||||
|
||||
Priority 3 (OCR and parsing):
|
||||
11. utils/mistral_client.py
|
||||
12. utils/ocr_processor.py
|
||||
13. utils/markdown_builder.py
|
||||
14. utils/hierarchy_parser.py
|
||||
15. utils/image_extractor.py
|
||||
|
||||
Priority 4 (Supporting modules):
|
||||
16. utils/toc_extractor.py
|
||||
17. utils/toc_extractor_markdown.py
|
||||
18. utils/toc_extractor_visual.py
|
||||
19. utils/llm_structurer.py (legacy)
|
||||
</critical_modules>
|
||||
</implementation_priority>
|
||||
|
||||
<implementation_steps>
|
||||
<feature_1>
|
||||
<title>Setup Type Checking Infrastructure</title>
|
||||
<description>
|
||||
Configure mypy with strict settings and create foundational type definitions
|
||||
</description>
|
||||
<tasks>
|
||||
- Create mypy.ini configuration file with strict settings
|
||||
- Add mypy to requirements.txt or dev dependencies
|
||||
- Create utils/types.py module for common TypedDict definitions
|
||||
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
|
||||
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
|
||||
- Create Protocol types for callbacks (ProgressCallback, etc.)
|
||||
- Document type definitions in utils/types.py module docstring
|
||||
- Test mypy configuration on a single module to verify settings
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- mypy.ini exists with strict configuration
|
||||
- utils/types.py contains all foundational types with docstrings
|
||||
- mypy runs without errors on utils/types.py
|
||||
- Type definitions are comprehensive and reusable
|
||||
</acceptance_criteria>
|
||||
</feature_1>
|
||||
|
||||
<feature_2>
|
||||
<title>Add Types to PDF Pipeline Orchestration</title>
|
||||
<description>
|
||||
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
|
||||
</description>
|
||||
<tasks>
|
||||
- Add type annotations to all function signatures in pdf_pipeline.py
|
||||
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
|
||||
- Type progress_callback parameter with Protocol or Callable
|
||||
- Add TypedDict for pipeline options dictionary
|
||||
- Add TypedDict for pipeline result dictionary structure
|
||||
- Type all helper functions (extract_document_metadata_legacy, etc.)
|
||||
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
|
||||
- Fix any mypy errors that arise
|
||||
- Verify mypy --strict passes on pdf_pipeline.py
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All functions in pdf_pipeline.py have complete type annotations
|
||||
- progress_callback is properly typed with Protocol
|
||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
||||
- mypy --strict pdf_pipeline.py passes with zero errors
|
||||
- No # type: ignore comments (or justified if absolutely necessary)
|
||||
</acceptance_criteria>
|
||||
</feature_2>
|
||||
|
||||
<feature_3>
|
||||
<title>Add Types to Flask Application</title>
|
||||
<description>
|
||||
Add complete type annotations to flask_app.py and type all routes
|
||||
</description>
|
||||
<tasks>
|
||||
- Add type annotations to all Flask route handlers
|
||||
- Type request.args, request.form, request.files usage
|
||||
- Type jsonify() return values
|
||||
- Type get_weaviate_client context manager
|
||||
- Type get_collection_stats, get_all_chunks, search_chunks functions
|
||||
- Add TypedDict for Weaviate query results
|
||||
- Type background job processing functions (run_processing_job)
|
||||
- Type SSE generator function (upload_progress)
|
||||
- Add type hints for template rendering
|
||||
- Verify mypy --strict passes on flask_app.py
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All Flask routes have complete type annotations
|
||||
- Request/response types are clear and documented
|
||||
- Weaviate query functions are properly typed
|
||||
- SSE generator is correctly typed
|
||||
- mypy --strict flask_app.py passes with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_3>
|
||||
|
||||
<feature_4>
|
||||
<title>Add Types to Core LLM Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
|
||||
</description>
|
||||
<tasks>
|
||||
- llm_metadata.py: Type extract_metadata function, return structure
|
||||
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
|
||||
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
|
||||
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
|
||||
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
|
||||
- llm_validator.py: Type validate_document, validation result structure
|
||||
- Add TypedDict for LLM request/response structures
|
||||
- Type provider selection ("ollama" | "mistral" as Literal)
|
||||
- Type model names with Literal or constants
|
||||
- Verify mypy --strict passes on all llm_*.py modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All LLM modules have complete type annotations
|
||||
- Section types use Literal for type safety
|
||||
- Provider and model parameters are strongly typed
|
||||
- LLM request/response structures use TypedDict
|
||||
- mypy --strict passes on all llm_*.py modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_4>
|
||||
|
||||
<feature_5>
|
||||
<title>Add Types to Weaviate and Database Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to schema.py and weaviate_ingest.py
|
||||
</description>
|
||||
<tasks>
|
||||
- schema.py: Type Weaviate configuration objects
|
||||
- schema.py: Type collection property definitions
|
||||
- weaviate_ingest.py: Type ingest_document function signature
|
||||
- weaviate_ingest.py: Type delete_document_chunks function
|
||||
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
|
||||
- Type batch insertion operations
|
||||
- Type nested object references (work, document)
|
||||
- Add proper error types for Weaviate exceptions
|
||||
- Verify mypy --strict passes on both modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- schema.py has complete type annotations for Weaviate config
|
||||
- weaviate_ingest.py functions are fully typed
|
||||
- Nested object structures use TypedDict
|
||||
- Weaviate client operations are properly typed
|
||||
- mypy --strict passes on both modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_5>
|
||||
|
||||
<feature_6>
|
||||
<title>Add Types to OCR and Parsing Modules</title>
|
||||
<description>
|
||||
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
|
||||
</description>
|
||||
<tasks>
|
||||
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
|
||||
- mistral_client.py: Add TypedDict for Mistral API response structures
|
||||
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
|
||||
- markdown_builder.py: Type build_markdown, image_writer parameter
|
||||
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
|
||||
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
|
||||
- image_extractor.py: Type create_image_writer, image handling
|
||||
- Verify mypy --strict passes on all modules
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All OCR/parsing modules have complete type annotations
|
||||
- Mistral API structures use TypedDict
|
||||
- Hierarchy nodes are properly typed
|
||||
- Image handling functions are typed
|
||||
- mypy --strict passes on all modules with zero errors
|
||||
</acceptance_criteria>
|
||||
</feature_6>
|
||||
|
||||
<feature_7>
|
||||
<title>Add Google-Style Docstrings to Core Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
|
||||
</description>
|
||||
<tasks>
|
||||
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
|
||||
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
|
||||
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
|
||||
- pdf_pipeline.py: Add Examples section showing typical usage
|
||||
- flask_app.py: Add module docstring explaining Flask application
|
||||
- flask_app.py: Document all routes with request/response examples
|
||||
- flask_app.py: Document Weaviate connection management
|
||||
- schema.py: Add module docstring explaining schema design
|
||||
- schema.py: Document each collection's purpose and relationships
|
||||
- weaviate_ingest.py: Document ingestion process with examples
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All core modules have comprehensive module-level docstrings
|
||||
- All public functions have Google-style docstrings
|
||||
- Args, Returns, Raises sections are complete and accurate
|
||||
- Examples are provided for complex functions
|
||||
- Docstrings explain WHY, not just WHAT
|
||||
</acceptance_criteria>
|
||||
</feature_7>
|
||||
|
||||
<feature_8>
|
||||
<title>Add Google-Style Docstrings to LLM Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to all LLM processing modules
|
||||
</description>
|
||||
<tasks>
|
||||
- llm_metadata.py: Document metadata extraction logic with examples
|
||||
- llm_toc.py: Document TOC extraction strategies and fallbacks
|
||||
- llm_classifier.py: Document section types and classification criteria
|
||||
- llm_chunker.py: Document semantic vs basic chunking approaches
|
||||
- llm_cleaner.py: Document cleaning rules and validation logic
|
||||
- llm_validator.py: Document validation criteria and corrections
|
||||
- Add Examples sections showing input/output for each function
|
||||
- Document LLM provider differences (Ollama vs Mistral)
|
||||
- Document cost implications in Notes sections
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All LLM modules have comprehensive docstrings
|
||||
- Each function has Args, Returns, Raises sections
|
||||
- Examples show realistic input/output
|
||||
- Provider differences are documented
|
||||
- Cost implications are noted where relevant
|
||||
</acceptance_criteria>
|
||||
</feature_8>
|
||||
|
||||
<feature_9>
|
||||
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
|
||||
<description>
|
||||
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
|
||||
</description>
|
||||
<tasks>
|
||||
- mistral_client.py: Document OCR API usage, cost calculation
|
||||
- ocr_processor.py: Document OCR response processing
|
||||
- markdown_builder.py: Document markdown generation strategy
|
||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
||||
- image_extractor.py: Document image extraction process
|
||||
- toc_extractor*.py: Document various TOC extraction methods
|
||||
- Add Examples sections for complex algorithms
|
||||
- Document edge cases and error handling
|
||||
- All docstrings must follow Google style format exactly
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- All OCR/parsing modules have comprehensive docstrings
|
||||
- Complex algorithms are well explained
|
||||
- Edge cases are documented
|
||||
- Error handling is documented
|
||||
- Examples demonstrate typical usage
|
||||
</acceptance_criteria>
|
||||
</feature_9>
|
||||
|
||||
<feature_10>
|
||||
<title>Final Validation and CI Integration</title>
|
||||
<description>
|
||||
Verify all type annotations and docstrings, integrate mypy into CI/CD
|
||||
</description>
|
||||
<tasks>
|
||||
- Run mypy --strict on entire codebase, verify 100% pass rate
|
||||
- Verify all public functions have docstrings
|
||||
- Check docstring formatting with pydocstyle or similar tool
|
||||
- Create GitHub Actions workflow to run mypy on every commit
|
||||
- Update README.md with type checking instructions
|
||||
- Update CLAUDE.md with documentation standards
|
||||
- Create CONTRIBUTING.md with type annotation and docstring guidelines
|
||||
- Generate API documentation with Sphinx or pdoc
|
||||
- Fix any remaining mypy errors or missing docstrings
|
||||
</tasks>
|
||||
<acceptance_criteria>
|
||||
- mypy --strict passes on entire codebase with zero errors
|
||||
- All public functions have Google-style docstrings
|
||||
- CI/CD runs mypy checks automatically
|
||||
- Documentation is generated and accessible
|
||||
- Contributing guidelines document type/docstring requirements
|
||||
</acceptance_criteria>
|
||||
</feature_10>
|
||||
</implementation_steps>
|
||||
|
||||
<success_criteria>
|
||||
<type_safety>
|
||||
- 100% type coverage across all modules
|
||||
- mypy --strict passes with zero errors
|
||||
- No # type: ignore comments without justification
|
||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
||||
- Proper use of generics, protocols, and type variables
|
||||
- NewType used for semantic type safety
|
||||
</type_safety>
|
||||
|
||||
<documentation_quality>
|
||||
- All modules have comprehensive module-level docstrings
|
||||
- All public functions/classes have Google-style docstrings
|
||||
- All docstrings include Args, Returns, Raises sections
|
||||
- Complex functions include Examples sections
|
||||
- Cost implications documented in Notes sections
|
||||
- Error handling clearly documented
|
||||
- Provider differences (Ollama vs Mistral) documented
|
||||
</documentation_quality>
|
||||
|
||||
<code_quality>
|
||||
- Code is self-documenting with clear variable names
|
||||
- Inline comments explain WHY, not WHAT
|
||||
- Complex algorithms are well explained
|
||||
- Performance considerations documented
|
||||
- Security considerations documented
|
||||
</code_quality>
|
||||
|
||||
<developer_experience>
|
||||
- IDE autocomplete works perfectly with type hints
|
||||
- Type errors caught at development time, not runtime
|
||||
- Documentation is easily accessible in IDE
|
||||
- API examples are executable and tested
|
||||
- Contributing guidelines are clear and comprehensive
|
||||
</developer_experience>
|
||||
|
||||
<maintainability>
|
||||
- Refactoring is safer with type checking
|
||||
- Function signatures are self-documenting
|
||||
- API contracts are explicit and enforced
|
||||
- Breaking changes are caught by type checker
|
||||
- New developers can understand code quickly
|
||||
</maintainability>
|
||||
</success_criteria>
|
||||
|
||||
<constraints>
|
||||
<compatibility>
|
||||
- Must maintain backward compatibility with existing code
|
||||
- Cannot break existing Flask routes or API contracts
|
||||
- Weaviate schema must remain unchanged
|
||||
- Existing tests must continue to pass
|
||||
</compatibility>
|
||||
|
||||
<gradual_migration>
|
||||
- Can use per-module mypy configuration for gradual migration
|
||||
- Can temporarily disable strict checks on legacy modules
|
||||
- Priority modules must be completed first
|
||||
- Low-priority modules can be deferred
|
||||
</gradual_migration>
|
||||
|
||||
<standards>
|
||||
- All type annotations must use Python 3.10+ syntax
|
||||
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
|
||||
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
|
||||
- Use from __future__ import annotations if needed for forward references
|
||||
</standards>
|
||||
</constraints>
|
||||
|
||||
<testing_strategy>
|
||||
<type_checking>
|
||||
- Run mypy --strict on each module after adding types
|
||||
- Use mypy daemon (dmypy) for faster incremental checking
|
||||
- Add mypy to pre-commit hooks
|
||||
- CI/CD must run mypy and fail on type errors
|
||||
</type_checking>
|
||||
|
||||
<documentation_validation>
|
||||
- Use pydocstyle to validate Google-style format
|
||||
- Use sphinx-build to generate docs and catch errors
|
||||
- Manual review of docstring examples
|
||||
- Verify examples are executable and correct
|
||||
</documentation_validation>
|
||||
|
||||
<integration_testing>
|
||||
- Verify existing tests still pass after type additions
|
||||
- Add new tests for complex typed structures
|
||||
- Test mypy configuration on sample code
|
||||
- Verify IDE autocomplete works correctly
|
||||
</integration_testing>
|
||||
</testing_strategy>
|
||||
|
||||
<documentation_examples>
|
||||
<module_docstring>
|
||||
```python
|
||||
"""
|
||||
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
|
||||
|
||||
This module orchestrates a 10-step pipeline for processing PDF documents:
|
||||
1. OCR via Mistral API
|
||||
2. Markdown construction with images
|
||||
3. Metadata extraction via LLM
|
||||
4. Table of contents (TOC) extraction
|
||||
5. Section classification
|
||||
6. Semantic chunking
|
||||
7. Chunk cleaning and validation
|
||||
8. Enrichment with concepts
|
||||
9. Validation and corrections
|
||||
10. Ingestion into Weaviate vector database
|
||||
|
||||
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
|
||||
various processing modes (skip OCR, semantic chunking, OCR annotations).
|
||||
|
||||
Typical usage:
|
||||
>>> from pathlib import Path
|
||||
>>> from utils.pdf_pipeline import process_pdf
|
||||
>>>
|
||||
>>> result = process_pdf(
|
||||
... Path("document.pdf"),
|
||||
... use_llm=True,
|
||||
... llm_provider="ollama",
|
||||
... ingest_to_weaviate=True,
|
||||
... )
|
||||
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
|
||||
|
||||
See Also:
|
||||
mistral_client: OCR API client
|
||||
llm_metadata: Metadata extraction
|
||||
weaviate_ingest: Database ingestion
|
||||
"""
|
||||
```
|
||||
</module_docstring>
|
||||
|
||||
<function_docstring>
|
||||
```python
|
||||
def process_pdf_v2(
|
||||
pdf_path: Path,
|
||||
output_dir: Path = Path("output"),
|
||||
*,
|
||||
use_llm: bool = True,
|
||||
llm_provider: Literal["ollama", "mistral"] = "ollama",
|
||||
llm_model: Optional[str] = None,
|
||||
skip_ocr: bool = False,
|
||||
ingest_to_weaviate: bool = True,
|
||||
progress_callback: Optional[ProgressCallback] = None,
|
||||
) -> PipelineResult:
|
||||
"""
|
||||
Process a PDF through the complete V2 pipeline with LLM enhancement.
|
||||
|
||||
This function orchestrates all 10 steps of the intelligent document processing
|
||||
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
|
||||
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
|
||||
|
||||
Args:
|
||||
pdf_path: Absolute path to the PDF file to process.
|
||||
output_dir: Base directory for output files. Defaults to "./output".
|
||||
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
|
||||
If False, uses basic heuristic processing.
|
||||
llm_provider: LLM provider to use. "ollama" for local (free but slow),
|
||||
"mistral" for API (fast but paid).
|
||||
llm_model: Specific model name. If None, auto-detects based on provider
|
||||
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
|
||||
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
|
||||
Requires output_dir/<doc_name>/<doc_name>.md to exist.
|
||||
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
|
||||
progress_callback: Optional callback for real-time progress updates.
|
||||
Called with (step_id, status, detail) for each pipeline step.
|
||||
|
||||
Returns:
|
||||
Dictionary containing processing results with the following keys:
|
||||
- success (bool): True if processing completed without errors
|
||||
- document_name (str): Name of the processed document
|
||||
- pages (int): Number of pages in the PDF
|
||||
- chunks_count (int): Number of chunks generated
|
||||
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
|
||||
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
|
||||
- cost_total (float): Total cost (ocr + llm)
|
||||
- metadata (dict): Extracted metadata (title, author, etc.)
|
||||
- toc (list): Hierarchical table of contents
|
||||
- files (dict): Paths to generated files (markdown, chunks, etc.)
|
||||
|
||||
Raises:
|
||||
FileNotFoundError: If pdf_path does not exist.
|
||||
ValueError: If skip_ocr=True but markdown file not found.
|
||||
RuntimeError: If Weaviate connection fails during ingestion.
|
||||
|
||||
Examples:
|
||||
Basic usage with Ollama (free):
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... llm_provider="ollama"
|
||||
... )
|
||||
>>> print(f"Cost: {result['cost_total']:.4f}€")
|
||||
Cost: 0.0270€ # OCR only
|
||||
|
||||
With Mistral API (faster):
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... llm_provider="mistral",
|
||||
... llm_model="mistral-small-latest"
|
||||
... )
|
||||
|
||||
Skip OCR to avoid cost:
|
||||
>>> result = process_pdf_v2(
|
||||
... Path("platon_menon.pdf"),
|
||||
... skip_ocr=True, # Reuses existing markdown
|
||||
... ingest_to_weaviate=False
|
||||
... )
|
||||
|
||||
Notes:
|
||||
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
|
||||
- LLM cost: Free with Ollama, variable with Mistral API
|
||||
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
|
||||
- Weaviate must be running (docker-compose up -d) before ingestion
|
||||
"""
|
||||
```
|
||||
</function_docstring>
|
||||
</documentation_examples>
|
||||
</project_specification>
|
||||
@@ -1,290 +0,0 @@
|
||||
## YOUR ROLE - CODING AGENT (Library RAG - Type Safety & Documentation)
|
||||
|
||||
You are working on adding strict type annotations and Google-style docstrings to a Python library project.
|
||||
This is a FRESH context window - you have no memory of previous sessions.
|
||||
|
||||
You have access to Linear for project management via MCP tools. Linear is your single source of truth.
|
||||
|
||||
### STEP 1: GET YOUR BEARINGS (MANDATORY)
|
||||
|
||||
Start by orienting yourself:
|
||||
|
||||
```bash
|
||||
# 1. See your working directory
|
||||
pwd
|
||||
|
||||
# 2. List files to understand project structure
|
||||
ls -la
|
||||
|
||||
# 3. Read the project specification
|
||||
cat app_spec.txt
|
||||
|
||||
# 4. Read the Linear project state
|
||||
cat .linear_project.json
|
||||
|
||||
# 5. Check recent git history
|
||||
git log --oneline -20
|
||||
```
|
||||
|
||||
### STEP 2: CHECK LINEAR STATUS
|
||||
|
||||
Query Linear to understand current project state using the project_id from `.linear_project.json`.
|
||||
|
||||
1. **Get all issues and count progress:**
|
||||
```
|
||||
mcp__linear__list_issues with project_id
|
||||
```
|
||||
Count:
|
||||
- Issues "Done" = completed
|
||||
- Issues "Todo" = remaining
|
||||
- Issues "In Progress" = currently being worked on
|
||||
|
||||
2. **Find META issue** (if exists) for session context
|
||||
|
||||
3. **Check for in-progress work** - complete it first if found
|
||||
|
||||
### STEP 3: SELECT NEXT ISSUE
|
||||
|
||||
Get Todo issues sorted by priority:
|
||||
```
|
||||
mcp__linear__list_issues with project_id, status="Todo", limit=5
|
||||
```
|
||||
|
||||
Select ONE highest-priority issue to work on.
|
||||
|
||||
### STEP 4: CLAIM THE ISSUE
|
||||
|
||||
Use `mcp__linear__update_issue` to set status to "In Progress"
|
||||
|
||||
### STEP 5: IMPLEMENT THE ISSUE
|
||||
|
||||
Based on issue category:
|
||||
|
||||
**For Type Annotation Issues (e.g., "Types - Add type annotations to X.py"):**
|
||||
|
||||
1. Read the target Python file
|
||||
2. Identify all functions, methods, and variables
|
||||
3. Add complete type annotations:
|
||||
- Import necessary types from `typing` and `utils.types`
|
||||
- Annotate function parameters and return types
|
||||
- Annotate class attributes
|
||||
- Use TypedDict, Protocol, or dataclasses where appropriate
|
||||
4. Save the file
|
||||
5. Run mypy to verify (MANDATORY):
|
||||
```bash
|
||||
cd generations/library_rag
|
||||
mypy --config-file=mypy.ini <file_path>
|
||||
```
|
||||
6. Fix any mypy errors
|
||||
7. Commit the changes
|
||||
|
||||
**For Documentation Issues (e.g., "Docs - Add docstrings to X.py"):**
|
||||
|
||||
1. Read the target Python file
|
||||
2. Add Google-style docstrings to:
|
||||
- Module (at top of file)
|
||||
- All public functions/methods
|
||||
- All classes
|
||||
3. Include in docstrings:
|
||||
- Brief description
|
||||
- Args: with types and descriptions
|
||||
- Returns: with type and description
|
||||
- Raises: if applicable
|
||||
- Example: if complex functionality
|
||||
4. Save the file
|
||||
5. Optionally run pydocstyle to verify (if installed)
|
||||
6. Commit the changes
|
||||
|
||||
**For Setup/Infrastructure Issues:**
|
||||
|
||||
Follow the specific instructions in the issue description.
|
||||
|
||||
### STEP 6: VERIFICATION
|
||||
|
||||
**Type Annotation Issues:**
|
||||
- Run mypy on the modified file(s)
|
||||
- Ensure zero type errors
|
||||
- If errors exist, fix them before proceeding
|
||||
|
||||
**Documentation Issues:**
|
||||
- Review docstrings for completeness
|
||||
- Ensure Args/Returns sections match function signatures
|
||||
- Check that examples are accurate
|
||||
|
||||
**Functional Changes (rare):**
|
||||
- If the issue changes behavior, test manually
|
||||
- Start Flask server if needed: `python flask_app.py`
|
||||
- Test the affected functionality
|
||||
|
||||
### STEP 7: GIT COMMIT
|
||||
|
||||
Make a descriptive commit:
|
||||
```bash
|
||||
git add <files>
|
||||
git commit -m "<Issue ID>: <Short description>
|
||||
|
||||
- <List of changes>
|
||||
- Verified with mypy (for type issues)
|
||||
- Linear issue: <issue identifier>
|
||||
"
|
||||
```
|
||||
|
||||
### STEP 8: UPDATE LINEAR ISSUE
|
||||
|
||||
1. **Add implementation comment:**
|
||||
```markdown
|
||||
## Implementation Complete
|
||||
|
||||
### Changes Made
|
||||
- [List of files modified]
|
||||
- [Key changes]
|
||||
|
||||
### Verification
|
||||
- mypy passes with zero errors (for type issues)
|
||||
- All test steps from issue description verified
|
||||
|
||||
### Git Commit
|
||||
[commit hash and message]
|
||||
```
|
||||
|
||||
2. **Update status to "Done"** using `mcp__linear__update_issue`
|
||||
|
||||
### STEP 9: DECIDE NEXT ACTION
|
||||
|
||||
After completing an issue, ask yourself:
|
||||
|
||||
1. Have I been working for a while? (Use judgment based on complexity of work done)
|
||||
2. Is the code in a stable state?
|
||||
3. Would this be a good handoff point?
|
||||
|
||||
**If YES to all three:**
|
||||
- Proceed to STEP 10 (Session Summary)
|
||||
- End cleanly
|
||||
|
||||
**If NO:**
|
||||
- Continue to another issue (go back to STEP 3)
|
||||
- But commit first!
|
||||
|
||||
**Pacing Guidelines:**
|
||||
- Early phase (< 20% done): Can complete multiple simple issues
|
||||
- Mid/late phase (> 20% done): 1-2 issues per session for quality
|
||||
|
||||
### STEP 10: SESSION SUMMARY (When Ending)
|
||||
|
||||
If META issue exists, add a comment:
|
||||
|
||||
```markdown
|
||||
## Session Complete
|
||||
|
||||
### Completed This Session
|
||||
- [Issue ID]: [Title] - [Brief summary]
|
||||
|
||||
### Current Progress
|
||||
- X issues Done
|
||||
- Y issues In Progress
|
||||
- Z issues Todo
|
||||
|
||||
### Notes for Next Session
|
||||
- [Important context]
|
||||
- [Recommendations]
|
||||
- [Any concerns]
|
||||
```
|
||||
|
||||
Ensure:
|
||||
- All code committed
|
||||
- No uncommitted changes
|
||||
- App in working state
|
||||
|
||||
---
|
||||
|
||||
## LINEAR WORKFLOW RULES
|
||||
|
||||
**Status Transitions:**
|
||||
- Todo → In Progress (when starting)
|
||||
- In Progress → Done (when verified)
|
||||
|
||||
**NEVER:**
|
||||
- Delete or modify issue descriptions
|
||||
- Mark Done without verification
|
||||
- Leave issues In Progress when switching
|
||||
|
||||
---
|
||||
|
||||
## TYPE ANNOTATION GUIDELINES
|
||||
|
||||
**Imports needed:**
|
||||
```python
|
||||
from typing import Optional, Dict, List, Any, Tuple, Callable
|
||||
from pathlib import Path
|
||||
from utils.types import <ProjectSpecificTypes>
|
||||
```
|
||||
|
||||
**Common patterns:**
|
||||
```python
|
||||
# Functions
|
||||
def process_data(input: str, options: Optional[Dict[str, Any]] = None) -> List[str]:
|
||||
"""Process input data."""
|
||||
...
|
||||
|
||||
# Methods with self
|
||||
def save(self, path: Path) -> None:
|
||||
"""Save to file."""
|
||||
...
|
||||
|
||||
# Async functions
|
||||
async def fetch_data(url: str) -> Dict[str, Any]:
|
||||
"""Fetch from API."""
|
||||
...
|
||||
```
|
||||
|
||||
**Use project types from `utils/types.py`:**
|
||||
- Metadata, OCRResponse, TOCEntry, ChunkData, PipelineResult, etc.
|
||||
|
||||
---
|
||||
|
||||
## DOCSTRING TEMPLATE (Google Style)
|
||||
|
||||
```python
|
||||
def function_name(param1: str, param2: int = 0) -> List[str]:
|
||||
"""
|
||||
Brief one-line description.
|
||||
|
||||
More detailed description if needed. Explain what the function does,
|
||||
any important behavior, side effects, etc.
|
||||
|
||||
Args:
|
||||
param1: Description of param1.
|
||||
param2: Description of param2. Defaults to 0.
|
||||
|
||||
Returns:
|
||||
Description of return value.
|
||||
|
||||
Raises:
|
||||
ValueError: When param1 is empty.
|
||||
IOError: When file cannot be read.
|
||||
|
||||
Example:
|
||||
>>> result = function_name("test", 5)
|
||||
>>> print(result)
|
||||
['test', 'test', 'test', 'test', 'test']
|
||||
"""
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## IMPORTANT REMINDERS
|
||||
|
||||
**Your Goal:** Add strict type annotations and comprehensive documentation to all Python modules
|
||||
|
||||
**This Session's Goal:** Complete 1-2 issues with quality work and clean handoff
|
||||
|
||||
**Quality Bar:**
|
||||
- mypy --strict passes with zero errors
|
||||
- All public functions have complete Google-style docstrings
|
||||
- Code is clean and well-documented
|
||||
|
||||
**Context is finite.** End sessions early with good handoff notes. The next agent will continue.
|
||||
|
||||
---
|
||||
|
||||
Begin by running STEP 1 (Get Your Bearings).
|
||||
Reference in New Issue
Block a user