Remove obsolete documentation and backup files
- Remove REMOTE_WEAVIATE_ARCHITECTURE.md (moved to library_rag) - Remove navette.txt (obsolete notes) - Remove backup and obsolete app spec files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,431 +0,0 @@
|
|||||||
# Architecture pour Weaviate distant (Synology/VPS)
|
|
||||||
|
|
||||||
## Votre cas d'usage
|
|
||||||
|
|
||||||
**Situation** : Application LLM (local ou cloud) → Weaviate (Synology ou VPS distant)
|
|
||||||
|
|
||||||
**Besoins** :
|
|
||||||
- ✅ Fiabilité maximale
|
|
||||||
- ✅ Sécurité (données privées)
|
|
||||||
- ✅ Performance acceptable
|
|
||||||
- ✅ Maintenance simple
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🏆 Option recommandée : API REST + Tunnel sécurisé
|
|
||||||
|
|
||||||
### Architecture globale
|
|
||||||
|
|
||||||
```
|
|
||||||
┌──────────────────────────────────────────────────────────────┐
|
|
||||||
│ Application LLM │
|
|
||||||
│ (Claude API, OpenAI, Ollama local, etc.) │
|
|
||||||
└────────────────────┬─────────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
▼
|
|
||||||
┌──────────────────────────────────────────────────────────────┐
|
|
||||||
│ API REST Custom (Flask/FastAPI) │
|
|
||||||
│ - Authentification JWT/API Key │
|
|
||||||
│ - Rate limiting │
|
|
||||||
│ - Logging │
|
|
||||||
│ - HTTPS (Let's Encrypt) │
|
|
||||||
└────────────────────┬─────────────────────────────────────────┘
|
|
||||||
│
|
|
||||||
▼ (réseau privé ou VPN)
|
|
||||||
┌──────────────────────────────────────────────────────────────┐
|
|
||||||
│ Synology NAS / VPS │
|
|
||||||
│ ┌────────────────────────────────────────────────────┐ │
|
|
||||||
│ │ Docker Compose │ │
|
|
||||||
│ │ ┌──────────────────┐ ┌─────────────────────┐ │ │
|
|
||||||
│ │ │ Weaviate :8080 │ │ text2vec-transformers│ │ │
|
|
||||||
│ │ └──────────────────┘ └─────────────────────┘ │ │
|
|
||||||
│ └────────────────────────────────────────────────────┘ │
|
|
||||||
└──────────────────────────────────────────────────────────────┘
|
|
||||||
```
|
|
||||||
|
|
||||||
### Pourquoi cette option ?
|
|
||||||
|
|
||||||
✅ **Fiabilité maximale** (5/5)
|
|
||||||
- HTTP/REST = protocole standard, éprouvé
|
|
||||||
- Retry automatique facile
|
|
||||||
- Gestion d'erreur claire
|
|
||||||
|
|
||||||
✅ **Sécurité** (5/5)
|
|
||||||
- HTTPS obligatoire
|
|
||||||
- Authentification par API key
|
|
||||||
- IP whitelisting possible
|
|
||||||
- Logs d'audit
|
|
||||||
|
|
||||||
✅ **Performance** (4/5)
|
|
||||||
- Latence réseau inévitable
|
|
||||||
- Compression gzip possible
|
|
||||||
- Cache Redis optionnel
|
|
||||||
|
|
||||||
✅ **Maintenance** (5/5)
|
|
||||||
- Code simple (Flask/FastAPI)
|
|
||||||
- Monitoring facile
|
|
||||||
- Déploiement standard
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Comparaison des 4 options
|
|
||||||
|
|
||||||
### Option 1 : API REST Custom (⭐ RECOMMANDÉ)
|
|
||||||
|
|
||||||
**Architecture** : App → API REST → Weaviate
|
|
||||||
|
|
||||||
**Code exemple** :
|
|
||||||
|
|
||||||
```python
|
|
||||||
# api_server.py (déployé sur VPS/Synology)
|
|
||||||
from fastapi import FastAPI, HTTPException, Security
|
|
||||||
from fastapi.security import APIKeyHeader
|
|
||||||
import weaviate
|
|
||||||
|
|
||||||
app = FastAPI()
|
|
||||||
api_key_header = APIKeyHeader(name="X-API-Key")
|
|
||||||
|
|
||||||
# Connect to Weaviate (local on same machine)
|
|
||||||
client = weaviate.connect_to_local()
|
|
||||||
|
|
||||||
def verify_api_key(api_key: str = Security(api_key_header)):
|
|
||||||
if api_key != os.getenv("API_KEY"):
|
|
||||||
raise HTTPException(status_code=403, detail="Invalid API key")
|
|
||||||
return api_key
|
|
||||||
|
|
||||||
@app.post("/search")
|
|
||||||
async def search_chunks(
|
|
||||||
query: str,
|
|
||||||
limit: int = 10,
|
|
||||||
api_key: str = Security(verify_api_key)
|
|
||||||
):
|
|
||||||
collection = client.collections.get("Chunk")
|
|
||||||
result = collection.query.near_text(
|
|
||||||
query=query,
|
|
||||||
limit=limit
|
|
||||||
)
|
|
||||||
return {"results": [obj.properties for obj in result.objects]}
|
|
||||||
|
|
||||||
@app.post("/insert_pdf")
|
|
||||||
async def insert_pdf(
|
|
||||||
pdf_path: str,
|
|
||||||
api_key: str = Security(verify_api_key)
|
|
||||||
):
|
|
||||||
# Appeler le pipeline library_rag
|
|
||||||
from utils.pdf_pipeline import process_pdf
|
|
||||||
result = process_pdf(Path(pdf_path))
|
|
||||||
return result
|
|
||||||
```
|
|
||||||
|
|
||||||
**Déploiement** :
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Sur VPS/Synology
|
|
||||||
docker-compose up -d weaviate text2vec
|
|
||||||
uvicorn api_server:app --host 0.0.0.0 --port 8000 --ssl-keyfile key.pem --ssl-certfile cert.pem
|
|
||||||
```
|
|
||||||
|
|
||||||
**Avantages** :
|
|
||||||
- ✅ Contrôle total sur l'API
|
|
||||||
- ✅ Facile à sécuriser (HTTPS + API key)
|
|
||||||
- ✅ Peut wrapper tout le pipeline library_rag
|
|
||||||
- ✅ Monitoring et logging faciles
|
|
||||||
|
|
||||||
**Inconvénients** :
|
|
||||||
- ⚠️ Code custom à maintenir
|
|
||||||
- ⚠️ Besoin d'un serveur web (nginx/uvicorn)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Option 2 : Accès direct Weaviate via VPN
|
|
||||||
|
|
||||||
**Architecture** : App → VPN → Weaviate:8080
|
|
||||||
|
|
||||||
**Configuration** :
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Sur Synology : activer VPN Server (OpenVPN/WireGuard)
|
|
||||||
# Sur client : se connecter au VPN
|
|
||||||
# Accès direct à http://192.168.x.x:8080 (IP privée Synology)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Code client** :
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Dans votre app LLM
|
|
||||||
import weaviate
|
|
||||||
|
|
||||||
# Via VPN, IP privée Synology
|
|
||||||
client = weaviate.connect_to_custom(
|
|
||||||
http_host="192.168.1.100",
|
|
||||||
http_port=8080,
|
|
||||||
http_secure=False, # En VPN, pas besoin HTTPS
|
|
||||||
grpc_host="192.168.1.100",
|
|
||||||
grpc_port=50051,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Utilisation directe
|
|
||||||
collection = client.collections.get("Chunk")
|
|
||||||
result = collection.query.near_text(query="justice")
|
|
||||||
```
|
|
||||||
|
|
||||||
**Avantages** :
|
|
||||||
- ✅ Très simple (pas de code custom)
|
|
||||||
- ✅ Sécurité via VPN
|
|
||||||
- ✅ Utilise Weaviate client Python directement
|
|
||||||
|
|
||||||
**Inconvénients** :
|
|
||||||
- ⚠️ VPN doit être actif en permanence
|
|
||||||
- ⚠️ Latence VPN
|
|
||||||
- ⚠️ Pas de couche d'abstraction (app doit connaître Weaviate)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Option 3 : MCP Server HTTP sur VPS
|
|
||||||
|
|
||||||
**Architecture** : App → MCP HTTP → Weaviate
|
|
||||||
|
|
||||||
**Problème** : FastMCP SSE ne fonctionne pas bien en production (comme on l'a vu)
|
|
||||||
|
|
||||||
**Solution** : Wrapper custom MCP over HTTP
|
|
||||||
|
|
||||||
```python
|
|
||||||
# mcp_http_wrapper.py (sur VPS)
|
|
||||||
from fastapi import FastAPI
|
|
||||||
from mcp_tools import parse_pdf_handler, search_chunks_handler
|
|
||||||
from pydantic import BaseModel
|
|
||||||
|
|
||||||
app = FastAPI()
|
|
||||||
|
|
||||||
class SearchRequest(BaseModel):
|
|
||||||
query: str
|
|
||||||
limit: int = 10
|
|
||||||
|
|
||||||
@app.post("/mcp/search_chunks")
|
|
||||||
async def mcp_search(req: SearchRequest):
|
|
||||||
# Appeler directement le handler MCP
|
|
||||||
input_data = SearchChunksInput(query=req.query, limit=req.limit)
|
|
||||||
result = await search_chunks_handler(input_data)
|
|
||||||
return result.model_dump()
|
|
||||||
```
|
|
||||||
|
|
||||||
**Avantages** :
|
|
||||||
- ✅ Réutilise le code MCP existant
|
|
||||||
- ✅ HTTP standard
|
|
||||||
|
|
||||||
**Inconvénients** :
|
|
||||||
- ⚠️ MCP stdio ne peut pas être utilisé
|
|
||||||
- ⚠️ Besoin d'un wrapper HTTP custom de toute façon
|
|
||||||
- ⚠️ Équivalent à l'option 1 en plus complexe
|
|
||||||
|
|
||||||
**Verdict** : Option 1 (API REST pure) est meilleure
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Option 4 : Tunnel SSH + Port forwarding
|
|
||||||
|
|
||||||
**Architecture** : App → SSH tunnel → localhost:8080 (Weaviate distant)
|
|
||||||
|
|
||||||
**Configuration** :
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Sur votre machine locale
|
|
||||||
ssh -L 8080:localhost:8080 user@synology-ip
|
|
||||||
|
|
||||||
# Weaviate distant est maintenant accessible sur localhost:8080
|
|
||||||
```
|
|
||||||
|
|
||||||
**Code** :
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Dans votre app (pense que Weaviate est local)
|
|
||||||
client = weaviate.connect_to_local() # Va sur localhost:8080 = tunnel SSH
|
|
||||||
```
|
|
||||||
|
|
||||||
**Avantages** :
|
|
||||||
- ✅ Sécurité SSH
|
|
||||||
- ✅ Simple à configurer
|
|
||||||
- ✅ Pas de code custom
|
|
||||||
|
|
||||||
**Inconvénients** :
|
|
||||||
- ⚠️ Tunnel doit rester ouvert
|
|
||||||
- ⚠️ Pas adapté pour une app cloud
|
|
||||||
- ⚠️ Latence SSH
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🎯 Recommandations selon votre cas
|
|
||||||
|
|
||||||
### Cas 1 : Application locale (votre PC) → Weaviate Synology/VPS
|
|
||||||
|
|
||||||
**Recommandation** : **VPN + Accès direct Weaviate** (Option 2)
|
|
||||||
|
|
||||||
**Pourquoi** :
|
|
||||||
- Simple à configurer sur Synology (VPN Server intégré)
|
|
||||||
- Pas de code custom
|
|
||||||
- Sécurité via VPN
|
|
||||||
- Performance acceptable en réseau local/VPN
|
|
||||||
|
|
||||||
**Setup** :
|
|
||||||
|
|
||||||
1. Synology : Activer VPN Server (OpenVPN)
|
|
||||||
2. Client : Se connecter au VPN
|
|
||||||
3. Python : `weaviate.connect_to_custom(http_host="192.168.x.x", ...)`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Cas 2 : Application cloud (serveur distant) → Weaviate Synology/VPS
|
|
||||||
|
|
||||||
**Recommandation** : **API REST Custom** (Option 1)
|
|
||||||
|
|
||||||
**Pourquoi** :
|
|
||||||
- Pas de VPN nécessaire
|
|
||||||
- HTTPS public avec Let's Encrypt
|
|
||||||
- Contrôle d'accès par API key
|
|
||||||
- Rate limiting
|
|
||||||
- Monitoring
|
|
||||||
|
|
||||||
**Setup** :
|
|
||||||
|
|
||||||
1. VPS/Synology : Docker Compose (Weaviate + API REST)
|
|
||||||
2. Domaine : api.monrag.com → VPS IP
|
|
||||||
3. Let's Encrypt : HTTPS automatique
|
|
||||||
4. App cloud : Appelle `https://api.monrag.com/search?api_key=xxx`
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Cas 3 : Développement local temporaire → Weaviate distant
|
|
||||||
|
|
||||||
**Recommandation** : **Tunnel SSH** (Option 4)
|
|
||||||
|
|
||||||
**Pourquoi** :
|
|
||||||
- Setup en 1 ligne
|
|
||||||
- Aucune config permanente
|
|
||||||
- Parfait pour le dev/debug
|
|
||||||
|
|
||||||
**Setup** :
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ssh -L 8080:localhost:8080 user@vps
|
|
||||||
# Weaviate distant accessible sur localhost:8080
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🔧 Déploiement recommandé pour VPS
|
|
||||||
|
|
||||||
### Stack complète
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# docker-compose.yml (sur VPS)
|
|
||||||
version: '3.8'
|
|
||||||
|
|
||||||
services:
|
|
||||||
# Weaviate + embeddings
|
|
||||||
weaviate:
|
|
||||||
image: cr.weaviate.io/semitechnologies/weaviate:1.34.4
|
|
||||||
ports:
|
|
||||||
- "127.0.0.1:8080:8080" # Uniquement localhost (sécurité)
|
|
||||||
environment:
|
|
||||||
AUTHENTICATION_APIKEY_ENABLED: "true"
|
|
||||||
AUTHENTICATION_APIKEY_ALLOWED_KEYS: "my-secret-key"
|
|
||||||
# ... autres configs
|
|
||||||
volumes:
|
|
||||||
- weaviate_data:/var/lib/weaviate
|
|
||||||
|
|
||||||
text2vec-transformers:
|
|
||||||
image: cr.weaviate.io/semitechnologies/transformers-inference:baai-bge-m3-onnx-latest
|
|
||||||
# ... config
|
|
||||||
|
|
||||||
# API REST custom
|
|
||||||
api:
|
|
||||||
build: ./api
|
|
||||||
ports:
|
|
||||||
- "8000:8000"
|
|
||||||
environment:
|
|
||||||
WEAVIATE_URL: http://weaviate:8080
|
|
||||||
API_KEY: ${API_KEY}
|
|
||||||
MISTRAL_API_KEY: ${MISTRAL_API_KEY}
|
|
||||||
depends_on:
|
|
||||||
- weaviate
|
|
||||||
restart: always
|
|
||||||
|
|
||||||
# NGINX reverse proxy + HTTPS
|
|
||||||
nginx:
|
|
||||||
image: nginx:alpine
|
|
||||||
ports:
|
|
||||||
- "80:80"
|
|
||||||
- "443:443"
|
|
||||||
volumes:
|
|
||||||
- ./nginx.conf:/etc/nginx/nginx.conf
|
|
||||||
- /etc/letsencrypt:/etc/letsencrypt
|
|
||||||
depends_on:
|
|
||||||
- api
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
weaviate_data:
|
|
||||||
```
|
|
||||||
|
|
||||||
### NGINX config
|
|
||||||
|
|
||||||
```nginx
|
|
||||||
# nginx.conf
|
|
||||||
server {
|
|
||||||
listen 443 ssl;
|
|
||||||
server_name api.monrag.com;
|
|
||||||
|
|
||||||
ssl_certificate /etc/letsencrypt/live/api.monrag.com/fullchain.pem;
|
|
||||||
ssl_certificate_key /etc/letsencrypt/live/api.monrag.com/privkey.pem;
|
|
||||||
|
|
||||||
location / {
|
|
||||||
proxy_pass http://api:8000;
|
|
||||||
proxy_set_header Host $host;
|
|
||||||
proxy_set_header X-Real-IP $remote_addr;
|
|
||||||
|
|
||||||
# Rate limiting
|
|
||||||
limit_req zone=api_limit burst=10 nodelay;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 📊 Comparaison finale
|
|
||||||
|
|
||||||
| Critère | VPN + Direct | API REST | Tunnel SSH | MCP HTTP |
|
|
||||||
|---------|--------------|----------|------------|----------|
|
|
||||||
| **Fiabilité** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
|
|
||||||
| **Sécurité** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
|
|
||||||
| **Simplicité** | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
|
|
||||||
| **Performance** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
|
|
||||||
| **Maintenance** | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ |
|
|
||||||
| **Production** | ✅ Oui | ✅ Oui | ❌ Non | ⚠️ Possible |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 💡 Ma recommandation finale
|
|
||||||
|
|
||||||
### Pour Synology (usage personnel/équipe)
|
|
||||||
**VPN + Accès direct Weaviate** (Option 2)
|
|
||||||
- Synology a un excellent VPN Server intégré
|
|
||||||
- Sécurité maximale
|
|
||||||
- Simple à maintenir
|
|
||||||
|
|
||||||
### Pour VPS (usage production/public)
|
|
||||||
**API REST Custom** (Option 1)
|
|
||||||
- Contrôle total
|
|
||||||
- HTTPS public
|
|
||||||
- Scalable
|
|
||||||
- Monitoring complet
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 🚀 Prochaine étape recommandée
|
|
||||||
|
|
||||||
Voulez-vous que je crée :
|
|
||||||
|
|
||||||
1. **Le code de l'API REST** (Flask/FastAPI) avec auth + rate limiting ?
|
|
||||||
2. **Le docker-compose VPS complet** avec nginx + Let's Encrypt ?
|
|
||||||
3. **Le guide d'installation Synology VPN** + config client ?
|
|
||||||
|
|
||||||
Dites-moi votre cas d'usage exact et je vous prépare la solution complète ! 🎯
|
|
||||||
2510
navette.txt
2510
navette.txt
File diff suppressed because it is too large
Load Diff
@@ -1,679 +0,0 @@
|
|||||||
<project_specification>
|
|
||||||
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
|
|
||||||
|
|
||||||
<overview>
|
|
||||||
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
|
|
||||||
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
|
|
||||||
improve code maintainability, enable static type checking with mypy, and provide clear documentation
|
|
||||||
for all functions, classes, and modules.
|
|
||||||
|
|
||||||
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
|
|
||||||
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
|
|
||||||
for document upload, processing, and semantic search.
|
|
||||||
</overview>
|
|
||||||
|
|
||||||
<technology_stack>
|
|
||||||
<backend>
|
|
||||||
<runtime>Python 3.10+</runtime>
|
|
||||||
<web_framework>Flask 3.0</web_framework>
|
|
||||||
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
|
|
||||||
<ocr>Mistral OCR API</ocr>
|
|
||||||
<llm>Ollama (local) or Mistral API</llm>
|
|
||||||
<type_checking>mypy with strict configuration</type_checking>
|
|
||||||
</backend>
|
|
||||||
<infrastructure>
|
|
||||||
<containerization>Docker Compose (Weaviate + transformers)</containerization>
|
|
||||||
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
|
|
||||||
</infrastructure>
|
|
||||||
</technology_stack>
|
|
||||||
|
|
||||||
<current_state>
|
|
||||||
<project_structure>
|
|
||||||
- flask_app.py: Main Flask application (640 lines)
|
|
||||||
- schema.py: Weaviate schema definition (383 lines)
|
|
||||||
- utils/: 16+ modules for PDF processing pipeline
|
|
||||||
- pdf_pipeline.py: Main orchestration (879 lines)
|
|
||||||
- mistral_client.py: OCR API client
|
|
||||||
- ocr_processor.py: OCR processing
|
|
||||||
- markdown_builder.py: Markdown generation
|
|
||||||
- llm_metadata.py: Metadata extraction via LLM
|
|
||||||
- llm_toc.py: Table of contents extraction
|
|
||||||
- llm_classifier.py: Section classification
|
|
||||||
- llm_chunker.py: Semantic chunking
|
|
||||||
- llm_cleaner.py: Chunk cleaning
|
|
||||||
- llm_validator.py: Document validation
|
|
||||||
- weaviate_ingest.py: Database ingestion
|
|
||||||
- hierarchy_parser.py: Document hierarchy parsing
|
|
||||||
- image_extractor.py: Image extraction from PDFs
|
|
||||||
- toc_extractor*.py: Various TOC extraction methods
|
|
||||||
- templates/: Jinja2 templates for Flask UI
|
|
||||||
- tests/utils2/: Minimal test coverage (3 test files)
|
|
||||||
</project_structure>
|
|
||||||
|
|
||||||
<issues>
|
|
||||||
- Inconsistent type annotations across modules (some have partial types, many have none)
|
|
||||||
- Missing or incomplete docstrings (no Google-style format)
|
|
||||||
- No mypy configuration for strict type checking
|
|
||||||
- Type hints missing on function parameters and return values
|
|
||||||
- Dict[str, Any] used extensively without proper typing
|
|
||||||
- No type stubs for complex nested structures
|
|
||||||
</issues>
|
|
||||||
</current_state>
|
|
||||||
|
|
||||||
<core_features>
|
|
||||||
<type_annotations>
|
|
||||||
<strict_typing>
|
|
||||||
- Add complete type annotations to ALL functions and methods
|
|
||||||
- Use proper generic types (List, Dict, Optional, Union) from typing module
|
|
||||||
- Add TypedDict for complex dictionary structures
|
|
||||||
- Add Protocol types for duck-typed interfaces
|
|
||||||
- Use Literal types for string constants
|
|
||||||
- Add ParamSpec and TypeVar where appropriate
|
|
||||||
- Type all class attributes and instance variables
|
|
||||||
- Add type annotations to lambda functions where possible
|
|
||||||
</strict_typing>
|
|
||||||
|
|
||||||
<mypy_configuration>
|
|
||||||
- Create mypy.ini with strict configuration
|
|
||||||
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
|
|
||||||
- Enable: disallow_untyped_calls, disallow_untyped_decorators
|
|
||||||
- Enable: warn_return_any, warn_redundant_casts
|
|
||||||
- Enable: strict_equality, strict_optional
|
|
||||||
- Set python_version to 3.10
|
|
||||||
- Configure per-module overrides if needed for gradual migration
|
|
||||||
</mypy_configuration>
|
|
||||||
|
|
||||||
<type_stubs>
|
|
||||||
- Create TypedDict definitions for common data structures:
|
|
||||||
- OCR response structures
|
|
||||||
- Metadata dictionaries
|
|
||||||
- TOC entries
|
|
||||||
- Chunk objects
|
|
||||||
- Weaviate objects
|
|
||||||
- Pipeline results
|
|
||||||
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
|
|
||||||
- Create Protocol types for callback functions
|
|
||||||
</type_stubs>
|
|
||||||
|
|
||||||
<specific_improvements>
|
|
||||||
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
|
|
||||||
- flask_app.py: Type all route handlers, request/response types
|
|
||||||
- schema.py: Type Weaviate configuration objects
|
|
||||||
- llm_*.py: Type LLM request/response structures
|
|
||||||
- mistral_client.py: Type API client methods and responses
|
|
||||||
- weaviate_ingest.py: Type ingestion functions and batch operations
|
|
||||||
</specific_improvements>
|
|
||||||
</type_annotations>
|
|
||||||
|
|
||||||
<documentation>
|
|
||||||
<google_style_docstrings>
|
|
||||||
- Add comprehensive Google-style docstrings to ALL:
|
|
||||||
- Module-level docstrings explaining purpose and usage
|
|
||||||
- Class docstrings with Attributes section
|
|
||||||
- Function/method docstrings with Args, Returns, Raises sections
|
|
||||||
- Complex algorithm explanations with Examples section
|
|
||||||
- Include code examples for public APIs
|
|
||||||
- Document all exceptions that can be raised
|
|
||||||
- Add Notes section for important implementation details
|
|
||||||
- Add See Also section for related functions
|
|
||||||
</google_style_docstrings>
|
|
||||||
|
|
||||||
<module_documentation>
|
|
||||||
<utils_modules>
|
|
||||||
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
|
|
||||||
- mistral_client.py: Document OCR API usage, cost calculation
|
|
||||||
- llm_metadata.py: Document metadata extraction logic
|
|
||||||
- llm_toc.py: Document TOC extraction strategies
|
|
||||||
- llm_classifier.py: Document section classification types
|
|
||||||
- llm_chunker.py: Document semantic vs basic chunking
|
|
||||||
- llm_cleaner.py: Document cleaning rules and validation
|
|
||||||
- llm_validator.py: Document validation criteria
|
|
||||||
- weaviate_ingest.py: Document ingestion process, nested objects
|
|
||||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
|
||||||
</utils_modules>
|
|
||||||
|
|
||||||
<flask_app>
|
|
||||||
- Document all routes with request/response examples
|
|
||||||
- Document SSE (Server-Sent Events) implementation
|
|
||||||
- Document Weaviate query patterns
|
|
||||||
- Document upload processing workflow
|
|
||||||
- Document background job management
|
|
||||||
</flask_app>
|
|
||||||
|
|
||||||
<schema>
|
|
||||||
- Document Weaviate schema design decisions
|
|
||||||
- Document each collection's purpose and relationships
|
|
||||||
- Document nested object structure
|
|
||||||
- Document vectorization strategy
|
|
||||||
</schema>
|
|
||||||
</module_documentation>
|
|
||||||
|
|
||||||
<inline_comments>
|
|
||||||
- Add inline comments for complex logic only (don't over-comment)
|
|
||||||
- Explain WHY not WHAT (code should be self-documenting)
|
|
||||||
- Document performance considerations
|
|
||||||
- Document cost implications (OCR, LLM API calls)
|
|
||||||
- Document error handling strategies
|
|
||||||
</inline_comments>
|
|
||||||
</documentation>
|
|
||||||
|
|
||||||
<validation>
|
|
||||||
<type_checking>
|
|
||||||
- All modules must pass mypy --strict
|
|
||||||
- No # type: ignore comments without justification
|
|
||||||
- CI/CD should run mypy checks
|
|
||||||
- Type coverage should be 100%
|
|
||||||
</type_checking>
|
|
||||||
|
|
||||||
<documentation_quality>
|
|
||||||
- All public functions must have docstrings
|
|
||||||
- All docstrings must follow Google style
|
|
||||||
- Examples should be executable and tested
|
|
||||||
- Documentation should be clear and concise
|
|
||||||
</documentation_quality>
|
|
||||||
</validation>
|
|
||||||
</core_features>
|
|
||||||
|
|
||||||
<implementation_priority>
|
|
||||||
<critical_modules>
|
|
||||||
Priority 1 (Most used, most complex):
|
|
||||||
1. utils/pdf_pipeline.py - Main orchestration
|
|
||||||
2. flask_app.py - Web application entry point
|
|
||||||
3. utils/weaviate_ingest.py - Database operations
|
|
||||||
4. schema.py - Schema definition
|
|
||||||
|
|
||||||
Priority 2 (Core LLM modules):
|
|
||||||
5. utils/llm_metadata.py
|
|
||||||
6. utils/llm_toc.py
|
|
||||||
7. utils/llm_classifier.py
|
|
||||||
8. utils/llm_chunker.py
|
|
||||||
9. utils/llm_cleaner.py
|
|
||||||
10. utils/llm_validator.py
|
|
||||||
|
|
||||||
Priority 3 (OCR and parsing):
|
|
||||||
11. utils/mistral_client.py
|
|
||||||
12. utils/ocr_processor.py
|
|
||||||
13. utils/markdown_builder.py
|
|
||||||
14. utils/hierarchy_parser.py
|
|
||||||
15. utils/image_extractor.py
|
|
||||||
|
|
||||||
Priority 4 (Supporting modules):
|
|
||||||
16. utils/toc_extractor.py
|
|
||||||
17. utils/toc_extractor_markdown.py
|
|
||||||
18. utils/toc_extractor_visual.py
|
|
||||||
19. utils/llm_structurer.py (legacy)
|
|
||||||
</critical_modules>
|
|
||||||
</implementation_priority>
|
|
||||||
|
|
||||||
<implementation_steps>
|
|
||||||
<feature_1>
|
|
||||||
<title>Setup Type Checking Infrastructure</title>
|
|
||||||
<description>
|
|
||||||
Configure mypy with strict settings and create foundational type definitions
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Create mypy.ini configuration file with strict settings
|
|
||||||
- Add mypy to requirements.txt or dev dependencies
|
|
||||||
- Create utils/types.py module for common TypedDict definitions
|
|
||||||
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
|
|
||||||
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
|
|
||||||
- Create Protocol types for callbacks (ProgressCallback, etc.)
|
|
||||||
- Document type definitions in utils/types.py module docstring
|
|
||||||
- Test mypy configuration on a single module to verify settings
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- mypy.ini exists with strict configuration
|
|
||||||
- utils/types.py contains all foundational types with docstrings
|
|
||||||
- mypy runs without errors on utils/types.py
|
|
||||||
- Type definitions are comprehensive and reusable
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_1>
|
|
||||||
|
|
||||||
<feature_2>
|
|
||||||
<title>Add Types to PDF Pipeline Orchestration</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Add type annotations to all function signatures in pdf_pipeline.py
|
|
||||||
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
|
|
||||||
- Type progress_callback parameter with Protocol or Callable
|
|
||||||
- Add TypedDict for pipeline options dictionary
|
|
||||||
- Add TypedDict for pipeline result dictionary structure
|
|
||||||
- Type all helper functions (extract_document_metadata_legacy, etc.)
|
|
||||||
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
|
|
||||||
- Fix any mypy errors that arise
|
|
||||||
- Verify mypy --strict passes on pdf_pipeline.py
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All functions in pdf_pipeline.py have complete type annotations
|
|
||||||
- progress_callback is properly typed with Protocol
|
|
||||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
|
||||||
- mypy --strict pdf_pipeline.py passes with zero errors
|
|
||||||
- No # type: ignore comments (or justified if absolutely necessary)
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_2>
|
|
||||||
|
|
||||||
<feature_3>
|
|
||||||
<title>Add Types to Flask Application</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to flask_app.py and type all routes
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Add type annotations to all Flask route handlers
|
|
||||||
- Type request.args, request.form, request.files usage
|
|
||||||
- Type jsonify() return values
|
|
||||||
- Type get_weaviate_client context manager
|
|
||||||
- Type get_collection_stats, get_all_chunks, search_chunks functions
|
|
||||||
- Add TypedDict for Weaviate query results
|
|
||||||
- Type background job processing functions (run_processing_job)
|
|
||||||
- Type SSE generator function (upload_progress)
|
|
||||||
- Add type hints for template rendering
|
|
||||||
- Verify mypy --strict passes on flask_app.py
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All Flask routes have complete type annotations
|
|
||||||
- Request/response types are clear and documented
|
|
||||||
- Weaviate query functions are properly typed
|
|
||||||
- SSE generator is correctly typed
|
|
||||||
- mypy --strict flask_app.py passes with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_3>
|
|
||||||
|
|
||||||
<feature_4>
|
|
||||||
<title>Add Types to Core LLM Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- llm_metadata.py: Type extract_metadata function, return structure
|
|
||||||
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
|
|
||||||
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
|
|
||||||
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
|
|
||||||
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
|
|
||||||
- llm_validator.py: Type validate_document, validation result structure
|
|
||||||
- Add TypedDict for LLM request/response structures
|
|
||||||
- Type provider selection ("ollama" | "mistral" as Literal)
|
|
||||||
- Type model names with Literal or constants
|
|
||||||
- Verify mypy --strict passes on all llm_*.py modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All LLM modules have complete type annotations
|
|
||||||
- Section types use Literal for type safety
|
|
||||||
- Provider and model parameters are strongly typed
|
|
||||||
- LLM request/response structures use TypedDict
|
|
||||||
- mypy --strict passes on all llm_*.py modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_4>
|
|
||||||
|
|
||||||
<feature_5>
|
|
||||||
<title>Add Types to Weaviate and Database Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to schema.py and weaviate_ingest.py
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- schema.py: Type Weaviate configuration objects
|
|
||||||
- schema.py: Type collection property definitions
|
|
||||||
- weaviate_ingest.py: Type ingest_document function signature
|
|
||||||
- weaviate_ingest.py: Type delete_document_chunks function
|
|
||||||
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
|
|
||||||
- Type batch insertion operations
|
|
||||||
- Type nested object references (work, document)
|
|
||||||
- Add proper error types for Weaviate exceptions
|
|
||||||
- Verify mypy --strict passes on both modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- schema.py has complete type annotations for Weaviate config
|
|
||||||
- weaviate_ingest.py functions are fully typed
|
|
||||||
- Nested object structures use TypedDict
|
|
||||||
- Weaviate client operations are properly typed
|
|
||||||
- mypy --strict passes on both modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_5>
|
|
||||||
|
|
||||||
<feature_6>
|
|
||||||
<title>Add Types to OCR and Parsing Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
|
|
||||||
- mistral_client.py: Add TypedDict for Mistral API response structures
|
|
||||||
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
|
|
||||||
- markdown_builder.py: Type build_markdown, image_writer parameter
|
|
||||||
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
|
|
||||||
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
|
|
||||||
- image_extractor.py: Type create_image_writer, image handling
|
|
||||||
- Verify mypy --strict passes on all modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All OCR/parsing modules have complete type annotations
|
|
||||||
- Mistral API structures use TypedDict
|
|
||||||
- Hierarchy nodes are properly typed
|
|
||||||
- Image handling functions are typed
|
|
||||||
- mypy --strict passes on all modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_6>
|
|
||||||
|
|
||||||
<feature_7>
|
|
||||||
<title>Add Google-Style Docstrings to Core Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
|
|
||||||
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
|
|
||||||
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
|
|
||||||
- pdf_pipeline.py: Add Examples section showing typical usage
|
|
||||||
- flask_app.py: Add module docstring explaining Flask application
|
|
||||||
- flask_app.py: Document all routes with request/response examples
|
|
||||||
- flask_app.py: Document Weaviate connection management
|
|
||||||
- schema.py: Add module docstring explaining schema design
|
|
||||||
- schema.py: Document each collection's purpose and relationships
|
|
||||||
- weaviate_ingest.py: Document ingestion process with examples
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All core modules have comprehensive module-level docstrings
|
|
||||||
- All public functions have Google-style docstrings
|
|
||||||
- Args, Returns, Raises sections are complete and accurate
|
|
||||||
- Examples are provided for complex functions
|
|
||||||
- Docstrings explain WHY, not just WHAT
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_7>
|
|
||||||
|
|
||||||
<feature_8>
|
|
||||||
<title>Add Google-Style Docstrings to LLM Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to all LLM processing modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- llm_metadata.py: Document metadata extraction logic with examples
|
|
||||||
- llm_toc.py: Document TOC extraction strategies and fallbacks
|
|
||||||
- llm_classifier.py: Document section types and classification criteria
|
|
||||||
- llm_chunker.py: Document semantic vs basic chunking approaches
|
|
||||||
- llm_cleaner.py: Document cleaning rules and validation logic
|
|
||||||
- llm_validator.py: Document validation criteria and corrections
|
|
||||||
- Add Examples sections showing input/output for each function
|
|
||||||
- Document LLM provider differences (Ollama vs Mistral)
|
|
||||||
- Document cost implications in Notes sections
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All LLM modules have comprehensive docstrings
|
|
||||||
- Each function has Args, Returns, Raises sections
|
|
||||||
- Examples show realistic input/output
|
|
||||||
- Provider differences are documented
|
|
||||||
- Cost implications are noted where relevant
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_8>
|
|
||||||
|
|
||||||
<feature_9>
|
|
||||||
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- mistral_client.py: Document OCR API usage, cost calculation
|
|
||||||
- ocr_processor.py: Document OCR response processing
|
|
||||||
- markdown_builder.py: Document markdown generation strategy
|
|
||||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
|
||||||
- image_extractor.py: Document image extraction process
|
|
||||||
- toc_extractor*.py: Document various TOC extraction methods
|
|
||||||
- Add Examples sections for complex algorithms
|
|
||||||
- Document edge cases and error handling
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All OCR/parsing modules have comprehensive docstrings
|
|
||||||
- Complex algorithms are well explained
|
|
||||||
- Edge cases are documented
|
|
||||||
- Error handling is documented
|
|
||||||
- Examples demonstrate typical usage
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_9>
|
|
||||||
|
|
||||||
<feature_10>
|
|
||||||
<title>Final Validation and CI Integration</title>
|
|
||||||
<description>
|
|
||||||
Verify all type annotations and docstrings, integrate mypy into CI/CD
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Run mypy --strict on entire codebase, verify 100% pass rate
|
|
||||||
- Verify all public functions have docstrings
|
|
||||||
- Check docstring formatting with pydocstyle or similar tool
|
|
||||||
- Create GitHub Actions workflow to run mypy on every commit
|
|
||||||
- Update README.md with type checking instructions
|
|
||||||
- Update CLAUDE.md with documentation standards
|
|
||||||
- Create CONTRIBUTING.md with type annotation and docstring guidelines
|
|
||||||
- Generate API documentation with Sphinx or pdoc
|
|
||||||
- Fix any remaining mypy errors or missing docstrings
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- mypy --strict passes on entire codebase with zero errors
|
|
||||||
- All public functions have Google-style docstrings
|
|
||||||
- CI/CD runs mypy checks automatically
|
|
||||||
- Documentation is generated and accessible
|
|
||||||
- Contributing guidelines document type/docstring requirements
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_10>
|
|
||||||
</implementation_steps>
|
|
||||||
|
|
||||||
<success_criteria>
|
|
||||||
<type_safety>
|
|
||||||
- 100% type coverage across all modules
|
|
||||||
- mypy --strict passes with zero errors
|
|
||||||
- No # type: ignore comments without justification
|
|
||||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
|
||||||
- Proper use of generics, protocols, and type variables
|
|
||||||
- NewType used for semantic type safety
|
|
||||||
</type_safety>
|
|
||||||
|
|
||||||
<documentation_quality>
|
|
||||||
- All modules have comprehensive module-level docstrings
|
|
||||||
- All public functions/classes have Google-style docstrings
|
|
||||||
- All docstrings include Args, Returns, Raises sections
|
|
||||||
- Complex functions include Examples sections
|
|
||||||
- Cost implications documented in Notes sections
|
|
||||||
- Error handling clearly documented
|
|
||||||
- Provider differences (Ollama vs Mistral) documented
|
|
||||||
</documentation_quality>
|
|
||||||
|
|
||||||
<code_quality>
|
|
||||||
- Code is self-documenting with clear variable names
|
|
||||||
- Inline comments explain WHY, not WHAT
|
|
||||||
- Complex algorithms are well explained
|
|
||||||
- Performance considerations documented
|
|
||||||
- Security considerations documented
|
|
||||||
</code_quality>
|
|
||||||
|
|
||||||
<developer_experience>
|
|
||||||
- IDE autocomplete works perfectly with type hints
|
|
||||||
- Type errors caught at development time, not runtime
|
|
||||||
- Documentation is easily accessible in IDE
|
|
||||||
- API examples are executable and tested
|
|
||||||
- Contributing guidelines are clear and comprehensive
|
|
||||||
</developer_experience>
|
|
||||||
|
|
||||||
<maintainability>
|
|
||||||
- Refactoring is safer with type checking
|
|
||||||
- Function signatures are self-documenting
|
|
||||||
- API contracts are explicit and enforced
|
|
||||||
- Breaking changes are caught by type checker
|
|
||||||
- New developers can understand code quickly
|
|
||||||
</maintainability>
|
|
||||||
</success_criteria>
|
|
||||||
|
|
||||||
<constraints>
|
|
||||||
<compatibility>
|
|
||||||
- Must maintain backward compatibility with existing code
|
|
||||||
- Cannot break existing Flask routes or API contracts
|
|
||||||
- Weaviate schema must remain unchanged
|
|
||||||
- Existing tests must continue to pass
|
|
||||||
</compatibility>
|
|
||||||
|
|
||||||
<gradual_migration>
|
|
||||||
- Can use per-module mypy configuration for gradual migration
|
|
||||||
- Can temporarily disable strict checks on legacy modules
|
|
||||||
- Priority modules must be completed first
|
|
||||||
- Low-priority modules can be deferred
|
|
||||||
</gradual_migration>
|
|
||||||
|
|
||||||
<standards>
|
|
||||||
- All type annotations must use Python 3.10+ syntax
|
|
||||||
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
|
|
||||||
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
|
|
||||||
- Use from __future__ import annotations if needed for forward references
|
|
||||||
</standards>
|
|
||||||
</constraints>
|
|
||||||
|
|
||||||
<testing_strategy>
|
|
||||||
<type_checking>
|
|
||||||
- Run mypy --strict on each module after adding types
|
|
||||||
- Use mypy daemon (dmypy) for faster incremental checking
|
|
||||||
- Add mypy to pre-commit hooks
|
|
||||||
- CI/CD must run mypy and fail on type errors
|
|
||||||
</type_checking>
|
|
||||||
|
|
||||||
<documentation_validation>
|
|
||||||
- Use pydocstyle to validate Google-style format
|
|
||||||
- Use sphinx-build to generate docs and catch errors
|
|
||||||
- Manual review of docstring examples
|
|
||||||
- Verify examples are executable and correct
|
|
||||||
</documentation_validation>
|
|
||||||
|
|
||||||
<integration_testing>
|
|
||||||
- Verify existing tests still pass after type additions
|
|
||||||
- Add new tests for complex typed structures
|
|
||||||
- Test mypy configuration on sample code
|
|
||||||
- Verify IDE autocomplete works correctly
|
|
||||||
</integration_testing>
|
|
||||||
</testing_strategy>
|
|
||||||
|
|
||||||
<documentation_examples>
|
|
||||||
<module_docstring>
|
|
||||||
```python
|
|
||||||
"""
|
|
||||||
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
|
|
||||||
|
|
||||||
This module orchestrates a 10-step pipeline for processing PDF documents:
|
|
||||||
1. OCR via Mistral API
|
|
||||||
2. Markdown construction with images
|
|
||||||
3. Metadata extraction via LLM
|
|
||||||
4. Table of contents (TOC) extraction
|
|
||||||
5. Section classification
|
|
||||||
6. Semantic chunking
|
|
||||||
7. Chunk cleaning and validation
|
|
||||||
8. Enrichment with concepts
|
|
||||||
9. Validation and corrections
|
|
||||||
10. Ingestion into Weaviate vector database
|
|
||||||
|
|
||||||
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
|
|
||||||
various processing modes (skip OCR, semantic chunking, OCR annotations).
|
|
||||||
|
|
||||||
Typical usage:
|
|
||||||
>>> from pathlib import Path
|
|
||||||
>>> from utils.pdf_pipeline import process_pdf
|
|
||||||
>>>
|
|
||||||
>>> result = process_pdf(
|
|
||||||
... Path("document.pdf"),
|
|
||||||
... use_llm=True,
|
|
||||||
... llm_provider="ollama",
|
|
||||||
... ingest_to_weaviate=True,
|
|
||||||
... )
|
|
||||||
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
|
|
||||||
|
|
||||||
See Also:
|
|
||||||
mistral_client: OCR API client
|
|
||||||
llm_metadata: Metadata extraction
|
|
||||||
weaviate_ingest: Database ingestion
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
</module_docstring>
|
|
||||||
|
|
||||||
<function_docstring>
|
|
||||||
```python
|
|
||||||
def process_pdf_v2(
|
|
||||||
pdf_path: Path,
|
|
||||||
output_dir: Path = Path("output"),
|
|
||||||
*,
|
|
||||||
use_llm: bool = True,
|
|
||||||
llm_provider: Literal["ollama", "mistral"] = "ollama",
|
|
||||||
llm_model: Optional[str] = None,
|
|
||||||
skip_ocr: bool = False,
|
|
||||||
ingest_to_weaviate: bool = True,
|
|
||||||
progress_callback: Optional[ProgressCallback] = None,
|
|
||||||
) -> PipelineResult:
|
|
||||||
"""
|
|
||||||
Process a PDF through the complete V2 pipeline with LLM enhancement.
|
|
||||||
|
|
||||||
This function orchestrates all 10 steps of the intelligent document processing
|
|
||||||
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
|
|
||||||
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
pdf_path: Absolute path to the PDF file to process.
|
|
||||||
output_dir: Base directory for output files. Defaults to "./output".
|
|
||||||
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
|
|
||||||
If False, uses basic heuristic processing.
|
|
||||||
llm_provider: LLM provider to use. "ollama" for local (free but slow),
|
|
||||||
"mistral" for API (fast but paid).
|
|
||||||
llm_model: Specific model name. If None, auto-detects based on provider
|
|
||||||
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
|
|
||||||
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
|
|
||||||
Requires output_dir/<doc_name>/<doc_name>.md to exist.
|
|
||||||
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
|
|
||||||
progress_callback: Optional callback for real-time progress updates.
|
|
||||||
Called with (step_id, status, detail) for each pipeline step.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Dictionary containing processing results with the following keys:
|
|
||||||
- success (bool): True if processing completed without errors
|
|
||||||
- document_name (str): Name of the processed document
|
|
||||||
- pages (int): Number of pages in the PDF
|
|
||||||
- chunks_count (int): Number of chunks generated
|
|
||||||
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
|
|
||||||
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
|
|
||||||
- cost_total (float): Total cost (ocr + llm)
|
|
||||||
- metadata (dict): Extracted metadata (title, author, etc.)
|
|
||||||
- toc (list): Hierarchical table of contents
|
|
||||||
- files (dict): Paths to generated files (markdown, chunks, etc.)
|
|
||||||
|
|
||||||
Raises:
|
|
||||||
FileNotFoundError: If pdf_path does not exist.
|
|
||||||
ValueError: If skip_ocr=True but markdown file not found.
|
|
||||||
RuntimeError: If Weaviate connection fails during ingestion.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
Basic usage with Ollama (free):
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... llm_provider="ollama"
|
|
||||||
... )
|
|
||||||
>>> print(f"Cost: {result['cost_total']:.4f}€")
|
|
||||||
Cost: 0.0270€ # OCR only
|
|
||||||
|
|
||||||
With Mistral API (faster):
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... llm_provider="mistral",
|
|
||||||
... llm_model="mistral-small-latest"
|
|
||||||
... )
|
|
||||||
|
|
||||||
Skip OCR to avoid cost:
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... skip_ocr=True, # Reuses existing markdown
|
|
||||||
... ingest_to_weaviate=False
|
|
||||||
... )
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
|
|
||||||
- LLM cost: Free with Ollama, variable with Mistral API
|
|
||||||
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
|
|
||||||
- Weaviate must be running (docker-compose up -d) before ingestion
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
</function_docstring>
|
|
||||||
</documentation_examples>
|
|
||||||
</project_specification>
|
|
||||||
@@ -1,490 +0,0 @@
|
|||||||
<project_specification>
|
|
||||||
<project_name>Library RAG - Native Markdown Support</project_name>
|
|
||||||
|
|
||||||
<overview>
|
|
||||||
Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
|
|
||||||
and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
|
|
||||||
skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
|
|
||||||
chunking, and Weaviate vectorization.
|
|
||||||
|
|
||||||
This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
|
|
||||||
for users who have philosophical texts in Markdown format.
|
|
||||||
</overview>
|
|
||||||
|
|
||||||
<technology_stack>
|
|
||||||
<backend>
|
|
||||||
<framework>Flask 3.0</framework>
|
|
||||||
<pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
|
|
||||||
<validation>Werkzeug secure_filename</validation>
|
|
||||||
<llm>Ollama (local) or Mistral API</llm>
|
|
||||||
<vectorization>Weaviate with BAAI/bge-m3</vectorization>
|
|
||||||
</backend>
|
|
||||||
<type_safety>
|
|
||||||
<type_checker>mypy strict mode</type_checker>
|
|
||||||
<docstrings>Google-style docstrings required</docstrings>
|
|
||||||
</type_safety>
|
|
||||||
</technology_stack>
|
|
||||||
|
|
||||||
<core_features>
|
|
||||||
<feature_1>
|
|
||||||
<title>Update Flask File Validation</title>
|
|
||||||
<description>
|
|
||||||
Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
|
|
||||||
configuration and file validation logic to support .md files while maintaining backward compatibility
|
|
||||||
with existing PDF workflows.
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
|
|
||||||
- Update allowed_file() function to accept both extensions
|
|
||||||
- Update upload.html template to accept .md files in file input
|
|
||||||
- Update error messages to reflect both formats
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Start Flask app
|
|
||||||
2. Navigate to /upload
|
|
||||||
3. Attempt to upload a .md file
|
|
||||||
4. Verify file is accepted (no "Format non supporté" error)
|
|
||||||
5. Verify PDF upload still works
|
|
||||||
</test_steps>
|
|
||||||
</feature_1>
|
|
||||||
|
|
||||||
<feature_2>
|
|
||||||
<title>Add Markdown Detection in Pipeline</title>
|
|
||||||
<description>
|
|
||||||
Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
|
|
||||||
Add logic to automatically skip OCR processing for .md files and copy the Markdown content
|
|
||||||
directly to the output directory.
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Add file extension detection: `file_ext = pdf_path.suffix.lower()`
|
|
||||||
- If file_ext == ".md":
|
|
||||||
- Skip OCR step entirely (no Mistral API call)
|
|
||||||
- Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
|
|
||||||
- Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
|
|
||||||
- Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
|
|
||||||
- Set cost_ocr = 0.0
|
|
||||||
- Emit progress: "markdown_load" instead of "ocr"
|
|
||||||
- If file_ext == ".pdf":
|
|
||||||
- Continue with existing OCR workflow
|
|
||||||
- Both paths converge at LLM processing (metadata, TOC, chunking)
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Create test Markdown file with philosophical content
|
|
||||||
2. Call process_pdf(Path("test.md"), use_llm=True)
|
|
||||||
3. Verify OCR is skipped (cost_ocr = 0.0)
|
|
||||||
4. Verify output/test/test.md is created
|
|
||||||
5. Verify no _ocr.json file is created
|
|
||||||
6. Verify LLM processing runs normally
|
|
||||||
</test_steps>
|
|
||||||
</feature_2>
|
|
||||||
|
|
||||||
<feature_3>
|
|
||||||
<title>Markdown-Specific Progress Callback</title>
|
|
||||||
<description>
|
|
||||||
Update the progress callback system to emit appropriate events for Markdown file processing.
|
|
||||||
Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
|
|
||||||
user feedback during Server-Sent Events streaming.
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- utils/pdf_pipeline.py (emit_progress calls)
|
|
||||||
- flask_app.py (process_file_background function)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Add conditional progress messages based on file type
|
|
||||||
- For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
|
|
||||||
- For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
|
|
||||||
- Update frontend to handle "markdown_load" event type
|
|
||||||
- Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Upload Markdown file via Flask interface
|
|
||||||
2. Monitor SSE progress stream at /upload/progress/<job_id>
|
|
||||||
3. Verify first step shows "Chargement du fichier Markdown..."
|
|
||||||
4. Verify no OCR-related messages appear
|
|
||||||
5. Verify subsequent steps (metadata, TOC, etc.) work normally
|
|
||||||
</test_steps>
|
|
||||||
</feature_3>
|
|
||||||
|
|
||||||
<feature_4>
|
|
||||||
<title>Update process_pdf_bytes for Markdown</title>
|
|
||||||
<description>
|
|
||||||
Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
|
|
||||||
This function currently creates a temporary PDF file, but for Markdown uploads,
|
|
||||||
it should create a temporary .md file instead.
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Detect file type from filename parameter
|
|
||||||
- If filename ends with .md:
|
|
||||||
- Create temp file with suffix=".md"
|
|
||||||
- Write file_bytes as UTF-8 text
|
|
||||||
- If filename ends with .pdf:
|
|
||||||
- Existing behavior (suffix=".pdf", binary write)
|
|
||||||
- Pass temp file path to process_pdf() which now handles both types
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Create Flask test client
|
|
||||||
2. POST multipart form with .md file to /upload
|
|
||||||
3. Verify process_pdf_bytes creates .md temp file
|
|
||||||
4. Verify temp file contains correct Markdown content
|
|
||||||
5. Verify cleanup deletes temp file after processing
|
|
||||||
</test_steps>
|
|
||||||
</feature_4>
|
|
||||||
|
|
||||||
<feature_5>
|
|
||||||
<title>Add Markdown File Validation</title>
|
|
||||||
<description>
|
|
||||||
Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
|
|
||||||
and basic Markdown structure. Reject files that are too large, contain binary data,
|
|
||||||
or have no meaningful content.
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_create>
|
|
||||||
- utils/markdown_validator.py
|
|
||||||
</files_to_create>
|
|
||||||
<implementation_details>
|
|
||||||
- Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
|
|
||||||
- Checks:
|
|
||||||
- File size < 10 MB
|
|
||||||
- Valid UTF-8 encoding
|
|
||||||
- Contains at least one header (#, ##, etc.)
|
|
||||||
- Not empty (at least 100 characters)
|
|
||||||
- No null bytes or excessive binary content
|
|
||||||
- Return dict with success, error, and warnings keys
|
|
||||||
- Call from process_pdf_v2 before processing
|
|
||||||
- Type annotations and Google-style docstrings required
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Test with valid Markdown file → passes validation
|
|
||||||
2. Test with empty file → fails with "File too short"
|
|
||||||
3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
|
|
||||||
4. Test with very large file (>10MB) → fails with "File too large"
|
|
||||||
5. Test with plain text no headers → warning but continues
|
|
||||||
</test_steps>
|
|
||||||
</feature_5>
|
|
||||||
|
|
||||||
<feature_6>
|
|
||||||
<title>Update Documentation</title>
|
|
||||||
<description>
|
|
||||||
Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
|
|
||||||
Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
|
|
||||||
</description>
|
|
||||||
<priority>3</priority>
|
|
||||||
<category>documentation</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- README.md (add section under "Pipeline de Traitement")
|
|
||||||
- .claude/CLAUDE.md (update development guidelines)
|
|
||||||
- templates/upload.html (add help text)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- README.md:
|
|
||||||
- Add "Support Markdown Natif" section
|
|
||||||
- Document accepted formats: PDF, MD
|
|
||||||
- Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
|
|
||||||
- Add example: process_pdf(Path("document.md"))
|
|
||||||
- CLAUDE.md:
|
|
||||||
- Update "Pipeline de Traitement" section
|
|
||||||
- Note conditional OCR step
|
|
||||||
- Document markdown_validator.py module
|
|
||||||
- upload.html:
|
|
||||||
- Update file input accept attribute: accept=".pdf,.md"
|
|
||||||
- Add help text: "Formats acceptés : PDF, Markdown (.md)"
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Read README.md markdown support section
|
|
||||||
2. Verify examples are clear and accurate
|
|
||||||
3. Check CLAUDE.md developer notes
|
|
||||||
4. Open /upload in browser
|
|
||||||
5. Verify help text displays correctly
|
|
||||||
</test_steps>
|
|
||||||
</feature_6>
|
|
||||||
|
|
||||||
<feature_7>
|
|
||||||
<title>Add Unit Tests for Markdown Processing</title>
|
|
||||||
<description>
|
|
||||||
Create comprehensive unit tests for Markdown file handling to ensure reliability
|
|
||||||
and prevent regressions. Cover file validation, pipeline processing, and edge cases.
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>testing</category>
|
|
||||||
<files_to_create>
|
|
||||||
- tests/utils/test_markdown_validator.py
|
|
||||||
- tests/utils/test_pdf_pipeline_markdown.py
|
|
||||||
- tests/fixtures/sample.md
|
|
||||||
</files_to_create>
|
|
||||||
<implementation_details>
|
|
||||||
- test_markdown_validator.py:
|
|
||||||
- Test valid Markdown acceptance
|
|
||||||
- Test invalid encoding rejection
|
|
||||||
- Test file size limits
|
|
||||||
- Test empty file rejection
|
|
||||||
- Test binary data detection
|
|
||||||
- test_pdf_pipeline_markdown.py:
|
|
||||||
- Test Markdown file processing end-to-end
|
|
||||||
- Test OCR skip for .md files
|
|
||||||
- Test cost_ocr = 0.0
|
|
||||||
- Test LLM processing (metadata, TOC, chunking)
|
|
||||||
- Mock Weaviate ingestion
|
|
||||||
- Verify output files created correctly
|
|
||||||
- fixtures/sample.md:
|
|
||||||
- Create realistic philosophical text in Markdown
|
|
||||||
- Include headers, paragraphs, formatting
|
|
||||||
- ~1000 words for realistic testing
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Run: pytest tests/utils/test_markdown_validator.py -v
|
|
||||||
2. Verify all validation tests pass
|
|
||||||
3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
|
|
||||||
4. Verify end-to-end Markdown processing works
|
|
||||||
5. Check test coverage: pytest --cov=utils --cov-report=html
|
|
||||||
</test_steps>
|
|
||||||
</feature_7>
|
|
||||||
|
|
||||||
<feature_8>
|
|
||||||
<title>Type Safety and Documentation</title>
|
|
||||||
<description>
|
|
||||||
Ensure all new code follows strict type safety requirements and includes comprehensive
|
|
||||||
Google-style docstrings. Run mypy checks and update type definitions as needed.
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>type_safety</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- utils/types.py (add Markdown-specific types if needed)
|
|
||||||
- All modified modules (type annotations)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Add type annotations to all new functions
|
|
||||||
- Update existing functions that handle both PDF and MD
|
|
||||||
- Consider adding:
|
|
||||||
- FileFormat = Literal["pdf", "md"]
|
|
||||||
- MarkdownValidationResult = TypedDict(...)
|
|
||||||
- Run mypy --strict on all modified files
|
|
||||||
- Add Google-style docstrings with:
|
|
||||||
- Args section documenting all parameters
|
|
||||||
- Returns section with structure details
|
|
||||||
- Raises section for exceptions
|
|
||||||
- Examples section for complex functions
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Run: mypy utils/pdf_pipeline.py --strict
|
|
||||||
2. Run: mypy utils/markdown_validator.py --strict
|
|
||||||
3. Verify no type errors
|
|
||||||
4. Run: pydocstyle utils/markdown_validator.py --convention=google
|
|
||||||
5. Verify all docstrings follow Google style
|
|
||||||
</test_steps>
|
|
||||||
</feature_8>
|
|
||||||
|
|
||||||
<feature_9>
|
|
||||||
<title>Handle Markdown-Specific Edge Cases</title>
|
|
||||||
<description>
|
|
||||||
Address edge cases specific to Markdown processing: front matter (YAML/TOML),
|
|
||||||
embedded code blocks, special characters, and non-standard Markdown extensions.
|
|
||||||
</description>
|
|
||||||
<priority>3</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- utils/markdown_validator.py
|
|
||||||
- utils/llm_metadata.py (handle front matter)
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- Front matter handling:
|
|
||||||
- Detect YAML/TOML front matter (--- or +++)
|
|
||||||
- Extract metadata if present (title, author, date)
|
|
||||||
- Pass to LLM or use directly if valid
|
|
||||||
- Strip front matter before content processing
|
|
||||||
- Code block handling:
|
|
||||||
- Don't treat code blocks as actual content
|
|
||||||
- Preserve them for chunking but don't analyze
|
|
||||||
- Special characters:
|
|
||||||
- Handle Unicode properly (Greek, Latin, French accents)
|
|
||||||
- Preserve LaTeX equations in $ or $$
|
|
||||||
- GitHub Flavored Markdown:
|
|
||||||
- Support tables, task lists, strikethrough
|
|
||||||
- Convert to standard format if needed
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Upload Markdown with YAML front matter
|
|
||||||
2. Verify metadata extracted correctly
|
|
||||||
3. Upload Markdown with code blocks
|
|
||||||
4. Verify code not treated as philosophical content
|
|
||||||
5. Upload Markdown with Greek/Latin text
|
|
||||||
6. Verify Unicode handled correctly
|
|
||||||
</test_steps>
|
|
||||||
</feature_9>
|
|
||||||
|
|
||||||
<feature_10>
|
|
||||||
<title>Update UI/UX for Markdown Upload</title>
|
|
||||||
<description>
|
|
||||||
Enhance the upload interface to clearly communicate Markdown support and provide
|
|
||||||
visual feedback about the file type being processed. Show format-specific information
|
|
||||||
(e.g., "No OCR cost for Markdown files").
|
|
||||||
</description>
|
|
||||||
<priority>3</priority>
|
|
||||||
<category>frontend</category>
|
|
||||||
<files_to_modify>
|
|
||||||
- templates/upload.html
|
|
||||||
- templates/upload_progress.html
|
|
||||||
</files_to_modify>
|
|
||||||
<implementation_details>
|
|
||||||
- upload.html:
|
|
||||||
- Add file type indicator icon (📄 PDF vs 📝 MD)
|
|
||||||
- Show format-specific help text on hover
|
|
||||||
- Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
|
|
||||||
- Add example Markdown file download link
|
|
||||||
- upload_progress.html:
|
|
||||||
- Show different icon for Markdown processing
|
|
||||||
- Adjust progress bar (9 steps vs 10 steps)
|
|
||||||
- Display "No OCR cost" badge for Markdown
|
|
||||||
- Update step descriptions based on file type
|
|
||||||
</implementation_details>
|
|
||||||
<test_steps>
|
|
||||||
1. Open /upload page
|
|
||||||
2. Verify help text mentions both PDF and MD
|
|
||||||
3. Select a .md file
|
|
||||||
4. Verify file type indicator shows 📝
|
|
||||||
5. Submit upload
|
|
||||||
6. Verify progress shows "Chargement Markdown..."
|
|
||||||
7. Verify "No OCR cost" badge displays
|
|
||||||
</test_steps>
|
|
||||||
</feature_10>
|
|
||||||
</core_features>
|
|
||||||
|
|
||||||
<implementation_steps>
|
|
||||||
<step number="1">
|
|
||||||
<title>Setup and Configuration</title>
|
|
||||||
<tasks>
|
|
||||||
- Update ALLOWED_EXTENSIONS in flask_app.py
|
|
||||||
- Modify allowed_file() validation function
|
|
||||||
- Update upload.html file input accept attribute
|
|
||||||
- Add Markdown MIME type handling
|
|
||||||
</tasks>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step number="2">
|
|
||||||
<title>Core Pipeline Extension</title>
|
|
||||||
<tasks>
|
|
||||||
- Add file extension detection in process_pdf_v2()
|
|
||||||
- Implement Markdown file reading logic
|
|
||||||
- Skip OCR for .md files
|
|
||||||
- Add conditional progress callbacks
|
|
||||||
- Update process_pdf_bytes() for Markdown
|
|
||||||
</tasks>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step number="3">
|
|
||||||
<title>Validation and Error Handling</title>
|
|
||||||
<tasks>
|
|
||||||
- Create markdown_validator.py module
|
|
||||||
- Implement UTF-8 encoding validation
|
|
||||||
- Add file size limits
|
|
||||||
- Handle front matter extraction
|
|
||||||
- Add comprehensive error messages
|
|
||||||
</tasks>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step number="4">
|
|
||||||
<title>Testing Infrastructure</title>
|
|
||||||
<tasks>
|
|
||||||
- Create test fixtures (sample.md)
|
|
||||||
- Write validation tests
|
|
||||||
- Write pipeline integration tests
|
|
||||||
- Add edge case tests
|
|
||||||
- Verify mypy strict compliance
|
|
||||||
</tasks>
|
|
||||||
</step>
|
|
||||||
|
|
||||||
<step number="5">
|
|
||||||
<title>Documentation and Polish</title>
|
|
||||||
<tasks>
|
|
||||||
- Update README.md with Markdown support
|
|
||||||
- Update .claude/CLAUDE.md developer docs
|
|
||||||
- Add Google-style docstrings
|
|
||||||
- Update UI templates with new messaging
|
|
||||||
- Create usage examples
|
|
||||||
</tasks>
|
|
||||||
</step>
|
|
||||||
</implementation_steps>
|
|
||||||
|
|
||||||
<success_criteria>
|
|
||||||
<functionality>
|
|
||||||
- Markdown files upload successfully via Flask
|
|
||||||
- OCR is skipped for .md files (cost_ocr = 0.0)
|
|
||||||
- LLM processing works identically for PDF and MD
|
|
||||||
- Chunks are created and vectorized correctly
|
|
||||||
- Both file types can be searched in Weaviate
|
|
||||||
- Existing PDF workflow remains unchanged
|
|
||||||
</functionality>
|
|
||||||
|
|
||||||
<type_safety>
|
|
||||||
- All code passes mypy --strict
|
|
||||||
- All functions have type annotations
|
|
||||||
- Google-style docstrings on all modules
|
|
||||||
- No Any types without justification
|
|
||||||
- TypedDict definitions for new data structures
|
|
||||||
</type_safety>
|
|
||||||
|
|
||||||
<testing>
|
|
||||||
- Unit tests cover Markdown validation
|
|
||||||
- Integration tests verify end-to-end processing
|
|
||||||
- Edge cases handled (front matter, Unicode, large files)
|
|
||||||
- Test coverage >80% for new code
|
|
||||||
- All tests pass in CI/CD pipeline
|
|
||||||
</testing>
|
|
||||||
|
|
||||||
<user_experience>
|
|
||||||
- Upload interface clearly shows both formats supported
|
|
||||||
- Progress feedback accurate for both PDF and MD
|
|
||||||
- Cost savings clearly communicated ("0€ for Markdown")
|
|
||||||
- Error messages helpful and specific
|
|
||||||
- Documentation clear with examples
|
|
||||||
</user_experience>
|
|
||||||
|
|
||||||
<performance>
|
|
||||||
- Markdown processing faster than PDF (no OCR)
|
|
||||||
- No regression in PDF processing speed
|
|
||||||
- Memory usage reasonable for large MD files
|
|
||||||
- Validation completes in <100ms
|
|
||||||
- Overall pipeline <30s for typical Markdown document
|
|
||||||
</performance>
|
|
||||||
</success_criteria>
|
|
||||||
|
|
||||||
<technical_notes>
|
|
||||||
<cost_comparison>
|
|
||||||
- PDF processing: OCR ~0.003€/page + LLM variable
|
|
||||||
- Markdown processing: 0€ OCR + LLM variable
|
|
||||||
- Estimated savings: 50-70% for documents with Markdown source
|
|
||||||
</cost_comparison>
|
|
||||||
|
|
||||||
<compatibility>
|
|
||||||
- Maintains backward compatibility with existing PDFs
|
|
||||||
- No breaking changes to API or database schema
|
|
||||||
- Existing chunks and documents unaffected
|
|
||||||
- Can process both formats in same session
|
|
||||||
</compatibility>
|
|
||||||
|
|
||||||
<future_enhancements>
|
|
||||||
- Support for .txt plain text files
|
|
||||||
- Support for .docx Word documents (via pandoc)
|
|
||||||
- Support for .epub ebooks
|
|
||||||
- Batch upload of multiple Markdown files
|
|
||||||
- Markdown to PDF export for archival
|
|
||||||
</future_enhancements>
|
|
||||||
</technical_notes>
|
|
||||||
</project_specification>
|
|
||||||
@@ -1,498 +0,0 @@
|
|||||||
<project_specification>
|
|
||||||
<project_name>ikario - Tavily MCP Integration for Internet Access</project_name>
|
|
||||||
|
|
||||||
<overview>
|
|
||||||
This specification adds Tavily search capabilities via MCP (Model Context Protocol) to give Ikario
|
|
||||||
internet access for real-time web searches. Tavily provides high-quality search results optimized
|
|
||||||
for AI agents, making it ideal for research, fact-checking, and accessing current information.
|
|
||||||
|
|
||||||
This integration adds a new MCP server connection to the existing architecture (alongside the
|
|
||||||
ikario-memory MCP server) and exposes Tavily search tools to Ikario during conversations.
|
|
||||||
|
|
||||||
All changes are additive and backward-compatible. Existing functionality remains unchanged.
|
|
||||||
</overview>
|
|
||||||
|
|
||||||
<architecture_design>
|
|
||||||
<mcp_integration>
|
|
||||||
Tavily MCP Server Connection:
|
|
||||||
- Uses @modelcontextprotocol/sdk Client to connect to Tavily MCP server
|
|
||||||
- Connection can be stdio-based (local MCP server) or HTTP-based (remote)
|
|
||||||
- Tavily MCP server provides search tools that are exposed to Claude via Tool Use API
|
|
||||||
- Backend routes handle tool execution and return results to Claude
|
|
||||||
</mcp_integration>
|
|
||||||
|
|
||||||
<benefits>
|
|
||||||
- Real-time internet access for Ikario
|
|
||||||
- High-quality search results optimized for LLMs
|
|
||||||
- Fact-checking and verification capabilities
|
|
||||||
- Access to current events and news
|
|
||||||
- Research assistance with cited sources
|
|
||||||
- Seamless integration with existing memory tools
|
|
||||||
</benefits>
|
|
||||||
</architecture_design>
|
|
||||||
|
|
||||||
<technology_stack>
|
|
||||||
<mcp_server>
|
|
||||||
<name>Tavily MCP Server</name>
|
|
||||||
<protocol>Model Context Protocol (MCP)</protocol>
|
|
||||||
<connection>stdio or HTTP transport</connection>
|
|
||||||
<sdk>@modelcontextprotocol/sdk</sdk>
|
|
||||||
<api_key>Tavily API key (from https://tavily.com)</api_key>
|
|
||||||
</mcp_server>
|
|
||||||
<backend>
|
|
||||||
<runtime>Node.js with Express (existing)</runtime>
|
|
||||||
<mcp_client>MCP Client for Tavily server connection</mcp_client>
|
|
||||||
<tool_executor>Existing toolExecutor service extended with Tavily tools</tool_executor>
|
|
||||||
</backend>
|
|
||||||
<api_endpoints>
|
|
||||||
<tavily_routes>GET/POST /api/tavily/* for Tavily-specific operations</tavily_routes>
|
|
||||||
<existing_routes>Existing /api/claude/chat routes support Tavily tools automatically</existing_routes>
|
|
||||||
</api_endpoints>
|
|
||||||
</technology_stack>
|
|
||||||
|
|
||||||
<prerequisites>
|
|
||||||
<environment_setup>
|
|
||||||
- Tavily API key obtained from https://tavily.com (free tier available)
|
|
||||||
- API key stored in environment variable TAVILY_API_KEY or configuration file
|
|
||||||
- MCP SDK already installed (@modelcontextprotocol/sdk exists for ikario-memory)
|
|
||||||
- Tavily MCP server installed (npm package or Python package)
|
|
||||||
</environment_setup>
|
|
||||||
<configuration>
|
|
||||||
- Add Tavily MCP server config to server/.claude_settings.json or similar
|
|
||||||
- Configure connection parameters (stdio vs HTTP)
|
|
||||||
- Set API key securely
|
|
||||||
</configuration>
|
|
||||||
</prerequisites>
|
|
||||||
|
|
||||||
<core_features>
|
|
||||||
<feature_1>
|
|
||||||
<title>Tavily MCP Client Setup</title>
|
|
||||||
<description>
|
|
||||||
Create MCP client connection to Tavily search server. This is similar to the existing
|
|
||||||
ikario-memory MCP client but connects to Tavily instead.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Create server/services/tavilyMcpClient.js
|
|
||||||
- Initialize MCP client with Tavily server connection
|
|
||||||
- Handle connection lifecycle (connect, disconnect, reconnect)
|
|
||||||
- Implement health checks and connection status
|
|
||||||
- Export client instance and helper functions
|
|
||||||
|
|
||||||
Configuration:
|
|
||||||
- Read Tavily API key from environment or config file
|
|
||||||
- Configure transport (stdio or HTTP)
|
|
||||||
- Set connection timeout and retry logic
|
|
||||||
- Log connection status for debugging
|
|
||||||
|
|
||||||
Error Handling:
|
|
||||||
- Graceful degradation if Tavily is unavailable
|
|
||||||
- Connection retry with exponential backoff
|
|
||||||
- Clear error messages for configuration issues
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify MCP client can connect to Tavily server on startup
|
|
||||||
2. Test connection health check endpoint returns correct status
|
|
||||||
3. Verify graceful handling when Tavily API key is missing
|
|
||||||
4. Test reconnection logic when connection drops
|
|
||||||
5. Verify connection status is logged correctly
|
|
||||||
6. Test that server starts even if Tavily is unavailable
|
|
||||||
</test_steps>
|
|
||||||
</feature_1>
|
|
||||||
|
|
||||||
<feature_2>
|
|
||||||
<title>Tavily Tool Configuration</title>
|
|
||||||
<description>
|
|
||||||
Configure Tavily search tools to be available to Claude during conversations.
|
|
||||||
This integrates with the existing tool system (like memory tools).
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Create server/config/tavilyTools.js
|
|
||||||
- Define tool schemas for Tavily search capabilities
|
|
||||||
- Integrate with existing toolExecutor service
|
|
||||||
- Add Tavily tools to system prompt alongside memory tools
|
|
||||||
|
|
||||||
Tavily Tools to Expose:
|
|
||||||
- tavily_search: General web search with AI-optimized results
|
|
||||||
- Parameters: query (string), max_results (number), search_depth (basic/advanced)
|
|
||||||
- Returns: Array of search results with title, url, content, score
|
|
||||||
|
|
||||||
- tavily_search_news: News-specific search for current events
|
|
||||||
- Parameters: query (string), max_results (number), days (number)
|
|
||||||
- Returns: Recent news articles with metadata
|
|
||||||
|
|
||||||
Tool Schema:
|
|
||||||
- Follow Claude Tool Use API format
|
|
||||||
- Clear descriptions for each tool
|
|
||||||
- Well-defined input schemas with validation
|
|
||||||
- Proper error handling in tool execution
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify Tavily tools are listed in available tools
|
|
||||||
2. Test tool schema validation with valid inputs
|
|
||||||
3. Test tool schema validation rejects invalid inputs
|
|
||||||
4. Verify tools appear in Claude's system prompt
|
|
||||||
5. Test that tool descriptions are clear and accurate
|
|
||||||
6. Verify tools can be called without errors
|
|
||||||
</test_steps>
|
|
||||||
</feature_2>
|
|
||||||
|
|
||||||
<feature_3>
|
|
||||||
<title>Tavily Tool Executor Integration</title>
|
|
||||||
<description>
|
|
||||||
Integrate Tavily tools into the existing toolExecutor service so Claude can
|
|
||||||
use them during conversations.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Extend server/services/toolExecutor.js to handle Tavily tools
|
|
||||||
- Add tool detection for tavily_search and tavily_search_news
|
|
||||||
- Implement tool execution logic using Tavily MCP client
|
|
||||||
- Format Tavily results for Claude consumption
|
|
||||||
- Handle errors and timeouts gracefully
|
|
||||||
|
|
||||||
Tool Execution Flow:
|
|
||||||
1. Claude requests tool use (e.g., tavily_search)
|
|
||||||
2. toolExecutor detects Tavily tool request
|
|
||||||
3. Call Tavily MCP client with tool parameters
|
|
||||||
4. Receive and format search results
|
|
||||||
5. Return formatted results to Claude
|
|
||||||
6. Claude incorporates results into response
|
|
||||||
|
|
||||||
Result Formatting:
|
|
||||||
- Convert Tavily results to Claude-friendly format
|
|
||||||
- Include source URLs for citation
|
|
||||||
- Add relevance scores
|
|
||||||
- Truncate content if too long
|
|
||||||
- Handle empty results gracefully
|
|
||||||
</description>
|
|
||||||
<priority>1</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Test tavily_search tool execution with valid query
|
|
||||||
2. Verify results are properly formatted
|
|
||||||
3. Test tavily_search_news tool execution
|
|
||||||
4. Verify error handling when Tavily API fails
|
|
||||||
5. Test timeout handling for slow searches
|
|
||||||
6. Verify results include proper citations and URLs
|
|
||||||
7. Test with empty search results
|
|
||||||
8. Test with very long search queries
|
|
||||||
</test_steps>
|
|
||||||
</feature_3>
|
|
||||||
|
|
||||||
<feature_4>
|
|
||||||
<title>System Prompt Enhancement for Internet Access</title>
|
|
||||||
<description>
|
|
||||||
Update the system prompt to inform Ikario about internet access capabilities.
|
|
||||||
This should be added alongside existing memory tools instructions.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Update MEMORY_SYSTEM_PROMPT in server/routes/messages.js and claude.js
|
|
||||||
- Add Tavily tools documentation
|
|
||||||
- Provide usage guidelines for when to search the internet
|
|
||||||
- Include examples of good search queries
|
|
||||||
|
|
||||||
Prompt Addition:
|
|
||||||
"## Internet Access via Tavily
|
|
||||||
|
|
||||||
Tu as accès à internet en temps réel via deux outils de recherche :
|
|
||||||
|
|
||||||
1. tavily_search : Recherche web générale optimisée pour l'IA
|
|
||||||
- Utilise pour : rechercher des informations actuelles, vérifier des faits,
|
|
||||||
trouver des sources fiables
|
|
||||||
- Paramètres : query (ta question), max_results (nombre de résultats, défaut: 5),
|
|
||||||
search_depth ('basic' ou 'advanced')
|
|
||||||
- Retourne : Résultats avec titre, URL, contenu et score de pertinence
|
|
||||||
|
|
||||||
2. tavily_search_news : Recherche d'actualités récentes
|
|
||||||
- Utilise pour : événements actuels, nouvelles, actualités
|
|
||||||
- Paramètres : query, max_results, days (nombre de jours en arrière, défaut: 7)
|
|
||||||
|
|
||||||
Quand utiliser la recherche internet :
|
|
||||||
- Quand l'utilisateur demande des informations récentes ou actuelles
|
|
||||||
- Pour vérifier des faits ou données que tu n'es pas sûr de connaître
|
|
||||||
- Quand ta base de connaissances est trop ancienne (après janvier 2025)
|
|
||||||
- Pour trouver des sources et citations spécifiques
|
|
||||||
- Pour des requêtes nécessitant des données en temps réel
|
|
||||||
|
|
||||||
N'utilise PAS la recherche pour :
|
|
||||||
- Des questions sur ta propre identité ou capacités
|
|
||||||
- Des concepts généraux que tu connais déjà bien
|
|
||||||
- Des questions purement créatives ou d'opinion
|
|
||||||
|
|
||||||
Utilise ces outils de façon autonome selon les besoins de la conversation.
|
|
||||||
Cite toujours tes sources quand tu utilises des informations de Tavily."
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify system prompt includes Tavily instructions
|
|
||||||
2. Test that Claude understands when to use Tavily search
|
|
||||||
3. Verify Claude cites sources from Tavily results
|
|
||||||
4. Test that Claude uses appropriate search queries
|
|
||||||
5. Verify Claude chooses between tavily_search and tavily_search_news correctly
|
|
||||||
6. Test that Claude doesn't over-use search for simple questions
|
|
||||||
</test_steps>
|
|
||||||
</feature_4>
|
|
||||||
|
|
||||||
<feature_5>
|
|
||||||
<title>Tavily Status API Endpoint</title>
|
|
||||||
<description>
|
|
||||||
Create API endpoint to check Tavily MCP connection status and search capabilities.
|
|
||||||
Similar to /api/memory/status endpoint.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Create GET /api/tavily/status endpoint
|
|
||||||
- Return connection status, available tools, and configuration
|
|
||||||
- Create GET /api/tavily/health endpoint for health checks
|
|
||||||
- Add Tavily status to existing /api/memory/stats (rename to /api/tools/stats)
|
|
||||||
|
|
||||||
Response Format:
|
|
||||||
{
|
|
||||||
"success": true,
|
|
||||||
"data": {
|
|
||||||
"connected": true,
|
|
||||||
"message": "Tavily MCP server is connected",
|
|
||||||
"tools": ["tavily_search", "tavily_search_news"],
|
|
||||||
"apiKeyConfigured": true,
|
|
||||||
"transport": "stdio"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Test GET /api/tavily/status returns correct status
|
|
||||||
2. Verify status shows "connected" when Tavily is available
|
|
||||||
3. Verify status shows "disconnected" when Tavily is unavailable
|
|
||||||
4. Test health endpoint returns proper status code
|
|
||||||
5. Verify tools list is accurate
|
|
||||||
6. Test with missing API key shows proper error
|
|
||||||
</test_steps>
|
|
||||||
</feature_5>
|
|
||||||
|
|
||||||
<feature_6>
|
|
||||||
<title>Frontend UI Indicator for Internet Access</title>
|
|
||||||
<description>
|
|
||||||
Add visual indicator in the UI to show when Ikario has internet access via Tavily.
|
|
||||||
This can be displayed alongside the existing memory status indicator.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Add Tavily status indicator in header or sidebar
|
|
||||||
- Show online/offline status for Tavily connection
|
|
||||||
- Optional: Show when Tavily is being used during a conversation
|
|
||||||
- Optional: Add tooltip explaining internet access capabilities
|
|
||||||
|
|
||||||
Visual Design:
|
|
||||||
- Globe or wifi icon to represent internet access
|
|
||||||
- Green when connected, gray when disconnected
|
|
||||||
- Subtle animation when search is in progress
|
|
||||||
- Tooltip: "Internet access via Tavily" or similar
|
|
||||||
|
|
||||||
Integration:
|
|
||||||
- Use existing useMemory hook pattern or create useTavily hook
|
|
||||||
- Poll /api/tavily/status periodically (every 60s)
|
|
||||||
- Update status in real-time during searches
|
|
||||||
</description>
|
|
||||||
<priority>3</priority>
|
|
||||||
<category>frontend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify internet access indicator appears in UI
|
|
||||||
2. Test status updates when Tavily connects/disconnects
|
|
||||||
3. Verify tooltip shows correct information
|
|
||||||
4. Test that indicator shows activity during searches
|
|
||||||
5. Verify status polling doesn't impact performance
|
|
||||||
6. Test with Tavily disabled shows offline status
|
|
||||||
</test_steps>
|
|
||||||
</feature_6>
|
|
||||||
|
|
||||||
<feature_7>
|
|
||||||
<title>Manual Search UI (Optional Enhancement)</title>
|
|
||||||
<description>
|
|
||||||
Optional: Add manual search interface to allow users to trigger Tavily searches directly,
|
|
||||||
similar to the memory search panel.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Add "Internet Search" panel in sidebar (alongside Memory panel)
|
|
||||||
- Search input for manual Tavily queries
|
|
||||||
- Display search results with title, snippet, URL
|
|
||||||
- Click to insert results into conversation
|
|
||||||
- Filter by search type (general vs news)
|
|
||||||
|
|
||||||
This is OPTIONAL and lower priority. The primary use case is autonomous search by Claude.
|
|
||||||
</description>
|
|
||||||
<priority>4</priority>
|
|
||||||
<category>frontend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify search panel appears in sidebar
|
|
||||||
2. Test manual search returns results
|
|
||||||
3. Verify results display properly with links
|
|
||||||
4. Test inserting results into conversation
|
|
||||||
5. Test news search filter works correctly
|
|
||||||
6. Verify search history is saved (optional)
|
|
||||||
</test_steps>
|
|
||||||
</feature_7>
|
|
||||||
|
|
||||||
<feature_8>
|
|
||||||
<title>Configuration and Settings</title>
|
|
||||||
<description>
|
|
||||||
Add Tavily configuration options to settings and environment.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Add TAVILY_API_KEY to environment variables
|
|
||||||
- Add Tavily settings to .claude_settings.json or similar config file
|
|
||||||
- Create server/config/tavilyConfig.js for configuration management
|
|
||||||
- Document configuration options in README
|
|
||||||
|
|
||||||
Configuration Options:
|
|
||||||
- API key
|
|
||||||
- Max results per search (default: 5)
|
|
||||||
- Search depth (basic/advanced)
|
|
||||||
- Timeout duration
|
|
||||||
- Enable/disable Tavily globally
|
|
||||||
- Rate limiting settings
|
|
||||||
|
|
||||||
Security:
|
|
||||||
- API key should NOT be exposed to frontend
|
|
||||||
- Use environment variable or secure config file
|
|
||||||
- Validate API key on startup
|
|
||||||
- Log warnings if API key is missing
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify API key is read from environment variable
|
|
||||||
2. Test fallback to config file if env var not set
|
|
||||||
3. Verify API key validation on startup
|
|
||||||
4. Test configuration options are applied correctly
|
|
||||||
5. Verify API key is never exposed in API responses
|
|
||||||
6. Test enabling/disabling Tavily via config
|
|
||||||
</test_steps>
|
|
||||||
</feature_8>
|
|
||||||
|
|
||||||
<feature_9>
|
|
||||||
<title>Error Handling and Rate Limiting</title>
|
|
||||||
<description>
|
|
||||||
Implement robust error handling and rate limiting for Tavily API calls.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Detect and handle Tavily API errors (rate limits, invalid API key, etc.)
|
|
||||||
- Implement client-side rate limiting to avoid hitting Tavily limits
|
|
||||||
- Cache search results for duplicate queries (optional)
|
|
||||||
- Provide clear error messages to Claude when searches fail
|
|
||||||
|
|
||||||
Error Types:
|
|
||||||
- 401: Invalid API key
|
|
||||||
- 429: Rate limit exceeded
|
|
||||||
- 500: Tavily server error
|
|
||||||
- Timeout: Search took too long
|
|
||||||
- Network: Connection failed
|
|
||||||
|
|
||||||
Rate Limiting:
|
|
||||||
- Track searches per minute/hour
|
|
||||||
- Queue requests if limit reached
|
|
||||||
- Return cached results for duplicate queries within 5 minutes
|
|
||||||
- Log rate limit warnings
|
|
||||||
</description>
|
|
||||||
<priority>2</priority>
|
|
||||||
<category>backend</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Test error handling for invalid API key
|
|
||||||
2. Verify rate limit detection and handling
|
|
||||||
3. Test timeout handling for slow searches
|
|
||||||
4. Verify error messages are clear to Claude
|
|
||||||
5. Test rate limiting prevents API abuse
|
|
||||||
6. Verify caching works for duplicate queries
|
|
||||||
</test_steps>
|
|
||||||
</feature_9>
|
|
||||||
|
|
||||||
<feature_10>
|
|
||||||
<title>Documentation and README Updates</title>
|
|
||||||
<description>
|
|
||||||
Update project documentation to explain Tavily integration.
|
|
||||||
|
|
||||||
Implementation:
|
|
||||||
- Update main README.md with Tavily setup instructions
|
|
||||||
- Add TAVILY_SETUP.md with detailed configuration guide
|
|
||||||
- Document API endpoints in README
|
|
||||||
- Add examples of using Tavily with Ikario
|
|
||||||
- Document troubleshooting steps
|
|
||||||
|
|
||||||
Documentation Sections:
|
|
||||||
- Prerequisites (Tavily API key)
|
|
||||||
- Installation steps
|
|
||||||
- Configuration options
|
|
||||||
- Testing Tavily connection
|
|
||||||
- Example conversations using internet search
|
|
||||||
- Troubleshooting common issues
|
|
||||||
- API reference for Tavily endpoints
|
|
||||||
</description>
|
|
||||||
<priority>3</priority>
|
|
||||||
<category>documentation</category>
|
|
||||||
<test_steps>
|
|
||||||
1. Verify README has Tavily setup section
|
|
||||||
2. Test that setup instructions are clear and complete
|
|
||||||
3. Verify all configuration options are documented
|
|
||||||
4. Test examples work as described
|
|
||||||
5. Verify troubleshooting section covers common issues
|
|
||||||
</test_steps>
|
|
||||||
</feature_10>
|
|
||||||
</core_features>
|
|
||||||
|
|
||||||
<implementation_notes>
|
|
||||||
<order>
|
|
||||||
Recommended implementation order:
|
|
||||||
1. Feature 1 (MCP Client Setup) - Foundation
|
|
||||||
2. Feature 2 (Tool Configuration) - Core functionality
|
|
||||||
3. Feature 3 (Tool Executor Integration) - Core functionality
|
|
||||||
4. Feature 8 (Configuration) - Required for testing
|
|
||||||
5. Feature 4 (System Prompt) - Makes tools accessible to Claude
|
|
||||||
6. Feature 9 (Error Handling) - Production readiness
|
|
||||||
7. Feature 5 (Status API) - Monitoring
|
|
||||||
8. Feature 10 (Documentation) - User onboarding
|
|
||||||
9. Feature 6 (UI Indicator) - Nice to have
|
|
||||||
10. Feature 7 (Manual Search UI) - Optional enhancement
|
|
||||||
</order>
|
|
||||||
|
|
||||||
<testing>
|
|
||||||
After implementing features 1-5, you should be able to:
|
|
||||||
- Ask Ikario: "Quelle est l'actualité aujourd'hui ?"
|
|
||||||
- Ask Ikario: "Recherche des informations sur [topic actuel]"
|
|
||||||
- Ask Ikario: "Vérifie cette information : [claim]"
|
|
||||||
|
|
||||||
Ikario should autonomously use Tavily search and cite sources.
|
|
||||||
</testing>
|
|
||||||
|
|
||||||
<compatibility>
|
|
||||||
- This specification is fully compatible with existing ikario-memory MCP integration
|
|
||||||
- Ikario will have both memory tools AND internet search tools
|
|
||||||
- Tools can be used together in the same conversation
|
|
||||||
- No conflicts expected between tool systems
|
|
||||||
</compatibility>
|
|
||||||
</implementation_notes>
|
|
||||||
|
|
||||||
<safety_requirements>
|
|
||||||
<critical>
|
|
||||||
- DO NOT expose Tavily API key to frontend or in API responses
|
|
||||||
- DO NOT modify existing MCP memory integration
|
|
||||||
- DO NOT break existing conversation functionality
|
|
||||||
- Tavily should gracefully degrade if unavailable (don't crash the app)
|
|
||||||
- Implement proper rate limiting to avoid API abuse
|
|
||||||
- Validate all user inputs before passing to Tavily
|
|
||||||
- Sanitize search results before displaying (XSS prevention)
|
|
||||||
- Log all Tavily API calls for monitoring and debugging
|
|
||||||
</critical>
|
|
||||||
</safety_requirements>
|
|
||||||
|
|
||||||
<success_metrics>
|
|
||||||
- Ikario can successfully perform internet searches when asked
|
|
||||||
- Search results are relevant and well-formatted
|
|
||||||
- Sources are properly cited
|
|
||||||
- Tavily integration doesn't slow down conversations
|
|
||||||
- Error handling is robust and user-friendly
|
|
||||||
- Configuration is straightforward
|
|
||||||
- Documentation is clear and complete
|
|
||||||
</success_metrics>
|
|
||||||
</project_specification>
|
|
||||||
@@ -1,679 +0,0 @@
|
|||||||
<project_specification>
|
|
||||||
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
|
|
||||||
|
|
||||||
<overview>
|
|
||||||
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
|
|
||||||
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
|
|
||||||
improve code maintainability, enable static type checking with mypy, and provide clear documentation
|
|
||||||
for all functions, classes, and modules.
|
|
||||||
|
|
||||||
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
|
|
||||||
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
|
|
||||||
for document upload, processing, and semantic search.
|
|
||||||
</overview>
|
|
||||||
|
|
||||||
<technology_stack>
|
|
||||||
<backend>
|
|
||||||
<runtime>Python 3.10+</runtime>
|
|
||||||
<web_framework>Flask 3.0</web_framework>
|
|
||||||
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
|
|
||||||
<ocr>Mistral OCR API</ocr>
|
|
||||||
<llm>Ollama (local) or Mistral API</llm>
|
|
||||||
<type_checking>mypy with strict configuration</type_checking>
|
|
||||||
</backend>
|
|
||||||
<infrastructure>
|
|
||||||
<containerization>Docker Compose (Weaviate + transformers)</containerization>
|
|
||||||
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
|
|
||||||
</infrastructure>
|
|
||||||
</technology_stack>
|
|
||||||
|
|
||||||
<current_state>
|
|
||||||
<project_structure>
|
|
||||||
- flask_app.py: Main Flask application (640 lines)
|
|
||||||
- schema.py: Weaviate schema definition (383 lines)
|
|
||||||
- utils/: 16+ modules for PDF processing pipeline
|
|
||||||
- pdf_pipeline.py: Main orchestration (879 lines)
|
|
||||||
- mistral_client.py: OCR API client
|
|
||||||
- ocr_processor.py: OCR processing
|
|
||||||
- markdown_builder.py: Markdown generation
|
|
||||||
- llm_metadata.py: Metadata extraction via LLM
|
|
||||||
- llm_toc.py: Table of contents extraction
|
|
||||||
- llm_classifier.py: Section classification
|
|
||||||
- llm_chunker.py: Semantic chunking
|
|
||||||
- llm_cleaner.py: Chunk cleaning
|
|
||||||
- llm_validator.py: Document validation
|
|
||||||
- weaviate_ingest.py: Database ingestion
|
|
||||||
- hierarchy_parser.py: Document hierarchy parsing
|
|
||||||
- image_extractor.py: Image extraction from PDFs
|
|
||||||
- toc_extractor*.py: Various TOC extraction methods
|
|
||||||
- templates/: Jinja2 templates for Flask UI
|
|
||||||
- tests/utils2/: Minimal test coverage (3 test files)
|
|
||||||
</project_structure>
|
|
||||||
|
|
||||||
<issues>
|
|
||||||
- Inconsistent type annotations across modules (some have partial types, many have none)
|
|
||||||
- Missing or incomplete docstrings (no Google-style format)
|
|
||||||
- No mypy configuration for strict type checking
|
|
||||||
- Type hints missing on function parameters and return values
|
|
||||||
- Dict[str, Any] used extensively without proper typing
|
|
||||||
- No type stubs for complex nested structures
|
|
||||||
</issues>
|
|
||||||
</current_state>
|
|
||||||
|
|
||||||
<core_features>
|
|
||||||
<type_annotations>
|
|
||||||
<strict_typing>
|
|
||||||
- Add complete type annotations to ALL functions and methods
|
|
||||||
- Use proper generic types (List, Dict, Optional, Union) from typing module
|
|
||||||
- Add TypedDict for complex dictionary structures
|
|
||||||
- Add Protocol types for duck-typed interfaces
|
|
||||||
- Use Literal types for string constants
|
|
||||||
- Add ParamSpec and TypeVar where appropriate
|
|
||||||
- Type all class attributes and instance variables
|
|
||||||
- Add type annotations to lambda functions where possible
|
|
||||||
</strict_typing>
|
|
||||||
|
|
||||||
<mypy_configuration>
|
|
||||||
- Create mypy.ini with strict configuration
|
|
||||||
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
|
|
||||||
- Enable: disallow_untyped_calls, disallow_untyped_decorators
|
|
||||||
- Enable: warn_return_any, warn_redundant_casts
|
|
||||||
- Enable: strict_equality, strict_optional
|
|
||||||
- Set python_version to 3.10
|
|
||||||
- Configure per-module overrides if needed for gradual migration
|
|
||||||
</mypy_configuration>
|
|
||||||
|
|
||||||
<type_stubs>
|
|
||||||
- Create TypedDict definitions for common data structures:
|
|
||||||
- OCR response structures
|
|
||||||
- Metadata dictionaries
|
|
||||||
- TOC entries
|
|
||||||
- Chunk objects
|
|
||||||
- Weaviate objects
|
|
||||||
- Pipeline results
|
|
||||||
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
|
|
||||||
- Create Protocol types for callback functions
|
|
||||||
</type_stubs>
|
|
||||||
|
|
||||||
<specific_improvements>
|
|
||||||
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
|
|
||||||
- flask_app.py: Type all route handlers, request/response types
|
|
||||||
- schema.py: Type Weaviate configuration objects
|
|
||||||
- llm_*.py: Type LLM request/response structures
|
|
||||||
- mistral_client.py: Type API client methods and responses
|
|
||||||
- weaviate_ingest.py: Type ingestion functions and batch operations
|
|
||||||
</specific_improvements>
|
|
||||||
</type_annotations>
|
|
||||||
|
|
||||||
<documentation>
|
|
||||||
<google_style_docstrings>
|
|
||||||
- Add comprehensive Google-style docstrings to ALL:
|
|
||||||
- Module-level docstrings explaining purpose and usage
|
|
||||||
- Class docstrings with Attributes section
|
|
||||||
- Function/method docstrings with Args, Returns, Raises sections
|
|
||||||
- Complex algorithm explanations with Examples section
|
|
||||||
- Include code examples for public APIs
|
|
||||||
- Document all exceptions that can be raised
|
|
||||||
- Add Notes section for important implementation details
|
|
||||||
- Add See Also section for related functions
|
|
||||||
</google_style_docstrings>
|
|
||||||
|
|
||||||
<module_documentation>
|
|
||||||
<utils_modules>
|
|
||||||
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
|
|
||||||
- mistral_client.py: Document OCR API usage, cost calculation
|
|
||||||
- llm_metadata.py: Document metadata extraction logic
|
|
||||||
- llm_toc.py: Document TOC extraction strategies
|
|
||||||
- llm_classifier.py: Document section classification types
|
|
||||||
- llm_chunker.py: Document semantic vs basic chunking
|
|
||||||
- llm_cleaner.py: Document cleaning rules and validation
|
|
||||||
- llm_validator.py: Document validation criteria
|
|
||||||
- weaviate_ingest.py: Document ingestion process, nested objects
|
|
||||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
|
||||||
</utils_modules>
|
|
||||||
|
|
||||||
<flask_app>
|
|
||||||
- Document all routes with request/response examples
|
|
||||||
- Document SSE (Server-Sent Events) implementation
|
|
||||||
- Document Weaviate query patterns
|
|
||||||
- Document upload processing workflow
|
|
||||||
- Document background job management
|
|
||||||
</flask_app>
|
|
||||||
|
|
||||||
<schema>
|
|
||||||
- Document Weaviate schema design decisions
|
|
||||||
- Document each collection's purpose and relationships
|
|
||||||
- Document nested object structure
|
|
||||||
- Document vectorization strategy
|
|
||||||
</schema>
|
|
||||||
</module_documentation>
|
|
||||||
|
|
||||||
<inline_comments>
|
|
||||||
- Add inline comments for complex logic only (don't over-comment)
|
|
||||||
- Explain WHY not WHAT (code should be self-documenting)
|
|
||||||
- Document performance considerations
|
|
||||||
- Document cost implications (OCR, LLM API calls)
|
|
||||||
- Document error handling strategies
|
|
||||||
</inline_comments>
|
|
||||||
</documentation>
|
|
||||||
|
|
||||||
<validation>
|
|
||||||
<type_checking>
|
|
||||||
- All modules must pass mypy --strict
|
|
||||||
- No # type: ignore comments without justification
|
|
||||||
- CI/CD should run mypy checks
|
|
||||||
- Type coverage should be 100%
|
|
||||||
</type_checking>
|
|
||||||
|
|
||||||
<documentation_quality>
|
|
||||||
- All public functions must have docstrings
|
|
||||||
- All docstrings must follow Google style
|
|
||||||
- Examples should be executable and tested
|
|
||||||
- Documentation should be clear and concise
|
|
||||||
</documentation_quality>
|
|
||||||
</validation>
|
|
||||||
</core_features>
|
|
||||||
|
|
||||||
<implementation_priority>
|
|
||||||
<critical_modules>
|
|
||||||
Priority 1 (Most used, most complex):
|
|
||||||
1. utils/pdf_pipeline.py - Main orchestration
|
|
||||||
2. flask_app.py - Web application entry point
|
|
||||||
3. utils/weaviate_ingest.py - Database operations
|
|
||||||
4. schema.py - Schema definition
|
|
||||||
|
|
||||||
Priority 2 (Core LLM modules):
|
|
||||||
5. utils/llm_metadata.py
|
|
||||||
6. utils/llm_toc.py
|
|
||||||
7. utils/llm_classifier.py
|
|
||||||
8. utils/llm_chunker.py
|
|
||||||
9. utils/llm_cleaner.py
|
|
||||||
10. utils/llm_validator.py
|
|
||||||
|
|
||||||
Priority 3 (OCR and parsing):
|
|
||||||
11. utils/mistral_client.py
|
|
||||||
12. utils/ocr_processor.py
|
|
||||||
13. utils/markdown_builder.py
|
|
||||||
14. utils/hierarchy_parser.py
|
|
||||||
15. utils/image_extractor.py
|
|
||||||
|
|
||||||
Priority 4 (Supporting modules):
|
|
||||||
16. utils/toc_extractor.py
|
|
||||||
17. utils/toc_extractor_markdown.py
|
|
||||||
18. utils/toc_extractor_visual.py
|
|
||||||
19. utils/llm_structurer.py (legacy)
|
|
||||||
</critical_modules>
|
|
||||||
</implementation_priority>
|
|
||||||
|
|
||||||
<implementation_steps>
|
|
||||||
<feature_1>
|
|
||||||
<title>Setup Type Checking Infrastructure</title>
|
|
||||||
<description>
|
|
||||||
Configure mypy with strict settings and create foundational type definitions
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Create mypy.ini configuration file with strict settings
|
|
||||||
- Add mypy to requirements.txt or dev dependencies
|
|
||||||
- Create utils/types.py module for common TypedDict definitions
|
|
||||||
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
|
|
||||||
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
|
|
||||||
- Create Protocol types for callbacks (ProgressCallback, etc.)
|
|
||||||
- Document type definitions in utils/types.py module docstring
|
|
||||||
- Test mypy configuration on a single module to verify settings
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- mypy.ini exists with strict configuration
|
|
||||||
- utils/types.py contains all foundational types with docstrings
|
|
||||||
- mypy runs without errors on utils/types.py
|
|
||||||
- Type definitions are comprehensive and reusable
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_1>
|
|
||||||
|
|
||||||
<feature_2>
|
|
||||||
<title>Add Types to PDF Pipeline Orchestration</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Add type annotations to all function signatures in pdf_pipeline.py
|
|
||||||
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
|
|
||||||
- Type progress_callback parameter with Protocol or Callable
|
|
||||||
- Add TypedDict for pipeline options dictionary
|
|
||||||
- Add TypedDict for pipeline result dictionary structure
|
|
||||||
- Type all helper functions (extract_document_metadata_legacy, etc.)
|
|
||||||
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
|
|
||||||
- Fix any mypy errors that arise
|
|
||||||
- Verify mypy --strict passes on pdf_pipeline.py
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All functions in pdf_pipeline.py have complete type annotations
|
|
||||||
- progress_callback is properly typed with Protocol
|
|
||||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
|
||||||
- mypy --strict pdf_pipeline.py passes with zero errors
|
|
||||||
- No # type: ignore comments (or justified if absolutely necessary)
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_2>
|
|
||||||
|
|
||||||
<feature_3>
|
|
||||||
<title>Add Types to Flask Application</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to flask_app.py and type all routes
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Add type annotations to all Flask route handlers
|
|
||||||
- Type request.args, request.form, request.files usage
|
|
||||||
- Type jsonify() return values
|
|
||||||
- Type get_weaviate_client context manager
|
|
||||||
- Type get_collection_stats, get_all_chunks, search_chunks functions
|
|
||||||
- Add TypedDict for Weaviate query results
|
|
||||||
- Type background job processing functions (run_processing_job)
|
|
||||||
- Type SSE generator function (upload_progress)
|
|
||||||
- Add type hints for template rendering
|
|
||||||
- Verify mypy --strict passes on flask_app.py
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All Flask routes have complete type annotations
|
|
||||||
- Request/response types are clear and documented
|
|
||||||
- Weaviate query functions are properly typed
|
|
||||||
- SSE generator is correctly typed
|
|
||||||
- mypy --strict flask_app.py passes with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_3>
|
|
||||||
|
|
||||||
<feature_4>
|
|
||||||
<title>Add Types to Core LLM Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- llm_metadata.py: Type extract_metadata function, return structure
|
|
||||||
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
|
|
||||||
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
|
|
||||||
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
|
|
||||||
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
|
|
||||||
- llm_validator.py: Type validate_document, validation result structure
|
|
||||||
- Add TypedDict for LLM request/response structures
|
|
||||||
- Type provider selection ("ollama" | "mistral" as Literal)
|
|
||||||
- Type model names with Literal or constants
|
|
||||||
- Verify mypy --strict passes on all llm_*.py modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All LLM modules have complete type annotations
|
|
||||||
- Section types use Literal for type safety
|
|
||||||
- Provider and model parameters are strongly typed
|
|
||||||
- LLM request/response structures use TypedDict
|
|
||||||
- mypy --strict passes on all llm_*.py modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_4>
|
|
||||||
|
|
||||||
<feature_5>
|
|
||||||
<title>Add Types to Weaviate and Database Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to schema.py and weaviate_ingest.py
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- schema.py: Type Weaviate configuration objects
|
|
||||||
- schema.py: Type collection property definitions
|
|
||||||
- weaviate_ingest.py: Type ingest_document function signature
|
|
||||||
- weaviate_ingest.py: Type delete_document_chunks function
|
|
||||||
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
|
|
||||||
- Type batch insertion operations
|
|
||||||
- Type nested object references (work, document)
|
|
||||||
- Add proper error types for Weaviate exceptions
|
|
||||||
- Verify mypy --strict passes on both modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- schema.py has complete type annotations for Weaviate config
|
|
||||||
- weaviate_ingest.py functions are fully typed
|
|
||||||
- Nested object structures use TypedDict
|
|
||||||
- Weaviate client operations are properly typed
|
|
||||||
- mypy --strict passes on both modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_5>
|
|
||||||
|
|
||||||
<feature_6>
|
|
||||||
<title>Add Types to OCR and Parsing Modules</title>
|
|
||||||
<description>
|
|
||||||
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
|
|
||||||
- mistral_client.py: Add TypedDict for Mistral API response structures
|
|
||||||
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
|
|
||||||
- markdown_builder.py: Type build_markdown, image_writer parameter
|
|
||||||
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
|
|
||||||
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
|
|
||||||
- image_extractor.py: Type create_image_writer, image handling
|
|
||||||
- Verify mypy --strict passes on all modules
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All OCR/parsing modules have complete type annotations
|
|
||||||
- Mistral API structures use TypedDict
|
|
||||||
- Hierarchy nodes are properly typed
|
|
||||||
- Image handling functions are typed
|
|
||||||
- mypy --strict passes on all modules with zero errors
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_6>
|
|
||||||
|
|
||||||
<feature_7>
|
|
||||||
<title>Add Google-Style Docstrings to Core Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
|
|
||||||
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
|
|
||||||
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
|
|
||||||
- pdf_pipeline.py: Add Examples section showing typical usage
|
|
||||||
- flask_app.py: Add module docstring explaining Flask application
|
|
||||||
- flask_app.py: Document all routes with request/response examples
|
|
||||||
- flask_app.py: Document Weaviate connection management
|
|
||||||
- schema.py: Add module docstring explaining schema design
|
|
||||||
- schema.py: Document each collection's purpose and relationships
|
|
||||||
- weaviate_ingest.py: Document ingestion process with examples
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All core modules have comprehensive module-level docstrings
|
|
||||||
- All public functions have Google-style docstrings
|
|
||||||
- Args, Returns, Raises sections are complete and accurate
|
|
||||||
- Examples are provided for complex functions
|
|
||||||
- Docstrings explain WHY, not just WHAT
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_7>
|
|
||||||
|
|
||||||
<feature_8>
|
|
||||||
<title>Add Google-Style Docstrings to LLM Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to all LLM processing modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- llm_metadata.py: Document metadata extraction logic with examples
|
|
||||||
- llm_toc.py: Document TOC extraction strategies and fallbacks
|
|
||||||
- llm_classifier.py: Document section types and classification criteria
|
|
||||||
- llm_chunker.py: Document semantic vs basic chunking approaches
|
|
||||||
- llm_cleaner.py: Document cleaning rules and validation logic
|
|
||||||
- llm_validator.py: Document validation criteria and corrections
|
|
||||||
- Add Examples sections showing input/output for each function
|
|
||||||
- Document LLM provider differences (Ollama vs Mistral)
|
|
||||||
- Document cost implications in Notes sections
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All LLM modules have comprehensive docstrings
|
|
||||||
- Each function has Args, Returns, Raises sections
|
|
||||||
- Examples show realistic input/output
|
|
||||||
- Provider differences are documented
|
|
||||||
- Cost implications are noted where relevant
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_8>
|
|
||||||
|
|
||||||
<feature_9>
|
|
||||||
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
|
|
||||||
<description>
|
|
||||||
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- mistral_client.py: Document OCR API usage, cost calculation
|
|
||||||
- ocr_processor.py: Document OCR response processing
|
|
||||||
- markdown_builder.py: Document markdown generation strategy
|
|
||||||
- hierarchy_parser.py: Document hierarchy building algorithm
|
|
||||||
- image_extractor.py: Document image extraction process
|
|
||||||
- toc_extractor*.py: Document various TOC extraction methods
|
|
||||||
- Add Examples sections for complex algorithms
|
|
||||||
- Document edge cases and error handling
|
|
||||||
- All docstrings must follow Google style format exactly
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- All OCR/parsing modules have comprehensive docstrings
|
|
||||||
- Complex algorithms are well explained
|
|
||||||
- Edge cases are documented
|
|
||||||
- Error handling is documented
|
|
||||||
- Examples demonstrate typical usage
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_9>
|
|
||||||
|
|
||||||
<feature_10>
|
|
||||||
<title>Final Validation and CI Integration</title>
|
|
||||||
<description>
|
|
||||||
Verify all type annotations and docstrings, integrate mypy into CI/CD
|
|
||||||
</description>
|
|
||||||
<tasks>
|
|
||||||
- Run mypy --strict on entire codebase, verify 100% pass rate
|
|
||||||
- Verify all public functions have docstrings
|
|
||||||
- Check docstring formatting with pydocstyle or similar tool
|
|
||||||
- Create GitHub Actions workflow to run mypy on every commit
|
|
||||||
- Update README.md with type checking instructions
|
|
||||||
- Update CLAUDE.md with documentation standards
|
|
||||||
- Create CONTRIBUTING.md with type annotation and docstring guidelines
|
|
||||||
- Generate API documentation with Sphinx or pdoc
|
|
||||||
- Fix any remaining mypy errors or missing docstrings
|
|
||||||
</tasks>
|
|
||||||
<acceptance_criteria>
|
|
||||||
- mypy --strict passes on entire codebase with zero errors
|
|
||||||
- All public functions have Google-style docstrings
|
|
||||||
- CI/CD runs mypy checks automatically
|
|
||||||
- Documentation is generated and accessible
|
|
||||||
- Contributing guidelines document type/docstring requirements
|
|
||||||
</acceptance_criteria>
|
|
||||||
</feature_10>
|
|
||||||
</implementation_steps>
|
|
||||||
|
|
||||||
<success_criteria>
|
|
||||||
<type_safety>
|
|
||||||
- 100% type coverage across all modules
|
|
||||||
- mypy --strict passes with zero errors
|
|
||||||
- No # type: ignore comments without justification
|
|
||||||
- All Dict[str, Any] replaced with TypedDict where appropriate
|
|
||||||
- Proper use of generics, protocols, and type variables
|
|
||||||
- NewType used for semantic type safety
|
|
||||||
</type_safety>
|
|
||||||
|
|
||||||
<documentation_quality>
|
|
||||||
- All modules have comprehensive module-level docstrings
|
|
||||||
- All public functions/classes have Google-style docstrings
|
|
||||||
- All docstrings include Args, Returns, Raises sections
|
|
||||||
- Complex functions include Examples sections
|
|
||||||
- Cost implications documented in Notes sections
|
|
||||||
- Error handling clearly documented
|
|
||||||
- Provider differences (Ollama vs Mistral) documented
|
|
||||||
</documentation_quality>
|
|
||||||
|
|
||||||
<code_quality>
|
|
||||||
- Code is self-documenting with clear variable names
|
|
||||||
- Inline comments explain WHY, not WHAT
|
|
||||||
- Complex algorithms are well explained
|
|
||||||
- Performance considerations documented
|
|
||||||
- Security considerations documented
|
|
||||||
</code_quality>
|
|
||||||
|
|
||||||
<developer_experience>
|
|
||||||
- IDE autocomplete works perfectly with type hints
|
|
||||||
- Type errors caught at development time, not runtime
|
|
||||||
- Documentation is easily accessible in IDE
|
|
||||||
- API examples are executable and tested
|
|
||||||
- Contributing guidelines are clear and comprehensive
|
|
||||||
</developer_experience>
|
|
||||||
|
|
||||||
<maintainability>
|
|
||||||
- Refactoring is safer with type checking
|
|
||||||
- Function signatures are self-documenting
|
|
||||||
- API contracts are explicit and enforced
|
|
||||||
- Breaking changes are caught by type checker
|
|
||||||
- New developers can understand code quickly
|
|
||||||
</maintainability>
|
|
||||||
</success_criteria>
|
|
||||||
|
|
||||||
<constraints>
|
|
||||||
<compatibility>
|
|
||||||
- Must maintain backward compatibility with existing code
|
|
||||||
- Cannot break existing Flask routes or API contracts
|
|
||||||
- Weaviate schema must remain unchanged
|
|
||||||
- Existing tests must continue to pass
|
|
||||||
</compatibility>
|
|
||||||
|
|
||||||
<gradual_migration>
|
|
||||||
- Can use per-module mypy configuration for gradual migration
|
|
||||||
- Can temporarily disable strict checks on legacy modules
|
|
||||||
- Priority modules must be completed first
|
|
||||||
- Low-priority modules can be deferred
|
|
||||||
</gradual_migration>
|
|
||||||
|
|
||||||
<standards>
|
|
||||||
- All type annotations must use Python 3.10+ syntax
|
|
||||||
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
|
|
||||||
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
|
|
||||||
- Use from __future__ import annotations if needed for forward references
|
|
||||||
</standards>
|
|
||||||
</constraints>
|
|
||||||
|
|
||||||
<testing_strategy>
|
|
||||||
<type_checking>
|
|
||||||
- Run mypy --strict on each module after adding types
|
|
||||||
- Use mypy daemon (dmypy) for faster incremental checking
|
|
||||||
- Add mypy to pre-commit hooks
|
|
||||||
- CI/CD must run mypy and fail on type errors
|
|
||||||
</type_checking>
|
|
||||||
|
|
||||||
<documentation_validation>
|
|
||||||
- Use pydocstyle to validate Google-style format
|
|
||||||
- Use sphinx-build to generate docs and catch errors
|
|
||||||
- Manual review of docstring examples
|
|
||||||
- Verify examples are executable and correct
|
|
||||||
</documentation_validation>
|
|
||||||
|
|
||||||
<integration_testing>
|
|
||||||
- Verify existing tests still pass after type additions
|
|
||||||
- Add new tests for complex typed structures
|
|
||||||
- Test mypy configuration on sample code
|
|
||||||
- Verify IDE autocomplete works correctly
|
|
||||||
</integration_testing>
|
|
||||||
</testing_strategy>
|
|
||||||
|
|
||||||
<documentation_examples>
|
|
||||||
<module_docstring>
|
|
||||||
```python
|
|
||||||
"""
|
|
||||||
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
|
|
||||||
|
|
||||||
This module orchestrates a 10-step pipeline for processing PDF documents:
|
|
||||||
1. OCR via Mistral API
|
|
||||||
2. Markdown construction with images
|
|
||||||
3. Metadata extraction via LLM
|
|
||||||
4. Table of contents (TOC) extraction
|
|
||||||
5. Section classification
|
|
||||||
6. Semantic chunking
|
|
||||||
7. Chunk cleaning and validation
|
|
||||||
8. Enrichment with concepts
|
|
||||||
9. Validation and corrections
|
|
||||||
10. Ingestion into Weaviate vector database
|
|
||||||
|
|
||||||
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
|
|
||||||
various processing modes (skip OCR, semantic chunking, OCR annotations).
|
|
||||||
|
|
||||||
Typical usage:
|
|
||||||
>>> from pathlib import Path
|
|
||||||
>>> from utils.pdf_pipeline import process_pdf
|
|
||||||
>>>
|
|
||||||
>>> result = process_pdf(
|
|
||||||
... Path("document.pdf"),
|
|
||||||
... use_llm=True,
|
|
||||||
... llm_provider="ollama",
|
|
||||||
... ingest_to_weaviate=True,
|
|
||||||
... )
|
|
||||||
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
|
|
||||||
|
|
||||||
See Also:
|
|
||||||
mistral_client: OCR API client
|
|
||||||
llm_metadata: Metadata extraction
|
|
||||||
weaviate_ingest: Database ingestion
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
</module_docstring>
|
|
||||||
|
|
||||||
<function_docstring>
|
|
||||||
```python
|
|
||||||
def process_pdf_v2(
|
|
||||||
pdf_path: Path,
|
|
||||||
output_dir: Path = Path("output"),
|
|
||||||
*,
|
|
||||||
use_llm: bool = True,
|
|
||||||
llm_provider: Literal["ollama", "mistral"] = "ollama",
|
|
||||||
llm_model: Optional[str] = None,
|
|
||||||
skip_ocr: bool = False,
|
|
||||||
ingest_to_weaviate: bool = True,
|
|
||||||
progress_callback: Optional[ProgressCallback] = None,
|
|
||||||
) -> PipelineResult:
|
|
||||||
"""
|
|
||||||
Process a PDF through the complete V2 pipeline with LLM enhancement.
|
|
||||||
|
|
||||||
This function orchestrates all 10 steps of the intelligent document processing
|
|
||||||
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
|
|
||||||
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
pdf_path: Absolute path to the PDF file to process.
|
|
||||||
output_dir: Base directory for output files. Defaults to "./output".
|
|
||||||
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
|
|
||||||
If False, uses basic heuristic processing.
|
|
||||||
llm_provider: LLM provider to use. "ollama" for local (free but slow),
|
|
||||||
"mistral" for API (fast but paid).
|
|
||||||
llm_model: Specific model name. If None, auto-detects based on provider
|
|
||||||
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
|
|
||||||
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
|
|
||||||
Requires output_dir/<doc_name>/<doc_name>.md to exist.
|
|
||||||
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
|
|
||||||
progress_callback: Optional callback for real-time progress updates.
|
|
||||||
Called with (step_id, status, detail) for each pipeline step.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Dictionary containing processing results with the following keys:
|
|
||||||
- success (bool): True if processing completed without errors
|
|
||||||
- document_name (str): Name of the processed document
|
|
||||||
- pages (int): Number of pages in the PDF
|
|
||||||
- chunks_count (int): Number of chunks generated
|
|
||||||
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
|
|
||||||
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
|
|
||||||
- cost_total (float): Total cost (ocr + llm)
|
|
||||||
- metadata (dict): Extracted metadata (title, author, etc.)
|
|
||||||
- toc (list): Hierarchical table of contents
|
|
||||||
- files (dict): Paths to generated files (markdown, chunks, etc.)
|
|
||||||
|
|
||||||
Raises:
|
|
||||||
FileNotFoundError: If pdf_path does not exist.
|
|
||||||
ValueError: If skip_ocr=True but markdown file not found.
|
|
||||||
RuntimeError: If Weaviate connection fails during ingestion.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
Basic usage with Ollama (free):
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... llm_provider="ollama"
|
|
||||||
... )
|
|
||||||
>>> print(f"Cost: {result['cost_total']:.4f}€")
|
|
||||||
Cost: 0.0270€ # OCR only
|
|
||||||
|
|
||||||
With Mistral API (faster):
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... llm_provider="mistral",
|
|
||||||
... llm_model="mistral-small-latest"
|
|
||||||
... )
|
|
||||||
|
|
||||||
Skip OCR to avoid cost:
|
|
||||||
>>> result = process_pdf_v2(
|
|
||||||
... Path("platon_menon.pdf"),
|
|
||||||
... skip_ocr=True, # Reuses existing markdown
|
|
||||||
... ingest_to_weaviate=False
|
|
||||||
... )
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
|
|
||||||
- LLM cost: Free with Ollama, variable with Mistral API
|
|
||||||
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
|
|
||||||
- Weaviate must be running (docker-compose up -d) before ingestion
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
</function_docstring>
|
|
||||||
</documentation_examples>
|
|
||||||
</project_specification>
|
|
||||||
@@ -1,290 +0,0 @@
|
|||||||
## YOUR ROLE - CODING AGENT (Library RAG - Type Safety & Documentation)
|
|
||||||
|
|
||||||
You are working on adding strict type annotations and Google-style docstrings to a Python library project.
|
|
||||||
This is a FRESH context window - you have no memory of previous sessions.
|
|
||||||
|
|
||||||
You have access to Linear for project management via MCP tools. Linear is your single source of truth.
|
|
||||||
|
|
||||||
### STEP 1: GET YOUR BEARINGS (MANDATORY)
|
|
||||||
|
|
||||||
Start by orienting yourself:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. See your working directory
|
|
||||||
pwd
|
|
||||||
|
|
||||||
# 2. List files to understand project structure
|
|
||||||
ls -la
|
|
||||||
|
|
||||||
# 3. Read the project specification
|
|
||||||
cat app_spec.txt
|
|
||||||
|
|
||||||
# 4. Read the Linear project state
|
|
||||||
cat .linear_project.json
|
|
||||||
|
|
||||||
# 5. Check recent git history
|
|
||||||
git log --oneline -20
|
|
||||||
```
|
|
||||||
|
|
||||||
### STEP 2: CHECK LINEAR STATUS
|
|
||||||
|
|
||||||
Query Linear to understand current project state using the project_id from `.linear_project.json`.
|
|
||||||
|
|
||||||
1. **Get all issues and count progress:**
|
|
||||||
```
|
|
||||||
mcp__linear__list_issues with project_id
|
|
||||||
```
|
|
||||||
Count:
|
|
||||||
- Issues "Done" = completed
|
|
||||||
- Issues "Todo" = remaining
|
|
||||||
- Issues "In Progress" = currently being worked on
|
|
||||||
|
|
||||||
2. **Find META issue** (if exists) for session context
|
|
||||||
|
|
||||||
3. **Check for in-progress work** - complete it first if found
|
|
||||||
|
|
||||||
### STEP 3: SELECT NEXT ISSUE
|
|
||||||
|
|
||||||
Get Todo issues sorted by priority:
|
|
||||||
```
|
|
||||||
mcp__linear__list_issues with project_id, status="Todo", limit=5
|
|
||||||
```
|
|
||||||
|
|
||||||
Select ONE highest-priority issue to work on.
|
|
||||||
|
|
||||||
### STEP 4: CLAIM THE ISSUE
|
|
||||||
|
|
||||||
Use `mcp__linear__update_issue` to set status to "In Progress"
|
|
||||||
|
|
||||||
### STEP 5: IMPLEMENT THE ISSUE
|
|
||||||
|
|
||||||
Based on issue category:
|
|
||||||
|
|
||||||
**For Type Annotation Issues (e.g., "Types - Add type annotations to X.py"):**
|
|
||||||
|
|
||||||
1. Read the target Python file
|
|
||||||
2. Identify all functions, methods, and variables
|
|
||||||
3. Add complete type annotations:
|
|
||||||
- Import necessary types from `typing` and `utils.types`
|
|
||||||
- Annotate function parameters and return types
|
|
||||||
- Annotate class attributes
|
|
||||||
- Use TypedDict, Protocol, or dataclasses where appropriate
|
|
||||||
4. Save the file
|
|
||||||
5. Run mypy to verify (MANDATORY):
|
|
||||||
```bash
|
|
||||||
cd generations/library_rag
|
|
||||||
mypy --config-file=mypy.ini <file_path>
|
|
||||||
```
|
|
||||||
6. Fix any mypy errors
|
|
||||||
7. Commit the changes
|
|
||||||
|
|
||||||
**For Documentation Issues (e.g., "Docs - Add docstrings to X.py"):**
|
|
||||||
|
|
||||||
1. Read the target Python file
|
|
||||||
2. Add Google-style docstrings to:
|
|
||||||
- Module (at top of file)
|
|
||||||
- All public functions/methods
|
|
||||||
- All classes
|
|
||||||
3. Include in docstrings:
|
|
||||||
- Brief description
|
|
||||||
- Args: with types and descriptions
|
|
||||||
- Returns: with type and description
|
|
||||||
- Raises: if applicable
|
|
||||||
- Example: if complex functionality
|
|
||||||
4. Save the file
|
|
||||||
5. Optionally run pydocstyle to verify (if installed)
|
|
||||||
6. Commit the changes
|
|
||||||
|
|
||||||
**For Setup/Infrastructure Issues:**
|
|
||||||
|
|
||||||
Follow the specific instructions in the issue description.
|
|
||||||
|
|
||||||
### STEP 6: VERIFICATION
|
|
||||||
|
|
||||||
**Type Annotation Issues:**
|
|
||||||
- Run mypy on the modified file(s)
|
|
||||||
- Ensure zero type errors
|
|
||||||
- If errors exist, fix them before proceeding
|
|
||||||
|
|
||||||
**Documentation Issues:**
|
|
||||||
- Review docstrings for completeness
|
|
||||||
- Ensure Args/Returns sections match function signatures
|
|
||||||
- Check that examples are accurate
|
|
||||||
|
|
||||||
**Functional Changes (rare):**
|
|
||||||
- If the issue changes behavior, test manually
|
|
||||||
- Start Flask server if needed: `python flask_app.py`
|
|
||||||
- Test the affected functionality
|
|
||||||
|
|
||||||
### STEP 7: GIT COMMIT
|
|
||||||
|
|
||||||
Make a descriptive commit:
|
|
||||||
```bash
|
|
||||||
git add <files>
|
|
||||||
git commit -m "<Issue ID>: <Short description>
|
|
||||||
|
|
||||||
- <List of changes>
|
|
||||||
- Verified with mypy (for type issues)
|
|
||||||
- Linear issue: <issue identifier>
|
|
||||||
"
|
|
||||||
```
|
|
||||||
|
|
||||||
### STEP 8: UPDATE LINEAR ISSUE
|
|
||||||
|
|
||||||
1. **Add implementation comment:**
|
|
||||||
```markdown
|
|
||||||
## Implementation Complete
|
|
||||||
|
|
||||||
### Changes Made
|
|
||||||
- [List of files modified]
|
|
||||||
- [Key changes]
|
|
||||||
|
|
||||||
### Verification
|
|
||||||
- mypy passes with zero errors (for type issues)
|
|
||||||
- All test steps from issue description verified
|
|
||||||
|
|
||||||
### Git Commit
|
|
||||||
[commit hash and message]
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Update status to "Done"** using `mcp__linear__update_issue`
|
|
||||||
|
|
||||||
### STEP 9: DECIDE NEXT ACTION
|
|
||||||
|
|
||||||
After completing an issue, ask yourself:
|
|
||||||
|
|
||||||
1. Have I been working for a while? (Use judgment based on complexity of work done)
|
|
||||||
2. Is the code in a stable state?
|
|
||||||
3. Would this be a good handoff point?
|
|
||||||
|
|
||||||
**If YES to all three:**
|
|
||||||
- Proceed to STEP 10 (Session Summary)
|
|
||||||
- End cleanly
|
|
||||||
|
|
||||||
**If NO:**
|
|
||||||
- Continue to another issue (go back to STEP 3)
|
|
||||||
- But commit first!
|
|
||||||
|
|
||||||
**Pacing Guidelines:**
|
|
||||||
- Early phase (< 20% done): Can complete multiple simple issues
|
|
||||||
- Mid/late phase (> 20% done): 1-2 issues per session for quality
|
|
||||||
|
|
||||||
### STEP 10: SESSION SUMMARY (When Ending)
|
|
||||||
|
|
||||||
If META issue exists, add a comment:
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
## Session Complete
|
|
||||||
|
|
||||||
### Completed This Session
|
|
||||||
- [Issue ID]: [Title] - [Brief summary]
|
|
||||||
|
|
||||||
### Current Progress
|
|
||||||
- X issues Done
|
|
||||||
- Y issues In Progress
|
|
||||||
- Z issues Todo
|
|
||||||
|
|
||||||
### Notes for Next Session
|
|
||||||
- [Important context]
|
|
||||||
- [Recommendations]
|
|
||||||
- [Any concerns]
|
|
||||||
```
|
|
||||||
|
|
||||||
Ensure:
|
|
||||||
- All code committed
|
|
||||||
- No uncommitted changes
|
|
||||||
- App in working state
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## LINEAR WORKFLOW RULES
|
|
||||||
|
|
||||||
**Status Transitions:**
|
|
||||||
- Todo → In Progress (when starting)
|
|
||||||
- In Progress → Done (when verified)
|
|
||||||
|
|
||||||
**NEVER:**
|
|
||||||
- Delete or modify issue descriptions
|
|
||||||
- Mark Done without verification
|
|
||||||
- Leave issues In Progress when switching
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## TYPE ANNOTATION GUIDELINES
|
|
||||||
|
|
||||||
**Imports needed:**
|
|
||||||
```python
|
|
||||||
from typing import Optional, Dict, List, Any, Tuple, Callable
|
|
||||||
from pathlib import Path
|
|
||||||
from utils.types import <ProjectSpecificTypes>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Common patterns:**
|
|
||||||
```python
|
|
||||||
# Functions
|
|
||||||
def process_data(input: str, options: Optional[Dict[str, Any]] = None) -> List[str]:
|
|
||||||
"""Process input data."""
|
|
||||||
...
|
|
||||||
|
|
||||||
# Methods with self
|
|
||||||
def save(self, path: Path) -> None:
|
|
||||||
"""Save to file."""
|
|
||||||
...
|
|
||||||
|
|
||||||
# Async functions
|
|
||||||
async def fetch_data(url: str) -> Dict[str, Any]:
|
|
||||||
"""Fetch from API."""
|
|
||||||
...
|
|
||||||
```
|
|
||||||
|
|
||||||
**Use project types from `utils/types.py`:**
|
|
||||||
- Metadata, OCRResponse, TOCEntry, ChunkData, PipelineResult, etc.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## DOCSTRING TEMPLATE (Google Style)
|
|
||||||
|
|
||||||
```python
|
|
||||||
def function_name(param1: str, param2: int = 0) -> List[str]:
|
|
||||||
"""
|
|
||||||
Brief one-line description.
|
|
||||||
|
|
||||||
More detailed description if needed. Explain what the function does,
|
|
||||||
any important behavior, side effects, etc.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
param1: Description of param1.
|
|
||||||
param2: Description of param2. Defaults to 0.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Description of return value.
|
|
||||||
|
|
||||||
Raises:
|
|
||||||
ValueError: When param1 is empty.
|
|
||||||
IOError: When file cannot be read.
|
|
||||||
|
|
||||||
Example:
|
|
||||||
>>> result = function_name("test", 5)
|
|
||||||
>>> print(result)
|
|
||||||
['test', 'test', 'test', 'test', 'test']
|
|
||||||
"""
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## IMPORTANT REMINDERS
|
|
||||||
|
|
||||||
**Your Goal:** Add strict type annotations and comprehensive documentation to all Python modules
|
|
||||||
|
|
||||||
**This Session's Goal:** Complete 1-2 issues with quality work and clean handoff
|
|
||||||
|
|
||||||
**Quality Bar:**
|
|
||||||
- mypy --strict passes with zero errors
|
|
||||||
- All public functions have complete Google-style docstrings
|
|
||||||
- Code is clean and well-documented
|
|
||||||
|
|
||||||
**Context is finite.** End sessions early with good handoff notes. The next agent will continue.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
Begin by running STEP 1 (Get Your Bearings).
|
|
||||||
Reference in New Issue
Block a user