Remove obsolete documentation and backup files

- Remove REMOTE_WEAVIATE_ARCHITECTURE.md (moved to library_rag)
- Remove navette.txt (obsolete notes)
- Remove backup and obsolete app spec files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-30 11:57:21 +01:00
parent d2f7165120
commit ef8cd32711
7 changed files with 0 additions and 5577 deletions

View File

@@ -1,679 +0,0 @@
<project_specification>
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
<overview>
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
improve code maintainability, enable static type checking with mypy, and provide clear documentation
for all functions, classes, and modules.
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
for document upload, processing, and semantic search.
</overview>
<technology_stack>
<backend>
<runtime>Python 3.10+</runtime>
<web_framework>Flask 3.0</web_framework>
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
<ocr>Mistral OCR API</ocr>
<llm>Ollama (local) or Mistral API</llm>
<type_checking>mypy with strict configuration</type_checking>
</backend>
<infrastructure>
<containerization>Docker Compose (Weaviate + transformers)</containerization>
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
</infrastructure>
</technology_stack>
<current_state>
<project_structure>
- flask_app.py: Main Flask application (640 lines)
- schema.py: Weaviate schema definition (383 lines)
- utils/: 16+ modules for PDF processing pipeline
- pdf_pipeline.py: Main orchestration (879 lines)
- mistral_client.py: OCR API client
- ocr_processor.py: OCR processing
- markdown_builder.py: Markdown generation
- llm_metadata.py: Metadata extraction via LLM
- llm_toc.py: Table of contents extraction
- llm_classifier.py: Section classification
- llm_chunker.py: Semantic chunking
- llm_cleaner.py: Chunk cleaning
- llm_validator.py: Document validation
- weaviate_ingest.py: Database ingestion
- hierarchy_parser.py: Document hierarchy parsing
- image_extractor.py: Image extraction from PDFs
- toc_extractor*.py: Various TOC extraction methods
- templates/: Jinja2 templates for Flask UI
- tests/utils2/: Minimal test coverage (3 test files)
</project_structure>
<issues>
- Inconsistent type annotations across modules (some have partial types, many have none)
- Missing or incomplete docstrings (no Google-style format)
- No mypy configuration for strict type checking
- Type hints missing on function parameters and return values
- Dict[str, Any] used extensively without proper typing
- No type stubs for complex nested structures
</issues>
</current_state>
<core_features>
<type_annotations>
<strict_typing>
- Add complete type annotations to ALL functions and methods
- Use proper generic types (List, Dict, Optional, Union) from typing module
- Add TypedDict for complex dictionary structures
- Add Protocol types for duck-typed interfaces
- Use Literal types for string constants
- Add ParamSpec and TypeVar where appropriate
- Type all class attributes and instance variables
- Add type annotations to lambda functions where possible
</strict_typing>
<mypy_configuration>
- Create mypy.ini with strict configuration
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
- Enable: disallow_untyped_calls, disallow_untyped_decorators
- Enable: warn_return_any, warn_redundant_casts
- Enable: strict_equality, strict_optional
- Set python_version to 3.10
- Configure per-module overrides if needed for gradual migration
</mypy_configuration>
<type_stubs>
- Create TypedDict definitions for common data structures:
- OCR response structures
- Metadata dictionaries
- TOC entries
- Chunk objects
- Weaviate objects
- Pipeline results
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
- Create Protocol types for callback functions
</type_stubs>
<specific_improvements>
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
- flask_app.py: Type all route handlers, request/response types
- schema.py: Type Weaviate configuration objects
- llm_*.py: Type LLM request/response structures
- mistral_client.py: Type API client methods and responses
- weaviate_ingest.py: Type ingestion functions and batch operations
</specific_improvements>
</type_annotations>
<documentation>
<google_style_docstrings>
- Add comprehensive Google-style docstrings to ALL:
- Module-level docstrings explaining purpose and usage
- Class docstrings with Attributes section
- Function/method docstrings with Args, Returns, Raises sections
- Complex algorithm explanations with Examples section
- Include code examples for public APIs
- Document all exceptions that can be raised
- Add Notes section for important implementation details
- Add See Also section for related functions
</google_style_docstrings>
<module_documentation>
<utils_modules>
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
- mistral_client.py: Document OCR API usage, cost calculation
- llm_metadata.py: Document metadata extraction logic
- llm_toc.py: Document TOC extraction strategies
- llm_classifier.py: Document section classification types
- llm_chunker.py: Document semantic vs basic chunking
- llm_cleaner.py: Document cleaning rules and validation
- llm_validator.py: Document validation criteria
- weaviate_ingest.py: Document ingestion process, nested objects
- hierarchy_parser.py: Document hierarchy building algorithm
</utils_modules>
<flask_app>
- Document all routes with request/response examples
- Document SSE (Server-Sent Events) implementation
- Document Weaviate query patterns
- Document upload processing workflow
- Document background job management
</flask_app>
<schema>
- Document Weaviate schema design decisions
- Document each collection's purpose and relationships
- Document nested object structure
- Document vectorization strategy
</schema>
</module_documentation>
<inline_comments>
- Add inline comments for complex logic only (don't over-comment)
- Explain WHY not WHAT (code should be self-documenting)
- Document performance considerations
- Document cost implications (OCR, LLM API calls)
- Document error handling strategies
</inline_comments>
</documentation>
<validation>
<type_checking>
- All modules must pass mypy --strict
- No # type: ignore comments without justification
- CI/CD should run mypy checks
- Type coverage should be 100%
</type_checking>
<documentation_quality>
- All public functions must have docstrings
- All docstrings must follow Google style
- Examples should be executable and tested
- Documentation should be clear and concise
</documentation_quality>
</validation>
</core_features>
<implementation_priority>
<critical_modules>
Priority 1 (Most used, most complex):
1. utils/pdf_pipeline.py - Main orchestration
2. flask_app.py - Web application entry point
3. utils/weaviate_ingest.py - Database operations
4. schema.py - Schema definition
Priority 2 (Core LLM modules):
5. utils/llm_metadata.py
6. utils/llm_toc.py
7. utils/llm_classifier.py
8. utils/llm_chunker.py
9. utils/llm_cleaner.py
10. utils/llm_validator.py
Priority 3 (OCR and parsing):
11. utils/mistral_client.py
12. utils/ocr_processor.py
13. utils/markdown_builder.py
14. utils/hierarchy_parser.py
15. utils/image_extractor.py
Priority 4 (Supporting modules):
16. utils/toc_extractor.py
17. utils/toc_extractor_markdown.py
18. utils/toc_extractor_visual.py
19. utils/llm_structurer.py (legacy)
</critical_modules>
</implementation_priority>
<implementation_steps>
<feature_1>
<title>Setup Type Checking Infrastructure</title>
<description>
Configure mypy with strict settings and create foundational type definitions
</description>
<tasks>
- Create mypy.ini configuration file with strict settings
- Add mypy to requirements.txt or dev dependencies
- Create utils/types.py module for common TypedDict definitions
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
- Create Protocol types for callbacks (ProgressCallback, etc.)
- Document type definitions in utils/types.py module docstring
- Test mypy configuration on a single module to verify settings
</tasks>
<acceptance_criteria>
- mypy.ini exists with strict configuration
- utils/types.py contains all foundational types with docstrings
- mypy runs without errors on utils/types.py
- Type definitions are comprehensive and reusable
</acceptance_criteria>
</feature_1>
<feature_2>
<title>Add Types to PDF Pipeline Orchestration</title>
<description>
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
</description>
<tasks>
- Add type annotations to all function signatures in pdf_pipeline.py
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
- Type progress_callback parameter with Protocol or Callable
- Add TypedDict for pipeline options dictionary
- Add TypedDict for pipeline result dictionary structure
- Type all helper functions (extract_document_metadata_legacy, etc.)
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
- Fix any mypy errors that arise
- Verify mypy --strict passes on pdf_pipeline.py
</tasks>
<acceptance_criteria>
- All functions in pdf_pipeline.py have complete type annotations
- progress_callback is properly typed with Protocol
- All Dict[str, Any] replaced with TypedDict where appropriate
- mypy --strict pdf_pipeline.py passes with zero errors
- No # type: ignore comments (or justified if absolutely necessary)
</acceptance_criteria>
</feature_2>
<feature_3>
<title>Add Types to Flask Application</title>
<description>
Add complete type annotations to flask_app.py and type all routes
</description>
<tasks>
- Add type annotations to all Flask route handlers
- Type request.args, request.form, request.files usage
- Type jsonify() return values
- Type get_weaviate_client context manager
- Type get_collection_stats, get_all_chunks, search_chunks functions
- Add TypedDict for Weaviate query results
- Type background job processing functions (run_processing_job)
- Type SSE generator function (upload_progress)
- Add type hints for template rendering
- Verify mypy --strict passes on flask_app.py
</tasks>
<acceptance_criteria>
- All Flask routes have complete type annotations
- Request/response types are clear and documented
- Weaviate query functions are properly typed
- SSE generator is correctly typed
- mypy --strict flask_app.py passes with zero errors
</acceptance_criteria>
</feature_3>
<feature_4>
<title>Add Types to Core LLM Modules</title>
<description>
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
</description>
<tasks>
- llm_metadata.py: Type extract_metadata function, return structure
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
- llm_validator.py: Type validate_document, validation result structure
- Add TypedDict for LLM request/response structures
- Type provider selection ("ollama" | "mistral" as Literal)
- Type model names with Literal or constants
- Verify mypy --strict passes on all llm_*.py modules
</tasks>
<acceptance_criteria>
- All LLM modules have complete type annotations
- Section types use Literal for type safety
- Provider and model parameters are strongly typed
- LLM request/response structures use TypedDict
- mypy --strict passes on all llm_*.py modules with zero errors
</acceptance_criteria>
</feature_4>
<feature_5>
<title>Add Types to Weaviate and Database Modules</title>
<description>
Add complete type annotations to schema.py and weaviate_ingest.py
</description>
<tasks>
- schema.py: Type Weaviate configuration objects
- schema.py: Type collection property definitions
- weaviate_ingest.py: Type ingest_document function signature
- weaviate_ingest.py: Type delete_document_chunks function
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
- Type batch insertion operations
- Type nested object references (work, document)
- Add proper error types for Weaviate exceptions
- Verify mypy --strict passes on both modules
</tasks>
<acceptance_criteria>
- schema.py has complete type annotations for Weaviate config
- weaviate_ingest.py functions are fully typed
- Nested object structures use TypedDict
- Weaviate client operations are properly typed
- mypy --strict passes on both modules with zero errors
</acceptance_criteria>
</feature_5>
<feature_6>
<title>Add Types to OCR and Parsing Modules</title>
<description>
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
</description>
<tasks>
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
- mistral_client.py: Add TypedDict for Mistral API response structures
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
- markdown_builder.py: Type build_markdown, image_writer parameter
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
- image_extractor.py: Type create_image_writer, image handling
- Verify mypy --strict passes on all modules
</tasks>
<acceptance_criteria>
- All OCR/parsing modules have complete type annotations
- Mistral API structures use TypedDict
- Hierarchy nodes are properly typed
- Image handling functions are typed
- mypy --strict passes on all modules with zero errors
</acceptance_criteria>
</feature_6>
<feature_7>
<title>Add Google-Style Docstrings to Core Modules</title>
<description>
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
</description>
<tasks>
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
- pdf_pipeline.py: Add Examples section showing typical usage
- flask_app.py: Add module docstring explaining Flask application
- flask_app.py: Document all routes with request/response examples
- flask_app.py: Document Weaviate connection management
- schema.py: Add module docstring explaining schema design
- schema.py: Document each collection's purpose and relationships
- weaviate_ingest.py: Document ingestion process with examples
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All core modules have comprehensive module-level docstrings
- All public functions have Google-style docstrings
- Args, Returns, Raises sections are complete and accurate
- Examples are provided for complex functions
- Docstrings explain WHY, not just WHAT
</acceptance_criteria>
</feature_7>
<feature_8>
<title>Add Google-Style Docstrings to LLM Modules</title>
<description>
Add comprehensive Google-style docstrings to all LLM processing modules
</description>
<tasks>
- llm_metadata.py: Document metadata extraction logic with examples
- llm_toc.py: Document TOC extraction strategies and fallbacks
- llm_classifier.py: Document section types and classification criteria
- llm_chunker.py: Document semantic vs basic chunking approaches
- llm_cleaner.py: Document cleaning rules and validation logic
- llm_validator.py: Document validation criteria and corrections
- Add Examples sections showing input/output for each function
- Document LLM provider differences (Ollama vs Mistral)
- Document cost implications in Notes sections
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All LLM modules have comprehensive docstrings
- Each function has Args, Returns, Raises sections
- Examples show realistic input/output
- Provider differences are documented
- Cost implications are noted where relevant
</acceptance_criteria>
</feature_8>
<feature_9>
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
<description>
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
</description>
<tasks>
- mistral_client.py: Document OCR API usage, cost calculation
- ocr_processor.py: Document OCR response processing
- markdown_builder.py: Document markdown generation strategy
- hierarchy_parser.py: Document hierarchy building algorithm
- image_extractor.py: Document image extraction process
- toc_extractor*.py: Document various TOC extraction methods
- Add Examples sections for complex algorithms
- Document edge cases and error handling
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All OCR/parsing modules have comprehensive docstrings
- Complex algorithms are well explained
- Edge cases are documented
- Error handling is documented
- Examples demonstrate typical usage
</acceptance_criteria>
</feature_9>
<feature_10>
<title>Final Validation and CI Integration</title>
<description>
Verify all type annotations and docstrings, integrate mypy into CI/CD
</description>
<tasks>
- Run mypy --strict on entire codebase, verify 100% pass rate
- Verify all public functions have docstrings
- Check docstring formatting with pydocstyle or similar tool
- Create GitHub Actions workflow to run mypy on every commit
- Update README.md with type checking instructions
- Update CLAUDE.md with documentation standards
- Create CONTRIBUTING.md with type annotation and docstring guidelines
- Generate API documentation with Sphinx or pdoc
- Fix any remaining mypy errors or missing docstrings
</tasks>
<acceptance_criteria>
- mypy --strict passes on entire codebase with zero errors
- All public functions have Google-style docstrings
- CI/CD runs mypy checks automatically
- Documentation is generated and accessible
- Contributing guidelines document type/docstring requirements
</acceptance_criteria>
</feature_10>
</implementation_steps>
<success_criteria>
<type_safety>
- 100% type coverage across all modules
- mypy --strict passes with zero errors
- No # type: ignore comments without justification
- All Dict[str, Any] replaced with TypedDict where appropriate
- Proper use of generics, protocols, and type variables
- NewType used for semantic type safety
</type_safety>
<documentation_quality>
- All modules have comprehensive module-level docstrings
- All public functions/classes have Google-style docstrings
- All docstrings include Args, Returns, Raises sections
- Complex functions include Examples sections
- Cost implications documented in Notes sections
- Error handling clearly documented
- Provider differences (Ollama vs Mistral) documented
</documentation_quality>
<code_quality>
- Code is self-documenting with clear variable names
- Inline comments explain WHY, not WHAT
- Complex algorithms are well explained
- Performance considerations documented
- Security considerations documented
</code_quality>
<developer_experience>
- IDE autocomplete works perfectly with type hints
- Type errors caught at development time, not runtime
- Documentation is easily accessible in IDE
- API examples are executable and tested
- Contributing guidelines are clear and comprehensive
</developer_experience>
<maintainability>
- Refactoring is safer with type checking
- Function signatures are self-documenting
- API contracts are explicit and enforced
- Breaking changes are caught by type checker
- New developers can understand code quickly
</maintainability>
</success_criteria>
<constraints>
<compatibility>
- Must maintain backward compatibility with existing code
- Cannot break existing Flask routes or API contracts
- Weaviate schema must remain unchanged
- Existing tests must continue to pass
</compatibility>
<gradual_migration>
- Can use per-module mypy configuration for gradual migration
- Can temporarily disable strict checks on legacy modules
- Priority modules must be completed first
- Low-priority modules can be deferred
</gradual_migration>
<standards>
- All type annotations must use Python 3.10+ syntax
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
- Use from __future__ import annotations if needed for forward references
</standards>
</constraints>
<testing_strategy>
<type_checking>
- Run mypy --strict on each module after adding types
- Use mypy daemon (dmypy) for faster incremental checking
- Add mypy to pre-commit hooks
- CI/CD must run mypy and fail on type errors
</type_checking>
<documentation_validation>
- Use pydocstyle to validate Google-style format
- Use sphinx-build to generate docs and catch errors
- Manual review of docstring examples
- Verify examples are executable and correct
</documentation_validation>
<integration_testing>
- Verify existing tests still pass after type additions
- Add new tests for complex typed structures
- Test mypy configuration on sample code
- Verify IDE autocomplete works correctly
</integration_testing>
</testing_strategy>
<documentation_examples>
<module_docstring>
```python
"""
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
This module orchestrates a 10-step pipeline for processing PDF documents:
1. OCR via Mistral API
2. Markdown construction with images
3. Metadata extraction via LLM
4. Table of contents (TOC) extraction
5. Section classification
6. Semantic chunking
7. Chunk cleaning and validation
8. Enrichment with concepts
9. Validation and corrections
10. Ingestion into Weaviate vector database
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
various processing modes (skip OCR, semantic chunking, OCR annotations).
Typical usage:
>>> from pathlib import Path
>>> from utils.pdf_pipeline import process_pdf
>>>
>>> result = process_pdf(
... Path("document.pdf"),
... use_llm=True,
... llm_provider="ollama",
... ingest_to_weaviate=True,
... )
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
See Also:
mistral_client: OCR API client
llm_metadata: Metadata extraction
weaviate_ingest: Database ingestion
"""
```
</module_docstring>
<function_docstring>
```python
def process_pdf_v2(
pdf_path: Path,
output_dir: Path = Path("output"),
*,
use_llm: bool = True,
llm_provider: Literal["ollama", "mistral"] = "ollama",
llm_model: Optional[str] = None,
skip_ocr: bool = False,
ingest_to_weaviate: bool = True,
progress_callback: Optional[ProgressCallback] = None,
) -> PipelineResult:
"""
Process a PDF through the complete V2 pipeline with LLM enhancement.
This function orchestrates all 10 steps of the intelligent document processing
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
Args:
pdf_path: Absolute path to the PDF file to process.
output_dir: Base directory for output files. Defaults to "./output".
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
If False, uses basic heuristic processing.
llm_provider: LLM provider to use. "ollama" for local (free but slow),
"mistral" for API (fast but paid).
llm_model: Specific model name. If None, auto-detects based on provider
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
Requires output_dir/<doc_name>/<doc_name>.md to exist.
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
progress_callback: Optional callback for real-time progress updates.
Called with (step_id, status, detail) for each pipeline step.
Returns:
Dictionary containing processing results with the following keys:
- success (bool): True if processing completed without errors
- document_name (str): Name of the processed document
- pages (int): Number of pages in the PDF
- chunks_count (int): Number of chunks generated
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
- cost_total (float): Total cost (ocr + llm)
- metadata (dict): Extracted metadata (title, author, etc.)
- toc (list): Hierarchical table of contents
- files (dict): Paths to generated files (markdown, chunks, etc.)
Raises:
FileNotFoundError: If pdf_path does not exist.
ValueError: If skip_ocr=True but markdown file not found.
RuntimeError: If Weaviate connection fails during ingestion.
Examples:
Basic usage with Ollama (free):
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... llm_provider="ollama"
... )
>>> print(f"Cost: {result['cost_total']:.4f}€")
Cost: 0.0270€ # OCR only
With Mistral API (faster):
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... llm_provider="mistral",
... llm_model="mistral-small-latest"
... )
Skip OCR to avoid cost:
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... skip_ocr=True, # Reuses existing markdown
... ingest_to_weaviate=False
... )
Notes:
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
- LLM cost: Free with Ollama, variable with Mistral API
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
- Weaviate must be running (docker-compose up -d) before ingestion
"""
```
</function_docstring>
</documentation_examples>
</project_specification>

View File

@@ -1,490 +0,0 @@
<project_specification>
<project_name>Library RAG - Native Markdown Support</project_name>
<overview>
Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
chunking, and Weaviate vectorization.
This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
for users who have philosophical texts in Markdown format.
</overview>
<technology_stack>
<backend>
<framework>Flask 3.0</framework>
<pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
<validation>Werkzeug secure_filename</validation>
<llm>Ollama (local) or Mistral API</llm>
<vectorization>Weaviate with BAAI/bge-m3</vectorization>
</backend>
<type_safety>
<type_checker>mypy strict mode</type_checker>
<docstrings>Google-style docstrings required</docstrings>
</type_safety>
</technology_stack>
<core_features>
<feature_1>
<title>Update Flask File Validation</title>
<description>
Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
configuration and file validation logic to support .md files while maintaining backward compatibility
with existing PDF workflows.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
</files_to_modify>
<implementation_details>
- Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
- Update allowed_file() function to accept both extensions
- Update upload.html template to accept .md files in file input
- Update error messages to reflect both formats
</implementation_details>
<test_steps>
1. Start Flask app
2. Navigate to /upload
3. Attempt to upload a .md file
4. Verify file is accepted (no "Format non supporté" error)
5. Verify PDF upload still works
</test_steps>
</feature_1>
<feature_2>
<title>Add Markdown Detection in Pipeline</title>
<description>
Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
Add logic to automatically skip OCR processing for .md files and copy the Markdown content
directly to the output directory.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
</files_to_modify>
<implementation_details>
- Add file extension detection: `file_ext = pdf_path.suffix.lower()`
- If file_ext == ".md":
- Skip OCR step entirely (no Mistral API call)
- Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
- Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
- Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
- Set cost_ocr = 0.0
- Emit progress: "markdown_load" instead of "ocr"
- If file_ext == ".pdf":
- Continue with existing OCR workflow
- Both paths converge at LLM processing (metadata, TOC, chunking)
</implementation_details>
<test_steps>
1. Create test Markdown file with philosophical content
2. Call process_pdf(Path("test.md"), use_llm=True)
3. Verify OCR is skipped (cost_ocr = 0.0)
4. Verify output/test/test.md is created
5. Verify no _ocr.json file is created
6. Verify LLM processing runs normally
</test_steps>
</feature_2>
<feature_3>
<title>Markdown-Specific Progress Callback</title>
<description>
Update the progress callback system to emit appropriate events for Markdown file processing.
Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
user feedback during Server-Sent Events streaming.
</description>
<priority>2</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (emit_progress calls)
- flask_app.py (process_file_background function)
</files_to_modify>
<implementation_details>
- Add conditional progress messages based on file type
- For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
- For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
- Update frontend to handle "markdown_load" event type
- Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
</implementation_details>
<test_steps>
1. Upload Markdown file via Flask interface
2. Monitor SSE progress stream at /upload/progress/&lt;job_id&gt;
3. Verify first step shows "Chargement du fichier Markdown..."
4. Verify no OCR-related messages appear
5. Verify subsequent steps (metadata, TOC, etc.) work normally
</test_steps>
</feature_3>
<feature_4>
<title>Update process_pdf_bytes for Markdown</title>
<description>
Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
This function currently creates a temporary PDF file, but for Markdown uploads,
it should create a temporary .md file instead.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
</files_to_modify>
<implementation_details>
- Detect file type from filename parameter
- If filename ends with .md:
- Create temp file with suffix=".md"
- Write file_bytes as UTF-8 text
- If filename ends with .pdf:
- Existing behavior (suffix=".pdf", binary write)
- Pass temp file path to process_pdf() which now handles both types
</implementation_details>
<test_steps>
1. Create Flask test client
2. POST multipart form with .md file to /upload
3. Verify process_pdf_bytes creates .md temp file
4. Verify temp file contains correct Markdown content
5. Verify cleanup deletes temp file after processing
</test_steps>
</feature_4>
<feature_5>
<title>Add Markdown File Validation</title>
<description>
Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
and basic Markdown structure. Reject files that are too large, contain binary data,
or have no meaningful content.
</description>
<priority>2</priority>
<category>backend</category>
<files_to_create>
- utils/markdown_validator.py
</files_to_create>
<implementation_details>
- Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
- Checks:
- File size &lt; 10 MB
- Valid UTF-8 encoding
- Contains at least one header (#, ##, etc.)
- Not empty (at least 100 characters)
- No null bytes or excessive binary content
- Return dict with success, error, and warnings keys
- Call from process_pdf_v2 before processing
- Type annotations and Google-style docstrings required
</implementation_details>
<test_steps>
1. Test with valid Markdown file → passes validation
2. Test with empty file → fails with "File too short"
3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
4. Test with very large file (&gt;10MB) → fails with "File too large"
5. Test with plain text no headers → warning but continues
</test_steps>
</feature_5>
<feature_6>
<title>Update Documentation</title>
<description>
Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
</description>
<priority>3</priority>
<category>documentation</category>
<files_to_modify>
- README.md (add section under "Pipeline de Traitement")
- .claude/CLAUDE.md (update development guidelines)
- templates/upload.html (add help text)
</files_to_modify>
<implementation_details>
- README.md:
- Add "Support Markdown Natif" section
- Document accepted formats: PDF, MD
- Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
- Add example: process_pdf(Path("document.md"))
- CLAUDE.md:
- Update "Pipeline de Traitement" section
- Note conditional OCR step
- Document markdown_validator.py module
- upload.html:
- Update file input accept attribute: accept=".pdf,.md"
- Add help text: "Formats acceptés : PDF, Markdown (.md)"
</implementation_details>
<test_steps>
1. Read README.md markdown support section
2. Verify examples are clear and accurate
3. Check CLAUDE.md developer notes
4. Open /upload in browser
5. Verify help text displays correctly
</test_steps>
</feature_6>
<feature_7>
<title>Add Unit Tests for Markdown Processing</title>
<description>
Create comprehensive unit tests for Markdown file handling to ensure reliability
and prevent regressions. Cover file validation, pipeline processing, and edge cases.
</description>
<priority>2</priority>
<category>testing</category>
<files_to_create>
- tests/utils/test_markdown_validator.py
- tests/utils/test_pdf_pipeline_markdown.py
- tests/fixtures/sample.md
</files_to_create>
<implementation_details>
- test_markdown_validator.py:
- Test valid Markdown acceptance
- Test invalid encoding rejection
- Test file size limits
- Test empty file rejection
- Test binary data detection
- test_pdf_pipeline_markdown.py:
- Test Markdown file processing end-to-end
- Test OCR skip for .md files
- Test cost_ocr = 0.0
- Test LLM processing (metadata, TOC, chunking)
- Mock Weaviate ingestion
- Verify output files created correctly
- fixtures/sample.md:
- Create realistic philosophical text in Markdown
- Include headers, paragraphs, formatting
- ~1000 words for realistic testing
</implementation_details>
<test_steps>
1. Run: pytest tests/utils/test_markdown_validator.py -v
2. Verify all validation tests pass
3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
4. Verify end-to-end Markdown processing works
5. Check test coverage: pytest --cov=utils --cov-report=html
</test_steps>
</feature_7>
<feature_8>
<title>Type Safety and Documentation</title>
<description>
Ensure all new code follows strict type safety requirements and includes comprehensive
Google-style docstrings. Run mypy checks and update type definitions as needed.
</description>
<priority>2</priority>
<category>type_safety</category>
<files_to_modify>
- utils/types.py (add Markdown-specific types if needed)
- All modified modules (type annotations)
</files_to_modify>
<implementation_details>
- Add type annotations to all new functions
- Update existing functions that handle both PDF and MD
- Consider adding:
- FileFormat = Literal["pdf", "md"]
- MarkdownValidationResult = TypedDict(...)
- Run mypy --strict on all modified files
- Add Google-style docstrings with:
- Args section documenting all parameters
- Returns section with structure details
- Raises section for exceptions
- Examples section for complex functions
</implementation_details>
<test_steps>
1. Run: mypy utils/pdf_pipeline.py --strict
2. Run: mypy utils/markdown_validator.py --strict
3. Verify no type errors
4. Run: pydocstyle utils/markdown_validator.py --convention=google
5. Verify all docstrings follow Google style
</test_steps>
</feature_8>
<feature_9>
<title>Handle Markdown-Specific Edge Cases</title>
<description>
Address edge cases specific to Markdown processing: front matter (YAML/TOML),
embedded code blocks, special characters, and non-standard Markdown extensions.
</description>
<priority>3</priority>
<category>backend</category>
<files_to_modify>
- utils/markdown_validator.py
- utils/llm_metadata.py (handle front matter)
</files_to_modify>
<implementation_details>
- Front matter handling:
- Detect YAML/TOML front matter (--- or +++)
- Extract metadata if present (title, author, date)
- Pass to LLM or use directly if valid
- Strip front matter before content processing
- Code block handling:
- Don't treat code blocks as actual content
- Preserve them for chunking but don't analyze
- Special characters:
- Handle Unicode properly (Greek, Latin, French accents)
- Preserve LaTeX equations in $ or $$
- GitHub Flavored Markdown:
- Support tables, task lists, strikethrough
- Convert to standard format if needed
</implementation_details>
<test_steps>
1. Upload Markdown with YAML front matter
2. Verify metadata extracted correctly
3. Upload Markdown with code blocks
4. Verify code not treated as philosophical content
5. Upload Markdown with Greek/Latin text
6. Verify Unicode handled correctly
</test_steps>
</feature_9>
<feature_10>
<title>Update UI/UX for Markdown Upload</title>
<description>
Enhance the upload interface to clearly communicate Markdown support and provide
visual feedback about the file type being processed. Show format-specific information
(e.g., "No OCR cost for Markdown files").
</description>
<priority>3</priority>
<category>frontend</category>
<files_to_modify>
- templates/upload.html
- templates/upload_progress.html
</files_to_modify>
<implementation_details>
- upload.html:
- Add file type indicator icon (📄 PDF vs 📝 MD)
- Show format-specific help text on hover
- Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
- Add example Markdown file download link
- upload_progress.html:
- Show different icon for Markdown processing
- Adjust progress bar (9 steps vs 10 steps)
- Display "No OCR cost" badge for Markdown
- Update step descriptions based on file type
</implementation_details>
<test_steps>
1. Open /upload page
2. Verify help text mentions both PDF and MD
3. Select a .md file
4. Verify file type indicator shows 📝
5. Submit upload
6. Verify progress shows "Chargement Markdown..."
7. Verify "No OCR cost" badge displays
</test_steps>
</feature_10>
</core_features>
<implementation_steps>
<step number="1">
<title>Setup and Configuration</title>
<tasks>
- Update ALLOWED_EXTENSIONS in flask_app.py
- Modify allowed_file() validation function
- Update upload.html file input accept attribute
- Add Markdown MIME type handling
</tasks>
</step>
<step number="2">
<title>Core Pipeline Extension</title>
<tasks>
- Add file extension detection in process_pdf_v2()
- Implement Markdown file reading logic
- Skip OCR for .md files
- Add conditional progress callbacks
- Update process_pdf_bytes() for Markdown
</tasks>
</step>
<step number="3">
<title>Validation and Error Handling</title>
<tasks>
- Create markdown_validator.py module
- Implement UTF-8 encoding validation
- Add file size limits
- Handle front matter extraction
- Add comprehensive error messages
</tasks>
</step>
<step number="4">
<title>Testing Infrastructure</title>
<tasks>
- Create test fixtures (sample.md)
- Write validation tests
- Write pipeline integration tests
- Add edge case tests
- Verify mypy strict compliance
</tasks>
</step>
<step number="5">
<title>Documentation and Polish</title>
<tasks>
- Update README.md with Markdown support
- Update .claude/CLAUDE.md developer docs
- Add Google-style docstrings
- Update UI templates with new messaging
- Create usage examples
</tasks>
</step>
</implementation_steps>
<success_criteria>
<functionality>
- Markdown files upload successfully via Flask
- OCR is skipped for .md files (cost_ocr = 0.0)
- LLM processing works identically for PDF and MD
- Chunks are created and vectorized correctly
- Both file types can be searched in Weaviate
- Existing PDF workflow remains unchanged
</functionality>
<type_safety>
- All code passes mypy --strict
- All functions have type annotations
- Google-style docstrings on all modules
- No Any types without justification
- TypedDict definitions for new data structures
</type_safety>
<testing>
- Unit tests cover Markdown validation
- Integration tests verify end-to-end processing
- Edge cases handled (front matter, Unicode, large files)
- Test coverage &gt;80% for new code
- All tests pass in CI/CD pipeline
</testing>
<user_experience>
- Upload interface clearly shows both formats supported
- Progress feedback accurate for both PDF and MD
- Cost savings clearly communicated ("0€ for Markdown")
- Error messages helpful and specific
- Documentation clear with examples
</user_experience>
<performance>
- Markdown processing faster than PDF (no OCR)
- No regression in PDF processing speed
- Memory usage reasonable for large MD files
- Validation completes in &lt;100ms
- Overall pipeline &lt;30s for typical Markdown document
</performance>
</success_criteria>
<technical_notes>
<cost_comparison>
- PDF processing: OCR ~0.003€/page + LLM variable
- Markdown processing: 0€ OCR + LLM variable
- Estimated savings: 50-70% for documents with Markdown source
</cost_comparison>
<compatibility>
- Maintains backward compatibility with existing PDFs
- No breaking changes to API or database schema
- Existing chunks and documents unaffected
- Can process both formats in same session
</compatibility>
<future_enhancements>
- Support for .txt plain text files
- Support for .docx Word documents (via pandoc)
- Support for .epub ebooks
- Batch upload of multiple Markdown files
- Markdown to PDF export for archival
</future_enhancements>
</technical_notes>
</project_specification>

View File

@@ -1,498 +0,0 @@
<project_specification>
<project_name>ikario - Tavily MCP Integration for Internet Access</project_name>
<overview>
This specification adds Tavily search capabilities via MCP (Model Context Protocol) to give Ikario
internet access for real-time web searches. Tavily provides high-quality search results optimized
for AI agents, making it ideal for research, fact-checking, and accessing current information.
This integration adds a new MCP server connection to the existing architecture (alongside the
ikario-memory MCP server) and exposes Tavily search tools to Ikario during conversations.
All changes are additive and backward-compatible. Existing functionality remains unchanged.
</overview>
<architecture_design>
<mcp_integration>
Tavily MCP Server Connection:
- Uses @modelcontextprotocol/sdk Client to connect to Tavily MCP server
- Connection can be stdio-based (local MCP server) or HTTP-based (remote)
- Tavily MCP server provides search tools that are exposed to Claude via Tool Use API
- Backend routes handle tool execution and return results to Claude
</mcp_integration>
<benefits>
- Real-time internet access for Ikario
- High-quality search results optimized for LLMs
- Fact-checking and verification capabilities
- Access to current events and news
- Research assistance with cited sources
- Seamless integration with existing memory tools
</benefits>
</architecture_design>
<technology_stack>
<mcp_server>
<name>Tavily MCP Server</name>
<protocol>Model Context Protocol (MCP)</protocol>
<connection>stdio or HTTP transport</connection>
<sdk>@modelcontextprotocol/sdk</sdk>
<api_key>Tavily API key (from https://tavily.com)</api_key>
</mcp_server>
<backend>
<runtime>Node.js with Express (existing)</runtime>
<mcp_client>MCP Client for Tavily server connection</mcp_client>
<tool_executor>Existing toolExecutor service extended with Tavily tools</tool_executor>
</backend>
<api_endpoints>
<tavily_routes>GET/POST /api/tavily/* for Tavily-specific operations</tavily_routes>
<existing_routes>Existing /api/claude/chat routes support Tavily tools automatically</existing_routes>
</api_endpoints>
</technology_stack>
<prerequisites>
<environment_setup>
- Tavily API key obtained from https://tavily.com (free tier available)
- API key stored in environment variable TAVILY_API_KEY or configuration file
- MCP SDK already installed (@modelcontextprotocol/sdk exists for ikario-memory)
- Tavily MCP server installed (npm package or Python package)
</environment_setup>
<configuration>
- Add Tavily MCP server config to server/.claude_settings.json or similar
- Configure connection parameters (stdio vs HTTP)
- Set API key securely
</configuration>
</prerequisites>
<core_features>
<feature_1>
<title>Tavily MCP Client Setup</title>
<description>
Create MCP client connection to Tavily search server. This is similar to the existing
ikario-memory MCP client but connects to Tavily instead.
Implementation:
- Create server/services/tavilyMcpClient.js
- Initialize MCP client with Tavily server connection
- Handle connection lifecycle (connect, disconnect, reconnect)
- Implement health checks and connection status
- Export client instance and helper functions
Configuration:
- Read Tavily API key from environment or config file
- Configure transport (stdio or HTTP)
- Set connection timeout and retry logic
- Log connection status for debugging
Error Handling:
- Graceful degradation if Tavily is unavailable
- Connection retry with exponential backoff
- Clear error messages for configuration issues
</description>
<priority>1</priority>
<category>backend</category>
<test_steps>
1. Verify MCP client can connect to Tavily server on startup
2. Test connection health check endpoint returns correct status
3. Verify graceful handling when Tavily API key is missing
4. Test reconnection logic when connection drops
5. Verify connection status is logged correctly
6. Test that server starts even if Tavily is unavailable
</test_steps>
</feature_1>
<feature_2>
<title>Tavily Tool Configuration</title>
<description>
Configure Tavily search tools to be available to Claude during conversations.
This integrates with the existing tool system (like memory tools).
Implementation:
- Create server/config/tavilyTools.js
- Define tool schemas for Tavily search capabilities
- Integrate with existing toolExecutor service
- Add Tavily tools to system prompt alongside memory tools
Tavily Tools to Expose:
- tavily_search: General web search with AI-optimized results
- Parameters: query (string), max_results (number), search_depth (basic/advanced)
- Returns: Array of search results with title, url, content, score
- tavily_search_news: News-specific search for current events
- Parameters: query (string), max_results (number), days (number)
- Returns: Recent news articles with metadata
Tool Schema:
- Follow Claude Tool Use API format
- Clear descriptions for each tool
- Well-defined input schemas with validation
- Proper error handling in tool execution
</description>
<priority>1</priority>
<category>backend</category>
<test_steps>
1. Verify Tavily tools are listed in available tools
2. Test tool schema validation with valid inputs
3. Test tool schema validation rejects invalid inputs
4. Verify tools appear in Claude's system prompt
5. Test that tool descriptions are clear and accurate
6. Verify tools can be called without errors
</test_steps>
</feature_2>
<feature_3>
<title>Tavily Tool Executor Integration</title>
<description>
Integrate Tavily tools into the existing toolExecutor service so Claude can
use them during conversations.
Implementation:
- Extend server/services/toolExecutor.js to handle Tavily tools
- Add tool detection for tavily_search and tavily_search_news
- Implement tool execution logic using Tavily MCP client
- Format Tavily results for Claude consumption
- Handle errors and timeouts gracefully
Tool Execution Flow:
1. Claude requests tool use (e.g., tavily_search)
2. toolExecutor detects Tavily tool request
3. Call Tavily MCP client with tool parameters
4. Receive and format search results
5. Return formatted results to Claude
6. Claude incorporates results into response
Result Formatting:
- Convert Tavily results to Claude-friendly format
- Include source URLs for citation
- Add relevance scores
- Truncate content if too long
- Handle empty results gracefully
</description>
<priority>1</priority>
<category>backend</category>
<test_steps>
1. Test tavily_search tool execution with valid query
2. Verify results are properly formatted
3. Test tavily_search_news tool execution
4. Verify error handling when Tavily API fails
5. Test timeout handling for slow searches
6. Verify results include proper citations and URLs
7. Test with empty search results
8. Test with very long search queries
</test_steps>
</feature_3>
<feature_4>
<title>System Prompt Enhancement for Internet Access</title>
<description>
Update the system prompt to inform Ikario about internet access capabilities.
This should be added alongside existing memory tools instructions.
Implementation:
- Update MEMORY_SYSTEM_PROMPT in server/routes/messages.js and claude.js
- Add Tavily tools documentation
- Provide usage guidelines for when to search the internet
- Include examples of good search queries
Prompt Addition:
"## Internet Access via Tavily
Tu as accès à internet en temps réel via deux outils de recherche :
1. tavily_search : Recherche web générale optimisée pour l'IA
- Utilise pour : rechercher des informations actuelles, vérifier des faits,
trouver des sources fiables
- Paramètres : query (ta question), max_results (nombre de résultats, défaut: 5),
search_depth ('basic' ou 'advanced')
- Retourne : Résultats avec titre, URL, contenu et score de pertinence
2. tavily_search_news : Recherche d'actualités récentes
- Utilise pour : événements actuels, nouvelles, actualités
- Paramètres : query, max_results, days (nombre de jours en arrière, défaut: 7)
Quand utiliser la recherche internet :
- Quand l'utilisateur demande des informations récentes ou actuelles
- Pour vérifier des faits ou données que tu n'es pas sûr de connaître
- Quand ta base de connaissances est trop ancienne (après janvier 2025)
- Pour trouver des sources et citations spécifiques
- Pour des requêtes nécessitant des données en temps réel
N'utilise PAS la recherche pour :
- Des questions sur ta propre identité ou capacités
- Des concepts généraux que tu connais déjà bien
- Des questions purement créatives ou d'opinion
Utilise ces outils de façon autonome selon les besoins de la conversation.
Cite toujours tes sources quand tu utilises des informations de Tavily."
</description>
<priority>2</priority>
<category>backend</category>
<test_steps>
1. Verify system prompt includes Tavily instructions
2. Test that Claude understands when to use Tavily search
3. Verify Claude cites sources from Tavily results
4. Test that Claude uses appropriate search queries
5. Verify Claude chooses between tavily_search and tavily_search_news correctly
6. Test that Claude doesn't over-use search for simple questions
</test_steps>
</feature_4>
<feature_5>
<title>Tavily Status API Endpoint</title>
<description>
Create API endpoint to check Tavily MCP connection status and search capabilities.
Similar to /api/memory/status endpoint.
Implementation:
- Create GET /api/tavily/status endpoint
- Return connection status, available tools, and configuration
- Create GET /api/tavily/health endpoint for health checks
- Add Tavily status to existing /api/memory/stats (rename to /api/tools/stats)
Response Format:
{
"success": true,
"data": {
"connected": true,
"message": "Tavily MCP server is connected",
"tools": ["tavily_search", "tavily_search_news"],
"apiKeyConfigured": true,
"transport": "stdio"
}
}
</description>
<priority>2</priority>
<category>backend</category>
<test_steps>
1. Test GET /api/tavily/status returns correct status
2. Verify status shows "connected" when Tavily is available
3. Verify status shows "disconnected" when Tavily is unavailable
4. Test health endpoint returns proper status code
5. Verify tools list is accurate
6. Test with missing API key shows proper error
</test_steps>
</feature_5>
<feature_6>
<title>Frontend UI Indicator for Internet Access</title>
<description>
Add visual indicator in the UI to show when Ikario has internet access via Tavily.
This can be displayed alongside the existing memory status indicator.
Implementation:
- Add Tavily status indicator in header or sidebar
- Show online/offline status for Tavily connection
- Optional: Show when Tavily is being used during a conversation
- Optional: Add tooltip explaining internet access capabilities
Visual Design:
- Globe or wifi icon to represent internet access
- Green when connected, gray when disconnected
- Subtle animation when search is in progress
- Tooltip: "Internet access via Tavily" or similar
Integration:
- Use existing useMemory hook pattern or create useTavily hook
- Poll /api/tavily/status periodically (every 60s)
- Update status in real-time during searches
</description>
<priority>3</priority>
<category>frontend</category>
<test_steps>
1. Verify internet access indicator appears in UI
2. Test status updates when Tavily connects/disconnects
3. Verify tooltip shows correct information
4. Test that indicator shows activity during searches
5. Verify status polling doesn't impact performance
6. Test with Tavily disabled shows offline status
</test_steps>
</feature_6>
<feature_7>
<title>Manual Search UI (Optional Enhancement)</title>
<description>
Optional: Add manual search interface to allow users to trigger Tavily searches directly,
similar to the memory search panel.
Implementation:
- Add "Internet Search" panel in sidebar (alongside Memory panel)
- Search input for manual Tavily queries
- Display search results with title, snippet, URL
- Click to insert results into conversation
- Filter by search type (general vs news)
This is OPTIONAL and lower priority. The primary use case is autonomous search by Claude.
</description>
<priority>4</priority>
<category>frontend</category>
<test_steps>
1. Verify search panel appears in sidebar
2. Test manual search returns results
3. Verify results display properly with links
4. Test inserting results into conversation
5. Test news search filter works correctly
6. Verify search history is saved (optional)
</test_steps>
</feature_7>
<feature_8>
<title>Configuration and Settings</title>
<description>
Add Tavily configuration options to settings and environment.
Implementation:
- Add TAVILY_API_KEY to environment variables
- Add Tavily settings to .claude_settings.json or similar config file
- Create server/config/tavilyConfig.js for configuration management
- Document configuration options in README
Configuration Options:
- API key
- Max results per search (default: 5)
- Search depth (basic/advanced)
- Timeout duration
- Enable/disable Tavily globally
- Rate limiting settings
Security:
- API key should NOT be exposed to frontend
- Use environment variable or secure config file
- Validate API key on startup
- Log warnings if API key is missing
</description>
<priority>2</priority>
<category>backend</category>
<test_steps>
1. Verify API key is read from environment variable
2. Test fallback to config file if env var not set
3. Verify API key validation on startup
4. Test configuration options are applied correctly
5. Verify API key is never exposed in API responses
6. Test enabling/disabling Tavily via config
</test_steps>
</feature_8>
<feature_9>
<title>Error Handling and Rate Limiting</title>
<description>
Implement robust error handling and rate limiting for Tavily API calls.
Implementation:
- Detect and handle Tavily API errors (rate limits, invalid API key, etc.)
- Implement client-side rate limiting to avoid hitting Tavily limits
- Cache search results for duplicate queries (optional)
- Provide clear error messages to Claude when searches fail
Error Types:
- 401: Invalid API key
- 429: Rate limit exceeded
- 500: Tavily server error
- Timeout: Search took too long
- Network: Connection failed
Rate Limiting:
- Track searches per minute/hour
- Queue requests if limit reached
- Return cached results for duplicate queries within 5 minutes
- Log rate limit warnings
</description>
<priority>2</priority>
<category>backend</category>
<test_steps>
1. Test error handling for invalid API key
2. Verify rate limit detection and handling
3. Test timeout handling for slow searches
4. Verify error messages are clear to Claude
5. Test rate limiting prevents API abuse
6. Verify caching works for duplicate queries
</test_steps>
</feature_9>
<feature_10>
<title>Documentation and README Updates</title>
<description>
Update project documentation to explain Tavily integration.
Implementation:
- Update main README.md with Tavily setup instructions
- Add TAVILY_SETUP.md with detailed configuration guide
- Document API endpoints in README
- Add examples of using Tavily with Ikario
- Document troubleshooting steps
Documentation Sections:
- Prerequisites (Tavily API key)
- Installation steps
- Configuration options
- Testing Tavily connection
- Example conversations using internet search
- Troubleshooting common issues
- API reference for Tavily endpoints
</description>
<priority>3</priority>
<category>documentation</category>
<test_steps>
1. Verify README has Tavily setup section
2. Test that setup instructions are clear and complete
3. Verify all configuration options are documented
4. Test examples work as described
5. Verify troubleshooting section covers common issues
</test_steps>
</feature_10>
</core_features>
<implementation_notes>
<order>
Recommended implementation order:
1. Feature 1 (MCP Client Setup) - Foundation
2. Feature 2 (Tool Configuration) - Core functionality
3. Feature 3 (Tool Executor Integration) - Core functionality
4. Feature 8 (Configuration) - Required for testing
5. Feature 4 (System Prompt) - Makes tools accessible to Claude
6. Feature 9 (Error Handling) - Production readiness
7. Feature 5 (Status API) - Monitoring
8. Feature 10 (Documentation) - User onboarding
9. Feature 6 (UI Indicator) - Nice to have
10. Feature 7 (Manual Search UI) - Optional enhancement
</order>
<testing>
After implementing features 1-5, you should be able to:
- Ask Ikario: "Quelle est l'actualité aujourd'hui ?"
- Ask Ikario: "Recherche des informations sur [topic actuel]"
- Ask Ikario: "Vérifie cette information : [claim]"
Ikario should autonomously use Tavily search and cite sources.
</testing>
<compatibility>
- This specification is fully compatible with existing ikario-memory MCP integration
- Ikario will have both memory tools AND internet search tools
- Tools can be used together in the same conversation
- No conflicts expected between tool systems
</compatibility>
</implementation_notes>
<safety_requirements>
<critical>
- DO NOT expose Tavily API key to frontend or in API responses
- DO NOT modify existing MCP memory integration
- DO NOT break existing conversation functionality
- Tavily should gracefully degrade if unavailable (don't crash the app)
- Implement proper rate limiting to avoid API abuse
- Validate all user inputs before passing to Tavily
- Sanitize search results before displaying (XSS prevention)
- Log all Tavily API calls for monitoring and debugging
</critical>
</safety_requirements>
<success_metrics>
- Ikario can successfully perform internet searches when asked
- Search results are relevant and well-formatted
- Sources are properly cited
- Tavily integration doesn't slow down conversations
- Error handling is robust and user-friendly
- Configuration is straightforward
- Documentation is clear and complete
</success_metrics>
</project_specification>

View File

@@ -1,679 +0,0 @@
<project_specification>
<project_name>Library RAG - Type Safety & Documentation Enhancement</project_name>
<overview>
Enhance the Library RAG application (philosophical texts indexing and semantic search) by adding
strict type annotations and comprehensive Google-style docstrings to all Python modules. This will
improve code maintainability, enable static type checking with mypy, and provide clear documentation
for all functions, classes, and modules.
The application is a RAG pipeline that processes PDF documents through OCR, LLM-based extraction,
semantic chunking, and ingestion into Weaviate vector database. It includes a Flask web interface
for document upload, processing, and semantic search.
</overview>
<technology_stack>
<backend>
<runtime>Python 3.10+</runtime>
<web_framework>Flask 3.0</web_framework>
<vector_database>Weaviate 1.34.4 with text2vec-transformers</vector_database>
<ocr>Mistral OCR API</ocr>
<llm>Ollama (local) or Mistral API</llm>
<type_checking>mypy with strict configuration</type_checking>
</backend>
<infrastructure>
<containerization>Docker Compose (Weaviate + transformers)</containerization>
<dependencies>weaviate-client, flask, mistralai, python-dotenv</dependencies>
</infrastructure>
</technology_stack>
<current_state>
<project_structure>
- flask_app.py: Main Flask application (640 lines)
- schema.py: Weaviate schema definition (383 lines)
- utils/: 16+ modules for PDF processing pipeline
- pdf_pipeline.py: Main orchestration (879 lines)
- mistral_client.py: OCR API client
- ocr_processor.py: OCR processing
- markdown_builder.py: Markdown generation
- llm_metadata.py: Metadata extraction via LLM
- llm_toc.py: Table of contents extraction
- llm_classifier.py: Section classification
- llm_chunker.py: Semantic chunking
- llm_cleaner.py: Chunk cleaning
- llm_validator.py: Document validation
- weaviate_ingest.py: Database ingestion
- hierarchy_parser.py: Document hierarchy parsing
- image_extractor.py: Image extraction from PDFs
- toc_extractor*.py: Various TOC extraction methods
- templates/: Jinja2 templates for Flask UI
- tests/utils2/: Minimal test coverage (3 test files)
</project_structure>
<issues>
- Inconsistent type annotations across modules (some have partial types, many have none)
- Missing or incomplete docstrings (no Google-style format)
- No mypy configuration for strict type checking
- Type hints missing on function parameters and return values
- Dict[str, Any] used extensively without proper typing
- No type stubs for complex nested structures
</issues>
</current_state>
<core_features>
<type_annotations>
<strict_typing>
- Add complete type annotations to ALL functions and methods
- Use proper generic types (List, Dict, Optional, Union) from typing module
- Add TypedDict for complex dictionary structures
- Add Protocol types for duck-typed interfaces
- Use Literal types for string constants
- Add ParamSpec and TypeVar where appropriate
- Type all class attributes and instance variables
- Add type annotations to lambda functions where possible
</strict_typing>
<mypy_configuration>
- Create mypy.ini with strict configuration
- Enable: check_untyped_defs, disallow_untyped_defs, disallow_incomplete_defs
- Enable: disallow_untyped_calls, disallow_untyped_decorators
- Enable: warn_return_any, warn_redundant_casts
- Enable: strict_equality, strict_optional
- Set python_version to 3.10
- Configure per-module overrides if needed for gradual migration
</mypy_configuration>
<type_stubs>
- Create TypedDict definitions for common data structures:
- OCR response structures
- Metadata dictionaries
- TOC entries
- Chunk objects
- Weaviate objects
- Pipeline results
- Add NewType for semantic type safety (DocumentName, ChunkId, etc.)
- Create Protocol types for callback functions
</type_stubs>
<specific_improvements>
- pdf_pipeline.py: Type all 10 pipeline steps, callbacks, result dictionaries
- flask_app.py: Type all route handlers, request/response types
- schema.py: Type Weaviate configuration objects
- llm_*.py: Type LLM request/response structures
- mistral_client.py: Type API client methods and responses
- weaviate_ingest.py: Type ingestion functions and batch operations
</specific_improvements>
</type_annotations>
<documentation>
<google_style_docstrings>
- Add comprehensive Google-style docstrings to ALL:
- Module-level docstrings explaining purpose and usage
- Class docstrings with Attributes section
- Function/method docstrings with Args, Returns, Raises sections
- Complex algorithm explanations with Examples section
- Include code examples for public APIs
- Document all exceptions that can be raised
- Add Notes section for important implementation details
- Add See Also section for related functions
</google_style_docstrings>
<module_documentation>
<utils_modules>
- pdf_pipeline.py: Document the 10-step pipeline, each step's purpose
- mistral_client.py: Document OCR API usage, cost calculation
- llm_metadata.py: Document metadata extraction logic
- llm_toc.py: Document TOC extraction strategies
- llm_classifier.py: Document section classification types
- llm_chunker.py: Document semantic vs basic chunking
- llm_cleaner.py: Document cleaning rules and validation
- llm_validator.py: Document validation criteria
- weaviate_ingest.py: Document ingestion process, nested objects
- hierarchy_parser.py: Document hierarchy building algorithm
</utils_modules>
<flask_app>
- Document all routes with request/response examples
- Document SSE (Server-Sent Events) implementation
- Document Weaviate query patterns
- Document upload processing workflow
- Document background job management
</flask_app>
<schema>
- Document Weaviate schema design decisions
- Document each collection's purpose and relationships
- Document nested object structure
- Document vectorization strategy
</schema>
</module_documentation>
<inline_comments>
- Add inline comments for complex logic only (don't over-comment)
- Explain WHY not WHAT (code should be self-documenting)
- Document performance considerations
- Document cost implications (OCR, LLM API calls)
- Document error handling strategies
</inline_comments>
</documentation>
<validation>
<type_checking>
- All modules must pass mypy --strict
- No # type: ignore comments without justification
- CI/CD should run mypy checks
- Type coverage should be 100%
</type_checking>
<documentation_quality>
- All public functions must have docstrings
- All docstrings must follow Google style
- Examples should be executable and tested
- Documentation should be clear and concise
</documentation_quality>
</validation>
</core_features>
<implementation_priority>
<critical_modules>
Priority 1 (Most used, most complex):
1. utils/pdf_pipeline.py - Main orchestration
2. flask_app.py - Web application entry point
3. utils/weaviate_ingest.py - Database operations
4. schema.py - Schema definition
Priority 2 (Core LLM modules):
5. utils/llm_metadata.py
6. utils/llm_toc.py
7. utils/llm_classifier.py
8. utils/llm_chunker.py
9. utils/llm_cleaner.py
10. utils/llm_validator.py
Priority 3 (OCR and parsing):
11. utils/mistral_client.py
12. utils/ocr_processor.py
13. utils/markdown_builder.py
14. utils/hierarchy_parser.py
15. utils/image_extractor.py
Priority 4 (Supporting modules):
16. utils/toc_extractor.py
17. utils/toc_extractor_markdown.py
18. utils/toc_extractor_visual.py
19. utils/llm_structurer.py (legacy)
</critical_modules>
</implementation_priority>
<implementation_steps>
<feature_1>
<title>Setup Type Checking Infrastructure</title>
<description>
Configure mypy with strict settings and create foundational type definitions
</description>
<tasks>
- Create mypy.ini configuration file with strict settings
- Add mypy to requirements.txt or dev dependencies
- Create utils/types.py module for common TypedDict definitions
- Define core types: OCRResponse, Metadata, TOCEntry, ChunkData, PipelineResult
- Add NewType definitions for semantic types: DocumentName, ChunkId, SectionPath
- Create Protocol types for callbacks (ProgressCallback, etc.)
- Document type definitions in utils/types.py module docstring
- Test mypy configuration on a single module to verify settings
</tasks>
<acceptance_criteria>
- mypy.ini exists with strict configuration
- utils/types.py contains all foundational types with docstrings
- mypy runs without errors on utils/types.py
- Type definitions are comprehensive and reusable
</acceptance_criteria>
</feature_1>
<feature_2>
<title>Add Types to PDF Pipeline Orchestration</title>
<description>
Add complete type annotations to pdf_pipeline.py (879 lines, most complex module)
</description>
<tasks>
- Add type annotations to all function signatures in pdf_pipeline.py
- Type the 10-step pipeline: OCR, Markdown, Metadata, TOC, Classify, Chunk, Clean, Validate, Weaviate
- Type progress_callback parameter with Protocol or Callable
- Add TypedDict for pipeline options dictionary
- Add TypedDict for pipeline result dictionary structure
- Type all helper functions (extract_document_metadata_legacy, etc.)
- Add proper return types for process_pdf_v2, process_pdf, process_pdf_bytes
- Fix any mypy errors that arise
- Verify mypy --strict passes on pdf_pipeline.py
</tasks>
<acceptance_criteria>
- All functions in pdf_pipeline.py have complete type annotations
- progress_callback is properly typed with Protocol
- All Dict[str, Any] replaced with TypedDict where appropriate
- mypy --strict pdf_pipeline.py passes with zero errors
- No # type: ignore comments (or justified if absolutely necessary)
</acceptance_criteria>
</feature_2>
<feature_3>
<title>Add Types to Flask Application</title>
<description>
Add complete type annotations to flask_app.py and type all routes
</description>
<tasks>
- Add type annotations to all Flask route handlers
- Type request.args, request.form, request.files usage
- Type jsonify() return values
- Type get_weaviate_client context manager
- Type get_collection_stats, get_all_chunks, search_chunks functions
- Add TypedDict for Weaviate query results
- Type background job processing functions (run_processing_job)
- Type SSE generator function (upload_progress)
- Add type hints for template rendering
- Verify mypy --strict passes on flask_app.py
</tasks>
<acceptance_criteria>
- All Flask routes have complete type annotations
- Request/response types are clear and documented
- Weaviate query functions are properly typed
- SSE generator is correctly typed
- mypy --strict flask_app.py passes with zero errors
</acceptance_criteria>
</feature_3>
<feature_4>
<title>Add Types to Core LLM Modules</title>
<description>
Add complete type annotations to all LLM processing modules (metadata, TOC, classifier, chunker, cleaner, validator)
</description>
<tasks>
- llm_metadata.py: Type extract_metadata function, return structure
- llm_toc.py: Type extract_toc function, TOC hierarchy structure
- llm_classifier.py: Type classify_sections, section types (Literal), validation functions
- llm_chunker.py: Type chunk_section_with_llm, chunk objects
- llm_cleaner.py: Type clean_chunk, is_chunk_valid functions
- llm_validator.py: Type validate_document, validation result structure
- Add TypedDict for LLM request/response structures
- Type provider selection ("ollama" | "mistral" as Literal)
- Type model names with Literal or constants
- Verify mypy --strict passes on all llm_*.py modules
</tasks>
<acceptance_criteria>
- All LLM modules have complete type annotations
- Section types use Literal for type safety
- Provider and model parameters are strongly typed
- LLM request/response structures use TypedDict
- mypy --strict passes on all llm_*.py modules with zero errors
</acceptance_criteria>
</feature_4>
<feature_5>
<title>Add Types to Weaviate and Database Modules</title>
<description>
Add complete type annotations to schema.py and weaviate_ingest.py
</description>
<tasks>
- schema.py: Type Weaviate configuration objects
- schema.py: Type collection property definitions
- weaviate_ingest.py: Type ingest_document function signature
- weaviate_ingest.py: Type delete_document_chunks function
- weaviate_ingest.py: Add TypedDict for Weaviate object structure
- Type batch insertion operations
- Type nested object references (work, document)
- Add proper error types for Weaviate exceptions
- Verify mypy --strict passes on both modules
</tasks>
<acceptance_criteria>
- schema.py has complete type annotations for Weaviate config
- weaviate_ingest.py functions are fully typed
- Nested object structures use TypedDict
- Weaviate client operations are properly typed
- mypy --strict passes on both modules with zero errors
</acceptance_criteria>
</feature_5>
<feature_6>
<title>Add Types to OCR and Parsing Modules</title>
<description>
Add complete type annotations to mistral_client.py, ocr_processor.py, markdown_builder.py, hierarchy_parser.py
</description>
<tasks>
- mistral_client.py: Type create_client, run_ocr, estimate_ocr_cost
- mistral_client.py: Add TypedDict for Mistral API response structures
- ocr_processor.py: Type serialize_ocr_response, OCR object structures
- markdown_builder.py: Type build_markdown, image_writer parameter
- hierarchy_parser.py: Type build_hierarchy, flatten_hierarchy functions
- hierarchy_parser.py: Add TypedDict for hierarchy node structure
- image_extractor.py: Type create_image_writer, image handling
- Verify mypy --strict passes on all modules
</tasks>
<acceptance_criteria>
- All OCR/parsing modules have complete type annotations
- Mistral API structures use TypedDict
- Hierarchy nodes are properly typed
- Image handling functions are typed
- mypy --strict passes on all modules with zero errors
</acceptance_criteria>
</feature_6>
<feature_7>
<title>Add Google-Style Docstrings to Core Modules</title>
<description>
Add comprehensive Google-style docstrings to pdf_pipeline.py, flask_app.py, and weaviate modules
</description>
<tasks>
- pdf_pipeline.py: Add module docstring explaining the V2 pipeline
- pdf_pipeline.py: Add docstrings to process_pdf_v2 with Args, Returns, Raises sections
- pdf_pipeline.py: Document each of the 10 pipeline steps in comments
- pdf_pipeline.py: Add Examples section showing typical usage
- flask_app.py: Add module docstring explaining Flask application
- flask_app.py: Document all routes with request/response examples
- flask_app.py: Document Weaviate connection management
- schema.py: Add module docstring explaining schema design
- schema.py: Document each collection's purpose and relationships
- weaviate_ingest.py: Document ingestion process with examples
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All core modules have comprehensive module-level docstrings
- All public functions have Google-style docstrings
- Args, Returns, Raises sections are complete and accurate
- Examples are provided for complex functions
- Docstrings explain WHY, not just WHAT
</acceptance_criteria>
</feature_7>
<feature_8>
<title>Add Google-Style Docstrings to LLM Modules</title>
<description>
Add comprehensive Google-style docstrings to all LLM processing modules
</description>
<tasks>
- llm_metadata.py: Document metadata extraction logic with examples
- llm_toc.py: Document TOC extraction strategies and fallbacks
- llm_classifier.py: Document section types and classification criteria
- llm_chunker.py: Document semantic vs basic chunking approaches
- llm_cleaner.py: Document cleaning rules and validation logic
- llm_validator.py: Document validation criteria and corrections
- Add Examples sections showing input/output for each function
- Document LLM provider differences (Ollama vs Mistral)
- Document cost implications in Notes sections
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All LLM modules have comprehensive docstrings
- Each function has Args, Returns, Raises sections
- Examples show realistic input/output
- Provider differences are documented
- Cost implications are noted where relevant
</acceptance_criteria>
</feature_8>
<feature_9>
<title>Add Google-Style Docstrings to OCR and Parsing Modules</title>
<description>
Add comprehensive Google-style docstrings to OCR, markdown, hierarchy, and extraction modules
</description>
<tasks>
- mistral_client.py: Document OCR API usage, cost calculation
- ocr_processor.py: Document OCR response processing
- markdown_builder.py: Document markdown generation strategy
- hierarchy_parser.py: Document hierarchy building algorithm
- image_extractor.py: Document image extraction process
- toc_extractor*.py: Document various TOC extraction methods
- Add Examples sections for complex algorithms
- Document edge cases and error handling
- All docstrings must follow Google style format exactly
</tasks>
<acceptance_criteria>
- All OCR/parsing modules have comprehensive docstrings
- Complex algorithms are well explained
- Edge cases are documented
- Error handling is documented
- Examples demonstrate typical usage
</acceptance_criteria>
</feature_9>
<feature_10>
<title>Final Validation and CI Integration</title>
<description>
Verify all type annotations and docstrings, integrate mypy into CI/CD
</description>
<tasks>
- Run mypy --strict on entire codebase, verify 100% pass rate
- Verify all public functions have docstrings
- Check docstring formatting with pydocstyle or similar tool
- Create GitHub Actions workflow to run mypy on every commit
- Update README.md with type checking instructions
- Update CLAUDE.md with documentation standards
- Create CONTRIBUTING.md with type annotation and docstring guidelines
- Generate API documentation with Sphinx or pdoc
- Fix any remaining mypy errors or missing docstrings
</tasks>
<acceptance_criteria>
- mypy --strict passes on entire codebase with zero errors
- All public functions have Google-style docstrings
- CI/CD runs mypy checks automatically
- Documentation is generated and accessible
- Contributing guidelines document type/docstring requirements
</acceptance_criteria>
</feature_10>
</implementation_steps>
<success_criteria>
<type_safety>
- 100% type coverage across all modules
- mypy --strict passes with zero errors
- No # type: ignore comments without justification
- All Dict[str, Any] replaced with TypedDict where appropriate
- Proper use of generics, protocols, and type variables
- NewType used for semantic type safety
</type_safety>
<documentation_quality>
- All modules have comprehensive module-level docstrings
- All public functions/classes have Google-style docstrings
- All docstrings include Args, Returns, Raises sections
- Complex functions include Examples sections
- Cost implications documented in Notes sections
- Error handling clearly documented
- Provider differences (Ollama vs Mistral) documented
</documentation_quality>
<code_quality>
- Code is self-documenting with clear variable names
- Inline comments explain WHY, not WHAT
- Complex algorithms are well explained
- Performance considerations documented
- Security considerations documented
</code_quality>
<developer_experience>
- IDE autocomplete works perfectly with type hints
- Type errors caught at development time, not runtime
- Documentation is easily accessible in IDE
- API examples are executable and tested
- Contributing guidelines are clear and comprehensive
</developer_experience>
<maintainability>
- Refactoring is safer with type checking
- Function signatures are self-documenting
- API contracts are explicit and enforced
- Breaking changes are caught by type checker
- New developers can understand code quickly
</maintainability>
</success_criteria>
<constraints>
<compatibility>
- Must maintain backward compatibility with existing code
- Cannot break existing Flask routes or API contracts
- Weaviate schema must remain unchanged
- Existing tests must continue to pass
</compatibility>
<gradual_migration>
- Can use per-module mypy configuration for gradual migration
- Can temporarily disable strict checks on legacy modules
- Priority modules must be completed first
- Low-priority modules can be deferred
</gradual_migration>
<standards>
- All type annotations must use Python 3.10+ syntax
- Docstrings must follow Google style exactly (not NumPy or reStructuredText)
- Use typing module (List, Dict, Optional) until Python 3.9 support dropped
- Use from __future__ import annotations if needed for forward references
</standards>
</constraints>
<testing_strategy>
<type_checking>
- Run mypy --strict on each module after adding types
- Use mypy daemon (dmypy) for faster incremental checking
- Add mypy to pre-commit hooks
- CI/CD must run mypy and fail on type errors
</type_checking>
<documentation_validation>
- Use pydocstyle to validate Google-style format
- Use sphinx-build to generate docs and catch errors
- Manual review of docstring examples
- Verify examples are executable and correct
</documentation_validation>
<integration_testing>
- Verify existing tests still pass after type additions
- Add new tests for complex typed structures
- Test mypy configuration on sample code
- Verify IDE autocomplete works correctly
</integration_testing>
</testing_strategy>
<documentation_examples>
<module_docstring>
```python
"""
PDF Pipeline V2 - Intelligent document processing with LLM enhancement.
This module orchestrates a 10-step pipeline for processing PDF documents:
1. OCR via Mistral API
2. Markdown construction with images
3. Metadata extraction via LLM
4. Table of contents (TOC) extraction
5. Section classification
6. Semantic chunking
7. Chunk cleaning and validation
8. Enrichment with concepts
9. Validation and corrections
10. Ingestion into Weaviate vector database
The pipeline supports multiple LLM providers (Ollama local, Mistral API) and
various processing modes (skip OCR, semantic chunking, OCR annotations).
Typical usage:
>>> from pathlib import Path
>>> from utils.pdf_pipeline import process_pdf
>>>
>>> result = process_pdf(
... Path("document.pdf"),
... use_llm=True,
... llm_provider="ollama",
... ingest_to_weaviate=True,
... )
>>> print(f"Processed {result['pages']} pages, {result['chunks_count']} chunks")
See Also:
mistral_client: OCR API client
llm_metadata: Metadata extraction
weaviate_ingest: Database ingestion
"""
```
</module_docstring>
<function_docstring>
```python
def process_pdf_v2(
pdf_path: Path,
output_dir: Path = Path("output"),
*,
use_llm: bool = True,
llm_provider: Literal["ollama", "mistral"] = "ollama",
llm_model: Optional[str] = None,
skip_ocr: bool = False,
ingest_to_weaviate: bool = True,
progress_callback: Optional[ProgressCallback] = None,
) -> PipelineResult:
"""
Process a PDF through the complete V2 pipeline with LLM enhancement.
This function orchestrates all 10 steps of the intelligent document processing
pipeline, from OCR to Weaviate ingestion. It supports both local (Ollama) and
cloud (Mistral API) LLM providers, with optional caching via skip_ocr.
Args:
pdf_path: Absolute path to the PDF file to process.
output_dir: Base directory for output files. Defaults to "./output".
use_llm: Enable LLM-based processing (metadata, TOC, chunking).
If False, uses basic heuristic processing.
llm_provider: LLM provider to use. "ollama" for local (free but slow),
"mistral" for API (fast but paid).
llm_model: Specific model name. If None, auto-detects based on provider
(qwen2.5:7b for ollama, mistral-small-latest for mistral).
skip_ocr: If True, reuses existing markdown file to avoid OCR cost.
Requires output_dir/<doc_name>/<doc_name>.md to exist.
ingest_to_weaviate: If True, ingests chunks into Weaviate after processing.
progress_callback: Optional callback for real-time progress updates.
Called with (step_id, status, detail) for each pipeline step.
Returns:
Dictionary containing processing results with the following keys:
- success (bool): True if processing completed without errors
- document_name (str): Name of the processed document
- pages (int): Number of pages in the PDF
- chunks_count (int): Number of chunks generated
- cost_ocr (float): OCR cost in euros (0 if skip_ocr=True)
- cost_llm (float): LLM API cost in euros (0 if provider=ollama)
- cost_total (float): Total cost (ocr + llm)
- metadata (dict): Extracted metadata (title, author, etc.)
- toc (list): Hierarchical table of contents
- files (dict): Paths to generated files (markdown, chunks, etc.)
Raises:
FileNotFoundError: If pdf_path does not exist.
ValueError: If skip_ocr=True but markdown file not found.
RuntimeError: If Weaviate connection fails during ingestion.
Examples:
Basic usage with Ollama (free):
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... llm_provider="ollama"
... )
>>> print(f"Cost: {result['cost_total']:.4f}€")
Cost: 0.0270€ # OCR only
With Mistral API (faster):
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... llm_provider="mistral",
... llm_model="mistral-small-latest"
... )
Skip OCR to avoid cost:
>>> result = process_pdf_v2(
... Path("platon_menon.pdf"),
... skip_ocr=True, # Reuses existing markdown
... ingest_to_weaviate=False
... )
Notes:
- OCR cost: ~0.003€/page (standard), ~0.009€/page (with annotations)
- LLM cost: Free with Ollama, variable with Mistral API
- Processing time: ~30s/page with Ollama, ~5s/page with Mistral
- Weaviate must be running (docker-compose up -d) before ingestion
"""
```
</function_docstring>
</documentation_examples>
</project_specification>

View File

@@ -1,290 +0,0 @@
## YOUR ROLE - CODING AGENT (Library RAG - Type Safety & Documentation)
You are working on adding strict type annotations and Google-style docstrings to a Python library project.
This is a FRESH context window - you have no memory of previous sessions.
You have access to Linear for project management via MCP tools. Linear is your single source of truth.
### STEP 1: GET YOUR BEARINGS (MANDATORY)
Start by orienting yourself:
```bash
# 1. See your working directory
pwd
# 2. List files to understand project structure
ls -la
# 3. Read the project specification
cat app_spec.txt
# 4. Read the Linear project state
cat .linear_project.json
# 5. Check recent git history
git log --oneline -20
```
### STEP 2: CHECK LINEAR STATUS
Query Linear to understand current project state using the project_id from `.linear_project.json`.
1. **Get all issues and count progress:**
```
mcp__linear__list_issues with project_id
```
Count:
- Issues "Done" = completed
- Issues "Todo" = remaining
- Issues "In Progress" = currently being worked on
2. **Find META issue** (if exists) for session context
3. **Check for in-progress work** - complete it first if found
### STEP 3: SELECT NEXT ISSUE
Get Todo issues sorted by priority:
```
mcp__linear__list_issues with project_id, status="Todo", limit=5
```
Select ONE highest-priority issue to work on.
### STEP 4: CLAIM THE ISSUE
Use `mcp__linear__update_issue` to set status to "In Progress"
### STEP 5: IMPLEMENT THE ISSUE
Based on issue category:
**For Type Annotation Issues (e.g., "Types - Add type annotations to X.py"):**
1. Read the target Python file
2. Identify all functions, methods, and variables
3. Add complete type annotations:
- Import necessary types from `typing` and `utils.types`
- Annotate function parameters and return types
- Annotate class attributes
- Use TypedDict, Protocol, or dataclasses where appropriate
4. Save the file
5. Run mypy to verify (MANDATORY):
```bash
cd generations/library_rag
mypy --config-file=mypy.ini <file_path>
```
6. Fix any mypy errors
7. Commit the changes
**For Documentation Issues (e.g., "Docs - Add docstrings to X.py"):**
1. Read the target Python file
2. Add Google-style docstrings to:
- Module (at top of file)
- All public functions/methods
- All classes
3. Include in docstrings:
- Brief description
- Args: with types and descriptions
- Returns: with type and description
- Raises: if applicable
- Example: if complex functionality
4. Save the file
5. Optionally run pydocstyle to verify (if installed)
6. Commit the changes
**For Setup/Infrastructure Issues:**
Follow the specific instructions in the issue description.
### STEP 6: VERIFICATION
**Type Annotation Issues:**
- Run mypy on the modified file(s)
- Ensure zero type errors
- If errors exist, fix them before proceeding
**Documentation Issues:**
- Review docstrings for completeness
- Ensure Args/Returns sections match function signatures
- Check that examples are accurate
**Functional Changes (rare):**
- If the issue changes behavior, test manually
- Start Flask server if needed: `python flask_app.py`
- Test the affected functionality
### STEP 7: GIT COMMIT
Make a descriptive commit:
```bash
git add <files>
git commit -m "<Issue ID>: <Short description>
- <List of changes>
- Verified with mypy (for type issues)
- Linear issue: <issue identifier>
"
```
### STEP 8: UPDATE LINEAR ISSUE
1. **Add implementation comment:**
```markdown
## Implementation Complete
### Changes Made
- [List of files modified]
- [Key changes]
### Verification
- mypy passes with zero errors (for type issues)
- All test steps from issue description verified
### Git Commit
[commit hash and message]
```
2. **Update status to "Done"** using `mcp__linear__update_issue`
### STEP 9: DECIDE NEXT ACTION
After completing an issue, ask yourself:
1. Have I been working for a while? (Use judgment based on complexity of work done)
2. Is the code in a stable state?
3. Would this be a good handoff point?
**If YES to all three:**
- Proceed to STEP 10 (Session Summary)
- End cleanly
**If NO:**
- Continue to another issue (go back to STEP 3)
- But commit first!
**Pacing Guidelines:**
- Early phase (< 20% done): Can complete multiple simple issues
- Mid/late phase (> 20% done): 1-2 issues per session for quality
### STEP 10: SESSION SUMMARY (When Ending)
If META issue exists, add a comment:
```markdown
## Session Complete
### Completed This Session
- [Issue ID]: [Title] - [Brief summary]
### Current Progress
- X issues Done
- Y issues In Progress
- Z issues Todo
### Notes for Next Session
- [Important context]
- [Recommendations]
- [Any concerns]
```
Ensure:
- All code committed
- No uncommitted changes
- App in working state
---
## LINEAR WORKFLOW RULES
**Status Transitions:**
- Todo → In Progress (when starting)
- In Progress → Done (when verified)
**NEVER:**
- Delete or modify issue descriptions
- Mark Done without verification
- Leave issues In Progress when switching
---
## TYPE ANNOTATION GUIDELINES
**Imports needed:**
```python
from typing import Optional, Dict, List, Any, Tuple, Callable
from pathlib import Path
from utils.types import <ProjectSpecificTypes>
```
**Common patterns:**
```python
# Functions
def process_data(input: str, options: Optional[Dict[str, Any]] = None) -> List[str]:
"""Process input data."""
...
# Methods with self
def save(self, path: Path) -> None:
"""Save to file."""
...
# Async functions
async def fetch_data(url: str) -> Dict[str, Any]:
"""Fetch from API."""
...
```
**Use project types from `utils/types.py`:**
- Metadata, OCRResponse, TOCEntry, ChunkData, PipelineResult, etc.
---
## DOCSTRING TEMPLATE (Google Style)
```python
def function_name(param1: str, param2: int = 0) -> List[str]:
"""
Brief one-line description.
More detailed description if needed. Explain what the function does,
any important behavior, side effects, etc.
Args:
param1: Description of param1.
param2: Description of param2. Defaults to 0.
Returns:
Description of return value.
Raises:
ValueError: When param1 is empty.
IOError: When file cannot be read.
Example:
>>> result = function_name("test", 5)
>>> print(result)
['test', 'test', 'test', 'test', 'test']
"""
```
---
## IMPORTANT REMINDERS
**Your Goal:** Add strict type annotations and comprehensive documentation to all Python modules
**This Session's Goal:** Complete 1-2 issues with quality work and clean handoff
**Quality Bar:**
- mypy --strict passes with zero errors
- All public functions have complete Google-style docstrings
- Code is clean and well-documented
**Context is finite.** End sessions early with good handoff notes. The next agent will continue.
---
Begin by running STEP 1 (Get Your Bearings).