Add specification for Markdown support in Library RAG

New feature specification to add native Markdown (.md) file support:
- Skip OCR for .md files (0€ cost vs ~0.003€/page for PDF)
- Process Markdown directly through LLM pipeline
- Maintain full compatibility with existing PDF workflow
- Includes 10 features, 5 implementation steps, comprehensive tests

This will enable users to upload pre-digitized philosophical texts
in Markdown format without incurring OCR costs while still benefiting
from LLM-based metadata extraction, TOC generation, semantic chunking,
and Weaviate vectorization.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-25 12:46:07 +01:00
parent e13e0fa261
commit bf790b63a0

View File

@@ -0,0 +1,490 @@
<project_specification>
<project_name>Library RAG - Native Markdown Support</project_name>
<overview>
Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
chunking, and Weaviate vectorization.
This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
for users who have philosophical texts in Markdown format.
</overview>
<technology_stack>
<backend>
<framework>Flask 3.0</framework>
<pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
<validation>Werkzeug secure_filename</validation>
<llm>Ollama (local) or Mistral API</llm>
<vectorization>Weaviate with BAAI/bge-m3</vectorization>
</backend>
<type_safety>
<type_checker>mypy strict mode</type_checker>
<docstrings>Google-style docstrings required</docstrings>
</type_safety>
</technology_stack>
<core_features>
<feature_1>
<title>Update Flask File Validation</title>
<description>
Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
configuration and file validation logic to support .md files while maintaining backward compatibility
with existing PDF workflows.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
</files_to_modify>
<implementation_details>
- Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
- Update allowed_file() function to accept both extensions
- Update upload.html template to accept .md files in file input
- Update error messages to reflect both formats
</implementation_details>
<test_steps>
1. Start Flask app
2. Navigate to /upload
3. Attempt to upload a .md file
4. Verify file is accepted (no "Format non supporté" error)
5. Verify PDF upload still works
</test_steps>
</feature_1>
<feature_2>
<title>Add Markdown Detection in Pipeline</title>
<description>
Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
Add logic to automatically skip OCR processing for .md files and copy the Markdown content
directly to the output directory.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
</files_to_modify>
<implementation_details>
- Add file extension detection: `file_ext = pdf_path.suffix.lower()`
- If file_ext == ".md":
- Skip OCR step entirely (no Mistral API call)
- Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
- Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
- Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
- Set cost_ocr = 0.0
- Emit progress: "markdown_load" instead of "ocr"
- If file_ext == ".pdf":
- Continue with existing OCR workflow
- Both paths converge at LLM processing (metadata, TOC, chunking)
</implementation_details>
<test_steps>
1. Create test Markdown file with philosophical content
2. Call process_pdf(Path("test.md"), use_llm=True)
3. Verify OCR is skipped (cost_ocr = 0.0)
4. Verify output/test/test.md is created
5. Verify no _ocr.json file is created
6. Verify LLM processing runs normally
</test_steps>
</feature_2>
<feature_3>
<title>Markdown-Specific Progress Callback</title>
<description>
Update the progress callback system to emit appropriate events for Markdown file processing.
Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
user feedback during Server-Sent Events streaming.
</description>
<priority>2</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (emit_progress calls)
- flask_app.py (process_file_background function)
</files_to_modify>
<implementation_details>
- Add conditional progress messages based on file type
- For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
- For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
- Update frontend to handle "markdown_load" event type
- Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
</implementation_details>
<test_steps>
1. Upload Markdown file via Flask interface
2. Monitor SSE progress stream at /upload/progress/&lt;job_id&gt;
3. Verify first step shows "Chargement du fichier Markdown..."
4. Verify no OCR-related messages appear
5. Verify subsequent steps (metadata, TOC, etc.) work normally
</test_steps>
</feature_3>
<feature_4>
<title>Update process_pdf_bytes for Markdown</title>
<description>
Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
This function currently creates a temporary PDF file, but for Markdown uploads,
it should create a temporary .md file instead.
</description>
<priority>1</priority>
<category>backend</category>
<files_to_modify>
- utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
</files_to_modify>
<implementation_details>
- Detect file type from filename parameter
- If filename ends with .md:
- Create temp file with suffix=".md"
- Write file_bytes as UTF-8 text
- If filename ends with .pdf:
- Existing behavior (suffix=".pdf", binary write)
- Pass temp file path to process_pdf() which now handles both types
</implementation_details>
<test_steps>
1. Create Flask test client
2. POST multipart form with .md file to /upload
3. Verify process_pdf_bytes creates .md temp file
4. Verify temp file contains correct Markdown content
5. Verify cleanup deletes temp file after processing
</test_steps>
</feature_4>
<feature_5>
<title>Add Markdown File Validation</title>
<description>
Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
and basic Markdown structure. Reject files that are too large, contain binary data,
or have no meaningful content.
</description>
<priority>2</priority>
<category>backend</category>
<files_to_create>
- utils/markdown_validator.py
</files_to_create>
<implementation_details>
- Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
- Checks:
- File size &lt; 10 MB
- Valid UTF-8 encoding
- Contains at least one header (#, ##, etc.)
- Not empty (at least 100 characters)
- No null bytes or excessive binary content
- Return dict with success, error, and warnings keys
- Call from process_pdf_v2 before processing
- Type annotations and Google-style docstrings required
</implementation_details>
<test_steps>
1. Test with valid Markdown file → passes validation
2. Test with empty file → fails with "File too short"
3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
4. Test with very large file (&gt;10MB) → fails with "File too large"
5. Test with plain text no headers → warning but continues
</test_steps>
</feature_5>
<feature_6>
<title>Update Documentation</title>
<description>
Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
</description>
<priority>3</priority>
<category>documentation</category>
<files_to_modify>
- README.md (add section under "Pipeline de Traitement")
- .claude/CLAUDE.md (update development guidelines)
- templates/upload.html (add help text)
</files_to_modify>
<implementation_details>
- README.md:
- Add "Support Markdown Natif" section
- Document accepted formats: PDF, MD
- Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
- Add example: process_pdf(Path("document.md"))
- CLAUDE.md:
- Update "Pipeline de Traitement" section
- Note conditional OCR step
- Document markdown_validator.py module
- upload.html:
- Update file input accept attribute: accept=".pdf,.md"
- Add help text: "Formats acceptés : PDF, Markdown (.md)"
</implementation_details>
<test_steps>
1. Read README.md markdown support section
2. Verify examples are clear and accurate
3. Check CLAUDE.md developer notes
4. Open /upload in browser
5. Verify help text displays correctly
</test_steps>
</feature_6>
<feature_7>
<title>Add Unit Tests for Markdown Processing</title>
<description>
Create comprehensive unit tests for Markdown file handling to ensure reliability
and prevent regressions. Cover file validation, pipeline processing, and edge cases.
</description>
<priority>2</priority>
<category>testing</category>
<files_to_create>
- tests/utils/test_markdown_validator.py
- tests/utils/test_pdf_pipeline_markdown.py
- tests/fixtures/sample.md
</files_to_create>
<implementation_details>
- test_markdown_validator.py:
- Test valid Markdown acceptance
- Test invalid encoding rejection
- Test file size limits
- Test empty file rejection
- Test binary data detection
- test_pdf_pipeline_markdown.py:
- Test Markdown file processing end-to-end
- Test OCR skip for .md files
- Test cost_ocr = 0.0
- Test LLM processing (metadata, TOC, chunking)
- Mock Weaviate ingestion
- Verify output files created correctly
- fixtures/sample.md:
- Create realistic philosophical text in Markdown
- Include headers, paragraphs, formatting
- ~1000 words for realistic testing
</implementation_details>
<test_steps>
1. Run: pytest tests/utils/test_markdown_validator.py -v
2. Verify all validation tests pass
3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
4. Verify end-to-end Markdown processing works
5. Check test coverage: pytest --cov=utils --cov-report=html
</test_steps>
</feature_7>
<feature_8>
<title>Type Safety and Documentation</title>
<description>
Ensure all new code follows strict type safety requirements and includes comprehensive
Google-style docstrings. Run mypy checks and update type definitions as needed.
</description>
<priority>2</priority>
<category>type_safety</category>
<files_to_modify>
- utils/types.py (add Markdown-specific types if needed)
- All modified modules (type annotations)
</files_to_modify>
<implementation_details>
- Add type annotations to all new functions
- Update existing functions that handle both PDF and MD
- Consider adding:
- FileFormat = Literal["pdf", "md"]
- MarkdownValidationResult = TypedDict(...)
- Run mypy --strict on all modified files
- Add Google-style docstrings with:
- Args section documenting all parameters
- Returns section with structure details
- Raises section for exceptions
- Examples section for complex functions
</implementation_details>
<test_steps>
1. Run: mypy utils/pdf_pipeline.py --strict
2. Run: mypy utils/markdown_validator.py --strict
3. Verify no type errors
4. Run: pydocstyle utils/markdown_validator.py --convention=google
5. Verify all docstrings follow Google style
</test_steps>
</feature_8>
<feature_9>
<title>Handle Markdown-Specific Edge Cases</title>
<description>
Address edge cases specific to Markdown processing: front matter (YAML/TOML),
embedded code blocks, special characters, and non-standard Markdown extensions.
</description>
<priority>3</priority>
<category>backend</category>
<files_to_modify>
- utils/markdown_validator.py
- utils/llm_metadata.py (handle front matter)
</files_to_modify>
<implementation_details>
- Front matter handling:
- Detect YAML/TOML front matter (--- or +++)
- Extract metadata if present (title, author, date)
- Pass to LLM or use directly if valid
- Strip front matter before content processing
- Code block handling:
- Don't treat code blocks as actual content
- Preserve them for chunking but don't analyze
- Special characters:
- Handle Unicode properly (Greek, Latin, French accents)
- Preserve LaTeX equations in $ or $$
- GitHub Flavored Markdown:
- Support tables, task lists, strikethrough
- Convert to standard format if needed
</implementation_details>
<test_steps>
1. Upload Markdown with YAML front matter
2. Verify metadata extracted correctly
3. Upload Markdown with code blocks
4. Verify code not treated as philosophical content
5. Upload Markdown with Greek/Latin text
6. Verify Unicode handled correctly
</test_steps>
</feature_9>
<feature_10>
<title>Update UI/UX for Markdown Upload</title>
<description>
Enhance the upload interface to clearly communicate Markdown support and provide
visual feedback about the file type being processed. Show format-specific information
(e.g., "No OCR cost for Markdown files").
</description>
<priority>3</priority>
<category>frontend</category>
<files_to_modify>
- templates/upload.html
- templates/upload_progress.html
</files_to_modify>
<implementation_details>
- upload.html:
- Add file type indicator icon (📄 PDF vs 📝 MD)
- Show format-specific help text on hover
- Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
- Add example Markdown file download link
- upload_progress.html:
- Show different icon for Markdown processing
- Adjust progress bar (9 steps vs 10 steps)
- Display "No OCR cost" badge for Markdown
- Update step descriptions based on file type
</implementation_details>
<test_steps>
1. Open /upload page
2. Verify help text mentions both PDF and MD
3. Select a .md file
4. Verify file type indicator shows 📝
5. Submit upload
6. Verify progress shows "Chargement Markdown..."
7. Verify "No OCR cost" badge displays
</test_steps>
</feature_10>
</core_features>
<implementation_steps>
<step number="1">
<title>Setup and Configuration</title>
<tasks>
- Update ALLOWED_EXTENSIONS in flask_app.py
- Modify allowed_file() validation function
- Update upload.html file input accept attribute
- Add Markdown MIME type handling
</tasks>
</step>
<step number="2">
<title>Core Pipeline Extension</title>
<tasks>
- Add file extension detection in process_pdf_v2()
- Implement Markdown file reading logic
- Skip OCR for .md files
- Add conditional progress callbacks
- Update process_pdf_bytes() for Markdown
</tasks>
</step>
<step number="3">
<title>Validation and Error Handling</title>
<tasks>
- Create markdown_validator.py module
- Implement UTF-8 encoding validation
- Add file size limits
- Handle front matter extraction
- Add comprehensive error messages
</tasks>
</step>
<step number="4">
<title>Testing Infrastructure</title>
<tasks>
- Create test fixtures (sample.md)
- Write validation tests
- Write pipeline integration tests
- Add edge case tests
- Verify mypy strict compliance
</tasks>
</step>
<step number="5">
<title>Documentation and Polish</title>
<tasks>
- Update README.md with Markdown support
- Update .claude/CLAUDE.md developer docs
- Add Google-style docstrings
- Update UI templates with new messaging
- Create usage examples
</tasks>
</step>
</implementation_steps>
<success_criteria>
<functionality>
- Markdown files upload successfully via Flask
- OCR is skipped for .md files (cost_ocr = 0.0)
- LLM processing works identically for PDF and MD
- Chunks are created and vectorized correctly
- Both file types can be searched in Weaviate
- Existing PDF workflow remains unchanged
</functionality>
<type_safety>
- All code passes mypy --strict
- All functions have type annotations
- Google-style docstrings on all modules
- No Any types without justification
- TypedDict definitions for new data structures
</type_safety>
<testing>
- Unit tests cover Markdown validation
- Integration tests verify end-to-end processing
- Edge cases handled (front matter, Unicode, large files)
- Test coverage &gt;80% for new code
- All tests pass in CI/CD pipeline
</testing>
<user_experience>
- Upload interface clearly shows both formats supported
- Progress feedback accurate for both PDF and MD
- Cost savings clearly communicated ("0€ for Markdown")
- Error messages helpful and specific
- Documentation clear with examples
</user_experience>
<performance>
- Markdown processing faster than PDF (no OCR)
- No regression in PDF processing speed
- Memory usage reasonable for large MD files
- Validation completes in &lt;100ms
- Overall pipeline &lt;30s for typical Markdown document
</performance>
</success_criteria>
<technical_notes>
<cost_comparison>
- PDF processing: OCR ~0.003€/page + LLM variable
- Markdown processing: 0€ OCR + LLM variable
- Estimated savings: 50-70% for documents with Markdown source
</cost_comparison>
<compatibility>
- Maintains backward compatibility with existing PDFs
- No breaking changes to API or database schema
- Existing chunks and documents unaffected
- Can process both formats in same session
</compatibility>
<future_enhancements>
- Support for .txt plain text files
- Support for .docx Word documents (via pandoc)
- Support for .epub ebooks
- Batch upload of multiple Markdown files
- Markdown to PDF export for archival
</future_enhancements>
</technical_notes>
</project_specification>