Add specification for Markdown support in Library RAG
New feature specification to add native Markdown (.md) file support: - Skip OCR for .md files (0€ cost vs ~0.003€/page for PDF) - Process Markdown directly through LLM pipeline - Maintain full compatibility with existing PDF workflow - Includes 10 features, 5 implementation steps, comprehensive tests This will enable users to upload pre-digitized philosophical texts in Markdown format without incurring OCR costs while still benefiting from LLM-based metadata extraction, TOC generation, semantic chunking, and Weaviate vectorization. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
490
prompts/app_spec_markdown_support.txt
Normal file
490
prompts/app_spec_markdown_support.txt
Normal file
@@ -0,0 +1,490 @@
|
||||
<project_specification>
|
||||
<project_name>Library RAG - Native Markdown Support</project_name>
|
||||
|
||||
<overview>
|
||||
Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
|
||||
and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
|
||||
skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
|
||||
chunking, and Weaviate vectorization.
|
||||
|
||||
This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
|
||||
for users who have philosophical texts in Markdown format.
|
||||
</overview>
|
||||
|
||||
<technology_stack>
|
||||
<backend>
|
||||
<framework>Flask 3.0</framework>
|
||||
<pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
|
||||
<validation>Werkzeug secure_filename</validation>
|
||||
<llm>Ollama (local) or Mistral API</llm>
|
||||
<vectorization>Weaviate with BAAI/bge-m3</vectorization>
|
||||
</backend>
|
||||
<type_safety>
|
||||
<type_checker>mypy strict mode</type_checker>
|
||||
<docstrings>Google-style docstrings required</docstrings>
|
||||
</type_safety>
|
||||
</technology_stack>
|
||||
|
||||
<core_features>
|
||||
<feature_1>
|
||||
<title>Update Flask File Validation</title>
|
||||
<description>
|
||||
Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
|
||||
configuration and file validation logic to support .md files while maintaining backward compatibility
|
||||
with existing PDF workflows.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
|
||||
- Update allowed_file() function to accept both extensions
|
||||
- Update upload.html template to accept .md files in file input
|
||||
- Update error messages to reflect both formats
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Start Flask app
|
||||
2. Navigate to /upload
|
||||
3. Attempt to upload a .md file
|
||||
4. Verify file is accepted (no "Format non supporté" error)
|
||||
5. Verify PDF upload still works
|
||||
</test_steps>
|
||||
</feature_1>
|
||||
|
||||
<feature_2>
|
||||
<title>Add Markdown Detection in Pipeline</title>
|
||||
<description>
|
||||
Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
|
||||
Add logic to automatically skip OCR processing for .md files and copy the Markdown content
|
||||
directly to the output directory.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add file extension detection: `file_ext = pdf_path.suffix.lower()`
|
||||
- If file_ext == ".md":
|
||||
- Skip OCR step entirely (no Mistral API call)
|
||||
- Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
|
||||
- Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
|
||||
- Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
|
||||
- Set cost_ocr = 0.0
|
||||
- Emit progress: "markdown_load" instead of "ocr"
|
||||
- If file_ext == ".pdf":
|
||||
- Continue with existing OCR workflow
|
||||
- Both paths converge at LLM processing (metadata, TOC, chunking)
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Create test Markdown file with philosophical content
|
||||
2. Call process_pdf(Path("test.md"), use_llm=True)
|
||||
3. Verify OCR is skipped (cost_ocr = 0.0)
|
||||
4. Verify output/test/test.md is created
|
||||
5. Verify no _ocr.json file is created
|
||||
6. Verify LLM processing runs normally
|
||||
</test_steps>
|
||||
</feature_2>
|
||||
|
||||
<feature_3>
|
||||
<title>Markdown-Specific Progress Callback</title>
|
||||
<description>
|
||||
Update the progress callback system to emit appropriate events for Markdown file processing.
|
||||
Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
|
||||
user feedback during Server-Sent Events streaming.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (emit_progress calls)
|
||||
- flask_app.py (process_file_background function)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add conditional progress messages based on file type
|
||||
- For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
|
||||
- For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
|
||||
- Update frontend to handle "markdown_load" event type
|
||||
- Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Upload Markdown file via Flask interface
|
||||
2. Monitor SSE progress stream at /upload/progress/<job_id>
|
||||
3. Verify first step shows "Chargement du fichier Markdown..."
|
||||
4. Verify no OCR-related messages appear
|
||||
5. Verify subsequent steps (metadata, TOC, etc.) work normally
|
||||
</test_steps>
|
||||
</feature_3>
|
||||
|
||||
<feature_4>
|
||||
<title>Update process_pdf_bytes for Markdown</title>
|
||||
<description>
|
||||
Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
|
||||
This function currently creates a temporary PDF file, but for Markdown uploads,
|
||||
it should create a temporary .md file instead.
|
||||
</description>
|
||||
<priority>1</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Detect file type from filename parameter
|
||||
- If filename ends with .md:
|
||||
- Create temp file with suffix=".md"
|
||||
- Write file_bytes as UTF-8 text
|
||||
- If filename ends with .pdf:
|
||||
- Existing behavior (suffix=".pdf", binary write)
|
||||
- Pass temp file path to process_pdf() which now handles both types
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Create Flask test client
|
||||
2. POST multipart form with .md file to /upload
|
||||
3. Verify process_pdf_bytes creates .md temp file
|
||||
4. Verify temp file contains correct Markdown content
|
||||
5. Verify cleanup deletes temp file after processing
|
||||
</test_steps>
|
||||
</feature_4>
|
||||
|
||||
<feature_5>
|
||||
<title>Add Markdown File Validation</title>
|
||||
<description>
|
||||
Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
|
||||
and basic Markdown structure. Reject files that are too large, contain binary data,
|
||||
or have no meaningful content.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>backend</category>
|
||||
<files_to_create>
|
||||
- utils/markdown_validator.py
|
||||
</files_to_create>
|
||||
<implementation_details>
|
||||
- Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
|
||||
- Checks:
|
||||
- File size < 10 MB
|
||||
- Valid UTF-8 encoding
|
||||
- Contains at least one header (#, ##, etc.)
|
||||
- Not empty (at least 100 characters)
|
||||
- No null bytes or excessive binary content
|
||||
- Return dict with success, error, and warnings keys
|
||||
- Call from process_pdf_v2 before processing
|
||||
- Type annotations and Google-style docstrings required
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Test with valid Markdown file → passes validation
|
||||
2. Test with empty file → fails with "File too short"
|
||||
3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
|
||||
4. Test with very large file (>10MB) → fails with "File too large"
|
||||
5. Test with plain text no headers → warning but continues
|
||||
</test_steps>
|
||||
</feature_5>
|
||||
|
||||
<feature_6>
|
||||
<title>Update Documentation</title>
|
||||
<description>
|
||||
Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
|
||||
Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>documentation</category>
|
||||
<files_to_modify>
|
||||
- README.md (add section under "Pipeline de Traitement")
|
||||
- .claude/CLAUDE.md (update development guidelines)
|
||||
- templates/upload.html (add help text)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- README.md:
|
||||
- Add "Support Markdown Natif" section
|
||||
- Document accepted formats: PDF, MD
|
||||
- Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
|
||||
- Add example: process_pdf(Path("document.md"))
|
||||
- CLAUDE.md:
|
||||
- Update "Pipeline de Traitement" section
|
||||
- Note conditional OCR step
|
||||
- Document markdown_validator.py module
|
||||
- upload.html:
|
||||
- Update file input accept attribute: accept=".pdf,.md"
|
||||
- Add help text: "Formats acceptés : PDF, Markdown (.md)"
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Read README.md markdown support section
|
||||
2. Verify examples are clear and accurate
|
||||
3. Check CLAUDE.md developer notes
|
||||
4. Open /upload in browser
|
||||
5. Verify help text displays correctly
|
||||
</test_steps>
|
||||
</feature_6>
|
||||
|
||||
<feature_7>
|
||||
<title>Add Unit Tests for Markdown Processing</title>
|
||||
<description>
|
||||
Create comprehensive unit tests for Markdown file handling to ensure reliability
|
||||
and prevent regressions. Cover file validation, pipeline processing, and edge cases.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>testing</category>
|
||||
<files_to_create>
|
||||
- tests/utils/test_markdown_validator.py
|
||||
- tests/utils/test_pdf_pipeline_markdown.py
|
||||
- tests/fixtures/sample.md
|
||||
</files_to_create>
|
||||
<implementation_details>
|
||||
- test_markdown_validator.py:
|
||||
- Test valid Markdown acceptance
|
||||
- Test invalid encoding rejection
|
||||
- Test file size limits
|
||||
- Test empty file rejection
|
||||
- Test binary data detection
|
||||
- test_pdf_pipeline_markdown.py:
|
||||
- Test Markdown file processing end-to-end
|
||||
- Test OCR skip for .md files
|
||||
- Test cost_ocr = 0.0
|
||||
- Test LLM processing (metadata, TOC, chunking)
|
||||
- Mock Weaviate ingestion
|
||||
- Verify output files created correctly
|
||||
- fixtures/sample.md:
|
||||
- Create realistic philosophical text in Markdown
|
||||
- Include headers, paragraphs, formatting
|
||||
- ~1000 words for realistic testing
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Run: pytest tests/utils/test_markdown_validator.py -v
|
||||
2. Verify all validation tests pass
|
||||
3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
|
||||
4. Verify end-to-end Markdown processing works
|
||||
5. Check test coverage: pytest --cov=utils --cov-report=html
|
||||
</test_steps>
|
||||
</feature_7>
|
||||
|
||||
<feature_8>
|
||||
<title>Type Safety and Documentation</title>
|
||||
<description>
|
||||
Ensure all new code follows strict type safety requirements and includes comprehensive
|
||||
Google-style docstrings. Run mypy checks and update type definitions as needed.
|
||||
</description>
|
||||
<priority>2</priority>
|
||||
<category>type_safety</category>
|
||||
<files_to_modify>
|
||||
- utils/types.py (add Markdown-specific types if needed)
|
||||
- All modified modules (type annotations)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Add type annotations to all new functions
|
||||
- Update existing functions that handle both PDF and MD
|
||||
- Consider adding:
|
||||
- FileFormat = Literal["pdf", "md"]
|
||||
- MarkdownValidationResult = TypedDict(...)
|
||||
- Run mypy --strict on all modified files
|
||||
- Add Google-style docstrings with:
|
||||
- Args section documenting all parameters
|
||||
- Returns section with structure details
|
||||
- Raises section for exceptions
|
||||
- Examples section for complex functions
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Run: mypy utils/pdf_pipeline.py --strict
|
||||
2. Run: mypy utils/markdown_validator.py --strict
|
||||
3. Verify no type errors
|
||||
4. Run: pydocstyle utils/markdown_validator.py --convention=google
|
||||
5. Verify all docstrings follow Google style
|
||||
</test_steps>
|
||||
</feature_8>
|
||||
|
||||
<feature_9>
|
||||
<title>Handle Markdown-Specific Edge Cases</title>
|
||||
<description>
|
||||
Address edge cases specific to Markdown processing: front matter (YAML/TOML),
|
||||
embedded code blocks, special characters, and non-standard Markdown extensions.
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>backend</category>
|
||||
<files_to_modify>
|
||||
- utils/markdown_validator.py
|
||||
- utils/llm_metadata.py (handle front matter)
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- Front matter handling:
|
||||
- Detect YAML/TOML front matter (--- or +++)
|
||||
- Extract metadata if present (title, author, date)
|
||||
- Pass to LLM or use directly if valid
|
||||
- Strip front matter before content processing
|
||||
- Code block handling:
|
||||
- Don't treat code blocks as actual content
|
||||
- Preserve them for chunking but don't analyze
|
||||
- Special characters:
|
||||
- Handle Unicode properly (Greek, Latin, French accents)
|
||||
- Preserve LaTeX equations in $ or $$
|
||||
- GitHub Flavored Markdown:
|
||||
- Support tables, task lists, strikethrough
|
||||
- Convert to standard format if needed
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Upload Markdown with YAML front matter
|
||||
2. Verify metadata extracted correctly
|
||||
3. Upload Markdown with code blocks
|
||||
4. Verify code not treated as philosophical content
|
||||
5. Upload Markdown with Greek/Latin text
|
||||
6. Verify Unicode handled correctly
|
||||
</test_steps>
|
||||
</feature_9>
|
||||
|
||||
<feature_10>
|
||||
<title>Update UI/UX for Markdown Upload</title>
|
||||
<description>
|
||||
Enhance the upload interface to clearly communicate Markdown support and provide
|
||||
visual feedback about the file type being processed. Show format-specific information
|
||||
(e.g., "No OCR cost for Markdown files").
|
||||
</description>
|
||||
<priority>3</priority>
|
||||
<category>frontend</category>
|
||||
<files_to_modify>
|
||||
- templates/upload.html
|
||||
- templates/upload_progress.html
|
||||
</files_to_modify>
|
||||
<implementation_details>
|
||||
- upload.html:
|
||||
- Add file type indicator icon (📄 PDF vs 📝 MD)
|
||||
- Show format-specific help text on hover
|
||||
- Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
|
||||
- Add example Markdown file download link
|
||||
- upload_progress.html:
|
||||
- Show different icon for Markdown processing
|
||||
- Adjust progress bar (9 steps vs 10 steps)
|
||||
- Display "No OCR cost" badge for Markdown
|
||||
- Update step descriptions based on file type
|
||||
</implementation_details>
|
||||
<test_steps>
|
||||
1. Open /upload page
|
||||
2. Verify help text mentions both PDF and MD
|
||||
3. Select a .md file
|
||||
4. Verify file type indicator shows 📝
|
||||
5. Submit upload
|
||||
6. Verify progress shows "Chargement Markdown..."
|
||||
7. Verify "No OCR cost" badge displays
|
||||
</test_steps>
|
||||
</feature_10>
|
||||
</core_features>
|
||||
|
||||
<implementation_steps>
|
||||
<step number="1">
|
||||
<title>Setup and Configuration</title>
|
||||
<tasks>
|
||||
- Update ALLOWED_EXTENSIONS in flask_app.py
|
||||
- Modify allowed_file() validation function
|
||||
- Update upload.html file input accept attribute
|
||||
- Add Markdown MIME type handling
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="2">
|
||||
<title>Core Pipeline Extension</title>
|
||||
<tasks>
|
||||
- Add file extension detection in process_pdf_v2()
|
||||
- Implement Markdown file reading logic
|
||||
- Skip OCR for .md files
|
||||
- Add conditional progress callbacks
|
||||
- Update process_pdf_bytes() for Markdown
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="3">
|
||||
<title>Validation and Error Handling</title>
|
||||
<tasks>
|
||||
- Create markdown_validator.py module
|
||||
- Implement UTF-8 encoding validation
|
||||
- Add file size limits
|
||||
- Handle front matter extraction
|
||||
- Add comprehensive error messages
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="4">
|
||||
<title>Testing Infrastructure</title>
|
||||
<tasks>
|
||||
- Create test fixtures (sample.md)
|
||||
- Write validation tests
|
||||
- Write pipeline integration tests
|
||||
- Add edge case tests
|
||||
- Verify mypy strict compliance
|
||||
</tasks>
|
||||
</step>
|
||||
|
||||
<step number="5">
|
||||
<title>Documentation and Polish</title>
|
||||
<tasks>
|
||||
- Update README.md with Markdown support
|
||||
- Update .claude/CLAUDE.md developer docs
|
||||
- Add Google-style docstrings
|
||||
- Update UI templates with new messaging
|
||||
- Create usage examples
|
||||
</tasks>
|
||||
</step>
|
||||
</implementation_steps>
|
||||
|
||||
<success_criteria>
|
||||
<functionality>
|
||||
- Markdown files upload successfully via Flask
|
||||
- OCR is skipped for .md files (cost_ocr = 0.0)
|
||||
- LLM processing works identically for PDF and MD
|
||||
- Chunks are created and vectorized correctly
|
||||
- Both file types can be searched in Weaviate
|
||||
- Existing PDF workflow remains unchanged
|
||||
</functionality>
|
||||
|
||||
<type_safety>
|
||||
- All code passes mypy --strict
|
||||
- All functions have type annotations
|
||||
- Google-style docstrings on all modules
|
||||
- No Any types without justification
|
||||
- TypedDict definitions for new data structures
|
||||
</type_safety>
|
||||
|
||||
<testing>
|
||||
- Unit tests cover Markdown validation
|
||||
- Integration tests verify end-to-end processing
|
||||
- Edge cases handled (front matter, Unicode, large files)
|
||||
- Test coverage >80% for new code
|
||||
- All tests pass in CI/CD pipeline
|
||||
</testing>
|
||||
|
||||
<user_experience>
|
||||
- Upload interface clearly shows both formats supported
|
||||
- Progress feedback accurate for both PDF and MD
|
||||
- Cost savings clearly communicated ("0€ for Markdown")
|
||||
- Error messages helpful and specific
|
||||
- Documentation clear with examples
|
||||
</user_experience>
|
||||
|
||||
<performance>
|
||||
- Markdown processing faster than PDF (no OCR)
|
||||
- No regression in PDF processing speed
|
||||
- Memory usage reasonable for large MD files
|
||||
- Validation completes in <100ms
|
||||
- Overall pipeline <30s for typical Markdown document
|
||||
</performance>
|
||||
</success_criteria>
|
||||
|
||||
<technical_notes>
|
||||
<cost_comparison>
|
||||
- PDF processing: OCR ~0.003€/page + LLM variable
|
||||
- Markdown processing: 0€ OCR + LLM variable
|
||||
- Estimated savings: 50-70% for documents with Markdown source
|
||||
</cost_comparison>
|
||||
|
||||
<compatibility>
|
||||
- Maintains backward compatibility with existing PDFs
|
||||
- No breaking changes to API or database schema
|
||||
- Existing chunks and documents unaffected
|
||||
- Can process both formats in same session
|
||||
</compatibility>
|
||||
|
||||
<future_enhancements>
|
||||
- Support for .txt plain text files
|
||||
- Support for .docx Word documents (via pandoc)
|
||||
- Support for .epub ebooks
|
||||
- Batch upload of multiple Markdown files
|
||||
- Markdown to PDF export for archival
|
||||
</future_enhancements>
|
||||
</technical_notes>
|
||||
</project_specification>
|
||||
Reference in New Issue
Block a user