Add specification for Markdown support in Library RAG

New feature specification to add native Markdown (.md) file support: - Skip OCR for .md files (0€ cost vs ~0.003€/page for PDF) - Process Markdown directly through LLM pipeline - Maintain full compatibility with existing PDF workflow - Includes 10 features, 5 implementation steps, comprehensive tests This will enable users to upload pre-digitized philosophical texts in Markdown format without incurring OCR costs while still benefiting from LLM-based metadata extraction, TOC generation, semantic chunking, and Weaviate vectorization. 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-25 12:46:07 +01:00
parent e13e0fa261
commit bf790b63a0
1 changed files with 490 additions and 0 deletions
--- a/prompts/app_spec_markdown_support.txt
+++ b/prompts/app_spec_markdown_support.txt
@@ -0,0 +1,490 @@
+<project_specification>
+  <project_name>Library RAG - Native Markdown Support</project_name>
+
+  <overview>
+    Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files
+    and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly,
+    skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic
+    chunking, and Weaviate vectorization.
+
+    This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible
+    for users who have philosophical texts in Markdown format.
+  </overview>
+
+  <technology_stack>
+    <backend>
+      <framework>Flask 3.0</framework>
+      <pipeline>utils/pdf_pipeline.py (to be extended)</pipeline>
+      <validation>Werkzeug secure_filename</validation>
+      <llm>Ollama (local) or Mistral API</llm>
+      <vectorization>Weaviate with BAAI/bge-m3</vectorization>
+    </backend>
+    <type_safety>
+      <type_checker>mypy strict mode</type_checker>
+      <docstrings>Google-style docstrings required</docstrings>
+    </type_safety>
+  </technology_stack>
+
+  <core_features>
+    <feature_1>
+      <title>Update Flask File Validation</title>
+      <description>
+        Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS
+        configuration and file validation logic to support .md files while maintaining backward compatibility
+        with existing PDF workflows.
+      </description>
+      <priority>1</priority>
+      <category>backend</category>
+      <files_to_modify>
+        - flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function)
+      </files_to_modify>
+      <implementation_details>
+        - Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"}
+        - Update allowed_file() function to accept both extensions
+        - Update upload.html template to accept .md files in file input
+        - Update error messages to reflect both formats
+      </implementation_details>
+      <test_steps>
+        1. Start Flask app
+        2. Navigate to /upload
+        3. Attempt to upload a .md file
+        4. Verify file is accepted (no "Format non supporté" error)
+        5. Verify PDF upload still works
+      </test_steps>
+    </feature_1>
+
+    <feature_2>
+      <title>Add Markdown Detection in Pipeline</title>
+      <description>
+        Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF.
+        Add logic to automatically skip OCR processing for .md files and copy the Markdown content
+        directly to the output directory.
+      </description>
+      <priority>1</priority>
+      <category>backend</category>
+      <files_to_modify>
+        - utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450)
+      </files_to_modify>
+      <implementation_details>
+        - Add file extension detection: `file_ext = pdf_path.suffix.lower()`
+        - If file_ext == ".md":
+          - Skip OCR step entirely (no Mistral API call)
+          - Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')`
+          - Copy to output: `md_path.write_text(md_content, encoding='utf-8')`
+          - Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers)
+          - Set cost_ocr = 0.0
+          - Emit progress: "markdown_load" instead of "ocr"
+        - If file_ext == ".pdf":
+          - Continue with existing OCR workflow
+        - Both paths converge at LLM processing (metadata, TOC, chunking)
+      </implementation_details>
+      <test_steps>
+        1. Create test Markdown file with philosophical content
+        2. Call process_pdf(Path("test.md"), use_llm=True)
+        3. Verify OCR is skipped (cost_ocr = 0.0)
+        4. Verify output/test/test.md is created
+        5. Verify no _ocr.json file is created
+        6. Verify LLM processing runs normally
+      </test_steps>
+    </feature_2>
+
+    <feature_3>
+      <title>Markdown-Specific Progress Callback</title>
+      <description>
+        Update the progress callback system to emit appropriate events for Markdown file processing.
+        Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate
+        user feedback during Server-Sent Events streaming.
+      </description>
+      <priority>2</priority>
+      <category>backend</category>
+      <files_to_modify>
+        - utils/pdf_pipeline.py (emit_progress calls)
+        - flask_app.py (process_file_background function)
+      </files_to_modify>
+      <implementation_details>
+        - Add conditional progress messages based on file type
+        - For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...")
+        - For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...")
+        - Update frontend to handle "markdown_load" event type
+        - Ensure step numbering adjusts (9 steps for MD vs 10 for PDF)
+      </implementation_details>
+      <test_steps>
+        1. Upload Markdown file via Flask interface
+        2. Monitor SSE progress stream at /upload/progress/&lt;job_id&gt;
+        3. Verify first step shows "Chargement du fichier Markdown..."
+        4. Verify no OCR-related messages appear
+        5. Verify subsequent steps (metadata, TOC, etc.) work normally
+      </test_steps>
+    </feature_3>
+
+    <feature_4>
+      <title>Update process_pdf_bytes for Markdown</title>
+      <description>
+        Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask.
+        This function currently creates a temporary PDF file, but for Markdown uploads,
+        it should create a temporary .md file instead.
+      </description>
+      <priority>1</priority>
+      <category>backend</category>
+      <files_to_modify>
+        - utils/pdf_pipeline.py (process_pdf_bytes function, line 1255)
+      </files_to_modify>
+      <implementation_details>
+        - Detect file type from filename parameter
+        - If filename ends with .md:
+          - Create temp file with suffix=".md"
+          - Write file_bytes as UTF-8 text
+        - If filename ends with .pdf:
+          - Existing behavior (suffix=".pdf", binary write)
+        - Pass temp file path to process_pdf() which now handles both types
+      </implementation_details>
+      <test_steps>
+        1. Create Flask test client
+        2. POST multipart form with .md file to /upload
+        3. Verify process_pdf_bytes creates .md temp file
+        4. Verify temp file contains correct Markdown content
+        5. Verify cleanup deletes temp file after processing
+      </test_steps>
+    </feature_4>
+
+    <feature_5>
+      <title>Add Markdown File Validation</title>
+      <description>
+        Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text
+        and basic Markdown structure. Reject files that are too large, contain binary data,
+        or have no meaningful content.
+      </description>
+      <priority>2</priority>
+      <category>backend</category>
+      <files_to_create>
+        - utils/markdown_validator.py
+      </files_to_create>
+      <implementation_details>
+        - Create validate_markdown_file(file_path: Path) -> dict[str, Any] function
+        - Checks:
+          - File size &lt; 10 MB
+          - Valid UTF-8 encoding
+          - Contains at least one header (#, ##, etc.)
+          - Not empty (at least 100 characters)
+          - No null bytes or excessive binary content
+        - Return dict with success, error, and warnings keys
+        - Call from process_pdf_v2 before processing
+        - Type annotations and Google-style docstrings required
+      </implementation_details>
+      <test_steps>
+        1. Test with valid Markdown file → passes validation
+        2. Test with empty file → fails with "File too short"
+        3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8"
+        4. Test with very large file (&gt;10MB) → fails with "File too large"
+        5. Test with plain text no headers → warning but continues
+      </test_steps>
+    </feature_5>
+
+    <feature_6>
+      <title>Update Documentation</title>
+      <description>
+        Update README.md and .claude/CLAUDE.md to document the new Markdown support feature.
+        Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips.
+      </description>
+      <priority>3</priority>
+      <category>documentation</category>
+      <files_to_modify>
+        - README.md (add section under "Pipeline de Traitement")
+        - .claude/CLAUDE.md (update development guidelines)
+        - templates/upload.html (add help text)
+      </files_to_modify>
+      <implementation_details>
+        - README.md:
+          - Add "Support Markdown Natif" section
+          - Document accepted formats: PDF, MD
+          - Show cost comparison table (PDF: ~0.003€/page, MD: 0€)
+          - Add example: process_pdf(Path("document.md"))
+        - CLAUDE.md:
+          - Update "Pipeline de Traitement" section
+          - Note conditional OCR step
+          - Document markdown_validator.py module
+        - upload.html:
+          - Update file input accept attribute: accept=".pdf,.md"
+          - Add help text: "Formats acceptés : PDF, Markdown (.md)"
+      </implementation_details>
+      <test_steps>
+        1. Read README.md markdown support section
+        2. Verify examples are clear and accurate
+        3. Check CLAUDE.md developer notes
+        4. Open /upload in browser
+        5. Verify help text displays correctly
+      </test_steps>
+    </feature_6>
+
+    <feature_7>
+      <title>Add Unit Tests for Markdown Processing</title>
+      <description>
+        Create comprehensive unit tests for Markdown file handling to ensure reliability
+        and prevent regressions. Cover file validation, pipeline processing, and edge cases.
+      </description>
+      <priority>2</priority>
+      <category>testing</category>
+      <files_to_create>
+        - tests/utils/test_markdown_validator.py
+        - tests/utils/test_pdf_pipeline_markdown.py
+        - tests/fixtures/sample.md
+      </files_to_create>
+      <implementation_details>
+        - test_markdown_validator.py:
+          - Test valid Markdown acceptance
+          - Test invalid encoding rejection
+          - Test file size limits
+          - Test empty file rejection
+          - Test binary data detection
+        - test_pdf_pipeline_markdown.py:
+          - Test Markdown file processing end-to-end
+          - Test OCR skip for .md files
+          - Test cost_ocr = 0.0
+          - Test LLM processing (metadata, TOC, chunking)
+          - Mock Weaviate ingestion
+          - Verify output files created correctly
+        - fixtures/sample.md:
+          - Create realistic philosophical text in Markdown
+          - Include headers, paragraphs, formatting
+          - ~1000 words for realistic testing
+      </implementation_details>
+      <test_steps>
+        1. Run: pytest tests/utils/test_markdown_validator.py -v
+        2. Verify all validation tests pass
+        3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v
+        4. Verify end-to-end Markdown processing works
+        5. Check test coverage: pytest --cov=utils --cov-report=html
+      </test_steps>
+    </feature_7>
+
+    <feature_8>
+      <title>Type Safety and Documentation</title>
+      <description>
+        Ensure all new code follows strict type safety requirements and includes comprehensive
+        Google-style docstrings. Run mypy checks and update type definitions as needed.
+      </description>
+      <priority>2</priority>
+      <category>type_safety</category>
+      <files_to_modify>
+        - utils/types.py (add Markdown-specific types if needed)
+        - All modified modules (type annotations)
+      </files_to_modify>
+      <implementation_details>
+        - Add type annotations to all new functions
+        - Update existing functions that handle both PDF and MD
+        - Consider adding:
+          - FileFormat = Literal["pdf", "md"]
+          - MarkdownValidationResult = TypedDict(...)
+        - Run mypy --strict on all modified files
+        - Add Google-style docstrings with:
+          - Args section documenting all parameters
+          - Returns section with structure details
+          - Raises section for exceptions
+          - Examples section for complex functions
+      </implementation_details>
+      <test_steps>
+        1. Run: mypy utils/pdf_pipeline.py --strict
+        2. Run: mypy utils/markdown_validator.py --strict
+        3. Verify no type errors
+        4. Run: pydocstyle utils/markdown_validator.py --convention=google
+        5. Verify all docstrings follow Google style
+      </test_steps>
+    </feature_8>
+
+    <feature_9>
+      <title>Handle Markdown-Specific Edge Cases</title>
+      <description>
+        Address edge cases specific to Markdown processing: front matter (YAML/TOML),
+        embedded code blocks, special characters, and non-standard Markdown extensions.
+      </description>
+      <priority>3</priority>
+      <category>backend</category>
+      <files_to_modify>
+        - utils/markdown_validator.py
+        - utils/llm_metadata.py (handle front matter)
+      </files_to_modify>
+      <implementation_details>
+        - Front matter handling:
+          - Detect YAML/TOML front matter (--- or +++)
+          - Extract metadata if present (title, author, date)
+          - Pass to LLM or use directly if valid
+          - Strip front matter before content processing
+        - Code block handling:
+          - Don't treat code blocks as actual content
+          - Preserve them for chunking but don't analyze
+        - Special characters:
+          - Handle Unicode properly (Greek, Latin, French accents)
+          - Preserve LaTeX equations in $ or $$
+        - GitHub Flavored Markdown:
+          - Support tables, task lists, strikethrough
+          - Convert to standard format if needed
+      </implementation_details>
+      <test_steps>
+        1. Upload Markdown with YAML front matter
+        2. Verify metadata extracted correctly
+        3. Upload Markdown with code blocks
+        4. Verify code not treated as philosophical content
+        5. Upload Markdown with Greek/Latin text
+        6. Verify Unicode handled correctly
+      </test_steps>
+    </feature_9>
+
+    <feature_10>
+      <title>Update UI/UX for Markdown Upload</title>
+      <description>
+        Enhance the upload interface to clearly communicate Markdown support and provide
+        visual feedback about the file type being processed. Show format-specific information
+        (e.g., "No OCR cost for Markdown files").
+      </description>
+      <priority>3</priority>
+      <category>frontend</category>
+      <files_to_modify>
+        - templates/upload.html
+        - templates/upload_progress.html
+      </files_to_modify>
+      <implementation_details>
+        - upload.html:
+          - Add file type indicator icon (📄 PDF vs 📝 MD)
+          - Show format-specific help text on hover
+          - Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€"
+          - Add example Markdown file download link
+        - upload_progress.html:
+          - Show different icon for Markdown processing
+          - Adjust progress bar (9 steps vs 10 steps)
+          - Display "No OCR cost" badge for Markdown
+          - Update step descriptions based on file type
+      </implementation_details>
+      <test_steps>
+        1. Open /upload page
+        2. Verify help text mentions both PDF and MD
+        3. Select a .md file
+        4. Verify file type indicator shows 📝
+        5. Submit upload
+        6. Verify progress shows "Chargement Markdown..."
+        7. Verify "No OCR cost" badge displays
+      </test_steps>
+    </feature_10>
+  </core_features>
+
+  <implementation_steps>
+    <step number="1">
+      <title>Setup and Configuration</title>
+      <tasks>
+        - Update ALLOWED_EXTENSIONS in flask_app.py
+        - Modify allowed_file() validation function
+        - Update upload.html file input accept attribute
+        - Add Markdown MIME type handling
+      </tasks>
+    </step>
+
+    <step number="2">
+      <title>Core Pipeline Extension</title>
+      <tasks>
+        - Add file extension detection in process_pdf_v2()
+        - Implement Markdown file reading logic
+        - Skip OCR for .md files
+        - Add conditional progress callbacks
+        - Update process_pdf_bytes() for Markdown
+      </tasks>
+    </step>
+
+    <step number="3">
+      <title>Validation and Error Handling</title>
+      <tasks>
+        - Create markdown_validator.py module
+        - Implement UTF-8 encoding validation
+        - Add file size limits
+        - Handle front matter extraction
+        - Add comprehensive error messages
+      </tasks>
+    </step>
+
+    <step number="4">
+      <title>Testing Infrastructure</title>
+      <tasks>
+        - Create test fixtures (sample.md)
+        - Write validation tests
+        - Write pipeline integration tests
+        - Add edge case tests
+        - Verify mypy strict compliance
+      </tasks>
+    </step>
+
+    <step number="5">
+      <title>Documentation and Polish</title>
+      <tasks>
+        - Update README.md with Markdown support
+        - Update .claude/CLAUDE.md developer docs
+        - Add Google-style docstrings
+        - Update UI templates with new messaging
+        - Create usage examples
+      </tasks>
+    </step>
+  </implementation_steps>
+
+  <success_criteria>
+    <functionality>
+      - Markdown files upload successfully via Flask
+      - OCR is skipped for .md files (cost_ocr = 0.0)
+      - LLM processing works identically for PDF and MD
+      - Chunks are created and vectorized correctly
+      - Both file types can be searched in Weaviate
+      - Existing PDF workflow remains unchanged
+    </functionality>
+
+    <type_safety>
+      - All code passes mypy --strict
+      - All functions have type annotations
+      - Google-style docstrings on all modules
+      - No Any types without justification
+      - TypedDict definitions for new data structures
+    </type_safety>
+
+    <testing>
+      - Unit tests cover Markdown validation
+      - Integration tests verify end-to-end processing
+      - Edge cases handled (front matter, Unicode, large files)
+      - Test coverage &gt;80% for new code
+      - All tests pass in CI/CD pipeline
+    </testing>
+
+    <user_experience>
+      - Upload interface clearly shows both formats supported
+      - Progress feedback accurate for both PDF and MD
+      - Cost savings clearly communicated ("0€ for Markdown")
+      - Error messages helpful and specific
+      - Documentation clear with examples
+    </user_experience>
+
+    <performance>
+      - Markdown processing faster than PDF (no OCR)
+      - No regression in PDF processing speed
+      - Memory usage reasonable for large MD files
+      - Validation completes in &lt;100ms
+      - Overall pipeline &lt;30s for typical Markdown document
+    </performance>
+  </success_criteria>
+
+  <technical_notes>
+    <cost_comparison>
+      - PDF processing: OCR ~0.003€/page + LLM variable
+      - Markdown processing: 0€ OCR + LLM variable
+      - Estimated savings: 50-70% for documents with Markdown source
+    </cost_comparison>
+
+    <compatibility>
+      - Maintains backward compatibility with existing PDFs
+      - No breaking changes to API or database schema
+      - Existing chunks and documents unaffected
+      - Can process both formats in same session
+    </compatibility>
+
+    <future_enhancements>
+      - Support for .txt plain text files
+      - Support for .docx Word documents (via pandoc)
+      - Support for .epub ebooks
+      - Batch upload of multiple Markdown files
+      - Markdown to PDF export for archival
+    </future_enhancements>
+  </technical_notes>
+</project_specification>