diff --git a/prompts/app_spec_markdown_support.txt b/prompts/app_spec_markdown_support.txt new file mode 100644 index 0000000..5cae3aa --- /dev/null +++ b/prompts/app_spec_markdown_support.txt @@ -0,0 +1,490 @@ + + Library RAG - Native Markdown Support + + + Add native support for Markdown (.md) files to the Library RAG application. Currently, the system only accepts PDF files + and uses Mistral OCR for text extraction. This feature will allow users to upload pre-existing Markdown files directly, + skipping the expensive OCR step while still benefiting from LLM-based metadata extraction, TOC generation, semantic + chunking, and Weaviate vectorization. + + This enhancement reduces costs, improves processing speed for already-digitized texts, and makes the system more flexible + for users who have philosophical texts in Markdown format. + + + + + Flask 3.0 + utils/pdf_pipeline.py (to be extended) + Werkzeug secure_filename + Ollama (local) or Mistral API + Weaviate with BAAI/bge-m3 + + + mypy strict mode + Google-style docstrings required + + + + + + Update Flask File Validation + + Modify the Flask application to accept both PDF and Markdown files. Update the ALLOWED_EXTENSIONS + configuration and file validation logic to support .md files while maintaining backward compatibility + with existing PDF workflows. + + 1 + backend + + - flask_app.py (line 99: ALLOWED_EXTENSIONS, line 427: allowed_file function) + + + - Change ALLOWED_EXTENSIONS from {"pdf"} to {"pdf", "md"} + - Update allowed_file() function to accept both extensions + - Update upload.html template to accept .md files in file input + - Update error messages to reflect both formats + + + 1. Start Flask app + 2. Navigate to /upload + 3. Attempt to upload a .md file + 4. Verify file is accepted (no "Format non supporté" error) + 5. Verify PDF upload still works + + + + + Add Markdown Detection in Pipeline + + Enhance pdf_pipeline.py to detect when a Markdown file is being processed instead of a PDF. + Add logic to automatically skip OCR processing for .md files and copy the Markdown content + directly to the output directory. + + 1 + backend + + - utils/pdf_pipeline.py (process_pdf_v2 function, around line 250-450) + + + - Add file extension detection: `file_ext = pdf_path.suffix.lower()` + - If file_ext == ".md": + - Skip OCR step entirely (no Mistral API call) + - Read Markdown content directly: `md_content = pdf_path.read_text(encoding='utf-8')` + - Copy to output: `md_path.write_text(md_content, encoding='utf-8')` + - Set nb_pages = md_content.count('\n# ') or 1 (estimate from H1 headers) + - Set cost_ocr = 0.0 + - Emit progress: "markdown_load" instead of "ocr" + - If file_ext == ".pdf": + - Continue with existing OCR workflow + - Both paths converge at LLM processing (metadata, TOC, chunking) + + + 1. Create test Markdown file with philosophical content + 2. Call process_pdf(Path("test.md"), use_llm=True) + 3. Verify OCR is skipped (cost_ocr = 0.0) + 4. Verify output/test/test.md is created + 5. Verify no _ocr.json file is created + 6. Verify LLM processing runs normally + + + + + Markdown-Specific Progress Callback + + Update the progress callback system to emit appropriate events for Markdown file processing. + Instead of "OCR Mistral en cours...", display "Chargement Markdown..." to provide accurate + user feedback during Server-Sent Events streaming. + + 2 + backend + + - utils/pdf_pipeline.py (emit_progress calls) + - flask_app.py (process_file_background function) + + + - Add conditional progress messages based on file type + - For .md files: emit_progress("markdown_load", "running", "Chargement du fichier Markdown...") + - For .pdf files: emit_progress("ocr", "running", "OCR Mistral en cours...") + - Update frontend to handle "markdown_load" event type + - Ensure step numbering adjusts (9 steps for MD vs 10 for PDF) + + + 1. Upload Markdown file via Flask interface + 2. Monitor SSE progress stream at /upload/progress/<job_id> + 3. Verify first step shows "Chargement du fichier Markdown..." + 4. Verify no OCR-related messages appear + 5. Verify subsequent steps (metadata, TOC, etc.) work normally + + + + + Update process_pdf_bytes for Markdown + + Extend process_pdf_bytes() function to handle Markdown content uploaded via Flask. + This function currently creates a temporary PDF file, but for Markdown uploads, + it should create a temporary .md file instead. + + 1 + backend + + - utils/pdf_pipeline.py (process_pdf_bytes function, line 1255) + + + - Detect file type from filename parameter + - If filename ends with .md: + - Create temp file with suffix=".md" + - Write file_bytes as UTF-8 text + - If filename ends with .pdf: + - Existing behavior (suffix=".pdf", binary write) + - Pass temp file path to process_pdf() which now handles both types + + + 1. Create Flask test client + 2. POST multipart form with .md file to /upload + 3. Verify process_pdf_bytes creates .md temp file + 4. Verify temp file contains correct Markdown content + 5. Verify cleanup deletes temp file after processing + + + + + Add Markdown File Validation + + Implement validation for uploaded Markdown files to ensure they contain valid UTF-8 text + and basic Markdown structure. Reject files that are too large, contain binary data, + or have no meaningful content. + + 2 + backend + + - utils/markdown_validator.py + + + - Create validate_markdown_file(file_path: Path) -> dict[str, Any] function + - Checks: + - File size < 10 MB + - Valid UTF-8 encoding + - Contains at least one header (#, ##, etc.) + - Not empty (at least 100 characters) + - No null bytes or excessive binary content + - Return dict with success, error, and warnings keys + - Call from process_pdf_v2 before processing + - Type annotations and Google-style docstrings required + + + 1. Test with valid Markdown file → passes validation + 2. Test with empty file → fails with "File too short" + 3. Test with binary file (.exe renamed to .md) → fails with "Invalid UTF-8" + 4. Test with very large file (>10MB) → fails with "File too large" + 5. Test with plain text no headers → warning but continues + + + + + Update Documentation + + Update README.md and .claude/CLAUDE.md to document the new Markdown support feature. + Include usage examples, cost comparison (PDF vs MD), and troubleshooting tips. + + 3 + documentation + + - README.md (add section under "Pipeline de Traitement") + - .claude/CLAUDE.md (update development guidelines) + - templates/upload.html (add help text) + + + - README.md: + - Add "Support Markdown Natif" section + - Document accepted formats: PDF, MD + - Show cost comparison table (PDF: ~0.003€/page, MD: 0€) + - Add example: process_pdf(Path("document.md")) + - CLAUDE.md: + - Update "Pipeline de Traitement" section + - Note conditional OCR step + - Document markdown_validator.py module + - upload.html: + - Update file input accept attribute: accept=".pdf,.md" + - Add help text: "Formats acceptés : PDF, Markdown (.md)" + + + 1. Read README.md markdown support section + 2. Verify examples are clear and accurate + 3. Check CLAUDE.md developer notes + 4. Open /upload in browser + 5. Verify help text displays correctly + + + + + Add Unit Tests for Markdown Processing + + Create comprehensive unit tests for Markdown file handling to ensure reliability + and prevent regressions. Cover file validation, pipeline processing, and edge cases. + + 2 + testing + + - tests/utils/test_markdown_validator.py + - tests/utils/test_pdf_pipeline_markdown.py + - tests/fixtures/sample.md + + + - test_markdown_validator.py: + - Test valid Markdown acceptance + - Test invalid encoding rejection + - Test file size limits + - Test empty file rejection + - Test binary data detection + - test_pdf_pipeline_markdown.py: + - Test Markdown file processing end-to-end + - Test OCR skip for .md files + - Test cost_ocr = 0.0 + - Test LLM processing (metadata, TOC, chunking) + - Mock Weaviate ingestion + - Verify output files created correctly + - fixtures/sample.md: + - Create realistic philosophical text in Markdown + - Include headers, paragraphs, formatting + - ~1000 words for realistic testing + + + 1. Run: pytest tests/utils/test_markdown_validator.py -v + 2. Verify all validation tests pass + 3. Run: pytest tests/utils/test_pdf_pipeline_markdown.py -v + 4. Verify end-to-end Markdown processing works + 5. Check test coverage: pytest --cov=utils --cov-report=html + + + + + Type Safety and Documentation + + Ensure all new code follows strict type safety requirements and includes comprehensive + Google-style docstrings. Run mypy checks and update type definitions as needed. + + 2 + type_safety + + - utils/types.py (add Markdown-specific types if needed) + - All modified modules (type annotations) + + + - Add type annotations to all new functions + - Update existing functions that handle both PDF and MD + - Consider adding: + - FileFormat = Literal["pdf", "md"] + - MarkdownValidationResult = TypedDict(...) + - Run mypy --strict on all modified files + - Add Google-style docstrings with: + - Args section documenting all parameters + - Returns section with structure details + - Raises section for exceptions + - Examples section for complex functions + + + 1. Run: mypy utils/pdf_pipeline.py --strict + 2. Run: mypy utils/markdown_validator.py --strict + 3. Verify no type errors + 4. Run: pydocstyle utils/markdown_validator.py --convention=google + 5. Verify all docstrings follow Google style + + + + + Handle Markdown-Specific Edge Cases + + Address edge cases specific to Markdown processing: front matter (YAML/TOML), + embedded code blocks, special characters, and non-standard Markdown extensions. + + 3 + backend + + - utils/markdown_validator.py + - utils/llm_metadata.py (handle front matter) + + + - Front matter handling: + - Detect YAML/TOML front matter (--- or +++) + - Extract metadata if present (title, author, date) + - Pass to LLM or use directly if valid + - Strip front matter before content processing + - Code block handling: + - Don't treat code blocks as actual content + - Preserve them for chunking but don't analyze + - Special characters: + - Handle Unicode properly (Greek, Latin, French accents) + - Preserve LaTeX equations in $ or $$ + - GitHub Flavored Markdown: + - Support tables, task lists, strikethrough + - Convert to standard format if needed + + + 1. Upload Markdown with YAML front matter + 2. Verify metadata extracted correctly + 3. Upload Markdown with code blocks + 4. Verify code not treated as philosophical content + 5. Upload Markdown with Greek/Latin text + 6. Verify Unicode handled correctly + + + + + Update UI/UX for Markdown Upload + + Enhance the upload interface to clearly communicate Markdown support and provide + visual feedback about the file type being processed. Show format-specific information + (e.g., "No OCR cost for Markdown files"). + + 3 + frontend + + - templates/upload.html + - templates/upload_progress.html + + + - upload.html: + - Add file type indicator icon (📄 PDF vs 📝 MD) + - Show format-specific help text on hover + - Display estimated cost: "PDF: ~0.003€/page, Markdown: 0€" + - Add example Markdown file download link + - upload_progress.html: + - Show different icon for Markdown processing + - Adjust progress bar (9 steps vs 10 steps) + - Display "No OCR cost" badge for Markdown + - Update step descriptions based on file type + + + 1. Open /upload page + 2. Verify help text mentions both PDF and MD + 3. Select a .md file + 4. Verify file type indicator shows 📝 + 5. Submit upload + 6. Verify progress shows "Chargement Markdown..." + 7. Verify "No OCR cost" badge displays + + + + + + + Setup and Configuration + + - Update ALLOWED_EXTENSIONS in flask_app.py + - Modify allowed_file() validation function + - Update upload.html file input accept attribute + - Add Markdown MIME type handling + + + + + Core Pipeline Extension + + - Add file extension detection in process_pdf_v2() + - Implement Markdown file reading logic + - Skip OCR for .md files + - Add conditional progress callbacks + - Update process_pdf_bytes() for Markdown + + + + + Validation and Error Handling + + - Create markdown_validator.py module + - Implement UTF-8 encoding validation + - Add file size limits + - Handle front matter extraction + - Add comprehensive error messages + + + + + Testing Infrastructure + + - Create test fixtures (sample.md) + - Write validation tests + - Write pipeline integration tests + - Add edge case tests + - Verify mypy strict compliance + + + + + Documentation and Polish + + - Update README.md with Markdown support + - Update .claude/CLAUDE.md developer docs + - Add Google-style docstrings + - Update UI templates with new messaging + - Create usage examples + + + + + + + - Markdown files upload successfully via Flask + - OCR is skipped for .md files (cost_ocr = 0.0) + - LLM processing works identically for PDF and MD + - Chunks are created and vectorized correctly + - Both file types can be searched in Weaviate + - Existing PDF workflow remains unchanged + + + + - All code passes mypy --strict + - All functions have type annotations + - Google-style docstrings on all modules + - No Any types without justification + - TypedDict definitions for new data structures + + + + - Unit tests cover Markdown validation + - Integration tests verify end-to-end processing + - Edge cases handled (front matter, Unicode, large files) + - Test coverage >80% for new code + - All tests pass in CI/CD pipeline + + + + - Upload interface clearly shows both formats supported + - Progress feedback accurate for both PDF and MD + - Cost savings clearly communicated ("0€ for Markdown") + - Error messages helpful and specific + - Documentation clear with examples + + + + - Markdown processing faster than PDF (no OCR) + - No regression in PDF processing speed + - Memory usage reasonable for large MD files + - Validation completes in <100ms + - Overall pipeline <30s for typical Markdown document + + + + + + - PDF processing: OCR ~0.003€/page + LLM variable + - Markdown processing: 0€ OCR + LLM variable + - Estimated savings: 50-70% for documents with Markdown source + + + + - Maintains backward compatibility with existing PDFs + - No breaking changes to API or database schema + - Existing chunks and documents unaffected + - Can process both formats in same session + + + + - Support for .txt plain text files + - Support for .docx Word documents (via pandoc) + - Support for .epub ebooks + - Batch upload of multiple Markdown files + - Markdown to PDF export for archival + + +