Configuration updates: - Added .env.example template for environment variables - Updated README.md with better setup instructions (.env usage) - Enhanced .claude/settings.local.json with additional Bash permissions - Added .claude/CLAUDE.md framework documentation Spec cleanup: - Removed obsolete spec files (language_selection, mistral_extensible, template, theme_customization) - Consolidated app_spec.txt (Claude Clone example) - Added app_spec_model.txt as reference template - Added app_spec_library_rag_types_docs.txt - Added coding_prompt_library.md Framework improvements: - Updated agent.py, autonomous_agent_demo.py, client.py with minor fixes - Enhanced dockerize_my_project.py - Updated prompts (initializer, initializer_bis) with better guidance - Added docker-compose.my_project.yml example This commit consolidates improvements made during development sessions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
543 lines
18 KiB
Plaintext
543 lines
18 KiB
Plaintext
<project_specification>
|
|
<project_name>Library RAG MCP Server - PDF Ingestion & Semantic Retrieval</project_name>
|
|
|
|
<overview>
|
|
Serveur MCP (Model Context Protocol) exposant les capacités de Library RAG comme outils pour LLMs.
|
|
|
|
**Architecture simplifiée :**
|
|
- 1 outil d'ingestion : parse_pdf (configuration optimale pré-définie)
|
|
- 7 outils de retrieval : recherche sémantique et gestion de documents
|
|
|
|
**Configuration optimale (paramètres fixes) :**
|
|
- LLM : Mistral Medium (mistral-medium-latest)
|
|
- OCR : Mistral avec annotations (meilleure qualité TOC, 3x coût)
|
|
- Chunking : Sémantique intelligent (LLM-based)
|
|
- Ingestion : Automatique dans Weaviate
|
|
|
|
Le LLM client n'a qu'à fournir le chemin du PDF, tous les paramètres sont optimisés par défaut.
|
|
</overview>
|
|
|
|
<technology_stack>
|
|
<backend>
|
|
<runtime>Python 3.10+</runtime>
|
|
<mcp_framework>mcp Python SDK (official Anthropic implementation)</mcp_framework>
|
|
<vector_database>Weaviate 1.34.4 + text2vec-transformers</vector_database>
|
|
<ocr>Mistral OCR API with annotations</ocr>
|
|
<llm>Mistral API (mistral-medium-latest)</llm>
|
|
<type_checking>mypy strict</type_checking>
|
|
</backend>
|
|
<protocol>
|
|
<mcp_version>1.0</mcp_version>
|
|
<transport>stdio</transport>
|
|
<capabilities>tools</capabilities>
|
|
</protocol>
|
|
</technology_stack>
|
|
|
|
<implementation_steps>
|
|
<feature_1>
|
|
<title>MCP Server Foundation</title>
|
|
<description>
|
|
Set up the basic MCP server structure with stdio transport.
|
|
|
|
Tasks:
|
|
- Install mcp Python SDK: pip install mcp
|
|
- Create mcp_server.py with Server initialization
|
|
- Configure stdio transport for LLM communication
|
|
- Implement server lifecycle handlers (startup, shutdown)
|
|
- Set up basic logging with Python logging module
|
|
- Test server startup and graceful shutdown
|
|
</description>
|
|
<priority>1</priority>
|
|
<category>infrastructure</category>
|
|
<test_steps>
|
|
1. Run: python mcp_server.py
|
|
2. Verify server starts without errors
|
|
3. Check that stdio transport is initialized
|
|
4. Send SIGTERM and verify graceful shutdown
|
|
5. Check logs are created and formatted correctly
|
|
</test_steps>
|
|
</feature_1>
|
|
|
|
<feature_2>
|
|
<title>Configuration Management</title>
|
|
<description>
|
|
Implement configuration loading and validation.
|
|
|
|
Tasks:
|
|
- Verify mcp_config.py exists (already created)
|
|
- Test MCPConfig.from_env() loads .env correctly
|
|
- Validate MISTRAL_API_KEY is required
|
|
- Test default values for optional settings
|
|
- Create .env.example file with all variables
|
|
- Document all environment variables
|
|
</description>
|
|
<priority>1</priority>
|
|
<category>infrastructure</category>
|
|
<test_steps>
|
|
1. Create .env file with MISTRAL_API_KEY
|
|
2. Run: python -c "from mcp_config import MCPConfig; MCPConfig.from_env()"
|
|
3. Verify config loads successfully
|
|
4. Remove MISTRAL_API_KEY from .env
|
|
5. Verify ValueError is raised
|
|
6. Test all default values are applied
|
|
</test_steps>
|
|
</feature_2>
|
|
|
|
<feature_3>
|
|
<title>Pydantic Schemas for All Tools</title>
|
|
<description>
|
|
Define all tool input/output schemas using Pydantic models.
|
|
|
|
Tasks:
|
|
- Create mcp_tools/ directory
|
|
- Create mcp_tools/__init__.py
|
|
- Create mcp_tools/schemas.py
|
|
- Define ParsePdfInput model with validation
|
|
- Define ParsePdfOutput model
|
|
- Define SearchChunksInput/Output models
|
|
- Define schemas for all 7 retrieval tools
|
|
- Add Field() validation (min/max, regex, enum)
|
|
- Add docstrings for JSON schema generation
|
|
- Verify mypy --strict passes
|
|
</description>
|
|
<priority>1</priority>
|
|
<category>infrastructure</category>
|
|
<test_steps>
|
|
1. Run: mypy mcp_tools/schemas.py --strict
|
|
2. Verify no type errors
|
|
3. Test ParsePdfInput validation with invalid inputs
|
|
4. Test ParsePdfInput validation with valid inputs
|
|
5. Generate JSON schema from Pydantic models
|
|
6. Verify all fields have descriptions
|
|
</test_steps>
|
|
</feature_3>
|
|
|
|
<feature_4>
|
|
<title>Parsing Tool - parse_pdf Implementation</title>
|
|
<description>
|
|
Implement the parse_pdf tool with optimal parameters pre-configured.
|
|
|
|
Tasks:
|
|
- Create mcp_tools/parsing_tools.py
|
|
- Implement parse_pdf tool handler
|
|
- Fixed parameters:
|
|
- llm_provider="mistral"
|
|
- llm_model="mistral-medium-latest"
|
|
- use_semantic_chunking=True
|
|
- use_ocr_annotations=True
|
|
- ingest_to_weaviate=True
|
|
- Wrapper around pdf_pipeline.process_pdf_bytes()
|
|
- Handle file downloads for URL inputs
|
|
- Return comprehensive results (metadata, costs, file paths)
|
|
- Add error handling and logging
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>1</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock pdf_pipeline.process_pdf_bytes()
|
|
2. Call parse_pdf with local PDF path
|
|
3. Verify fixed parameters are used
|
|
4. Call parse_pdf with URL
|
|
5. Verify file download works
|
|
6. Check output contains all required fields
|
|
7. Verify costs are tracked and returned
|
|
</test_steps>
|
|
</feature_4>
|
|
|
|
<feature_5>
|
|
<title>Retrieval Tool - search_chunks</title>
|
|
<description>
|
|
Implement semantic search on text chunks.
|
|
|
|
Tasks:
|
|
- Create mcp_tools/retrieval_tools.py
|
|
- Implement search_chunks tool handler
|
|
- Weaviate near_text query on Chunk collection
|
|
- Support filters: author, work, language, min_similarity
|
|
- Handle nested object properties (work.author, work.title)
|
|
- Return results with similarity scores
|
|
- Add error handling for Weaviate connection
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate client
|
|
2. Call search_chunks with query="justice"
|
|
3. Verify near_text query is called
|
|
4. Test author_filter applies correct nested filter
|
|
5. Test work_filter applies correct nested filter
|
|
6. Test min_similarity threshold works
|
|
7. Verify results include all metadata fields
|
|
</test_steps>
|
|
</feature_5>
|
|
|
|
<feature_6>
|
|
<title>Retrieval Tool - search_summaries</title>
|
|
<description>
|
|
Implement search in chapter/section summaries.
|
|
|
|
Tasks:
|
|
- Add search_summaries handler to retrieval_tools.py
|
|
- Query Summary collection with near_text
|
|
- Support level filters (min_level, max_level)
|
|
- Return summaries with hierarchical metadata
|
|
- Add error handling
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate Summary collection
|
|
2. Call search_summaries with query
|
|
3. Verify near_text query on Summary
|
|
4. Test min_level filter
|
|
5. Test max_level filter
|
|
6. Verify results include sectionPath and concepts
|
|
</test_steps>
|
|
</feature_6>
|
|
|
|
<feature_7>
|
|
<title>Retrieval Tool - get_document</title>
|
|
<description>
|
|
Retrieve complete document metadata and chunks by sourceId.
|
|
|
|
Tasks:
|
|
- Add get_document handler to retrieval_tools.py
|
|
- Query Document collection by sourceId
|
|
- Optionally fetch related chunks
|
|
- Support chunk_limit parameter
|
|
- Return complete document data with TOC and hierarchy
|
|
- Add error handling for missing documents
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate Document collection
|
|
2. Call get_document with valid sourceId
|
|
3. Verify Document query by sourceId
|
|
4. Test include_chunks=True fetches chunks
|
|
5. Test include_chunks=False skips chunks
|
|
6. Test chunk_limit parameter
|
|
7. Verify error on missing document
|
|
</test_steps>
|
|
</feature_7>
|
|
|
|
<feature_8>
|
|
<title>Retrieval Tool - list_documents</title>
|
|
<description>
|
|
List all documents with filtering and pagination.
|
|
|
|
Tasks:
|
|
- Add list_documents handler to retrieval_tools.py
|
|
- Query Document collection with filters
|
|
- Support author, work, language filters
|
|
- Implement pagination (limit, offset)
|
|
- Return document summaries with counts
|
|
- Add error handling
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate Document collection
|
|
2. Call list_documents without filters
|
|
3. Verify all documents returned
|
|
4. Test author_filter
|
|
5. Test work_filter
|
|
6. Test pagination (limit=10, offset=5)
|
|
7. Verify total count is accurate
|
|
</test_steps>
|
|
</feature_8>
|
|
|
|
<feature_9>
|
|
<title>Retrieval Tool - get_chunks_by_document</title>
|
|
<description>
|
|
Retrieve all chunks for a document in sequential order.
|
|
|
|
Tasks:
|
|
- Add get_chunks_by_document handler to retrieval_tools.py
|
|
- Query Chunk collection filtered by document.sourceId
|
|
- Order results by orderIndex
|
|
- Support pagination (limit, offset)
|
|
- Support section_filter for specific sections
|
|
- Return ordered chunks with document metadata
|
|
- Add error handling
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate Chunk collection
|
|
2. Call get_chunks_by_document with document_id
|
|
3. Verify filter by document.sourceId
|
|
4. Verify ordering by orderIndex
|
|
5. Test pagination
|
|
6. Test section_filter parameter
|
|
7. Verify document metadata is included
|
|
</test_steps>
|
|
</feature_9>
|
|
|
|
<feature_10>
|
|
<title>Retrieval Tool - filter_by_author</title>
|
|
<description>
|
|
Get all works and documents by a specific author.
|
|
|
|
Tasks:
|
|
- Add filter_by_author handler to retrieval_tools.py
|
|
- Query Work collection by author
|
|
- Fetch related Documents for each work
|
|
- Optionally aggregate chunk counts
|
|
- Return hierarchical structure (works → documents)
|
|
- Add error handling
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock Weaviate Work and Document collections
|
|
2. Call filter_by_author with author="Platon"
|
|
3. Verify Work query by author
|
|
4. Verify Documents are fetched for each work
|
|
5. Test include_chunks=True aggregates counts
|
|
6. Test include_chunks=False skips counts
|
|
7. Verify hierarchical structure in output
|
|
</test_steps>
|
|
</feature_10>
|
|
|
|
<feature_11>
|
|
<title>Retrieval Tool - delete_document</title>
|
|
<description>
|
|
Delete a document and all its chunks from Weaviate.
|
|
|
|
Tasks:
|
|
- Add delete_document handler to retrieval_tools.py
|
|
- Use weaviate_ingest.delete_document_chunks()
|
|
- Require explicit confirmation flag
|
|
- Return deletion statistics (chunks, summaries, document)
|
|
- Add safety checks
|
|
- Add error handling
|
|
- Register tool with MCP server
|
|
</description>
|
|
<priority>3</priority>
|
|
<category>functional</category>
|
|
<test_steps>
|
|
1. Mock weaviate_ingest.delete_document_chunks()
|
|
2. Call delete_document without confirm=True
|
|
3. Verify error is raised
|
|
4. Call delete_document with confirm=True
|
|
5. Verify delete_document_chunks is called
|
|
6. Verify deletion statistics are returned
|
|
7. Test error handling for missing document
|
|
</test_steps>
|
|
</feature_11>
|
|
|
|
<feature_12>
|
|
<title>Error Handling & Structured Logging</title>
|
|
<description>
|
|
Implement comprehensive error handling and logging across all tools.
|
|
|
|
Tasks:
|
|
- Define custom exception classes:
|
|
- WeaviateConnectionError
|
|
- PDFProcessingError
|
|
- ValidationError
|
|
- MCPToolError
|
|
- Add try-except in all tool handlers
|
|
- Convert Python exceptions to MCP error responses
|
|
- Implement structured JSON logging
|
|
- Log all tool invocations (name, inputs, duration, costs)
|
|
- Log Weaviate queries and results
|
|
- Configure log level from environment
|
|
- Never expose sensitive data in logs
|
|
</description>
|
|
<priority>2</priority>
|
|
<category>infrastructure</category>
|
|
<test_steps>
|
|
1. Test WeaviateConnectionError is raised on connection failure
|
|
2. Verify error is converted to MCP error format
|
|
3. Test PDFProcessingError on invalid PDF
|
|
4. Verify all errors are logged with context
|
|
5. Check logs contain tool name, inputs, duration
|
|
6. Verify sensitive data (API keys) not in logs
|
|
7. Test LOG_LEVEL environment variable works
|
|
</test_steps>
|
|
</feature_12>
|
|
|
|
<feature_13>
|
|
<title>MCP Server Documentation</title>
|
|
<description>
|
|
Create comprehensive documentation for MCP server usage.
|
|
|
|
Tasks:
|
|
- Create MCP_README.md with:
|
|
- Overview of capabilities
|
|
- Installation instructions
|
|
- Environment variable configuration
|
|
- Tool descriptions with examples
|
|
- Claude Desktop integration guide
|
|
- Troubleshooting section
|
|
- Add docstrings to all tool handlers
|
|
- Create .env.example file
|
|
- Document error codes and messages
|
|
- Add usage examples for each tool
|
|
</description>
|
|
<priority>3</priority>
|
|
<category>documentation</category>
|
|
<test_steps>
|
|
1. Review MCP_README.md for completeness
|
|
2. Verify installation instructions work
|
|
3. Check all tools documented with examples
|
|
4. Test Claude Desktop integration steps
|
|
5. Verify .env.example has all variables
|
|
6. Check troubleshooting covers common issues
|
|
</test_steps>
|
|
</feature_13>
|
|
|
|
<feature_14>
|
|
<title>Unit Tests - Parsing Tool</title>
|
|
<description>
|
|
Implement unit tests for parse_pdf tool.
|
|
|
|
Tasks:
|
|
- Create tests/mcp/ directory
|
|
- Create tests/mcp/conftest.py with fixtures
|
|
- Create tests/mcp/test_parsing_tools.py
|
|
- Test parse_pdf with valid PDF path
|
|
- Test parse_pdf with URL (mock download)
|
|
- Test parse_pdf error handling
|
|
- Mock pdf_pipeline.process_pdf_bytes()
|
|
- Verify fixed parameters are used
|
|
- Test cost tracking in output
|
|
- Use pytest-mock for mocking
|
|
- Target >80% coverage
|
|
</description>
|
|
<priority>3</priority>
|
|
<category>testing</category>
|
|
<test_steps>
|
|
1. Run: pytest tests/mcp/test_parsing_tools.py -v
|
|
2. Verify all tests pass
|
|
3. Run: pytest tests/mcp/test_parsing_tools.py --cov
|
|
4. Verify coverage >80%
|
|
5. Check mocks are used (no real API calls)
|
|
6. Verify error cases are tested
|
|
</test_steps>
|
|
</feature_14>
|
|
|
|
<feature_15>
|
|
<title>Unit Tests - Retrieval Tools</title>
|
|
<description>
|
|
Implement unit tests for all 7 retrieval tools.
|
|
|
|
Tasks:
|
|
- Create tests/mcp/test_retrieval_tools.py
|
|
- Test search_chunks with filters
|
|
- Test search_summaries with level filters
|
|
- Test get_document with/without chunks
|
|
- Test list_documents pagination
|
|
- Test get_chunks_by_document ordering
|
|
- Test filter_by_author aggregation
|
|
- Test delete_document confirmation
|
|
- Mock all Weaviate queries
|
|
- Test error handling for all tools
|
|
- Target >80% coverage
|
|
</description>
|
|
<priority>3</priority>
|
|
<category>testing</category>
|
|
<test_steps>
|
|
1. Run: pytest tests/mcp/test_retrieval_tools.py -v
|
|
2. Verify all tests pass
|
|
3. Run: pytest tests/mcp/test_retrieval_tools.py --cov
|
|
4. Verify coverage >80%
|
|
5. Check all 7 tools have tests
|
|
6. Verify error cases are tested
|
|
7. Check no real Weaviate connections
|
|
</test_steps>
|
|
</feature_15>
|
|
</implementation_steps>
|
|
|
|
<testing_strategy>
|
|
<unit_tests>
|
|
<structure>
|
|
tests/mcp/
|
|
├── conftest.py (fixtures pour mocks)
|
|
├── test_parsing_tools.py
|
|
├── test_retrieval_tools.py
|
|
├── test_schemas.py
|
|
└── test_config.py
|
|
</structure>
|
|
|
|
<mocking>
|
|
- pytest-mock pour mocking
|
|
- Mock Mistral API responses
|
|
- Mock Weaviate client
|
|
- Fixtures pour test data
|
|
- Pas d'appels API réels en unit tests
|
|
</mocking>
|
|
</unit_tests>
|
|
|
|
<no_ui_tests>
|
|
<critical>
|
|
**PAS DE TESTS PUPPETEER**
|
|
|
|
Serveur MCP sans interface web.
|
|
Focus sur :
|
|
- Fonctionnalité des outils (unit tests)
|
|
- Validation schemas (Pydantic)
|
|
- Error handling
|
|
</critical>
|
|
</no_ui_tests>
|
|
</testing_strategy>
|
|
|
|
<deployment>
|
|
<claude_desktop_integration>
|
|
Configuration dans ~/.config/claude/claude_desktop_config.json :
|
|
|
|
{
|
|
"mcpServers": {
|
|
"library-rag": {
|
|
"command": "python",
|
|
"args": ["C:/GitHub/linear_coding_library_rag/generations/library_rag/mcp_server.py"],
|
|
"env": {
|
|
"MISTRAL_API_KEY": "your-key-here"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
</claude_desktop_integration>
|
|
</deployment>
|
|
|
|
<success_criteria>
|
|
<functional>
|
|
- 8 outils fonctionnels (1 parsing + 7 retrieval)
|
|
- parse_pdf avec paramètres optimaux fixes
|
|
- Recherche sémantique Weaviate fonctionnelle
|
|
- Error handling complet
|
|
</functional>
|
|
|
|
<quality>
|
|
- mypy --strict passes
|
|
- Coverage >80%
|
|
- Docstrings complets
|
|
</quality>
|
|
|
|
<performance>
|
|
- parse_pdf : <2min par document
|
|
- search_chunks : <1s
|
|
</performance>
|
|
</success_criteria>
|
|
|
|
<cost_tracking>
|
|
<total_estimate>
|
|
Document 300 pages avec configuration optimale :
|
|
- OCR : ~2.70€
|
|
- LLM : ~0.15€
|
|
- Total : ~2.85€
|
|
|
|
Retourné dans parse_pdf output pour tracking.
|
|
</total_estimate>
|
|
</cost_tracking>
|
|
</project_specification>
|