Add Library RAG project and cleanup root directory

- Add complete Library RAG application (Flask + MCP server) - PDF processing pipeline with OCR and LLM extraction - Weaviate vector database integration (BGE-M3 embeddings) - Flask web interface with search and document management - MCP server for Claude Desktop integration - Comprehensive test suite (134 tests) - Clean up root directory - Remove obsolete documentation files - Remove backup and temporary files - Update autonomous agent configuration - Update prompts - Enhance initializer bis prompt with better instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-30 11:57:12 +01:00
parent 48470236da
commit d2f7165120
84 changed files with 26517 additions and 2 deletions
--- a/generations/library_rag/schema.py
+++ b/generations/library_rag/schema.py
@@ -0,0 +1,535 @@
+"""Weaviate schema definition for Library RAG - Philosophical Texts Database.
+
+This module defines and manages the Weaviate vector database schema for the
+Library RAG application. It provides functions to create, verify, and display
+the schema configuration for indexing and searching philosophical texts.
+
+Schema Architecture:
+    The schema follows a normalized design with denormalized nested objects
+    for efficient querying. The hierarchy is::
+
+        Work (metadata only)
+          └── Document (edition/translation instance)
+                ├── Chunk (vectorized text fragments)
+                └── Summary (vectorized chapter summaries)
+
+Collections:
+    **Work** (no vectorization):
+        Represents a philosophical or scholarly work (e.g., Plato's Meno).
+        Stores canonical metadata: title, author, year, language, genre.
+        Not vectorized - used only for metadata and relationships.
+
+    **Document** (no vectorization):
+        Represents a specific edition or translation of a Work.
+        Contains: sourceId, edition, language, pages, TOC, hierarchy.
+        Includes nested Work reference for denormalized access.
+
+    **Chunk** (vectorized with text2vec-transformers):
+        Text fragments optimized for semantic search (200-800 chars).
+        Vectorized fields: text, keywords.
+        Non-vectorized fields: sectionPath, chapterTitle, unitType, orderIndex.
+        Includes nested Document and Work references.
+
+    **Summary** (vectorized with text2vec-transformers):
+        LLM-generated chapter/section summaries for high-level search.
+        Vectorized fields: text, concepts.
+        Includes nested Document reference.
+
+Vectorization Strategy:
+    - Only Chunk.text, Chunk.keywords, Summary.text, and Summary.concepts are vectorized
+    - Uses text2vec-transformers (BAAI/bge-m3 with 1024-dim via Docker)
+    - Metadata fields use skip_vectorization=True for filtering only
+    - Work and Document collections have no vectorizer (metadata only)
+
+Migration Note (2024-12):
+    Migrated from MiniLM-L6 (384-dim) to BAAI/bge-m3 (1024-dim) for:
+    - 2.7x richer semantic representation
+    - 8192 token context (vs 512)
+    - Superior multilingual support (Greek, Latin, French, English)
+    - Better performance on philosophical/academic texts
+
+Nested Objects:
+    Instead of using Weaviate cross-references, we use nested objects for
+    denormalized data access. This allows single-query retrieval of chunk
+    data with its Work/Document metadata without joins::
+
+        Chunk.work = {title, author}
+        Chunk.document = {sourceId, edition}
+        Document.work = {title, author}
+        Summary.document = {sourceId}
+
+Usage:
+    From command line::
+
+        $ python schema.py
+
+    Programmatically::
+
+        import weaviate
+        from schema import create_schema, verify_schema
+
+        with weaviate.connect_to_local() as client:
+            create_schema(client, delete_existing=True)
+            verify_schema(client)
+
+    Check existing schema::
+
+        from schema import display_schema
+        with weaviate.connect_to_local() as client:
+            display_schema(client)
+
+Dependencies:
+    - Weaviate Python client v4+
+    - Running Weaviate instance with text2vec-transformers module
+    - Docker Compose setup from docker-compose.yml
+
+See Also:
+    - utils/weaviate_ingest.py : Functions to ingest data into this schema
+    - utils/types.py : TypedDict definitions matching schema structure
+    - docker-compose.yml : Weaviate + transformers container setup
+"""
+
+import sys
+from typing import List, Set
+
+import weaviate
+import weaviate.classes.config as wvc
+
+
+# =============================================================================
+# Schema Creation Functions
+# =============================================================================
+
+
+def create_work_collection(client: weaviate.WeaviateClient) -> None:
+    """Create the Work collection for philosophical works metadata.
+
+    Args:
+        client: Connected Weaviate client.
+
+    Note:
+        This collection has no vectorization - used only for metadata.
+    """
+    client.collections.create(
+        name="Work",
+        description="A philosophical or scholarly work (e.g., Meno, Republic, Apology).",
+        vectorizer_config=wvc.Configure.Vectorizer.none(),
+        properties=[
+            wvc.Property(
+                name="title",
+                description="Title of the work.",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="author",
+                description="Author of the work.",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="originalTitle",
+                description="Original title in source language (optional).",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="year",
+                description="Year of composition or publication (negative for BCE).",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="language",
+                description="Original language (e.g., 'gr', 'la', 'fr').",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="genre",
+                description="Genre or type (e.g., 'dialogue', 'treatise', 'commentary').",
+                data_type=wvc.DataType.TEXT,
+            ),
+        ],
+    )
+
+
+def create_document_collection(client: weaviate.WeaviateClient) -> None:
+    """Create the Document collection for edition/translation instances.
+
+    Args:
+        client: Connected Weaviate client.
+
+    Note:
+        Contains nested Work reference for denormalized access.
+    """
+    client.collections.create(
+        name="Document",
+        description="A specific edition or translation of a work (PDF, ebook, etc.).",
+        vectorizer_config=wvc.Configure.Vectorizer.none(),
+        properties=[
+            wvc.Property(
+                name="sourceId",
+                description="Unique identifier for this document (filename without extension).",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="edition",
+                description="Edition or translator (e.g., 'trad. Cousin', 'Loeb Classical Library').",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="language",
+                description="Language of this edition (e.g., 'fr', 'en').",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="pages",
+                description="Number of pages in the PDF/document.",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="chunksCount",
+                description="Total number of chunks extracted from this document.",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="toc",
+                description="Table of contents as JSON string [{title, level, page}, ...].",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="hierarchy",
+                description="Full hierarchical structure as JSON string.",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="createdAt",
+                description="Timestamp when this document was ingested.",
+                data_type=wvc.DataType.DATE,
+            ),
+            # Nested Work reference
+            wvc.Property(
+                name="work",
+                description="Reference to the Work this document is an instance of.",
+                data_type=wvc.DataType.OBJECT,
+                nested_properties=[
+                    wvc.Property(name="title", data_type=wvc.DataType.TEXT),
+                    wvc.Property(name="author", data_type=wvc.DataType.TEXT),
+                ],
+            ),
+        ],
+    )
+
+
+def create_chunk_collection(client: weaviate.WeaviateClient) -> None:
+    """Create the Chunk collection for vectorized text fragments.
+
+    Args:
+        client: Connected Weaviate client.
+
+    Note:
+        Uses text2vec-transformers for vectorizing 'text' and 'keywords' fields.
+        Other fields have skip_vectorization=True for filtering only.
+    """
+    client.collections.create(
+        name="Chunk",
+        description="A text chunk (paragraph, argument, etc.) vectorized for semantic search.",
+        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
+            vectorize_collection_name=False,
+        ),
+        properties=[
+            # Main content (vectorized)
+            wvc.Property(
+                name="text",
+                description="The text content to be vectorized (200-800 chars optimal).",
+                data_type=wvc.DataType.TEXT,
+            ),
+            # Hierarchical context (not vectorized, for filtering)
+            wvc.Property(
+                name="sectionPath",
+                description="Full hierarchical path (e.g., 'Présentation > Qu'est-ce que la vertu?').",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            wvc.Property(
+                name="sectionLevel",
+                description="Depth in hierarchy (1=top-level, 2=subsection, etc.).",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="chapterTitle",
+                description="Title of the top-level chapter/section.",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            wvc.Property(
+                name="canonicalReference",
+                description="Canonical academic reference (e.g., 'CP 1.628', 'Ménon 80a').",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            # Classification (not vectorized, for filtering)
+            wvc.Property(
+                name="unitType",
+                description="Type of logical unit (main_content, argument, exposition, transition, définition).",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            wvc.Property(
+                name="keywords",
+                description="Key concepts extracted from this chunk (vectorized for semantic search).",
+                data_type=wvc.DataType.TEXT_ARRAY,
+            ),
+            # Technical metadata (not vectorized)
+            wvc.Property(
+                name="orderIndex",
+                description="Sequential position in the document (0-based).",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="language",
+                description="Language of this chunk (e.g., 'fr', 'en', 'gr').",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            # Cross references (nested objects)
+            wvc.Property(
+                name="document",
+                description="Reference to parent Document with essential metadata.",
+                data_type=wvc.DataType.OBJECT,
+                nested_properties=[
+                    wvc.Property(name="sourceId", data_type=wvc.DataType.TEXT),
+                    wvc.Property(name="edition", data_type=wvc.DataType.TEXT),
+                ],
+            ),
+            wvc.Property(
+                name="work",
+                description="Reference to the Work with essential metadata.",
+                data_type=wvc.DataType.OBJECT,
+                nested_properties=[
+                    wvc.Property(name="title", data_type=wvc.DataType.TEXT),
+                    wvc.Property(name="author", data_type=wvc.DataType.TEXT),
+                ],
+            ),
+        ],
+    )
+
+
+def create_summary_collection(client: weaviate.WeaviateClient) -> None:
+    """Create the Summary collection for chapter/section summaries.
+
+    Args:
+        client: Connected Weaviate client.
+
+    Note:
+        Uses text2vec-transformers for vectorizing summary text.
+    """
+    client.collections.create(
+        name="Summary",
+        description="Chapter or section summary, vectorized for high-level semantic search.",
+        vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(
+            vectorize_collection_name=False,
+        ),
+        properties=[
+            wvc.Property(
+                name="sectionPath",
+                description="Hierarchical path (e.g., 'Chapter 1 > Section 2').",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            wvc.Property(
+                name="title",
+                description="Title of the section.",
+                data_type=wvc.DataType.TEXT,
+                skip_vectorization=True,
+            ),
+            wvc.Property(
+                name="level",
+                description="Hierarchy depth (1=chapter, 2=section, 3=subsection).",
+                data_type=wvc.DataType.INT,
+            ),
+            wvc.Property(
+                name="text",
+                description="LLM-generated summary of the section content (VECTORIZED).",
+                data_type=wvc.DataType.TEXT,
+            ),
+            wvc.Property(
+                name="concepts",
+                description="Key philosophical concepts in this section.",
+                data_type=wvc.DataType.TEXT_ARRAY,
+            ),
+            wvc.Property(
+                name="chunksCount",
+                description="Number of chunks in this section.",
+                data_type=wvc.DataType.INT,
+            ),
+            # Reference to Document
+            wvc.Property(
+                name="document",
+                description="Reference to parent Document.",
+                data_type=wvc.DataType.OBJECT,
+                nested_properties=[
+                    wvc.Property(name="sourceId", data_type=wvc.DataType.TEXT),
+                ],
+            ),
+        ],
+    )
+
+
+def create_schema(client: weaviate.WeaviateClient, delete_existing: bool = True) -> None:
+    """Create the complete Weaviate schema for Library RAG.
+
+    Creates all four collections: Work, Document, Chunk, Summary.
+
+    Args:
+        client: Connected Weaviate client.
+        delete_existing: If True, delete all existing collections first.
+
+    Raises:
+        Exception: If collection creation fails.
+    """
+    if delete_existing:
+        print("\n[1/4] Suppression des collections existantes...")
+        client.collections.delete_all()
+        print("      ✓ Collections supprimées")
+
+    print("\n[2/4] Création des collections...")
+
+    print("      → Work (métadonnées œuvre)...")
+    create_work_collection(client)
+
+    print("      → Document (métadonnées édition)...")
+    create_document_collection(client)
+
+    print("      → Chunk (fragments vectorisés)...")
+    create_chunk_collection(client)
+
+    print("      → Summary (résumés de chapitres)...")
+    create_summary_collection(client)
+
+    print("      ✓ 4 collections créées")
+
+
+def verify_schema(client: weaviate.WeaviateClient) -> bool:
+    """Verify that all expected collections exist.
+
+    Args:
+        client: Connected Weaviate client.
+
+    Returns:
+        True if all expected collections exist, False otherwise.
+    """
+    print("\n[3/4] Vérification des collections...")
+    collections = client.collections.list_all()
+
+    expected: Set[str] = {"Work", "Document", "Chunk", "Summary"}
+    actual: Set[str] = set(collections.keys())
+
+    if expected == actual:
+        print(f"      ✓ Toutes les collections créées: {sorted(actual)}")
+        return True
+    else:
+        missing: Set[str] = expected - actual
+        extra: Set[str] = actual - expected
+        if missing:
+            print(f"      ✗ Collections manquantes: {missing}")
+        if extra:
+            print(f"      ⚠ Collections inattendues: {extra}")
+        return False
+
+
+def display_schema(client: weaviate.WeaviateClient) -> None:
+    """Display detailed information about schema collections.
+
+    Args:
+        client: Connected Weaviate client.
+    """
+    print("\n[4/4] Détail des collections créées:")
+    print("=" * 80)
+
+    collections = client.collections.list_all()
+
+    for name in ["Work", "Document", "Chunk", "Summary"]:
+        if name not in collections:
+            continue
+
+        config = collections[name]
+        print(f"\n📦 {name}")
+        print("─" * 80)
+        print(f"Description: {config.description}")
+
+        # Vectorizer
+        vectorizer_str: str = str(config.vectorizer)
+        if "text2vec" in vectorizer_str.lower():
+            print("Vectorizer:  text2vec-transformers ✓")
+        else:
+            print("Vectorizer:  none")
+
+        # Properties
+        print("\nPropriétés:")
+        for prop in config.properties:
+            # Data type
+            dtype: str = str(prop.data_type).split('.')[-1]
+
+            # Skip vectorization flag
+            skip: str = ""
+            if hasattr(prop, 'skip_vectorization') and prop.skip_vectorization:
+                skip = " [skip_vec]"
+
+            # Nested properties
+            nested: str = ""
+            if hasattr(prop, 'nested_properties') and prop.nested_properties:
+                nested_names: List[str] = [p.name for p in prop.nested_properties]
+                nested = f" → {{{', '.join(nested_names)}}}"
+
+            print(f"  • {prop.name:<20} {dtype:<15} {skip}{nested}")
+
+
+def print_summary() -> None:
+    """Print a summary of the schema architecture."""
+    print("\n" + "=" * 80)
+    print("SCHÉMA CRÉÉ AVEC SUCCÈS!")
+    print("=" * 80)
+    print("\n✓ Architecture:")
+    print("  - Work: Source unique pour author/title")
+    print("  - Document: Métadonnées d'édition avec référence vers Work")
+    print("  - Chunk: Fragments vectorisés (text + keywords)")
+    print("  - Summary: Résumés de chapitres vectorisés (text)")
+    print("\n✓ Vectorisation:")
+    print("  - Work:    NONE")
+    print("  - Document: NONE")
+    print("  - Chunk:   text2vec (text + keywords)")
+    print("  - Summary: text2vec (text)")
+    print("=" * 80)
+
+
+# =============================================================================
+# Main Script Execution
+# =============================================================================
+
+
+def main() -> None:
+    """Main entry point for schema creation script."""
+    # Fix encoding for Windows console
+    if sys.platform == "win32" and hasattr(sys.stdout, 'reconfigure'):
+        sys.stdout.reconfigure(encoding='utf-8')
+
+    print("=" * 80)
+    print("CRÉATION DU SCHÉMA WEAVIATE - BASE DE TEXTES PHILOSOPHIQUES")
+    print("=" * 80)
+
+    # Connect to local Weaviate
+    client: weaviate.WeaviateClient = weaviate.connect_to_local(
+        host="localhost",
+        port=8080,
+        grpc_port=50051,
+    )
+
+    try:
+        create_schema(client, delete_existing=True)
+        verify_schema(client)
+        display_schema(client)
+        print_summary()
+    finally:
+        client.close()
+        print("\n✓ Connexion fermée\n")
+
+
+if __name__ == "__main__":
+    main()