Fix: Pipeline Word + UI simplifiée pour upload

Corrections word_pipeline.py: - Gestion robuste des erreurs LLM (fallback vers métadonnées Word) - Correction: s["section_type"] -> s.get("type") pour classification - Correction: "section_type" -> "type" dans fallback (use_llm=False) - Ajout try/except pour extract_metadata avec fallback automatique - Métadonnées Word utilisées si LLM échoue ou retourne None Refonte upload.html (interface simplifiée): - UI claire avec 2 options principales (LLM + Weaviate) - Options PDF masquées automatiquement pour Word/Markdown - Encart vert "Fichier Word détecté" s'affiche automatiquement - Encart orange "Fichier Markdown détecté" ajouté - Options avancées repliables (<details>) - Pipeline adaptatif selon le type de fichier - Support .md ajouté (oublié dans version précédente) Problème résolu: ❌ AVANT: Trop d'options partout, confus pour l'utilisateur ✅ APRÈS: Interface simple, 2 cases à cocher, reste pré-configuré Usage recommandé: 1. Sélectionner fichier (.pdf, .docx, .md) 2. Les options s'adaptent automatiquement 3. Cliquer sur "🚀 Analyser le document" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-30 22:34:28 +01:00
parent 4823fd1b10
commit 19713f22d6
2 changed files with 243 additions and 146 deletions
--- a/generations/library_rag/utils/word_pipeline.py
+++ b/generations/library_rag/utils/word_pipeline.py
@@ -249,18 +249,46 @@ def process_word(

            callback("Metadata Extraction", "running", "Extracting metadata with LLM...")

-            metadata = extract_metadata(
-                markdown_text,
-                provider=llm_provider,
-            )
+            try:
+                metadata_llm = extract_metadata(
+                    markdown_text,
+                    provider=llm_provider,
+                )

-            # Note: extract_metadata doesn't return cost directly
-
-            callback(
-                "Metadata Extraction",
-                "completed",
-                f"Title: {metadata['title'][:50]}..., Author: {metadata['author']}",
-            )
+                # Fallback to Word properties if LLM returns None
+                if metadata_llm is None:
+                    callback(
+                        "Metadata Extraction",
+                        "completed",
+                        "LLM extraction failed, using Word properties",
+                    )
+                    raw_meta = content["metadata_raw"]
+                    metadata = Metadata(
+                        title=raw_meta.get("title", doc_name),
+                        author=raw_meta.get("author", "Unknown"),
+                        year=raw_meta.get("created").year if raw_meta.get("created") else None,
+                        language=raw_meta.get("language", "unknown"),
+                    )
+                else:
+                    metadata = metadata_llm
+                    callback(
+                        "Metadata Extraction",
+                        "completed",
+                        f"Title: {metadata.get('title', '')[:50]}..., Author: {metadata.get('author', '')}",
+                    )
+            except Exception as e:
+                callback(
+                    "Metadata Extraction",
+                    "completed",
+                    f"LLM error ({str(e)}), using Word properties",
+                )
+                raw_meta = content["metadata_raw"]
+                metadata = Metadata(
+                    title=raw_meta.get("title", doc_name),
+                    author=raw_meta.get("author", "Unknown"),
+                    year=raw_meta.get("created").year if raw_meta.get("created") else None,
+                    language=raw_meta.get("language", "unknown"),
+                )
        else:
            # Use metadata from Word properties
            raw_meta = content["metadata_raw"]
@@ -303,7 +331,7 @@ def process_word(

            main_sections = [
                s for s in classified_sections
-                if s["section_type"] == "main_content"
+                if s.get("type") == "main_content"
            ]

            callback(
@@ -316,8 +344,9 @@ def process_word(
            classified_sections = [
                {
                    "section_path": entry["sectionPath"],
-                    "section_type": "main_content",
-                    "reason": "No LLM classification",
+                    "type": "main_content",
+                    "should_index": True,
+                    "classification_reason": "No LLM classification",
                }
                for entry in toc_flat
            ]