Fix: Pipeline Word + UI simplifiée pour upload
Corrections word_pipeline.py:
- Gestion robuste des erreurs LLM (fallback vers métadonnées Word)
- Correction: s["section_type"] -> s.get("type") pour classification
- Correction: "section_type" -> "type" dans fallback (use_llm=False)
- Ajout try/except pour extract_metadata avec fallback automatique
- Métadonnées Word utilisées si LLM échoue ou retourne None
Refonte upload.html (interface simplifiée):
- UI claire avec 2 options principales (LLM + Weaviate)
- Options PDF masquées automatiquement pour Word/Markdown
- Encart vert "Fichier Word détecté" s'affiche automatiquement
- Encart orange "Fichier Markdown détecté" ajouté
- Options avancées repliables (<details>)
- Pipeline adaptatif selon le type de fichier
- Support .md ajouté (oublié dans version précédente)
Problème résolu:
❌ AVANT: Trop d'options partout, confus pour l'utilisateur
✅ APRÈS: Interface simple, 2 cases à cocher, reste pré-configuré
Usage recommandé:
1. Sélectionner fichier (.pdf, .docx, .md)
2. Les options s'adaptent automatiquement
3. Cliquer sur "🚀 Analyser le document"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -249,18 +249,46 @@ def process_word(
|
||||
|
||||
callback("Metadata Extraction", "running", "Extracting metadata with LLM...")
|
||||
|
||||
metadata = extract_metadata(
|
||||
markdown_text,
|
||||
provider=llm_provider,
|
||||
)
|
||||
try:
|
||||
metadata_llm = extract_metadata(
|
||||
markdown_text,
|
||||
provider=llm_provider,
|
||||
)
|
||||
|
||||
# Note: extract_metadata doesn't return cost directly
|
||||
|
||||
callback(
|
||||
"Metadata Extraction",
|
||||
"completed",
|
||||
f"Title: {metadata['title'][:50]}..., Author: {metadata['author']}",
|
||||
)
|
||||
# Fallback to Word properties if LLM returns None
|
||||
if metadata_llm is None:
|
||||
callback(
|
||||
"Metadata Extraction",
|
||||
"completed",
|
||||
"LLM extraction failed, using Word properties",
|
||||
)
|
||||
raw_meta = content["metadata_raw"]
|
||||
metadata = Metadata(
|
||||
title=raw_meta.get("title", doc_name),
|
||||
author=raw_meta.get("author", "Unknown"),
|
||||
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
||||
language=raw_meta.get("language", "unknown"),
|
||||
)
|
||||
else:
|
||||
metadata = metadata_llm
|
||||
callback(
|
||||
"Metadata Extraction",
|
||||
"completed",
|
||||
f"Title: {metadata.get('title', '')[:50]}..., Author: {metadata.get('author', '')}",
|
||||
)
|
||||
except Exception as e:
|
||||
callback(
|
||||
"Metadata Extraction",
|
||||
"completed",
|
||||
f"LLM error ({str(e)}), using Word properties",
|
||||
)
|
||||
raw_meta = content["metadata_raw"]
|
||||
metadata = Metadata(
|
||||
title=raw_meta.get("title", doc_name),
|
||||
author=raw_meta.get("author", "Unknown"),
|
||||
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
||||
language=raw_meta.get("language", "unknown"),
|
||||
)
|
||||
else:
|
||||
# Use metadata from Word properties
|
||||
raw_meta = content["metadata_raw"]
|
||||
@@ -303,7 +331,7 @@ def process_word(
|
||||
|
||||
main_sections = [
|
||||
s for s in classified_sections
|
||||
if s["section_type"] == "main_content"
|
||||
if s.get("type") == "main_content"
|
||||
]
|
||||
|
||||
callback(
|
||||
@@ -316,8 +344,9 @@ def process_word(
|
||||
classified_sections = [
|
||||
{
|
||||
"section_path": entry["sectionPath"],
|
||||
"section_type": "main_content",
|
||||
"reason": "No LLM classification",
|
||||
"type": "main_content",
|
||||
"should_index": True,
|
||||
"classification_reason": "No LLM classification",
|
||||
}
|
||||
for entry in toc_flat
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user