Fix: Métadonnées Word correctes + désactivation concepts
Problèmes corrigés: 1. TITRE INCORRECT → Maintenant utilise TITRE: de la première page 2. CONCEPTS EN FRANÇAIS → Désactivé l'enrichissement LLM Avant: - Titre: "An Historical Sketch..." (mauvais, titre du chapitre) - Concepts: ['immuabilité des espèces', 'création séparée'] (français) - Résultat: 3/37 chunks ingérés dans Weaviate Après: - Titre: "On the Origin of Species BY MEANS OF..." (correct!) - Concepts: [] (vides, pas de problème d'encoding) - Résultat: 14/37 chunks ingérés (mieux mais pas parfait) Changements word_pipeline.py: 1. STEP 5 - Métadonnées simplifiées (ligne 241-262): - Supprimé l'appel à extract_metadata() du LLM - Utilise directement raw_meta de extract_word_metadata() - Le LLM prenait le titre du chapitre au lieu du livre 2. STEP 9 - Désactivé enrichissement concepts (ligne 410-423): - Skip enrich_chunks_with_concepts() - Raison: LLM génère concepts en FRANÇAIS pour texte ANGLAIS - Accents français causent échecs Weaviate Note TOC: Le document n'a que 2 Heading 2, donc la TOC est limitée. C'est normal pour un extrait de 10 pages. Reste à investiguer: Pourquoi 14/37 au lieu de 37/37 chunks? 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -239,71 +239,27 @@ def process_word(
|
|||||||
)
|
)
|
||||||
|
|
||||||
# ================================================================
|
# ================================================================
|
||||||
# STEP 5: LLM Metadata Extraction (REUSED)
|
# STEP 5: Metadata Extraction from Word (NO LLM NEEDED)
|
||||||
# ================================================================
|
# ================================================================
|
||||||
|
# Word documents have metadata in first lines (TITRE:, AUTEUR:, EDITION:)
|
||||||
|
# or in core properties. LLM extraction often gets it wrong (takes chapter
|
||||||
|
# title instead of book title), so we use Word-native metadata directly.
|
||||||
metadata: Metadata
|
metadata: Metadata
|
||||||
cost_llm = 0.0
|
cost_llm = 0.0
|
||||||
|
|
||||||
if use_llm:
|
|
||||||
from utils.llm_metadata import extract_metadata
|
|
||||||
|
|
||||||
callback("Metadata Extraction", "running", "Extracting metadata with LLM...")
|
|
||||||
|
|
||||||
try:
|
|
||||||
metadata_llm = extract_metadata(
|
|
||||||
markdown_text,
|
|
||||||
provider=llm_provider,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Fallback to Word properties if LLM returns None
|
|
||||||
if metadata_llm is None:
|
|
||||||
callback(
|
|
||||||
"Metadata Extraction",
|
|
||||||
"completed",
|
|
||||||
"LLM extraction failed, using Word properties",
|
|
||||||
)
|
|
||||||
raw_meta = content["metadata_raw"]
|
raw_meta = content["metadata_raw"]
|
||||||
metadata = Metadata(
|
metadata = Metadata(
|
||||||
title=raw_meta.get("title", doc_name),
|
title=raw_meta.get("title") or doc_name,
|
||||||
author=raw_meta.get("author", "Unknown"),
|
author=raw_meta.get("author") or "Unknown",
|
||||||
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
||||||
language=raw_meta.get("language", "unknown"),
|
language="en", # Default to English, could be improved
|
||||||
)
|
)
|
||||||
else:
|
|
||||||
metadata = metadata_llm
|
|
||||||
callback(
|
callback(
|
||||||
"Metadata Extraction",
|
"Metadata Extraction",
|
||||||
"completed",
|
"completed",
|
||||||
f"Title: {metadata.get('title', '')[:50]}..., Author: {metadata.get('author', '')}",
|
f"Title: {metadata.get('title', '')[:50]}..., Author: {metadata.get('author', '')}",
|
||||||
)
|
)
|
||||||
except Exception as e:
|
|
||||||
callback(
|
|
||||||
"Metadata Extraction",
|
|
||||||
"completed",
|
|
||||||
f"LLM error ({str(e)}), using Word properties",
|
|
||||||
)
|
|
||||||
raw_meta = content["metadata_raw"]
|
|
||||||
metadata = Metadata(
|
|
||||||
title=raw_meta.get("title", doc_name),
|
|
||||||
author=raw_meta.get("author", "Unknown"),
|
|
||||||
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
|
||||||
language=raw_meta.get("language", "unknown"),
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
# Use metadata from Word properties
|
|
||||||
raw_meta = content["metadata_raw"]
|
|
||||||
metadata = Metadata(
|
|
||||||
title=raw_meta.get("title", doc_name),
|
|
||||||
author=raw_meta.get("author", "Unknown"),
|
|
||||||
year=raw_meta.get("created").year if raw_meta.get("created") else None,
|
|
||||||
language=raw_meta.get("language", "unknown"),
|
|
||||||
)
|
|
||||||
|
|
||||||
callback(
|
|
||||||
"Metadata Extraction",
|
|
||||||
"completed",
|
|
||||||
"Using Word document properties",
|
|
||||||
)
|
|
||||||
|
|
||||||
# ================================================================
|
# ================================================================
|
||||||
# STEP 6: Section Classification (REUSED)
|
# STEP 6: Section Classification (REUSED)
|
||||||
@@ -452,25 +408,18 @@ def process_word(
|
|||||||
)
|
)
|
||||||
|
|
||||||
# ================================================================
|
# ================================================================
|
||||||
# STEP 9: Chunk Validation (REUSED)
|
# STEP 9: Chunk Validation (SKIP FOR WORD)
|
||||||
# ================================================================
|
# ================================================================
|
||||||
if use_llm:
|
# NOTE: We skip LLM concept enrichment for Word documents because:
|
||||||
from utils.llm_validator import enrich_chunks_with_concepts
|
# 1. The LLM generates concepts in French even for English text
|
||||||
|
# 2. French accents cause Weaviate ingestion failures
|
||||||
callback("Chunk Validation", "running", "Enriching chunks with concepts...")
|
# 3. Word documents already have clean structure, don't need LLM enhancement
|
||||||
|
#
|
||||||
# Enrich chunks with keywords/concepts
|
# For production: could re-enable with language detection + prompt tuning
|
||||||
enriched_chunks = enrich_chunks_with_concepts(
|
|
||||||
chunks,
|
|
||||||
provider=llm_provider,
|
|
||||||
)
|
|
||||||
|
|
||||||
chunks = enriched_chunks
|
|
||||||
|
|
||||||
callback(
|
callback(
|
||||||
"Chunk Validation",
|
"Chunk Validation",
|
||||||
"completed",
|
"completed",
|
||||||
f"Validated {len(chunks)} chunks",
|
f"Skipped (Word documents don't need LLM enrichment)",
|
||||||
)
|
)
|
||||||
|
|
||||||
# ================================================================
|
# ================================================================
|
||||||
|
|||||||
Reference in New Issue
Block a user