Fix: clean_chunk attend str, pas dict

Problème:
- Erreur: "expected string or bytes-like object, got 'dict'"
- À l'étape "Chunk Cleaning", on passait chunk (dict) au lieu de chunk["text"] (str)

Correction word_pipeline.py (ligne 434):
AVANT:
```python
cleaned = clean_chunk(chunk)  # chunk est un dict!
```

APRÈS:
```python
text: str = chunk.get("text", "")
cleaned_text = clean_chunk(text, use_llm=False)
if is_chunk_valid(cleaned_text, min_chars=30, min_words=8):
    chunk["text"] = cleaned_text
    cleaned_chunks.append(chunk)
```

Pattern copié depuis pdf_pipeline.py:765-771 où la même logique
extrait le texte, le nettoie, puis met à jour le dict.

Test réussi:
 48 paragraphes extraits
 37 chunks créés
 Nettoyage OK
 Validation OK
 Pipeline complet fonctionnel avec Mistral API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-30 22:39:41 +01:00
parent 19713f22d6
commit 0800f74bd7

View File

@@ -424,16 +424,24 @@ def process_word(
# STEP 8: Chunk Cleaning (REUSED) # STEP 8: Chunk Cleaning (REUSED)
# ================================================================ # ================================================================
if use_llm: if use_llm:
from utils.llm_cleaner import clean_chunk from utils.llm_cleaner import clean_chunk, is_chunk_valid
callback("Chunk Cleaning", "running", "Cleaning chunks...") callback("Chunk Cleaning", "running", "Cleaning chunks...")
# Clean each chunk # Clean each chunk
cleaned_chunks = [] cleaned_chunks = []
for chunk in chunks: for chunk in chunks:
cleaned = clean_chunk(chunk) # Extract text from chunk dict
if cleaned: # Only keep valid chunks text: str = chunk.get("text", "")
cleaned_chunks.append(cleaned)
# Clean the text
cleaned_text = clean_chunk(text, use_llm=False)
# Validate chunk
if is_chunk_valid(cleaned_text, min_chars=30, min_words=8):
# Update chunk with cleaned text
chunk["text"] = cleaned_text
cleaned_chunks.append(chunk)
chunks = cleaned_chunks chunks = cleaned_chunks