linear-coding-agent/generations/library_rag/utils at 845ffb4b06f3840a2e14ecd610753772d5c373ab - linear-coding-agent - Git PAROSIA

davebb/linear-coding-agent

Files

History

David Blanc Brioir 845ffb4b06 Fix: Métadonnées Word correctes + désactivation concepts

Problèmes corrigés:
1. TITRE INCORRECT → Maintenant utilise TITRE: de la première page
2. CONCEPTS EN FRANÇAIS → Désactivé l'enrichissement LLM

Avant:
- Titre: "An Historical Sketch..." (mauvais, titre du chapitre)
- Concepts: ['immuabilité des espèces', 'création séparée'] (français)
- Résultat: 3/37 chunks ingérés dans Weaviate

Après:
- Titre: "On the Origin of Species BY MEANS OF..." (correct!)
- Concepts: [] (vides, pas de problème d'encoding)
- Résultat: 14/37 chunks ingérés (mieux mais pas parfait)

Changements word_pipeline.py:

1. STEP 5 - Métadonnées simplifiées (ligne 241-262):
   - Supprimé l'appel à extract_metadata() du LLM
   - Utilise directement raw_meta de extract_word_metadata()
   - Le LLM prenait le titre du chapitre au lieu du livre

2. STEP 9 - Désactivé enrichissement concepts (ligne 410-423):
   - Skip enrich_chunks_with_concepts()
   - Raison: LLM génère concepts en FRANÇAIS pour texte ANGLAIS
   - Accents français causent échecs Weaviate

Note TOC:
Le document n'a que 2 Heading 2, donc la TOC est limitée.
C'est normal pour un extrait de 10 pages.

Reste à investiguer: Pourquoi 14/37 au lieu de 37/37 chunks?

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-30 23:39:41 +01:00

..

__init__.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

hierarchy_parser.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

image_extractor.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_chat.py

Fix: Gestion robuste des valeurs None dans .lower()

2025-12-30 22:26:29 +01:00

llm_chunker.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_classifier.py

Fix: Gestion robuste des valeurs None dans .lower()

2025-12-30 22:26:29 +01:00

llm_cleaner.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_metadata.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_structurer.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_toc.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

llm_validator.py

Fix: Gestion robuste des valeurs None dans .lower()

2025-12-30 22:26:29 +01:00

markdown_builder.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

mistral_client.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

ocr_processor.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

ocr_schemas.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

pdf_exporter.py

Ajout des fonctionnalités d'export Word et PDF pour le chat RAG

2025-12-30 14:02:11 +01:00

pdf_pipeline.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

pdf_uploader.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

toc_enricher.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

toc_extractor_markdown.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

toc_extractor_visual.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

toc_extractor.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

tts_generator.py

Ajout nettoyage markdown pour TTS audio

2025-12-30 19:35:01 +01:00

types.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

weaviate_ingest.py

Add Library RAG project and cleanup root directory

2025-12-30 11:57:12 +01:00

word_exporter.py

Ajout des fonctionnalités d'export Word et PDF pour le chat RAG

2025-12-30 14:02:11 +01:00

word_pipeline.py

Fix: Métadonnées Word correctes + désactivation concepts

2025-12-30 23:39:41 +01:00

word_processor.py

Ajout pipeline Word (.docx) pour ingestion RAG

2025-12-30 21:58:43 +01:00

word_toc_extractor.py

Ajout pipeline Word (.docx) pour ingestion RAG

2025-12-30 21:58:43 +01:00