feat: Add data quality verification & cleanup scripts

## Data Quality & Cleanup (Priorities 1-6)

Added comprehensive data quality verification and cleanup system:

**Scripts créés**:
- verify_data_quality.py: Analyse qualité complète œuvre par œuvre
- clean_duplicate_documents.py: Nettoyage doublons Documents
- populate_work_collection.py/clean.py: Peuplement Work collection
- fix_chunks_count.py: Correction chunksCount incohérents
- manage_orphan_chunks.py: Gestion chunks orphelins (3 options)
- clean_orphan_works.py: Suppression Works sans chunks
- add_missing_work.py: Création Work manquant
- generate_schema_stats.py: Génération stats auto
- migrate_add_work_collection.py: Migration sûre Work collection

**Documentation**:
- WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes)
- WEAVIATE_SCHEMA.md: Référence schéma rapide
- NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session
- ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale
- rapport_qualite_donnees.txt: Output brut vérification

**Résultats nettoyage**:
- Documents: 16 → 9 (7 doublons supprimés)
- Works: 0 → 9 (peuplé + nettoyé)
- Chunks: 5,404 → 5,230 (174 orphelins supprimés)
- chunksCount: Corrigés (231 → 5,230 déclaré = réel)
- Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres

**Modifications code**:
- schema.py: Ajout Work collection avec vectorisation
- utils/weaviate_ingest.py: Support Work ingestion
- utils/word_pipeline.py: Désactivation concepts (problème .lower())
- utils/word_toc_extractor.py: Métadonnées Word correctes
- .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-01 11:57:26 +01:00
parent 845ffb4b06
commit 04ee3f9e39
26 changed files with 6945 additions and 16 deletions

View File

@@ -227,3 +227,118 @@ def print_toc_tree(
print(f"{indent}{entry['sectionPath']}: {entry['title']}")
if entry["children"]:
print_toc_tree(entry["children"], indent + " ")
def _roman_to_int(roman: str) -> int:
"""Convert Roman numeral to integer.
Args:
roman: Roman numeral string (I, II, III, IV, V, VI, VII, etc.).
Returns:
Integer value.
Example:
>>> _roman_to_int("I")
1
>>> _roman_to_int("IV")
4
>>> _roman_to_int("VII")
7
"""
roman_values = {'I': 1, 'V': 5, 'X': 10, 'L': 50, 'C': 100, 'D': 500, 'M': 1000}
result = 0
prev_value = 0
for char in reversed(roman.upper()):
value = roman_values.get(char, 0)
if value < prev_value:
result -= value
else:
result += value
prev_value = value
return result
def extract_toc_from_chapter_summaries(paragraphs: List[Dict[str, Any]]) -> List[TOCEntry]:
"""Extract TOC from chapter summary paragraphs (CHAPTER I, CHAPTER II, etc.).
Many Word documents have a "RESUME DES CHAPITRES" or "TABLE OF CONTENTS" section
with paragraphs like:
CHAPTER I.
VARIATION UNDER DOMESTICATION.
Description...
This function extracts those into a proper TOC structure.
Args:
paragraphs: List of paragraph dicts from word_processor.extract_word_content().
Each dict must have:
- text (str): Paragraph text
- is_heading (bool): Whether it's a heading
- index (int): Paragraph index
Returns:
List of TOCEntry dicts with hierarchical structure.
Example:
>>> paragraphs = [...]
>>> toc = extract_toc_from_chapter_summaries(paragraphs)
>>> print(toc[0]["title"])
'VARIATION UNDER DOMESTICATION'
>>> print(toc[0]["sectionPath"])
'1'
"""
import re
toc: List[TOCEntry] = []
toc_started = False
for para in paragraphs:
text = para.get("text", "").strip()
# Detect TOC start (multiple possible markers)
if any(marker in text.upper() for marker in [
'RESUME DES CHAPITRES',
'TABLE OF CONTENTS',
'CONTENTS',
'CHAPITRES',
]):
toc_started = True
continue
# Extract chapters
if toc_started and text.startswith('CHAPTER'):
# Split by newlines to get chapter number and title
lines = [line.strip() for line in text.split('\n') if line.strip()]
if len(lines) >= 2:
chapter_line = lines[0]
title_line = lines[1]
# Extract chapter number (roman or arabic)
match = re.match(r'CHAPTER\s+([IVXLCDM]+|\d+)', chapter_line, re.IGNORECASE)
if match:
chapter_num_str = match.group(1)
# Convert to integer
if chapter_num_str.isdigit():
chapter_num = int(chapter_num_str)
else:
chapter_num = _roman_to_int(chapter_num_str)
# Remove trailing dots
title_clean = title_line.rstrip('.')
entry: TOCEntry = {
"title": title_clean,
"level": 1, # All chapters are top-level
"sectionPath": str(chapter_num),
"pageRange": "",
"children": [],
}
toc.append(entry)
return toc