feat: Add data quality verification & cleanup scripts

## Data Quality & Cleanup (Priorities 1-6) Added comprehensive data quality verification and cleanup system: **Scripts créés**: - verify_data_quality.py: Analyse qualité complète œuvre par œuvre - clean_duplicate_documents.py: Nettoyage doublons Documents - populate_work_collection.py/clean.py: Peuplement Work collection - fix_chunks_count.py: Correction chunksCount incohérents - manage_orphan_chunks.py: Gestion chunks orphelins (3 options) - clean_orphan_works.py: Suppression Works sans chunks - add_missing_work.py: Création Work manquant - generate_schema_stats.py: Génération stats auto - migrate_add_work_collection.py: Migration sûre Work collection **Documentation**: - WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes) - WEAVIATE_SCHEMA.md: Référence schéma rapide - NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session - ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale - rapport_qualite_donnees.txt: Output brut vérification **Résultats nettoyage**: - Documents: 16 → 9 (7 doublons supprimés) - Works: 0 → 9 (peuplé + nettoyé) - Chunks: 5,404 → 5,230 (174 orphelins supprimés) - chunksCount: Corrigés (231 → 5,230 déclaré = réel) - Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres **Modifications code**: - schema.py: Ajout Work collection avec vectorisation - utils/weaviate_ingest.py: Support Work ingestion - utils/word_pipeline.py: Désactivation concepts (problème .lower()) - utils/word_toc_extractor.py: Métadonnées Word correctes - .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-01 11:57:26 +01:00
parent 845ffb4b06
commit 04ee3f9e39
26 changed files with 6945 additions and 16 deletions
--- a/generations/library_rag/utils/word_toc_extractor.py
+++ b/generations/library_rag/utils/word_toc_extractor.py
@@ -227,3 +227,118 @@ def print_toc_tree(
        print(f"{indent}{entry['sectionPath']}: {entry['title']}")
        if entry["children"]:
            print_toc_tree(entry["children"], indent + "  ")
+
+
+def _roman_to_int(roman: str) -> int:
+    """Convert Roman numeral to integer.
+
+    Args:
+        roman: Roman numeral string (I, II, III, IV, V, VI, VII, etc.).
+
+    Returns:
+        Integer value.
+
+    Example:
+        >>> _roman_to_int("I")
+        1
+        >>> _roman_to_int("IV")
+        4
+        >>> _roman_to_int("VII")
+        7
+    """
+    roman_values = {'I': 1, 'V': 5, 'X': 10, 'L': 50, 'C': 100, 'D': 500, 'M': 1000}
+    result = 0
+    prev_value = 0
+
+    for char in reversed(roman.upper()):
+        value = roman_values.get(char, 0)
+        if value < prev_value:
+            result -= value
+        else:
+            result += value
+        prev_value = value
+
+    return result
+
+
+def extract_toc_from_chapter_summaries(paragraphs: List[Dict[str, Any]]) -> List[TOCEntry]:
+    """Extract TOC from chapter summary paragraphs (CHAPTER I, CHAPTER II, etc.).
+
+    Many Word documents have a "RESUME DES CHAPITRES" or "TABLE OF CONTENTS" section
+    with paragraphs like:
+        CHAPTER I.
+        VARIATION UNDER DOMESTICATION.
+        Description...
+
+    This function extracts those into a proper TOC structure.
+
+    Args:
+        paragraphs: List of paragraph dicts from word_processor.extract_word_content().
+            Each dict must have:
+            - text (str): Paragraph text
+            - is_heading (bool): Whether it's a heading
+            - index (int): Paragraph index
+
+    Returns:
+        List of TOCEntry dicts with hierarchical structure.
+
+    Example:
+        >>> paragraphs = [...]
+        >>> toc = extract_toc_from_chapter_summaries(paragraphs)
+        >>> print(toc[0]["title"])
+        'VARIATION UNDER DOMESTICATION'
+        >>> print(toc[0]["sectionPath"])
+        '1'
+    """
+    import re
+
+    toc: List[TOCEntry] = []
+    toc_started = False
+
+    for para in paragraphs:
+        text = para.get("text", "").strip()
+
+        # Detect TOC start (multiple possible markers)
+        if any(marker in text.upper() for marker in [
+            'RESUME DES CHAPITRES',
+            'TABLE OF CONTENTS',
+            'CONTENTS',
+            'CHAPITRES',
+        ]):
+            toc_started = True
+            continue
+
+        # Extract chapters
+        if toc_started and text.startswith('CHAPTER'):
+            # Split by newlines to get chapter number and title
+            lines = [line.strip() for line in text.split('\n') if line.strip()]
+
+            if len(lines) >= 2:
+                chapter_line = lines[0]
+                title_line = lines[1]
+
+                # Extract chapter number (roman or arabic)
+                match = re.match(r'CHAPTER\s+([IVXLCDM]+|\d+)', chapter_line, re.IGNORECASE)
+                if match:
+                    chapter_num_str = match.group(1)
+
+                    # Convert to integer
+                    if chapter_num_str.isdigit():
+                        chapter_num = int(chapter_num_str)
+                    else:
+                        chapter_num = _roman_to_int(chapter_num_str)
+
+                    # Remove trailing dots
+                    title_clean = title_line.rstrip('.')
+
+                    entry: TOCEntry = {
+                        "title": title_clean,
+                        "level": 1,  # All chapters are top-level
+                        "sectionPath": str(chapter_num),
+                        "pageRange": "",
+                        "children": [],
+                    }
+
+                    toc.append(entry)
+
+    return toc