feat: Add data quality verification & cleanup scripts
## Data Quality & Cleanup (Priorities 1-6) Added comprehensive data quality verification and cleanup system: **Scripts créés**: - verify_data_quality.py: Analyse qualité complète œuvre par œuvre - clean_duplicate_documents.py: Nettoyage doublons Documents - populate_work_collection.py/clean.py: Peuplement Work collection - fix_chunks_count.py: Correction chunksCount incohérents - manage_orphan_chunks.py: Gestion chunks orphelins (3 options) - clean_orphan_works.py: Suppression Works sans chunks - add_missing_work.py: Création Work manquant - generate_schema_stats.py: Génération stats auto - migrate_add_work_collection.py: Migration sûre Work collection **Documentation**: - WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes) - WEAVIATE_SCHEMA.md: Référence schéma rapide - NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session - ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale - rapport_qualite_donnees.txt: Output brut vérification **Résultats nettoyage**: - Documents: 16 → 9 (7 doublons supprimés) - Works: 0 → 9 (peuplé + nettoyé) - Chunks: 5,404 → 5,230 (174 orphelins supprimés) - chunksCount: Corrigés (231 → 5,230 déclaré = réel) - Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres **Modifications code**: - schema.py: Ajout Work collection avec vectorisation - utils/weaviate_ingest.py: Support Work ingestion - utils/word_pipeline.py: Désactivation concepts (problème .lower()) - utils/word_toc_extractor.py: Métadonnées Word correctes - .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -227,3 +227,118 @@ def print_toc_tree(
|
||||
print(f"{indent}{entry['sectionPath']}: {entry['title']}")
|
||||
if entry["children"]:
|
||||
print_toc_tree(entry["children"], indent + " ")
|
||||
|
||||
|
||||
def _roman_to_int(roman: str) -> int:
|
||||
"""Convert Roman numeral to integer.
|
||||
|
||||
Args:
|
||||
roman: Roman numeral string (I, II, III, IV, V, VI, VII, etc.).
|
||||
|
||||
Returns:
|
||||
Integer value.
|
||||
|
||||
Example:
|
||||
>>> _roman_to_int("I")
|
||||
1
|
||||
>>> _roman_to_int("IV")
|
||||
4
|
||||
>>> _roman_to_int("VII")
|
||||
7
|
||||
"""
|
||||
roman_values = {'I': 1, 'V': 5, 'X': 10, 'L': 50, 'C': 100, 'D': 500, 'M': 1000}
|
||||
result = 0
|
||||
prev_value = 0
|
||||
|
||||
for char in reversed(roman.upper()):
|
||||
value = roman_values.get(char, 0)
|
||||
if value < prev_value:
|
||||
result -= value
|
||||
else:
|
||||
result += value
|
||||
prev_value = value
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def extract_toc_from_chapter_summaries(paragraphs: List[Dict[str, Any]]) -> List[TOCEntry]:
|
||||
"""Extract TOC from chapter summary paragraphs (CHAPTER I, CHAPTER II, etc.).
|
||||
|
||||
Many Word documents have a "RESUME DES CHAPITRES" or "TABLE OF CONTENTS" section
|
||||
with paragraphs like:
|
||||
CHAPTER I.
|
||||
VARIATION UNDER DOMESTICATION.
|
||||
Description...
|
||||
|
||||
This function extracts those into a proper TOC structure.
|
||||
|
||||
Args:
|
||||
paragraphs: List of paragraph dicts from word_processor.extract_word_content().
|
||||
Each dict must have:
|
||||
- text (str): Paragraph text
|
||||
- is_heading (bool): Whether it's a heading
|
||||
- index (int): Paragraph index
|
||||
|
||||
Returns:
|
||||
List of TOCEntry dicts with hierarchical structure.
|
||||
|
||||
Example:
|
||||
>>> paragraphs = [...]
|
||||
>>> toc = extract_toc_from_chapter_summaries(paragraphs)
|
||||
>>> print(toc[0]["title"])
|
||||
'VARIATION UNDER DOMESTICATION'
|
||||
>>> print(toc[0]["sectionPath"])
|
||||
'1'
|
||||
"""
|
||||
import re
|
||||
|
||||
toc: List[TOCEntry] = []
|
||||
toc_started = False
|
||||
|
||||
for para in paragraphs:
|
||||
text = para.get("text", "").strip()
|
||||
|
||||
# Detect TOC start (multiple possible markers)
|
||||
if any(marker in text.upper() for marker in [
|
||||
'RESUME DES CHAPITRES',
|
||||
'TABLE OF CONTENTS',
|
||||
'CONTENTS',
|
||||
'CHAPITRES',
|
||||
]):
|
||||
toc_started = True
|
||||
continue
|
||||
|
||||
# Extract chapters
|
||||
if toc_started and text.startswith('CHAPTER'):
|
||||
# Split by newlines to get chapter number and title
|
||||
lines = [line.strip() for line in text.split('\n') if line.strip()]
|
||||
|
||||
if len(lines) >= 2:
|
||||
chapter_line = lines[0]
|
||||
title_line = lines[1]
|
||||
|
||||
# Extract chapter number (roman or arabic)
|
||||
match = re.match(r'CHAPTER\s+([IVXLCDM]+|\d+)', chapter_line, re.IGNORECASE)
|
||||
if match:
|
||||
chapter_num_str = match.group(1)
|
||||
|
||||
# Convert to integer
|
||||
if chapter_num_str.isdigit():
|
||||
chapter_num = int(chapter_num_str)
|
||||
else:
|
||||
chapter_num = _roman_to_int(chapter_num_str)
|
||||
|
||||
# Remove trailing dots
|
||||
title_clean = title_line.rstrip('.')
|
||||
|
||||
entry: TOCEntry = {
|
||||
"title": title_clean,
|
||||
"level": 1, # All chapters are top-level
|
||||
"sectionPath": str(chapter_num),
|
||||
"pageRange": "",
|
||||
"children": [],
|
||||
}
|
||||
|
||||
toc.append(entry)
|
||||
|
||||
return toc
|
||||
|
||||
Reference in New Issue
Block a user