## Data Quality & Cleanup (Priorities 1-6) Added comprehensive data quality verification and cleanup system: **Scripts créés**: - verify_data_quality.py: Analyse qualité complète œuvre par œuvre - clean_duplicate_documents.py: Nettoyage doublons Documents - populate_work_collection.py/clean.py: Peuplement Work collection - fix_chunks_count.py: Correction chunksCount incohérents - manage_orphan_chunks.py: Gestion chunks orphelins (3 options) - clean_orphan_works.py: Suppression Works sans chunks - add_missing_work.py: Création Work manquant - generate_schema_stats.py: Génération stats auto - migrate_add_work_collection.py: Migration sûre Work collection **Documentation**: - WEAVIATE_GUIDE_COMPLET.md: Guide consolidé complet (600+ lignes) - WEAVIATE_SCHEMA.md: Référence schéma rapide - NETTOYAGE_COMPLETE_RAPPORT.md: Rapport nettoyage session - ANALYSE_QUALITE_DONNEES.md: Analyse qualité initiale - rapport_qualite_donnees.txt: Output brut vérification **Résultats nettoyage**: - Documents: 16 → 9 (7 doublons supprimés) - Works: 0 → 9 (peuplé + nettoyé) - Chunks: 5,404 → 5,230 (174 orphelins supprimés) - chunksCount: Corrigés (231 → 5,230 déclaré = réel) - Cohérence parfaite: 9 Works = 9 Documents = 9 œuvres **Modifications code**: - schema.py: Ajout Work collection avec vectorisation - utils/weaviate_ingest.py: Support Work ingestion - utils/word_pipeline.py: Désactivation concepts (problème .lower()) - utils/word_toc_extractor.py: Métadonnées Word correctes - .gitignore: Exclusion fichiers temporaires (*.wav, output/*, NUL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
99 lines
1.0 KiB
Plaintext
99 lines
1.0 KiB
Plaintext
# Python
|
|
__pycache__/
|
|
*.py[cod]
|
|
*$py.class
|
|
*.so
|
|
.Python
|
|
build/
|
|
develop-eggs/
|
|
dist/
|
|
downloads/
|
|
eggs/
|
|
.eggs/
|
|
lib/
|
|
lib64/
|
|
parts/
|
|
sdist/
|
|
var/
|
|
wheels/
|
|
*.egg-info/
|
|
.installed.cfg
|
|
*.egg
|
|
|
|
# Virtual environments
|
|
venv/
|
|
ENV/
|
|
env/
|
|
.venv/
|
|
|
|
# IDE
|
|
.idea/
|
|
.vscode/
|
|
*.swp
|
|
*.swo
|
|
*~
|
|
|
|
# Environment variables
|
|
.env
|
|
.env.local
|
|
|
|
# Logs
|
|
*.log
|
|
logs/
|
|
|
|
# OS
|
|
.DS_Store
|
|
Thumbs.db
|
|
|
|
# Output files (large generated files)
|
|
output/*/images/
|
|
output/*/*.json
|
|
output/*/*.md
|
|
output/*.wav
|
|
output/*.docx
|
|
output/*.pdf
|
|
output/test_audio/
|
|
output/voices/
|
|
|
|
# Keep output folder structure
|
|
!output/.gitkeep
|
|
|
|
# Temporary files
|
|
*.tmp
|
|
*.bak
|
|
*.backup
|
|
temp_*.py
|
|
cleanup_*.py
|
|
*.wav
|
|
NUL
|
|
brinderb_temp.wav
|
|
|
|
# Input temporary files
|
|
input/
|
|
|
|
# Type checking outputs
|
|
mypy_errors.txt
|
|
*_errors.txt
|
|
|
|
# Test PDFs (keep input/ folder but ignore PDFs)
|
|
input/*.pdf
|
|
|
|
# Node artifacts (not a Node.js project)
|
|
package-lock.json
|
|
|
|
# Linear backup files
|
|
.linear_project.json.backup
|
|
|
|
# PRPs directory (project request proposals - temporary)
|
|
PRPs/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Markdown working directory (conversion scripts + large source files)
|
|
md/
|