Projeto Patrística Digital

Warning

Pipelines Python utilizam LLMs tanto no OCR (gating/juiz combinando Tesseract + LLM) quanto em resumos/keywords. Resultados podem conter alucinações, omissões ou vieses; valide sempre no texto original antes de uso acadêmico/editorial.

Warning

O corpus é massivo (~1B tokens). Evite rodar jobs em todo teste/ sem filtros; use --first/--last (OCR) e --pattern/--limit (keywords) para não gerar custo excessivo ou travamentos.

Este repositório mantém o corpus OCR da Patrologia (Graeca, Latina, Orientalis) e scripts de processamento. Escolha o idioma: This repository maintains the OCR corpus of the Patrology (Graeca, Latina, Orientalis) and processing scripts. Choose your language:

Versão completa em português (PT-BR): README.pt-BR.md
Full version in English: README.en.md
Pipeline técnico detalhado: docs/PIPELINE.md

Notas rápidas

A maior parte dos prompts, resumos e keywords foi escrita originalmente em PT-BR; não há versão em inglês para estes no momento.
Os textos de OCR já estão versionados em teste/ (páginas P*/text/*).
A limpeza/melhoria do OCR segue em progresso; PDFs/imagens originais não são versionados.

Quick notes

Most prompts, summaries, and keywords were originally written in PT-BR; the English version is a reference translation.
The OCR texts are already versioned in teste/ (pages P*/text/*).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
docs		docs
download		download
huggingface_readme		huggingface_readme
logs/resumo_parallel		logs/resumo_parallel
service		service
teste		teste
tools		tools
web		web
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CC0		LICENSE-CC0
LICENSE-NOTICE.md		LICENSE-NOTICE.md
README.en.md		README.en.md
README.md		README.md
README.pt-BR.md		README.pt-BR.md
agent_patristica_pipeline.py		agent_patristica_pipeline.py
agent_patristica_smol.py		agent_patristica_smol.py
apply_normalizations_to_resumos.py		apply_normalizations_to_resumos.py
assistente_patristico.py		assistente_patristico.py
concatall.py		concatall.py
delete_ungrouped_to_resumos.py		delete_ungrouped_to_resumos.py
easy_ocr_playground.py		easy_ocr_playground.py
embedding_playground.py		embedding_playground.py
entropy_lib.py		entropy_lib.py
evaluation_db.py		evaluation_db.py
find_gaps.py		find_gaps.py
fix_gaps.py		fix_gaps.py
fix_time_breaks.py		fix_time_breaks.py
hdbscan_embedding.py		hdbscan_embedding.py
infer.py		infer.py
join_po_volumes.py		join_po_volumes.py
keyword_integrity.py		keyword_integrity.py
keywords_parallel.sh		keywords_parallel.sh
keywords_serial.py		keywords_serial.py
kraken_all_in_one.py		kraken_all_in_one.py
kraken_pipeline.py		kraken_pipeline.py
kraken_playground.py		kraken_playground.py
layoutparser_playground.py		layoutparser_playground.py
lex_lookup.py		lex_lookup.py
llm_ocr_playground.py		llm_ocr_playground.py
lmlayout.py		lmlayout.py
main.py		main.py
main2.py		main2.py
ocr_batch.py		ocr_batch.py
ocr_gui.py		ocr_gui.py
ocr_ollama.py		ocr_ollama.py
ocr_versions_db.py		ocr_versions_db.py
ollama_keywords.py		ollama_keywords.py
ollama_prompt_test.py		ollama_prompt_test.py
patch_scripture.py		patch_scripture.py
patristicagrega.sh		patristicagrega.sh
patristicalatina.sh		patristicalatina.sh
patristicaoriental.sh		patristicaoriental.sh
requirements.txt		requirements.txt
resumo_parallel.sh		resumo_parallel.sh
resumo_serial.py		resumo_serial.py
review.py		review.py
sample.py		sample.py
scripture_ref_normalizer.py		scripture_ref_normalizer.py
test_ahocorasick.py		test_ahocorasick.py
test_char_filter.py		test_char_filter.py
test_char_filter2.py		test_char_filter2.py
test_gaps_files.py		test_gaps_files.py
test_judge_parser.py		test_judge_parser.py
test_keyword_integrity.py		test_keyword_integrity.py
test_keywords_clean.py		test_keywords_clean.py
test_limpeza_ocr.py		test_limpeza_ocr.py
test_limpeza_ocr_xml.py		test_limpeza_ocr_xml.py
test_process_index.py		test_process_index.py
test_regex.py		test_regex.py
test_regex_aho.py		test_regex_aho.py
test_regex_fil.py		test_regex_fil.py
test_regex_regex.py		test_regex_regex.py
test_resumo_parse.py		test_resumo_parse.py
test_scripture_ref_variants.py		test_scripture_ref_variants.py
test_stream_pg002_630.py		test_stream_pg002_630.py
testcv.py		testcv.py
unique_keywords_preview.py		unique_keywords_preview.py
upload_po.py		upload_po.py
xml_from_tsv.py		xml_from_tsv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Projeto Patrística Digital

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Projeto Patrística Digital

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages