This project aims to convert PDF documents to LaTeX, preserving the original document's structure and formatting with high fidelity. The pipeline utilizes a robust architecture incorporating OCR, structural analysis, and a persistence layer (PostgreSQL and Neo4j) to manage extracted data and facilitate accurate LaTeX generation.
- Robust PDF Processing: Handles both text-based and scanned PDFs.
- OCR Integration: Uses Optical Character Recognition (OCR) to extract text from scanned images.
- Structure Analysis: Analyzes the document's logical structure (headings, paragraphs, lists, tables, figures, equations).
- Persistence Layer: Stores extracted data in PostgreSQL (relational data) and Neo4j (graph relationships) for efficient querying and LaTeX generation.
- Faithful LaTeX Generation: Produces LaTeX code that accurately reflects the original document's content and formatting.
- Modular Design: Allows for customization and extension of individual pipeline components.
- Scalability: Designed to handle large PDF documents.
graph TD
A[PDF Input] --> B{Rasterization & OCR};
B --> C[Text Representation];
C --> D{Structure Analysis & Parsing};
D --> E[Persistence Layer];
E -- Graph(s) --> F;
E -- AST(s) --> F;
E -- Relational DB --> F;
E -- Misc Data --> F;
F[Integrated Data] --> G{LaTeX Generation};
G --> H[LaTeX Output];
graph LR
subgraph PDF Parser
A[PDFParser Class] --> B(parse_pdf function);
B --> C{"Rasterization (PyMuPDF)"};
B --> D{"OCR (Pytesseract)"};
D --> E(extract_text_from_image function);
end
subgraph Structure Analyzer
F[StructureAnalyzer Class] --> G(analyze_text function);
G --> H(Regular Expressions);
G --> I(NLP Techniques - Optional);
G --> J(Layout Analysis - Optional);
end
subgraph Persistence Layer
K[PersistenceLayer Class] --> L(create_document function);
K --> M(create_page function);
K --> N(create_block function);
K --> O(create_follows_relationship function);
K --> P(_pg_execute helper function);
subgraph PostgreSQL
Q[Documents Table]
R[Pages Table]
S[Blocks Table]
T[TextBlocks Table]
U[ImageBlocks Table]
V[Tables Table]
W[Equations Table]
L --> Q
M --> R
N --> S
N --> T
N --> U
N --> V
N --> W
end
subgraph Neo4j
X[Document Node]
Y[Page Node]
Z[Block Node]
L --> X
M --> Y
N --> Z
O --> Z
end
end
subgraph LaTeX Generator
AA[LaTeXGenerator Class] --> BB(generate_latex function);
BB --> CC(Jinja2 Templates);
BB --> DD(_generate_section helper function)
end
A --> F
F --> K
K --> AA
The pipeline comprises three main stages:
-
PDF to Text Representation: The input PDF is processed using OCR (if necessary) to extract text and identify basic layout elements.
-
Text Representation to Persistence Layer: The extracted text and layout information are analyzed to determine the document's structure. This data is then stored in:
- PostgreSQL: Stores relational data such as document metadata, page information, text content, table data, equation representations, and metadata about images and other unclassified elements.
- Neo4j: Stores the relationships between document elements as a graph, capturing the document's logical flow and structure.
-
Persistence Layer to LaTeX: The structured data in the persistence layer is used to generate LaTeX code. This stage leverages the graph representation in Neo4j and queries PostgreSQL to assemble the final LaTeX output.
-
Clone the Repository:
git clone https://github.com/your-username/pdf-to-latex.git # Replace with your repo URL -
Set up a Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate # Activate the environment (Linux/macOS) venv\Scripts\activate # Activate the environment (Windows)
-
Install Dependencies:
pip install -r requirements.txt
Create a
requirements.txtfile listing dependencies (e.g.,psycopg2for PostgreSQL,py2neofor Neo4j, OCR library, NLP libraries). -
Database Setup:
- Install and configure PostgreSQL and Neo4j.
- Create the necessary database schemas (see database schema design in documentation).
-
Data Preparation: No specific data preparation is required, although pre-processing PDFs (e.g., cleaning, enhancing image quality) might improve results.
-
Run the Pipeline:
python run_pipeline.py --input your_pdf_file.pdf --output output.tex
The script will process the PDF and generate the LaTeX file.
-
Configuration: Configure pipeline parameters (e.g., OCR engine, database connection details) in a configuration file or as command-line arguments.
run_pipeline.py: The main script to execute the pipeline.pdf_parser.py: Module for PDF processing and OCR.structure_analyzer.py: Module for structural analysis and data extraction.persistence_layer.py: Module for interacting with the persistence layer.latex_generator.py: Module for LaTeX generation.database_schemas/: SQL scripts for database schema creation.README.md: This file.requirements.txt: Lists project dependencies.
- Accuracy: The accuracy of the generated LaTeX depends on the quality of the input PDF and the performance of the OCR engine.
- Complex Layouts: Highly complex layouts might pose challenges for structural analysis.
- Performance: Processing very large PDFs can be time-consuming. Optimization strategies might be necessary.
- Improved handling of complex tables and figures.
- Integration with a vector database for semantic similarity search.
- Enhanced error handling and reporting.
- Automated testing framework.
- Support for more advanced LaTeX features.
Contributions are welcome! Please open an issue or submit a pull request.
[MIT License]