Implement Text Extraction in PyMuPdf Fitz Layout Mode

Thank you for this excellent muPdf wrapper!

One feature that muPdf does not implement natively is layout-preserving plain text extraction.
- XPdf / poppler's pdftotext offer a `layout` mode as standard:
https://www.mankier.com/1/pdftotext
- Other wrappers like PyMuPdf add their own implementation. Their fitz module extracts text in `layout` mode by default:
`python -m fitz gettext input.pdf`
https://pymupdf.readthedocs.io/en/latest/module.html#text-extraction

This is how the PyMuPdf fitz module does it:
https://github.com/pymupdf/PyMuPDF/blob/main/fitz/__main__.py#L577

When layout preservation is a must, there is currently no other way than invoking pdftotext from the go app or - even nastier - calling the fitz python module from go.

How hard would it be to add this to go-fitz as well?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement Text Extraction in PyMuPdf Fitz Layout Mode #86

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions