feat: add do_reading_order option to PdfPipelineOptions by Frank-Schruefer · Pull Request #3233 · docling-project/docling

Frank-Schruefer · 2026-04-05T21:37:22Z

Summary

This draft PR adds a do_reading_order flag to PdfPipelineOptions, analogous to the existing do_table_structure and do_ocr options.

Motivation

The ReadingOrderPredictor works well for most PDFs, but produces incorrect results for certain document types — in particular, scanned PDFs that also carry a native text layer. On such pages, each line of text typically becomes its own small orphan cluster, and the graph-based predictor struggles with 50–100 small clusters and reorders them incorrectly.

With do_reading_order=False, elements retain the order produced by the layout postprocessor. Combined with spatial cell sorting (sort_cells_spatial in BaseLayoutOptions), this gives correct reading order for the affected pages.

Changes

PdfPipelineOptions.do_reading_order: bool = True — new flag, defaults to True (no change in existing behaviour)
ReadingOrderOptions.skip_prediction: bool = False — internal flag on the model
StandardPdfPipeline wires do_reading_order → skip_prediction

Open question

Is this the right approach, or would you prefer a different mechanism to handle cases where the reading-order predictor produces poor results? Happy to adjust based on feedback.

Add do_reading_order (default True) to PdfPipelineOptions, analogous to the existing do_table_structure and do_ocr flags. When set to False, ReadingOrderModel skips the ReadingOrderPredictor and keeps the element order produced by the layout postprocessor as-is. This is useful for PDFs where the reading-order predictor produces incorrect results — for example scanned pages that also carry a native text layer, where the many resulting small orphan clusters confuse the graph-based predictor. Signed-off-by: stone <frank.schruefer@t-online.de>

github-actions · 2026-04-05T21:37:34Z

❌ DCO Check Failed

Hi @Frank-Schruefer, your pull request has failed the Developer Certificate of Origin (DCO) check.

This repository supports remediation commits, so you can fix this without rewriting history — but you must follow the required message format.

🛠 Quick Fix: Add a remediation commit

Run this command:

# Unable to auto-generate remediation message. Please check the DCO check details.
git push

🔧 Advanced: Sign off each commit directly

For the latest commit:

git commit --amend --signoff
git push --force-with-lease

For multiple commits:

git rebase --signoff origin/main
git push --force-with-lease

More info: DCO check report

mergify · 2026-04-05T21:37:58Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-04-06T07:17:23Z

Related Documentation

2 document(s) may need updating based on files changed in this PR:

Docling

How does Docling reconstruct reading order without using a large language model (LLM) call in an unsupervised manner?

View Suggested Changes

@@ -32,6 +32,12 @@
 ## Why It Works Without LLMs
 Spatial positioning encodes reading order for most documents (top-to-bottom, left-to-right conventions), and the ordering problem reduces to a **topological sort over a directed graph**—something DFS solves efficiently and deterministically.
 
+## Configuration
+
+Reading order prediction can be controlled via the `do_reading_order` flag in `PdfPipelineOptions` (defaults to `True`). Setting `do_reading_order=False` disables the `ReadingOrderPredictor` entirely, and elements retain the order produced by the layout postprocessor.
+
+Disabling reading order prediction is useful for certain document types—particularly scanned PDFs with native text layers—where the graph-based predictor may produce incorrect results due to many small orphan text clusters.
+
 ## Known Limitations
 - The 15% dilation threshold is fixed and **not user-configurable**, which can cause issues with complex multi-column layouts.
 - Reading order quality depends heavily on the accuracy of upstream layout detection.

[Accept] [Decline]

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Suggested Changes

@@ -4,6 +4,7 @@
     - `from_formats`: Supported input formats include `docx`, `pptx`, `html`, `image`, `pdf`, `asciidoc`, `md` (including `txt`, `text`, `qmd`, `rmd`), `csv`, `xlsx`, `xml_uspto`, `xml_jats`, `xml_xbrl`, `mets_gbs`, `json_docling`, `audio`, `vtt`, `latex`
     - `to_formats`: Supported output formats include `md`, `json`, `yaml`, `html`, `html_split_page`, `text`, `doctags`, `vtt`
     - `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
+      - `do_reading_order` (default True): Enable reading-order prediction to reorder document elements into logical reading sequence; when disabled, elements retain the order produced by the layout postprocessor; disabling can improve results for scanned PDFs with native text layers and many small orphan clusters
     - `do_ocr` (default True): Use OCR
     - `force_ocr`: Replace existing text with OCR-generated text
     - `ocr_engine`, `ocr_lang`: OCR engine and language options

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

codecov · 2026-04-07T08:53:22Z

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../models/stages/reading_order/readingorder_model.py	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

PeterStaar-IBM

lgtm!

cau-git

@Frank-Schruefer thanks for this PR, I can see the point of the suggested change. I would like to ask you however to align a bit with conventions we have, by:

removing the skip_prediction in the ReadingOrderOptions again
instead, create an enabled argument on the ReadingOrderModel.__init__ similar to this, with a default value of true.
have do_reading_order control the enabled arg.

Thanks.

Replace skip_prediction option in ReadingOrderOptions with an enabled parameter in ReadingOrderModel.__init__, consistent with AutoOcrModel and other pipeline models. The do_reading_order flag in PdfPipelineOptions now controls this enabled argument directly. Signed-off-by: stone <frank.schruefer@t-online.de>

PeterStaar-IBM · 2026-04-15T07:56:23Z

@Frank-Schruefer Please re-run a few times: uv run pre-commit run --all-files

Signed-off-by: stone <frank.schruefer@t-online.de>

cau-git · 2026-04-17T05:10:00Z

@Frank-Schruefer sorry for walking back on this. I recognize that this does not follow actual enabled semantics now. It still does the full caption, footnote and merge mapping. If that is your intention, then I agree this should have been more of a "skip reordering" option instead of an enabled flag for the full model.

Seeing this I would suggest to re-instate the previous option but:

Make it a "positive" option flag of the reading order model options (such as reorder_elements: bool = True). Then let the do_reading_order pipeline flag control that option field. The skip_prediction field you proposed originally was misleading me here.

Replace constructor-level `enabled` flag with `reorder_elements: bool = True` field in ReadingOrderOptions, consistent with the options-based pattern used throughout the codebase. Signed-off-by: Frank-Schruefer <frank.schruefer@sfs-software.de> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dolfim-ibm marked this pull request as ready for review April 6, 2026 07:11

PeterStaar-IBM previously approved these changes Apr 12, 2026

View reviewed changes

cau-git requested changes Apr 13, 2026

View reviewed changes

Frank-Schruefer dismissed PeterStaar-IBM’s stale review via 89cf156 April 13, 2026 18:05

style: apply ruff formatting

8fba188

Signed-off-by: stone <frank.schruefer@t-online.de>

cau-git reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add do_reading_order option to PdfPipelineOptions#3233

feat: add do_reading_order option to PdfPipelineOptions#3233
Frank-Schruefer wants to merge 4 commits intodocling-project:mainfrom
Frank-Schruefer:feat/do-reading-order-option

Frank-Schruefer commented Apr 5, 2026

Uh oh!

github-actions Bot commented Apr 5, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Apr 5, 2026

Uh oh!

dosubot Bot commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

PeterStaar-IBM left a comment

Uh oh!

cau-git left a comment •

edited

Loading

Uh oh!

PeterStaar-IBM commented Apr 15, 2026

Uh oh!

cau-git Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Frank-Schruefer commented Apr 5, 2026

Summary

Motivation

Changes

Open question

Uh oh!

github-actions Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 Quick Fix: Add a remediation commit

Uh oh!

mergify Bot commented Apr 5, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot Bot commented Apr 6, 2026

How does Docling reconstruct reading order without using a large language model (LLM) call in an unsupervised manner?

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

codecov Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

cau-git left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PeterStaar-IBM commented Apr 15, 2026

Uh oh!

cau-git Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Apr 5, 2026 •

edited

Loading

codecov Bot commented Apr 7, 2026 •

edited

Loading

cau-git left a comment •

edited

Loading

cau-git Apr 17, 2026 •

edited

Loading