feat(xlsx): render sheet names as headings in markdown export by Smeet23 · Pull Request #3255 · docling-project/docling

Smeet23 · 2026-04-09T11:21:35Z

Summary

Sheet names in Excel workbooks are meaningful structure — this PR surfaces them as Markdown headings when converting XLSX files, giving users readable output like:

## Sheet1

| Col A | Col B |
|-------|-------|
| 1     | 2     |

## Sheet2

| Col C | Col D |
|-------|-------|
| 3     | 4     |

Implementation (reworked per @PeterStaar-IBM's feedback)

The original approach injected a SectionHeaderItem heading node into the document model in the backend. Per the reviewer's suggestion, this has been replaced with a cleaner serializer-based approach:

msexcel_backend.py: switched from GroupLabel.SECTION / "sheet: <name>" to GroupLabel.SHEET / plain sheet name — no synthetic heading nodes injected
docling/utils/markdown.py (new): adds MsExcelMarkdownFallbackSerializer (renders GroupLabel.SHEET groups as ## <name> headings) and MsExcelMarkdownDocSerializer (drop-in replacement for MarkdownDocSerializer when exporting Excel documents)
Ground truth files regenerated to reflect the cleaner document structure

Open question: doc.export_to_markdown() uses MarkdownDocSerializer by default. To get headings automatically, callers currently need to use MsExcelMarkdownDocSerializer explicitly. If automatic heading output via doc.export_to_markdown() is preferred, a small companion PR to docling-core would be needed. Happy to pursue that path if you think it's right.

github-actions · 2026-04-09T11:21:47Z

✅ DCO Check Passed

Thanks @Smeet23, all your commits are properly signed off. 🎉

mergify · 2026-04-09T11:22:12Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

#approved-reviews-by >= 2

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-04-09T11:23:23Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

View Suggested Changes

@@ -112,6 +112,32 @@
     - Each sheet is treated as a page
     - Table detection via flood-fill, image extraction (bounding box based on cell anchor)
     - Includes provenance (location info), auto page size calculation
+    - Sheet names are stored in the document model using `GroupLabel.SHEET` for sheet grouping
+- **Markdown Export**: Sheet names are rendered as level-2 Markdown headings (`## <sheet_name>`) when using the `MsExcelMarkdownDocSerializer` class. This provides readable output that reflects the workbook's visual structure:
+
+```python
+from docling.utils.markdown import MsExcelMarkdownDocSerializer
+
+serializer = MsExcelMarkdownDocSerializer(doc=result.document)
+md = serializer.serialize().text
+```
+
+Example output format:
+
+```markdown
+## Sheet1
+
+| Col A | Col B |
+|-------|-------|
+| 1     | 2     |
+
+## Sheet2
+
+| Col C | Col D |
+|-------|-------|
+| 3     | 4     |
+```
+
 - **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
 
 ---

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

PeterStaar-IBM · 2026-04-12T05:44:24Z

@@ -0,0 +1,66 @@
+"""Markdown serialization utilities for docling."""


This file needs to be in docling-core in this folder https://github.com/docling-project/docling-core/tree/main/docling_core/transforms/serializer

Done! Moved the serializer to docling-core in docling-project/docling-core#587. Removed docling/utils/markdown.py from this PR and updated all imports to docling_core.transforms.serializer.markdown_excel. Will bump the dependency once the docling-core PR is merged.

codecov · 2026-04-12T06:11:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Remove `docling/utils/markdown.py` and update imports to use `docling_core.transforms.serializer.markdown_excel` as requested in code review on docling-project#3255. The serializer now lives in docling-core#587. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Smeet23 · 2026-04-12T07:57:39Z

Thanks for the review @PeterStaar-IBM! Moved the serializer to docling-core in docling-project/docling-core#587.

Removed docling/utils/markdown.py from this PR and updated all imports to use docling_core.transforms.serializer.markdown_excel. Will update the dependency once the docling-core PR is merged.

PeterStaar-IBM

lgtm!

GroupItem nodes with label=GroupLabel.SECTION (used to represent Excel sheet names) are now emitted as Markdown section headings during export_to_markdown(). This preserves the logical document structure and makes multi-sheet workbooks easier to navigate in downstream RAG pipelines. Closes docling-project#3229 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

…dering Replace the SectionHeaderItem injection approach with a cleaner architecture: - msexcel_backend: use GroupLabel.SHEET (already in docling-core) with the plain sheet name instead of GroupLabel.SECTION with "sheet: <name>" prefix - Add MsExcelMarkdownFallbackSerializer / MsExcelMarkdownDocSerializer in docling/utils/markdown.py: renders GroupLabel.SHEET groups as ## headings without polluting the document model with synthetic heading nodes - Update tests: e2e comparisons use MsExcelMarkdownDocSerializer; remove the test_sheet_names_as_headings assertions that relied on SectionHeaderItem nodes; update test_chartsheet and test_table_with_title accordingly - Regenerate all xlsx ground truth files to reflect the new document structure Closes docling-project#3229 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

Remove `docling/utils/markdown.py` and update imports to use `docling_core.transforms.serializer.markdown_excel` as requested in code review on docling-project#3255. The serializer now lives in docling-core#587. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

…lable Guard both test_e2e_excel_conversions and test_sheet_names_as_headings with pytest.importorskip so they are skipped (not errored) in CI environments where docling-core does not yet include docling_core.transforms.serializer.markdown_excel (pending docling-core#587). Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

Smeet23 · 2026-04-15T09:20:14Z

Hi @PeterStaar-IBM — the DCO sign-off has been fixed on all commits, and the test failures have been resolved by guarding the docling_core.transforms.serializer.markdown_excel imports with pytest.importorskip (they will skip gracefully until docling-core#587 is merged and released). CI is re-running on the fresh push. Could you please re-review and merge once everything looks good? Happy to address any feedback. Thanks! 🙏

Smeet23 · 2026-04-15T09:29:43Z

Hi @PeterStaar-IBM — following up on the earlier feedback, here's what was addressed in the latest push:

DCO: Fixed sign-off on all commits — all now carry Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com> ✅
Test failures: The ModuleNotFoundError for docling_core.transforms.serializer.markdown_excel was caused by docling-core#587 not yet being merged/released. Fixed by guarding both test imports with pytest.importorskip so they skip gracefully in CI until that lands.

DCO ✅ is now passing. Could you re-approve when you get a chance so Mergify can proceed? Happy to address any remaining concerns. Thanks!

cau-git · 2026-04-17T07:24:34Z

@Smeet23 docling-core 2.74.0 is now released with your changes. Can you please update the uv.lock so the test diff in this PR reflects the state you get with docling-core 2.74.0?

This should do it:

uv lock --upgrade-package docling-core

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Smeet23 · 2026-04-17T08:05:33Z

Hi @cau-git — done! Updated uv.lock with uv lock --upgrade-package docling-core, which bumped it to 2.74.0. Thanks for the heads-up!

PeterStaar-IBM · 2026-04-17T08:35:12Z

and re-do the DCO:

To add your Signed-off-by line to every commit in this branch:

1. Ensure you have a local copy of your branch by [checking out the pull request locally via command line](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally).
2. In your local branch, run: git rebase HEAD~5 --signoff
3. Force push your changes to overwrite the branch: git push --force-with-lease origin feat/excel-sheet-names-as-headings

cau-git · 2026-04-17T10:45:25Z

@Smeet23 now that the updated docling-core is pinned, we are seeing CI tests fail - as expected. This is because the test ground truth encodes the old truth that has no sheet names in the markdown. You need to re-generate the test-ground truth for this PR and submit the changed test files as below. Then we see if the diff is correct and can approve.

DOCLING_GEN_TEST_DATA=1 uv run pytest tests/ -s -v

…nment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Smeet23 · 2026-04-17T11:06:15Z

Hi @cau-git and @PeterStaar-IBM — pushed two follow-up commits:

chore: bump docling-core to 2.74.0 — updated uv.lock as requested
test: update markdown golden files for docling-core 2.74.0 table alignment — the bump caused 16 markdown golden files to diff due to tabulate now right-aligning numeric columns (a side effect of docling-core#588). Regenerated all .md golden files from the existing .json ground truth without re-running ML inference.

CI should be green now. Could you re-approve when you get a chance? Thanks!

cau-git · 2026-04-17T11:24:23Z

@Smeet23 I think the test diff looks right. It is a bit noisy because it also brings an update from handling tabulate output in, but no worry.

The part we're missing now is just the DCO signoff. As shown by the DCO bot please make one commit like this:

git commit --allow-empty -s -m "DCO Remediation Commit for Smeet23 <smeetagrawal2003@gmail.com>

I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 0929fa81c9290617bb5769b2fd9145bea2a3bc9b
I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: eeed70c4e97b757ec5debadeed3502004b728117"
git push

FYI: If you let claude make commits, it will usually ignore the -s (signoff) flag and hence create unsigned commits.

I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 0929fa8 I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: eeed70c Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>

PeterStaar-IBM self-requested a review April 12, 2026 05:41

PeterStaar-IBM requested changes Apr 12, 2026

View reviewed changes

Smeet23 mentioned this pull request Apr 12, 2026

feat(serializer): add MsExcelMarkdownDocSerializer for sheet-name headings docling-project/docling-core#587

Merged

3 tasks

PeterStaar-IBM previously approved these changes Apr 14, 2026

View reviewed changes

smeetagrawal23-sys and others added 4 commits April 15, 2026 14:45

Smeet23 dismissed PeterStaar-IBM’s stale review via 5973aca April 15, 2026 09:16

Smeet23 force-pushed the feat/excel-sheet-names-as-headings branch from d4eaf94 to 5973aca Compare April 15, 2026 09:16

PeterStaar-IBM previously approved these changes Apr 17, 2026

View reviewed changes

PeterStaar-IBM requested a review from cau-git April 17, 2026 06:52

cau-git previously approved these changes Apr 17, 2026

View reviewed changes

chore: bump docling-core to 2.74.0

0929fa8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Smeet23 dismissed stale reviews from cau-git and PeterStaar-IBM via 0929fa8 April 17, 2026 08:04

test: update markdown golden files for docling-core 2.74.0 table alig…

eeed70c

…nment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

		@@ -0,0 +1,66 @@
		"""Markdown serialization utilities for docling."""

Conversation

Smeet23 commented Apr 9, 2026

Summary

Implementation (reworked per @PeterStaar-IBM's feedback)

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Apr 9, 2026

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?

Uh oh!

PeterStaar-IBM Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Smeet23 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Smeet23 commented Apr 12, 2026

Uh oh!

PeterStaar-IBM left a comment

Choose a reason for hiding this comment

Uh oh!

Smeet23 commented Apr 15, 2026

Uh oh!

Smeet23 commented Apr 15, 2026

Uh oh!

cau-git commented Apr 17, 2026

Uh oh!

Smeet23 commented Apr 17, 2026

Uh oh!

PeterStaar-IBM commented Apr 17, 2026

Uh oh!

cau-git commented Apr 17, 2026

Uh oh!

Smeet23 commented Apr 17, 2026

Uh oh!

cau-git commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Apr 9, 2026 •

edited

Loading

mergify bot commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 12, 2026 •

edited

Loading

cau-git commented Apr 17, 2026 •

edited

Loading