Skip to content

feat(xlsx): render sheet names as headings in markdown export#3255

Open
Smeet23 wants to merge 7 commits intodocling-project:mainfrom
Smeet23:feat/excel-sheet-names-as-headings
Open

feat(xlsx): render sheet names as headings in markdown export#3255
Smeet23 wants to merge 7 commits intodocling-project:mainfrom
Smeet23:feat/excel-sheet-names-as-headings

Conversation

@Smeet23
Copy link
Copy Markdown
Contributor

@Smeet23 Smeet23 commented Apr 9, 2026

Summary

Closes #3229

Sheet names in Excel workbooks are meaningful structure — this PR surfaces them as Markdown headings when converting XLSX files, giving users readable output like:

## Sheet1

| Col A | Col B |
|-------|-------|
| 1     | 2     |

## Sheet2

| Col C | Col D |
|-------|-------|
| 3     | 4     |

Implementation (reworked per @PeterStaar-IBM's feedback)

The original approach injected a SectionHeaderItem heading node into the document model in the backend. Per the reviewer's suggestion, this has been replaced with a cleaner serializer-based approach:

  • msexcel_backend.py: switched from GroupLabel.SECTION / "sheet: <name>" to GroupLabel.SHEET / plain sheet name — no synthetic heading nodes injected
  • docling/utils/markdown.py (new): adds MsExcelMarkdownFallbackSerializer (renders GroupLabel.SHEET groups as ## <name> headings) and MsExcelMarkdownDocSerializer (drop-in replacement for MarkdownDocSerializer when exporting Excel documents)
  • Ground truth files regenerated to reflect the cleaner document structure

Open question: doc.export_to_markdown() uses MarkdownDocSerializer by default. To get headings automatically, callers currently need to use MsExcelMarkdownDocSerializer explicitly. If automatic heading output via doc.export_to_markdown() is preferred, a small companion PR to docling-core would be needed. Happy to pursue that path if you think it's right.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

DCO Check Passed

Thanks @Smeet23, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot bot commented Apr 9, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Suggested Changes
@@ -112,6 +112,32 @@
     - Each sheet is treated as a page
     - Table detection via flood-fill, image extraction (bounding box based on cell anchor)
     - Includes provenance (location info), auto page size calculation
+    - Sheet names are stored in the document model using `GroupLabel.SHEET` for sheet grouping
+- **Markdown Export**: Sheet names are rendered as level-2 Markdown headings (`## <sheet_name>`) when using the `MsExcelMarkdownDocSerializer` class. This provides readable output that reflects the workbook's visual structure:
+
+```python
+from docling.utils.markdown import MsExcelMarkdownDocSerializer
+
+serializer = MsExcelMarkdownDocSerializer(doc=result.document)
+md = serializer.serialize().text
+```
+
+Example output format:
+
+```markdown
+## Sheet1
+
+| Col A | Col B |
+|-------|-------|
+| 1     | 2     |
+
+## Sheet2
+
+| Col C | Col D |
+|-------|-------|
+| 3     | 4     |
+```
+
 - **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
 
 ---

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@PeterStaar-IBM PeterStaar-IBM self-requested a review April 12, 2026 05:41
Comment thread docling/utils/markdown.py Outdated
@@ -0,0 +1,66 @@
"""Markdown serialization utilities for docling."""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Moved the serializer to docling-core in docling-project/docling-core#587. Removed docling/utils/markdown.py from this PR and updated all imports to docling_core.transforms.serializer.markdown_excel. Will bump the dependency once the docling-core PR is merged.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Smeet23 added a commit to Smeet23/docling that referenced this pull request Apr 12, 2026
Remove `docling/utils/markdown.py` and update imports to use
`docling_core.transforms.serializer.markdown_excel` as requested in
code review on docling-project#3255. The serializer now lives in docling-core#587.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 12, 2026

Thanks for the review @PeterStaar-IBM! Moved the serializer to docling-core in docling-project/docling-core#587.

Removed docling/utils/markdown.py from this PR and updated all imports to use docling_core.transforms.serializer.markdown_excel. Will update the dependency once the docling-core PR is merged.

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Apr 14, 2026
Copy link
Copy Markdown
Member

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

smeetagrawal23-sys and others added 4 commits April 15, 2026 14:45
GroupItem nodes with label=GroupLabel.SECTION (used to represent Excel
sheet names) are now emitted as Markdown section headings during
export_to_markdown(). This preserves the logical document structure
and makes multi-sheet workbooks easier to navigate in downstream RAG
pipelines.

Closes docling-project#3229

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
…dering

Replace the SectionHeaderItem injection approach with a cleaner architecture:

- msexcel_backend: use GroupLabel.SHEET (already in docling-core) with the
  plain sheet name instead of GroupLabel.SECTION with "sheet: <name>" prefix
- Add MsExcelMarkdownFallbackSerializer / MsExcelMarkdownDocSerializer in
  docling/utils/markdown.py: renders GroupLabel.SHEET groups as ## headings
  without polluting the document model with synthetic heading nodes
- Update tests: e2e comparisons use MsExcelMarkdownDocSerializer; remove the
  test_sheet_names_as_headings assertions that relied on SectionHeaderItem
  nodes; update test_chartsheet and test_table_with_title accordingly
- Regenerate all xlsx ground truth files to reflect the new document structure

Closes docling-project#3229

Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Remove `docling/utils/markdown.py` and update imports to use
`docling_core.transforms.serializer.markdown_excel` as requested in
code review on docling-project#3255. The serializer now lives in docling-core#587.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
…lable

Guard both test_e2e_excel_conversions and test_sheet_names_as_headings
with pytest.importorskip so they are skipped (not errored) in CI
environments where docling-core does not yet include
docling_core.transforms.serializer.markdown_excel (pending
docling-core#587).

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
@Smeet23 Smeet23 force-pushed the feat/excel-sheet-names-as-headings branch from d4eaf94 to 5973aca Compare April 15, 2026 09:16
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 15, 2026

Hi @PeterStaar-IBM — the DCO sign-off has been fixed on all commits, and the test failures have been resolved by guarding the docling_core.transforms.serializer.markdown_excel imports with pytest.importorskip (they will skip gracefully until docling-core#587 is merged and released). CI is re-running on the fresh push. Could you please re-review and merge once everything looks good? Happy to address any feedback. Thanks! 🙏

@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 15, 2026

Hi @PeterStaar-IBM — following up on the earlier feedback, here's what was addressed in the latest push:

  • DCO: Fixed sign-off on all commits — all now carry Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
  • Test failures: The ModuleNotFoundError for docling_core.transforms.serializer.markdown_excel was caused by docling-core#587 not yet being merged/released. Fixed by guarding both test imports with pytest.importorskip so they skip gracefully in CI until that lands.

DCO ✅ is now passing. Could you re-approve when you get a chance so Mergify can proceed? Happy to address any remaining concerns. Thanks!

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Apr 17, 2026
@PeterStaar-IBM PeterStaar-IBM requested a review from cau-git April 17, 2026 06:52
cau-git
cau-git previously approved these changes Apr 17, 2026
@cau-git
Copy link
Copy Markdown
Member

cau-git commented Apr 17, 2026

@Smeet23 docling-core 2.74.0 is now released with your changes. Can you please update the uv.lock so the test diff in this PR reflects the state you get with docling-core 2.74.0?

This should do it:

uv lock --upgrade-package docling-core

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Smeet23 Smeet23 dismissed stale reviews from cau-git and PeterStaar-IBM via 0929fa8 April 17, 2026 08:04
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 17, 2026

Hi @cau-git — done! Updated uv.lock with uv lock --upgrade-package docling-core, which bumped it to 2.74.0. Thanks for the heads-up!

@PeterStaar-IBM
Copy link
Copy Markdown
Member

and re-do the DCO:

To add your Signed-off-by line to every commit in this branch:

1. Ensure you have a local copy of your branch by [checking out the pull request locally via command line](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/checking-out-pull-requests-locally).
2. In your local branch, run: git rebase HEAD~5 --signoff
3. Force push your changes to overwrite the branch: git push --force-with-lease origin feat/excel-sheet-names-as-headings

@cau-git
Copy link
Copy Markdown
Member

cau-git commented Apr 17, 2026

@Smeet23 now that the updated docling-core is pinned, we are seeing CI tests fail - as expected. This is because the test ground truth encodes the old truth that has no sheet names in the markdown. You need to re-generate the test-ground truth for this PR and submit the changed test files as below. Then we see if the diff is correct and can approve.

DOCLING_GEN_TEST_DATA=1 uv run pytest tests/ -s -v

…nment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Smeet23
Copy link
Copy Markdown
Contributor Author

Smeet23 commented Apr 17, 2026

Hi @cau-git and @PeterStaar-IBM — pushed two follow-up commits:

  1. chore: bump docling-core to 2.74.0 — updated uv.lock as requested
  2. test: update markdown golden files for docling-core 2.74.0 table alignment — the bump caused 16 markdown golden files to diff due to tabulate now right-aligning numeric columns (a side effect of docling-core#588). Regenerated all .md golden files from the existing .json ground truth without re-running ML inference.

CI should be green now. Could you re-approve when you get a chance? Thanks!

@cau-git
Copy link
Copy Markdown
Member

cau-git commented Apr 17, 2026

@Smeet23 I think the test diff looks right. It is a bit noisy because it also brings an update from handling tabulate output in, but no worry.

The part we're missing now is just the DCO signoff. As shown by the DCO bot please make one commit like this:

git commit --allow-empty -s -m "DCO Remediation Commit for Smeet23 <smeetagrawal2003@gmail.com>

I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 0929fa81c9290617bb5769b2fd9145bea2a3bc9b
I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: eeed70c4e97b757ec5debadeed3502004b728117"
git push

FYI: If you let claude make commits, it will usually ignore the -s (signoff) flag and hence create unsigned commits.

I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: 0929fa8
I, Smeet23 <smeetagrawal2003@gmail.com>, hereby add my Signed-off-by to this commit: eeed70c

Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Include Excel Sheet Names as Headings in Markdown Export

4 participants