feat(xlsx): render sheet names as headings in markdown export#3255
feat(xlsx): render sheet names as headings in markdown export#3255Smeet23 wants to merge 7 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @Smeet23, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for:
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -112,6 +112,32 @@
- Each sheet is treated as a page
- Table detection via flood-fill, image extraction (bounding box based on cell anchor)
- Includes provenance (location info), auto page size calculation
+ - Sheet names are stored in the document model using `GroupLabel.SHEET` for sheet grouping
+- **Markdown Export**: Sheet names are rendered as level-2 Markdown headings (`## <sheet_name>`) when using the `MsExcelMarkdownDocSerializer` class. This provides readable output that reflects the workbook's visual structure:
+
+```python
+from docling.utils.markdown import MsExcelMarkdownDocSerializer
+
+serializer = MsExcelMarkdownDocSerializer(doc=result.document)
+md = serializer.serialize().text
+```
+
+Example output format:
+
+```markdown
+## Sheet1
+
+| Col A | Col B |
+|-------|-------|
+| 1 | 2 |
+
+## Sheet2
+
+| Col C | Col D |
+|-------|-------|
+| 3 | 4 |
+```
+
- **Notes**: Table detection algorithm and singleton cell handling are configurable. [Backend options code](https://github.com/docling-project/docling/blob/ae4fdbbb09fd377bb271e9b2efe541873eeb2990/docling/datamodel/backend_options.py#L80-L97).
---Note: You must be authenticated to accept/decline updates. |
| @@ -0,0 +1,66 @@ | |||
| """Markdown serialization utilities for docling.""" | |||
There was a problem hiding this comment.
This file needs to be in docling-core in this folder https://github.com/docling-project/docling-core/tree/main/docling_core/transforms/serializer
There was a problem hiding this comment.
Done! Moved the serializer to docling-core in docling-project/docling-core#587. Removed docling/utils/markdown.py from this PR and updated all imports to docling_core.transforms.serializer.markdown_excel. Will bump the dependency once the docling-core PR is merged.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Remove `docling/utils/markdown.py` and update imports to use `docling_core.transforms.serializer.markdown_excel` as requested in code review on docling-project#3255. The serializer now lives in docling-core#587. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the review @PeterStaar-IBM! Moved the serializer to docling-core in docling-project/docling-core#587. Removed |
GroupItem nodes with label=GroupLabel.SECTION (used to represent Excel sheet names) are now emitted as Markdown section headings during export_to_markdown(). This preserves the logical document structure and makes multi-sheet workbooks easier to navigate in downstream RAG pipelines. Closes docling-project#3229 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
…dering Replace the SectionHeaderItem injection approach with a cleaner architecture: - msexcel_backend: use GroupLabel.SHEET (already in docling-core) with the plain sheet name instead of GroupLabel.SECTION with "sheet: <name>" prefix - Add MsExcelMarkdownFallbackSerializer / MsExcelMarkdownDocSerializer in docling/utils/markdown.py: renders GroupLabel.SHEET groups as ## headings without polluting the document model with synthetic heading nodes - Update tests: e2e comparisons use MsExcelMarkdownDocSerializer; remove the test_sheet_names_as_headings assertions that relied on SectionHeaderItem nodes; update test_chartsheet and test_table_with_title accordingly - Regenerate all xlsx ground truth files to reflect the new document structure Closes docling-project#3229 Signed-off-by: Smeet Agrawal <smeetagrawal23@gmail.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
Remove `docling/utils/markdown.py` and update imports to use `docling_core.transforms.serializer.markdown_excel` as requested in code review on docling-project#3255. The serializer now lives in docling-core#587. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
…lable Guard both test_e2e_excel_conversions and test_sheet_names_as_headings with pytest.importorskip so they are skipped (not errored) in CI environments where docling-core does not yet include docling_core.transforms.serializer.markdown_excel (pending docling-core#587). Signed-off-by: Smeet23 <smeetagrawal2003@gmail.com>
d4eaf94 to
5973aca
Compare
|
Hi @PeterStaar-IBM — the DCO sign-off has been fixed on all commits, and the test failures have been resolved by guarding the |
|
Hi @PeterStaar-IBM — following up on the earlier feedback, here's what was addressed in the latest push:
DCO ✅ is now passing. Could you re-approve when you get a chance so Mergify can proceed? Happy to address any remaining concerns. Thanks! |
|
@Smeet23 docling-core 2.74.0 is now released with your changes. Can you please update the uv.lock so the test diff in this PR reflects the state you get with This should do it: |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @cau-git — done! Updated |
|
and re-do the DCO: |
|
@Smeet23 now that the updated docling-core is pinned, we are seeing CI tests fail - as expected. This is because the test ground truth encodes the old truth that has no sheet names in the markdown. You need to re-generate the test-ground truth for this PR and submit the changed test files as below. Then we see if the diff is correct and can approve. |
…nment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @cau-git and @PeterStaar-IBM — pushed two follow-up commits:
CI should be green now. Could you re-approve when you get a chance? Thanks! |
|
@Smeet23 I think the test diff looks right. It is a bit noisy because it also brings an update from handling tabulate output in, but no worry. The part we're missing now is just the DCO signoff. As shown by the DCO bot please make one commit like this: FYI: If you let claude make commits, it will usually ignore the |
Summary
Closes #3229
Sheet names in Excel workbooks are meaningful structure — this PR surfaces them as Markdown headings when converting XLSX files, giving users readable output like:
Implementation (reworked per @PeterStaar-IBM's feedback)
The original approach injected a
SectionHeaderItemheading node into the document model in the backend. Per the reviewer's suggestion, this has been replaced with a cleaner serializer-based approach:msexcel_backend.py: switched fromGroupLabel.SECTION/"sheet: <name>"toGroupLabel.SHEET/ plain sheet name — no synthetic heading nodes injecteddocling/utils/markdown.py(new): addsMsExcelMarkdownFallbackSerializer(rendersGroupLabel.SHEETgroups as## <name>headings) andMsExcelMarkdownDocSerializer(drop-in replacement forMarkdownDocSerializerwhen exporting Excel documents)Open question:
doc.export_to_markdown()usesMarkdownDocSerializerby default. To get headings automatically, callers currently need to useMsExcelMarkdownDocSerializerexplicitly. If automatic heading output viadoc.export_to_markdown()is preferred, a small companion PR to docling-core would be needed. Happy to pursue that path if you think it's right.