feat(docx): extract VML images with v:imagedata elements by ceberam · Pull Request #3343 · docling-project/docling

ceberam · 2026-04-21T15:07:21Z

Summary

This PR adds support for extracting VML images with v:imagedata elements from Word documents. While VML shapes (vector graphics like rectangles, ovals, lines) were already supported for positioning, embedded images referenced via VML v:imagedata elements were not being extracted.

Problem

When converting Word documents containing VML images (commonly used for embedded Visio drawings, EMF/WMF files, and OLE objects), these images were missing from the output. The backend only detected DrawingML images (using a:blip elements) but missed VML images (using v:imagedata elements).

Solution

VML Image Detection - Added XPath expression to detect v:imagedata elements
VML Image Handler - Implemented _handle_vml_pictures() method that:
- Extracts images from VML v:imagedata relationship IDs
- Attempts direct PIL loading for common formats (PNG, JPEG, etc.)
- Falls back to LibreOffice conversion for EMF/WMF formats
Code Consolidation - Refactored duplicate code across all image handlers:
- _get_image_from_relationship() - Unified image retrieval from relationship IDs
- _add_picture_to_doc() - Unified picture element addition
- _convert_elements_via_docx() - Unified DOCX→PDF→PNG conversion (supports both single and multiple elements)

Key Distinction

Before: VML namespace detection was only used for positioning vector shapes
After: VML v:imagedata elements are now detected and their embedded images are extracted

Testing

✅ All 25 existing tests pass
✅ New test document docx_vml_images.docx with 2 VML images and ground truth
✅ Test document programmatically created with proper VML structure and v:imagedata elements
✅ Both VML images successfully extracted with embedded data

Files Changed

docling/backend/msword_backend.py - VML image extraction + code consolidation (~150 lines added, ~100 lines removed via consolidation)
tests/data/docx/docx_vml_images.docx - Test document with VML images
tests/data/groundtruth/docling_v2/docx_vml_images.docx.json - Ground truth

Resolves #3298

Checklist:

Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

Add VML image support with EMF/WMF conversion and consolidate image handler code. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

mergify · 2026-04-21T15:07:56Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

#approved-reviews-by >= 2

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

github-actions · 2026-04-21T15:14:44Z

✅ DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

codecov · 2026-04-21T15:26:00Z

Codecov Report

❌ Patch coverage is 70.40816% with 29 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling/backend/msword_backend.py	70.40%	29 Missing ⚠️

📢 Thoughts on this report? Let us know!

feat(docx): Extract VML images with v:imagedata elements

25954a6

Add VML image support with EMF/WMF conversion and consolidate image handler code. Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>

ceberam self-assigned this Apr 21, 2026

ceberam added the docx issue related to docx backend label Apr 21, 2026

ceberam assigned vagenas, PeterStaar-IBM, dolfim-ibm and maxmnemonic Apr 21, 2026

ceberam unassigned vagenas Apr 21, 2026

ceberam requested review from PeterStaar-IBM, dolfim-ibm and maxmnemonic April 21, 2026 15:42

ceberam unassigned maxmnemonic, dolfim-ibm and PeterStaar-IBM Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(docx): extract VML images with v:imagedata elements#3343

feat(docx): extract VML images with v:imagedata elements#3343
ceberam wants to merge 1 commit intomainfrom
fix/extract-emf-pictures

ceberam commented Apr 21, 2026

Uh oh!

mergify Bot commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ceberam commented Apr 21, 2026

Summary

Problem

Solution

Key Distinction

Testing

Files Changed

Uh oh!

mergify Bot commented Apr 21, 2026

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

codecov Bot commented Apr 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants