Skip to content

feat(docx): extract VML images with v:imagedata elements#3343

Open
ceberam wants to merge 1 commit intomainfrom
fix/extract-emf-pictures
Open

feat(docx): extract VML images with v:imagedata elements#3343
ceberam wants to merge 1 commit intomainfrom
fix/extract-emf-pictures

Conversation

@ceberam
Copy link
Copy Markdown
Member

@ceberam ceberam commented Apr 21, 2026

Summary

This PR adds support for extracting VML images with v:imagedata elements from Word documents. While VML shapes (vector graphics like rectangles, ovals, lines) were already supported for positioning, embedded images referenced via VML v:imagedata elements were not being extracted.

Problem

When converting Word documents containing VML images (commonly used for embedded Visio drawings, EMF/WMF files, and OLE objects), these images were missing from the output. The backend only detected DrawingML images (using a:blip elements) but missed VML images (using v:imagedata elements).

Solution

  1. VML Image Detection - Added XPath expression to detect v:imagedata elements
  2. VML Image Handler - Implemented _handle_vml_pictures() method that:
    • Extracts images from VML v:imagedata relationship IDs
    • Attempts direct PIL loading for common formats (PNG, JPEG, etc.)
    • Falls back to LibreOffice conversion for EMF/WMF formats
  3. Code Consolidation - Refactored duplicate code across all image handlers:
    • _get_image_from_relationship() - Unified image retrieval from relationship IDs
    • _add_picture_to_doc() - Unified picture element addition
    • _convert_elements_via_docx() - Unified DOCX→PDF→PNG conversion (supports both single and multiple elements)

Key Distinction

  • Before: VML namespace detection was only used for positioning vector shapes
  • After: VML v:imagedata elements are now detected and their embedded images are extracted

Testing

  • ✅ All 25 existing tests pass
  • ✅ New test document docx_vml_images.docx with 2 VML images and ground truth
  • ✅ Test document programmatically created with proper VML structure and v:imagedata elements
  • ✅ Both VML images successfully extracted with embedded data

Files Changed

  • docling/backend/msword_backend.py - VML image extraction + code consolidation (~150 lines added, ~100 lines removed via consolidation)
  • tests/data/docx/docx_vml_images.docx - Test document with VML images
  • tests/data/groundtruth/docling_v2/docx_vml_images.docx.json - Ground truth

Resolves #3298

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Add VML image support with EMF/WMF conversion and consolidate image handler code.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam self-assigned this Apr 21, 2026
@ceberam ceberam added the docx issue related to docx backend label Apr 21, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 21, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for:

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 70.40816% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 70.40% 29 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docx issue related to docx backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docling cannot correctly extract emf picture in a.docx file.

5 participants