Skip to content

fix: strengthen input validation for METS‑GBS processing#3336

Open
ceberam wants to merge 1 commit intomainfrom
fix/gbs-backend-decompression
Open

fix: strengthen input validation for METS‑GBS processing#3336
ceberam wants to merge 1 commit intomainfrom
fix/gbs-backend-decompression

Conversation

@ceberam
Copy link
Copy Markdown
Member

@ceberam ceberam commented Apr 21, 2026

Summary

This PR strengthens security in METS-GBS document processing by validating the input files.

XML Input Handling

  • Location: docling/backend/mets_gbs_backend.py
  • Issue: The XML parser was configured with default settings that allowed external entities and DTD loading.
  • Resolution: Updated the parser to disable entity resolution, prevent DTDs, and block network access.

Archive Extraction Safeguards

  • Locations:
    • docling/backend/mets_gbs_backend.py (file extraction)
    • docling/datamodel/document.py (format detection)
  • Issue: Archives were extracted without any limits on size or member count.
  • Resolution: Introduced multi‑layered checks that enforce maximum archive size, cap the number of members, and verify each file’s size before extraction.

Checklist

  • Documentation updated as needed.
  • Example usage added where relevant.
  • Tests covering the new safeguards added.

Signed-off-by: Cesar Berrospi Ramis <ceb@zurich.ibm.com>
@ceberam ceberam self-assigned this Apr 21, 2026
@ceberam ceberam added the bug Something isn't working label Apr 21, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 21, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Copy Markdown
Contributor

DCO Check Passed

Thanks @ceberam, all your commits are properly signed off. 🎉

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 34.28571% with 23 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/datamodel/document.py 4.54% 21 Missing ⚠️
docling/backend/mets_gbs_backend.py 84.61% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ceberam ceberam changed the title fix: prevent XXE and decompression bomb in METS-GBS processing fix: strengthen input validation for METS‑GBS processing Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants