Skip to content

Python: raise ValueError for malformed data URIs#6916

Open
VectorPeak wants to merge 2 commits into
microsoft:mainfrom
VectorPeak:codex-core-malformed-data-uri-valueerror
Open

Python: raise ValueError for malformed data URIs#6916
VectorPeak wants to merge 2 commits into
microsoft:mainfrom
VectorPeak:codex-core-malformed-data-uri-valueerror

Conversation

@VectorPeak

@VectorPeak VectorPeak commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

### Motivation & Context

What Problem This Solves

detect_media_type_from_base64(data_uri=...) is a core helper that accepts a full data URI and then inspects the decoded bytes to infer common media types. For the data_uri path, the helper expected the incoming value to contain base64 data, but it indexed the split result before checking that the data URI had the expected structure and encoding marker:

data_str = data_uri.split(";base64,", 1)[1]

That made malformed or non-base64 data URI inputs fail with an implementation-level IndexError instead of the documented ValueError contract for invalid caller input. There are two related but different cases here:

"data:text/plain,hello"     # well-formed data URI, but not base64-encoded
"data:image/png;base64"     # malformed data URI, missing comma/data separator
"not-a-data-uri"            # malformed data URI, missing data: scheme

This PR makes those cases explicit: structurally malformed data URIs get a format error, while well-formed non-base64 data URIs get a base64-specific error consistent with the neighboring data URI helper.

Changes

This PR adds a narrow validation step before extracting the base64 payload from data_uri:

if not data_uri.startswith("data:") or "," not in data_uri:
    raise ValueError("Invalid data URI format.")
prefix, data_str = data_uri.split(",", 1)
if not prefix.endswith(";base64"):
    raise ValueError("Data URI must use base64 encoding.")

The valid path is intentionally unchanged: values that include a valid data:*;base64,<payload> shape still flow into the existing base64 decode and magic-byte detection logic. The test change adds test_detect_media_type_from_base64_rejects_malformed_data_uri beside the existing media-detection tests, using the file's existing @mark.parametrize style and per-case error expectations.

Evidence

Before this change, malformed or non-base64 data_uri values could reach the split/index expression and raise IndexError: list index out of range before media detection or base64 decoding had a chance to run.

After this change:

"data:text/plain,hello" -> ValueError("Data URI must use base64 encoding.")
"data:image/png;base64" -> ValueError("Invalid data URI format.")
"not-a-data-uri" -> ValueError("Invalid data URI format.")

Well-formed base64 data URIs still behave as before; for example, a valid but unknown byte payload can decode successfully and return None when the magic bytes do not match a known media type.

Local validation run for this PR:

git diff --check
uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q -k "detect_media_type_from_base64"
uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q
uv run --project python/packages/core ruff check python/packages/core/agent_framework/_types.py python/packages/core/tests/core/test_types.py

The focused regression tests and the full test_types.py file passed. Ruff reported All checks passed!. The pytest output included one existing ExperimentalWarning from the core harness file-access module, unrelated to this change.

Possible call chain / impact

caller provides data_uri
  -> detect_media_type_from_base64(data_uri=...)
  -> validate data URI shape
  -> validate base64 marker on data URI metadata
  -> split out payload after comma
  -> base64 decode
  -> magic-byte media detection

The practical impact is limited to invalid data_uri inputs on this helper. Valid base64 data URI handling, raw data_str handling, and data_bytes handling are unchanged. Sibling data-URI surfaces already have their own delimiter checks or validation paths, so this PR keeps the fix at the core helper boundary rather than changing provider-specific conversions.

Description & Review Guide

  • What are the major changes?

    • Add a guard in python/packages/core/agent_framework/_types.py so data_uri inputs without ;base64, raise ValueError("Invalid data URI format.").
    • Add focused unit coverage in python/packages/core/tests/core/test_types.py for malformed/non-base64 data URI inputs.
  • What is the impact of these changes?

    • Invalid data_uri inputs now fail with clearer ValueError messages: malformed URI shape reports Invalid data URI format., while well-formed non-base64 data URIs report Data URI must use base64 encoding..
    • Valid base64 data URI inputs, raw base64 strings, and byte inputs are unchanged.
    • This is not a breaking change.
  • What do you want reviewers to focus on?

    • Whether the split between malformed data URI shape and well-formed non-base64 data URI errors is the right boundary here.
    • Whether the malformed URI coverage should include additional RFC-style data URI variants beyond the current focused cases.

Related Issue

Fixes #6917

Duplicate/open PR risk: this overlaps with open PR #5114, which also adds a detect_media_type_from_base64 guard for missing ;base64,. This PR is only worth reviewing separately if maintainers prefer this broader malformed-input test coverage and the Invalid data URI format. error wording; otherwise this change should likely be folded into or superseded by #5114.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • All unit tests pass, and I have added new tests where possible
  • The PR follows the Contribution Guidelines
  • This PR is linked to an issue and there is no other open PR for this issue (see Related Issue above).
  • This is not a breaking change. If it is a breaking change, add the breaking change label (or add "[BREAKING]" to the title prefix, before or after any language prefix) - a workflow keeps the label and title prefix in sync automatically.

Validation run locally:

  • git diff --check
  • uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q -k "detect_media_type_from_base64"
  • uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q
  • uv run --project python/packages/core ruff check python/packages/core/agent_framework/_types.py python/packages/core/tests/core/test_types.py

Copilot AI review requested due to automatic review settings July 5, 2026 02:13
@giles17 giles17 added the python Usage: [Issues, PRs], Target: Python label Jul 5, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Python core’s detect_media_type_from_base64 behavior by making malformed / non-base64 data_uri inputs fail with an explicit ValueError instead of an indexing/split failure, and adds unit coverage for those cases.

Changes:

  • Add a guard in detect_media_type_from_base64 to detect missing ;base64, in data_uri and raise ValueError.
  • Add a parametrized unit test covering malformed/non-base64 data_uri inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
python/packages/core/agent_framework/_types.py Adds data_uri format guarding before splitting on ;base64,.
python/packages/core/tests/core/test_types.py Adds targeted regression coverage for malformed/non-base64 data_uri inputs.

Comment on lines 125 to 127
if ";base64," not in data_uri:
raise ValueError("Invalid data URI format.")
data_str = data_uri.split(";base64,", 1)[1]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, addressed in 630243a. The helper now distinguishes malformed data URI shape from well-formed non-base64 data URIs: missing data:/comma raises ValueError("Invalid data URI format."), while a data URI without a ;base64 marker raises ValueError("Data URI must use base64 encoding."). The regression test now asserts the per-case messages.

detect_media_type_from_base64(data_str="data", data_uri="data:application/octet-stream;base64,AAA")


@pytest.mark.parametrize(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. I switched the new parametrized test to the existing file style with @mark.parametrize and raises(...).

@VectorPeak

Copy link
Copy Markdown
Contributor Author

Updated after review:

Validation rerun locally:

  • git diff --check
  • uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q -k detect_media_type_from_base64
  • uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q
  • uv run --project python/packages/core ruff check python/packages/core/agent_framework/_types.py python/packages/core/tests/core/test_types.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Usage: [Issues, PRs], Target: Python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: malformed data_uri should not leak IndexError

3 participants