Python: raise ValueError for malformed data URIs#6916
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves Python core’s detect_media_type_from_base64 behavior by making malformed / non-base64 data_uri inputs fail with an explicit ValueError instead of an indexing/split failure, and adds unit coverage for those cases.
Changes:
- Add a guard in
detect_media_type_from_base64to detect missing;base64,indata_uriand raiseValueError. - Add a parametrized unit test covering malformed/non-base64
data_uriinputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/packages/core/agent_framework/_types.py | Adds data_uri format guarding before splitting on ;base64,. |
| python/packages/core/tests/core/test_types.py | Adds targeted regression coverage for malformed/non-base64 data_uri inputs. |
| if ";base64," not in data_uri: | ||
| raise ValueError("Invalid data URI format.") | ||
| data_str = data_uri.split(";base64,", 1)[1] |
There was a problem hiding this comment.
Good point, addressed in 630243a. The helper now distinguishes malformed data URI shape from well-formed non-base64 data URIs: missing data:/comma raises ValueError("Invalid data URI format."), while a data URI without a ;base64 marker raises ValueError("Data URI must use base64 encoding."). The regression test now asserts the per-case messages.
| detect_media_type_from_base64(data_str="data", data_uri="data:application/octet-stream;base64,AAA") | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
Addressed. I switched the new parametrized test to the existing file style with @mark.parametrize and raises(...).
|
Updated after review:
Validation rerun locally:
|
### Motivation & Context
What Problem This Solves
detect_media_type_from_base64(data_uri=...)is a core helper that accepts a full data URI and then inspects the decoded bytes to infer common media types. For thedata_uripath, the helper expected the incoming value to contain base64 data, but it indexed the split result before checking that the data URI had the expected structure and encoding marker:That made malformed or non-base64 data URI inputs fail with an implementation-level
IndexErrorinstead of the documentedValueErrorcontract for invalid caller input. There are two related but different cases here:This PR makes those cases explicit: structurally malformed data URIs get a format error, while well-formed non-base64 data URIs get a base64-specific error consistent with the neighboring data URI helper.
Changes
This PR adds a narrow validation step before extracting the base64 payload from
data_uri:The valid path is intentionally unchanged: values that include a valid
data:*;base64,<payload>shape still flow into the existing base64 decode and magic-byte detection logic. The test change addstest_detect_media_type_from_base64_rejects_malformed_data_uribeside the existing media-detection tests, using the file's existing@mark.parametrizestyle and per-case error expectations.Evidence
Before this change, malformed or non-base64
data_urivalues could reach the split/index expression and raiseIndexError: list index out of rangebefore media detection or base64 decoding had a chance to run.After this change:
Well-formed base64 data URIs still behave as before; for example, a valid but unknown byte payload can decode successfully and return
Nonewhen the magic bytes do not match a known media type.Local validation run for this PR:
The focused regression tests and the full
test_types.pyfile passed. Ruff reportedAll checks passed!. The pytest output included one existingExperimentalWarningfrom the core harness file-access module, unrelated to this change.Possible call chain / impact
The practical impact is limited to invalid
data_uriinputs on this helper. Valid base64 data URI handling, rawdata_strhandling, anddata_byteshandling are unchanged. Sibling data-URI surfaces already have their own delimiter checks or validation paths, so this PR keeps the fix at the core helper boundary rather than changing provider-specific conversions.Description & Review Guide
What are the major changes?
python/packages/core/agent_framework/_types.pysodata_uriinputs without;base64,raiseValueError("Invalid data URI format.").python/packages/core/tests/core/test_types.pyfor malformed/non-base64 data URI inputs.What is the impact of these changes?
data_uriinputs now fail with clearerValueErrormessages: malformed URI shape reportsInvalid data URI format., while well-formed non-base64 data URIs reportData URI must use base64 encoding..What do you want reviewers to focus on?
Related Issue
Fixes #6917
Duplicate/open PR risk: this overlaps with open PR #5114, which also adds a
detect_media_type_from_base64guard for missing;base64,. This PR is only worth reviewing separately if maintainers prefer this broader malformed-input test coverage and theInvalid data URI format.error wording; otherwise this change should likely be folded into or superseded by #5114.Contribution Checklist
breaking changelabel (or add "[BREAKING]" to the title prefix, before or after any language prefix) - a workflow keeps the label and title prefix in sync automatically.Validation run locally:
git diff --checkuv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -q -k "detect_media_type_from_base64"uv run --project python/packages/core pytest python/packages/core/tests/core/test_types.py -quv run --project python/packages/core ruff check python/packages/core/agent_framework/_types.py python/packages/core/tests/core/test_types.py