[v0.3.0] security — audit and document PyYAML safe-loader discipline#56
Conversation
📝 WalkthroughWalkthroughA new test ChangesYAML Safe-Loader Trust Boundary Test
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThis PR adds a regression test to enforce the YAML safety discipline for the repository. The test scans all Python sources to verify that no unsafe YAML loaders ( ChangesYAML safety regression test
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_synonyms.py (1)
97-109: NarrowLoader=detection to YAML-call context to avoid brittle false positives.
re.compile(r"\bLoader\s*=")will match unrelated code and can fail this test for non-YAML reasons. Consider coupling this check toyaml.load(usage (or dropping it, sinceyaml.loadis already blocked).As per coding guidelines, “`tests/**`: Focus review on meaningful assertions... and deterministic behavior...”.Suggested patch
- loader_override = re.compile(r"\bLoader\s*=") + yaml_load_with_loader = re.compile(r"\byaml[.]load\s*[(].*\bLoader\s*=", re.DOTALL) @@ - if loader_override.search(line) and "SafeLoader" not in line: - violations.append(f"{relative_path}:{line_number}: custom YAML Loader") + if yaml_load_with_loader.search(line) and "SafeLoader" not in line: + violations.append(f"{relative_path}:{line_number}: custom YAML Loader")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_synonyms.py` around lines 97 - 109, The Loader= regex (loader_override) is too broad and triggers false positives; narrow its scope by only flagging assignments that occur immediately in a yaml.load call context (i.e., when a line matches unsafe_load_call or when loader appears inside a yaml.load(...) argument), or remove the loader_override check entirely since unsafe_load_call/unsafe_loader_name already block yaml.load; update the logic in the loop that builds violations to check loader_override only when unsafe_load_call.search(line) or when safe_load_call.search(line) shows a YAML call, referencing the loader_override, unsafe_load_call, unsafe_loader_name, and safe_load_call symbols to locate and modify the conditional.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_synonyms.py`:
- Around line 85-87: The scan currently only builds src_root and includes "src/"
in the expected_yaml_input set; update the test to also include the tests
directory by creating a tests_root (or reuse repo_root) and adding "tests/"
entries to expected_yaml_input and the other scan set used around lines
referencing expected_yaml_input (also update the analogous set at the other
occurrence around lines 103-106); locate the variables repo_root, src_root and
expected_yaml_input and append the corresponding tests path (e.g., repo_root /
"tests" or str(tests_root) with the same glob pattern used for "src") so the
audit includes tests/**/*.py.
---
Nitpick comments:
In `@tests/test_synonyms.py`:
- Around line 97-109: The Loader= regex (loader_override) is too broad and
triggers false positives; narrow its scope by only flagging assignments that
occur immediately in a yaml.load call context (i.e., when a line matches
unsafe_load_call or when loader appears inside a yaml.load(...) argument), or
remove the loader_override check entirely since
unsafe_load_call/unsafe_loader_name already block yaml.load; update the logic in
the loop that builds violations to check loader_override only when
unsafe_load_call.search(line) or when safe_load_call.search(line) shows a YAML
call, referencing the loader_override, unsafe_load_call, unsafe_loader_name, and
safe_load_call symbols to locate and modify the conditional.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a13dc24f-ac07-4298-b241-cf7a761f49b7
⛔ Files ignored due to path filters (1)
docs/architecture/YAML-TRUST-BOUNDARY.mdis excluded by none and included by none
📒 Files selected for processing (1)
tests/test_synonyms.py
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/test_synonyms.py (3)
105-105: ⚡ Quick winPin
read_text(encoding="utf-8")for deterministic, environment-independent reads.
read_text()falls back to the locale default encoding, which can differ across CI/dev environments (e.g. cp1252 on Windows) and produce inconsistent results when scanning source files.♻️ Proposed fix
- for line_number, line in enumerate(source_path.read_text().splitlines(), 1): + for line_number, line in enumerate( + source_path.read_text(encoding="utf-8").splitlines(), 1 + ):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_synonyms.py` at line 105, The file reads source files with source_path.read_text() which uses the system locale; change the call to source_path.read_text(encoding="utf-8") to force UTF-8 decoding for deterministic, environment-independent test runs (locate the loop using source_path.read_text() in tests/test_synonyms.py and update that invocation).
86-121: ⚡ Quick winScan misses
tests/, which the trust boundary also covers.The PR acceptance criteria is "no occurrences of
yaml.load(insrc/ortests/", but the guard only walkssrc_root. An unsafe loader added undertests/would slip past this regression test.♻️ Extend the scan roots to include tests/
repo_root = Path(__file__).resolve().parents[1] - src_root = repo_root / "src" + scan_roots = (repo_root / "src", repo_root / "tests") @@ - for source_path in sorted(src_root.rglob("*.py")): + py_files = sorted( + p for root in scan_roots for p in root.rglob("*.py") + ) + for source_path in py_files: relative_path = source_path.relative_to(repo_root).as_posix()Note: keep the packaged-YAML enumeration (
yaml_inputs) scoped tosrc_rootso thesynonyms.yamlassertion is unaffected.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_synonyms.py` around lines 86 - 121, The test currently only scans src_root for unsafe yaml usages (variables: src_root, violations, safe_load_sites) so occurrences in tests/ are missed; modify the scanning to include the tests directory in the walk (e.g., create a tests_root = repo_root / "tests" and iterate over both roots or build a combined iterable of roots) when collecting violations and safe_load_sites, while keeping the packaged-YAML enumeration (yaml_inputs) unchanged and still only scanning src_root so the synonyms.yaml assertion remains scoped to source files.
106-111: ⚖️ Poor tradeoffLine-based scan can false-positive on comments/strings.
Because matching runs over raw lines, a benign mention of
yaml.load(orLoader =in a comment, docstring, or string literal insidesrc/would fail this lock-in test spuriously. Acceptable for now given the audited codebase, but an AST-based check (ast.walkforCall/Attributenodes) would be more robust if false alarms appear later.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_synonyms.py` around lines 106 - 111, The current line-based regex scan (using unsafe_load_call, unsafe_loader_name, loader_override, safe_load_call) can false-positive on comments/strings; replace the per-line checks with an AST-based scan: parse the file content with ast.parse and ast.walk to find ast.Call nodes whose function is either an ast.Attribute with value named "yaml" and attr "load" or a Name "load" imported from yaml, then inspect call.keywords/args for a "Loader" argument (and mark unsafe if Loader is not SafeLoader), collect safe sites when Loader is SafeLoader or safe_load is used, and detect loader override via ast.Assign where target id/name matches loader variables (e.g., assignments to "Loader" or custom loader classes). Update the code that currently appends to violations and safe_load_sites to use these AST checks instead of raw line regex matches.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/test_synonyms.py`:
- Line 105: The file reads source files with source_path.read_text() which uses
the system locale; change the call to source_path.read_text(encoding="utf-8") to
force UTF-8 decoding for deterministic, environment-independent test runs
(locate the loop using source_path.read_text() in tests/test_synonyms.py and
update that invocation).
- Around line 86-121: The test currently only scans src_root for unsafe yaml
usages (variables: src_root, violations, safe_load_sites) so occurrences in
tests/ are missed; modify the scanning to include the tests directory in the
walk (e.g., create a tests_root = repo_root / "tests" and iterate over both
roots or build a combined iterable of roots) when collecting violations and
safe_load_sites, while keeping the packaged-YAML enumeration (yaml_inputs)
unchanged and still only scanning src_root so the synonyms.yaml assertion
remains scoped to source files.
- Around line 106-111: The current line-based regex scan (using
unsafe_load_call, unsafe_loader_name, loader_override, safe_load_call) can
false-positive on comments/strings; replace the per-line checks with an
AST-based scan: parse the file content with ast.parse and ast.walk to find
ast.Call nodes whose function is either an ast.Attribute with value named "yaml"
and attr "load" or a Name "load" imported from yaml, then inspect
call.keywords/args for a "Loader" argument (and mark unsafe if Loader is not
SafeLoader), collect safe sites when Loader is SafeLoader or safe_load is used,
and detect loader override via ast.Assign where target id/name matches loader
variables (e.g., assignments to "Loader" or custom loader classes). Update the
code that currently appends to violations and safe_load_sites to use these AST
checks instead of raw line regex matches.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 1c5df3aa-4539-4ac0-aeca-79847735fd39
⛔ Files ignored due to path filters (1)
docs/architecture/YAML-TRUST-BOUNDARY.mdis excluded by none and included by none
📒 Files selected for processing (1)
tests/test_synonyms.py
|
Heimdall verification passed for PR #56 at Independent checkout: Protocol/context read:
Scope verification:
Canonical gate run locally: Issue-specific checks: GitHub checks at verification time:
CodeRabbit triage:
Label action: replacing |
Closes #48
Acceptance criteria
grep -rn 'yaml.load(' src/ tests/returns zero hits — command produced no output.grep -rn 'yaml.safe_load(' src/returns at least the two expected call sites — foundsrc/mcp_server_python_docs/server.py:57andsrc/mcp_server_python_docs/ingestion/sphinx_json.py:597.grep -rln '\.ya\?ml' src/mcp_server_python_docs/showsdata/synonyms.yamlis the only YAML data input loaded at runtime/ingestion — source-reference grep excluding generated__pycache__listsserver.py,ingestion/sphinx_json.py,retrieval/query.py, and__main__.py; only the first two parse YAML, both parse packageddata/synonyms.yamlwithsafe_load, whileretrieval/query.pyand__main__.pyonly mention/use already-loaded synonym data.find src/mcp_server_python_docs -type f \( -name '*.yaml' -o -name '*.yml' \) -printreturns onlysrc/mcp_server_python_docs/data/synonyms.yaml.tests/test_synonyms.py::test_yaml_loaded_only_via_safe_load, scanningsrc/andtests/for unsafe YAML loaders, confirming both source safe-load sites, and assertingsynonyms.yamlis the only package YAML file.docs/architecture/YAML-TRUST-BOUNDARY.md.SECURITY.mdwas not edited.Validation gate output
Issue-specific checks:
CodeRabbit review
CodeRabbit raised a valid initial finding that the regression test should scan
tests/as well assrc/; addressed inef7736e. A valid deterministic-read nit was also addressed inbffc37fwith explicit UTF-8 reads. Final CodeRabbit check is passing.Why this approach
The issue prescribed a read-only audit, regression test, and allowed architecture note. The codebase was already clean; this PR locks the discipline into tests and documents the trust boundary without changing ingestion behavior or
SECURITY.md.Why this triggered supervisor review
None.