Add pluggy-based YAML/JSON rule loader with file-based DQ rules support by saiboddu759 · Pull Request #300 · Nike-Inc/spark-expectations

saiboddu759 · 2026-03-03T20:58:48Z

Description

Introduce a pluggy-based rule loader plugin system that allows users to define DQ rules in YAML or JSON files instead of managing them in database tables with SQL INSERT statements. The loader supports the rules-list format with optional dq_env for multi-environment support.
All rules are normalised into the standard 17-column rules DataFrame schema.

What's included:

spark_expectations/rules/ -- pluggy plugin manager, hook specs, YAML loader, JSON loader, and shared flatten/normalisation logic
examples/resources/sample_rules.yaml and sample_rules.json -- sample rule files using dq_env format
examples/scripts/sample_dq_yaml_json.py -- example script demonstrating file-based rule loading
docs/user_guide/file_based_rules.md -- comprehensive user guide covering both dq_env and simple formats, defaults, schema reference, and a full end-to-end example
pyproject.toml -- moved pyyaml>=6.0 from dev to runtime dependency, added pluggy>=1

Key design decisions:
Single format: Only the rules-list format (with optional dq_env) is supported, keeping the codebase simple and the user experience consistent
Case-insensitive dq_env lookup: Environment keys like DEV, dev, and Dev all match, so users don't need to worry about casing
Uppercase convention: DEV, QA, PROD is the standard convention used in all examples and docs
Pluggy architecture: Makes it easy for third parties to add custom loaders (e.g., from S3, APIs, databases) without modifying core code

Related Issue

#219

Motivation and Context

Currently, setting up DQ rules requires creating a rules table in a database and inserting rules via SQL. This has several drawbacks:
Rules are not easily version-controlled or reviewable in pull requests
SQL INSERT statements are verbose and error-prone for large rule sets
There is no way to share rule definitions across environments (dev/staging/prod) via config files
Repeated boilerplate fields on every rule make the configuration hard to read
File-based rules (YAML/JSON) solve these problems by allowing teams to define rules in human-readable config files that can be checked into Git, reviewed in PRs, and shared across environments.

How Has This Been Tested?

Unit tests covering all four input formats, edge cases, error handling, and format auto-detection:

tests/unit/rules/plugins/test_flatten.py (662 lines -- 30+ test cases)
tests/unit/rules/plugins/test_yaml_loader.py (261 lines)
tests/unit/rules/plugins/test_json_loader.py (261 lines)
tests/unit/rules/test__init__.py (129 lines -- plugin manager and convenience functions)

All tests run with pytest using a local SparkSession

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

Introduce a pluggy-based rule loader that supports loading DQ rules from YAML and JSON files in four formats: validations (multi-product), rules-list, hierarchical, and flat. Includes sample rule files, an example script, and tests with 98%+ coverage. Made-with: Cursor

Document how to define and load DQ rules from YAML/JSON files, covering all four authoring formats (validations, rules-list, hierarchical, flat), defaults, schema reference, and a full example. Made-with: Cursor

codecov · 2026-03-03T23:09:58Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.56%. Comparing base (eb6675f) to head (2403194).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
+ Coverage   98.47%   98.56%   +0.08%     
==========================================
  Files          27       32       +5     
  Lines        3348     3547     +199     
==========================================
+ Hits         3297     3496     +199     
  Misses         51       51

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…e dq_env - Remove support for validations, hierarchical, and flat rule formats - Remove _detect_format and flatten_rules dispatch; loaders call flatten_rules_list directly - Make dq_env lookup case-insensitive (DEV/dev/Dev all match) - Standardize env keys to uppercase (DEV, QA, PROD) across examples, docs, and tests - Fix mypy errors: _read_yaml/_read_json now return Dict[str, Any] - Remove unreachable empty-rows guard for 100% coverage on _flatten.py Made-with: Cursor

Only the dq_env format is supported; removed the non-dq_env simple format documentation to avoid confusion. Made-with: Cursor

- Add typed _cast_value to _flatten.py returning native bool/int/str instead of stringifying everything, matching the BooleanType/IntegerType Spark schema in the loaders - Pass SparkSession through the plugin interface so callers can provide an explicit session - Remove unused _to_str helper and its tests - Update flatten tests to assert native types instead of stringified values Made-with: Cursor

- Fix _cast_value to catch ValueError/TypeError for non-numeric int columns - Add top-level type validation in _read_yaml and _read_json for non-dict input - Extract duplicated _col_type, rules_schema, rows_to_dataframe into _flatten.py - Move import os to top of file in both loaders - Update docs to show spark parameter and explain auto-resolve behavior - Sync docstrings with actual dq_env format (uppercase DEV/PROD convention) - Add tests for non-numeric error_drop_threshold and non-dict top-level files Made-with: Cursor

The YAML and JSON example rules referenced columns that don't exist in order.csv (order_quantity, unit_price, total_amount, customer_name, product_name, customer_email, order_status, country_code). Updated rules to use the actual CSV columns (quantity, sales, discount, ship_mode, profit, etc.) so the example scripts run successfully. Made-with: Cursor

Cover previously uncovered branches: None value defaults for boolean, int, and string columns; string-to-boolean coercion ("true", "1", "yes"); and non-bool non-string to boolean coercion. Made-with: Cursor

…date The sample_rules.json was trimmed from 19 to 16 rules in b9e6039 but the test assertion was not updated accordingly. Made-with: Cursor

…date Made-with: Cursor

- Change query_dq_delimiter default from empty string to None in _flatten.py to prevent ValueError from split('') in reader.py for query_dq rules - Validate resolved rule_type is non-empty after merging defaults, giving a clear error instead of silently accepting empty rule_type - Fix docs stating query_dq_delimiter default is dollar sign when actual default is at-sign - Add test for missing rule_type validation

Fix query_dq_delimiter crash, enforce rule_type, and correct docs

Remove advertised.listeners override and sleep delays between Zookeeper/Kafka starts, which are not needed in the container environment.

asingamaneni

LGTM, thank you!

saiboddu759 added 2 commits March 3, 2026 12:37

Add user guide for file-based rules (YAML & JSON)

bed4dda

Document how to define and load DQ rules from YAML/JSON files, covering all four authoring formats (validations, rules-list, hierarchical, flat), defaults, schema reference, and a full example. Made-with: Cursor

saiboddu759 requested a review from a team as a code owner March 3, 2026 20:58

saiboddu759 added 3 commits March 5, 2026 11:58

Remove simple format section from file-based rules docs

7dd5158

Only the dq_env format is supported; removed the non-dq_env simple format documentation to avoid confusion. Made-with: Cursor

Merge branch 'main' into main

99a8803

smirkovic requested a review from a team March 10, 2026 03:05

saiboddu759 added 9 commits March 10, 2026 10:04

Add test coverage for _cast_value in _flatten.py

ebd0717

Cover previously uncovered branches: None value defaults for boolean, int, and string columns; string-to-boolean coercion ("true", "1", "yes"); and non-bool non-string to boolean coercion. Made-with: Cursor

Fix test_sample_rules_json_loads expected count after sample rules up…

427adb0

…date The sample_rules.json was trimmed from 19 to 16 rules in b9e6039 but the test assertion was not updated accordingly. Made-with: Cursor

Fix test_sample_rules_yaml_loads expected count after sample rules up…

66676cc

…date Made-with: Cursor

Merge pull request #7 from saiboddu759/feature/yaml_json_rules2

4d5f1c5

Fix query_dq_delimiter crash, enforce rule_type, and correct docs

Simplify Kafka startup script by removing unnecessary config and delays

2403194

Remove advertised.listeners override and sleep delays between Zookeeper/Kafka starts, which are not needed in the container environment.

asingamaneni approved these changes Mar 12, 2026

View reviewed changes

asingamaneni merged commit 22d5ef8 into Nike-Inc:main Mar 12, 2026
6 checks passed

asingamaneni mentioned this pull request Apr 4, 2026

[FEATURE] Warn on duplicate rule names in YAML/JSON rule files asingamaneni/spark-expectations#7

Open

AmaniG77 mentioned this pull request Apr 16, 2026

[IMPROVEMENT] File-based DQ Rules Support: Pluggy-based YAML/JSON Rule Loader #307

Closed

AmaniG77 linked an issue Apr 16, 2026 that may be closed by this pull request

[IMPROVEMENT] File-based DQ Rules Support: Pluggy-based YAML/JSON Rule Loader #307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300

Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300
asingamaneni merged 14 commits intoNike-Inc:mainfrom
saiboddu759:main

saiboddu759 commented Mar 3, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

asingamaneni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saiboddu759 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

codecov Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

asingamaneni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saiboddu759 commented Mar 3, 2026 •

edited

Loading

codecov Bot commented Mar 3, 2026 •

edited

Loading