Skip to content

Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300

Merged
asingamaneni merged 14 commits intoNike-Inc:mainfrom
saiboddu759:main
Mar 12, 2026
Merged

Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300
asingamaneni merged 14 commits intoNike-Inc:mainfrom
saiboddu759:main

Conversation

@saiboddu759
Copy link
Copy Markdown
Contributor

@saiboddu759 saiboddu759 commented Mar 3, 2026

Description

Introduce a pluggy-based rule loader plugin system that allows users to define DQ rules in YAML or JSON files instead of managing them in database tables with SQL INSERT statements. The loader supports the rules-list format with optional dq_env for multi-environment support.
All rules are normalised into the standard 17-column rules DataFrame schema.

What's included:

  • spark_expectations/rules/ -- pluggy plugin manager, hook specs, YAML loader, JSON loader, and shared flatten/normalisation logic
  • examples/resources/sample_rules.yaml and sample_rules.json -- sample rule files using dq_env format
  • examples/scripts/sample_dq_yaml_json.py -- example script demonstrating file-based rule loading
  • docs/user_guide/file_based_rules.md -- comprehensive user guide covering both dq_env and simple formats, defaults, schema reference, and a full end-to-end example
  • pyproject.toml -- moved pyyaml>=6.0 from dev to runtime dependency, added pluggy>=1

Key design decisions:
Single format: Only the rules-list format (with optional dq_env) is supported, keeping the codebase simple and the user experience consistent
Case-insensitive dq_env lookup: Environment keys like DEV, dev, and Dev all match, so users don't need to worry about casing
Uppercase convention: DEV, QA, PROD is the standard convention used in all examples and docs
Pluggy architecture: Makes it easy for third parties to add custom loaders (e.g., from S3, APIs, databases) without modifying core code

Related Issue

#219

Motivation and Context

Currently, setting up DQ rules requires creating a rules table in a database and inserting rules via SQL. This has several drawbacks:
Rules are not easily version-controlled or reviewable in pull requests
SQL INSERT statements are verbose and error-prone for large rule sets
There is no way to share rule definitions across environments (dev/staging/prod) via config files
Repeated boilerplate fields on every rule make the configuration hard to read
File-based rules (YAML/JSON) solve these problems by allowing teams to define rules in human-readable config files that can be checked into Git, reviewed in PRs, and shared across environments.

How Has This Been Tested?

Unit tests covering all four input formats, edge cases, error handling, and format auto-detection:

  • tests/unit/rules/plugins/test_flatten.py (662 lines -- 30+ test cases)
  • tests/unit/rules/plugins/test_yaml_loader.py (261 lines)
  • tests/unit/rules/plugins/test_json_loader.py (261 lines)
  • tests/unit/rules/test__init__.py (129 lines -- plugin manager and convenience functions)

All tests run with pytest using a local SparkSession

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Introduce a pluggy-based rule loader that supports loading DQ rules from
YAML and JSON files in four formats: validations (multi-product),
rules-list, hierarchical, and flat. Includes sample rule files, an
example script, and tests with 98%+ coverage.

Made-with: Cursor
Document how to define and load DQ rules from YAML/JSON files,
covering all four authoring formats (validations, rules-list,
hierarchical, flat), defaults, schema reference, and a full example.

Made-with: Cursor
@saiboddu759 saiboddu759 requested a review from a team as a code owner March 3, 2026 20:58
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.56%. Comparing base (eb6675f) to head (2403194).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
+ Coverage   98.47%   98.56%   +0.08%     
==========================================
  Files          27       32       +5     
  Lines        3348     3547     +199     
==========================================
+ Hits         3297     3496     +199     
  Misses         51       51              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…e dq_env

- Remove support for validations, hierarchical, and flat rule formats
- Remove _detect_format and flatten_rules dispatch; loaders call flatten_rules_list directly
- Make dq_env lookup case-insensitive (DEV/dev/Dev all match)
- Standardize env keys to uppercase (DEV, QA, PROD) across examples, docs, and tests
- Fix mypy errors: _read_yaml/_read_json now return Dict[str, Any]
- Remove unreachable empty-rows guard for 100% coverage on _flatten.py

Made-with: Cursor
Only the dq_env format is supported; removed the non-dq_env
simple format documentation to avoid confusion.

Made-with: Cursor
@smirkovic smirkovic requested a review from a team March 10, 2026 03:05
- Add typed _cast_value to _flatten.py returning native bool/int/str
  instead of stringifying everything, matching the BooleanType/IntegerType
  Spark schema in the loaders
- Pass SparkSession through the plugin interface so callers can provide
  an explicit session
- Remove unused _to_str helper and its tests
- Update flatten tests to assert native types instead of stringified values

Made-with: Cursor
- Fix _cast_value to catch ValueError/TypeError for non-numeric int columns
- Add top-level type validation in _read_yaml and _read_json for non-dict input
- Extract duplicated _col_type, rules_schema, rows_to_dataframe into _flatten.py
- Move import os to top of file in both loaders
- Update docs to show spark parameter and explain auto-resolve behavior
- Sync docstrings with actual dq_env format (uppercase DEV/PROD convention)
- Add tests for non-numeric error_drop_threshold and non-dict top-level files

Made-with: Cursor
The YAML and JSON example rules referenced columns that don't exist
in order.csv (order_quantity, unit_price, total_amount, customer_name,
product_name, customer_email, order_status, country_code). Updated
rules to use the actual CSV columns (quantity, sales, discount,
ship_mode, profit, etc.) so the example scripts run successfully.

Made-with: Cursor
Cover previously uncovered branches: None value defaults for boolean,
int, and string columns; string-to-boolean coercion ("true", "1",
"yes"); and non-bool non-string to boolean coercion.

Made-with: Cursor
…date

The sample_rules.json was trimmed from 19 to 16 rules in b9e6039 but
the test assertion was not updated accordingly.

Made-with: Cursor
- Change query_dq_delimiter default from empty string to None in
  _flatten.py to prevent ValueError from split('') in reader.py
  for query_dq rules
- Validate resolved rule_type is non-empty after merging defaults,
  giving a clear error instead of silently accepting empty rule_type
- Fix docs stating query_dq_delimiter default is dollar sign when
  actual default is at-sign
- Add test for missing rule_type validation
Fix query_dq_delimiter crash, enforce rule_type, and correct docs
Remove advertised.listeners override and sleep delays between Zookeeper/Kafka
starts, which are not needed in the container environment.
Copy link
Copy Markdown
Collaborator

@asingamaneni asingamaneni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[IMPROVEMENT] File-based DQ Rules Support: Pluggy-based YAML/JSON Rule Loader

2 participants