Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300
Merged
asingamaneni merged 14 commits intoNike-Inc:mainfrom Mar 12, 2026
Merged
Add pluggy-based YAML/JSON rule loader with file-based DQ rules support#300asingamaneni merged 14 commits intoNike-Inc:mainfrom
asingamaneni merged 14 commits intoNike-Inc:mainfrom
Conversation
Introduce a pluggy-based rule loader that supports loading DQ rules from YAML and JSON files in four formats: validations (multi-product), rules-list, hierarchical, and flat. Includes sample rule files, an example script, and tests with 98%+ coverage. Made-with: Cursor
Document how to define and load DQ rules from YAML/JSON files, covering all four authoring formats (validations, rules-list, hierarchical, flat), defaults, schema reference, and a full example. Made-with: Cursor
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #300 +/- ##
==========================================
+ Coverage 98.47% 98.56% +0.08%
==========================================
Files 27 32 +5
Lines 3348 3547 +199
==========================================
+ Hits 3297 3496 +199
Misses 51 51 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…e dq_env - Remove support for validations, hierarchical, and flat rule formats - Remove _detect_format and flatten_rules dispatch; loaders call flatten_rules_list directly - Make dq_env lookup case-insensitive (DEV/dev/Dev all match) - Standardize env keys to uppercase (DEV, QA, PROD) across examples, docs, and tests - Fix mypy errors: _read_yaml/_read_json now return Dict[str, Any] - Remove unreachable empty-rows guard for 100% coverage on _flatten.py Made-with: Cursor
Only the dq_env format is supported; removed the non-dq_env simple format documentation to avoid confusion. Made-with: Cursor
- Add typed _cast_value to _flatten.py returning native bool/int/str instead of stringifying everything, matching the BooleanType/IntegerType Spark schema in the loaders - Pass SparkSession through the plugin interface so callers can provide an explicit session - Remove unused _to_str helper and its tests - Update flatten tests to assert native types instead of stringified values Made-with: Cursor
- Fix _cast_value to catch ValueError/TypeError for non-numeric int columns - Add top-level type validation in _read_yaml and _read_json for non-dict input - Extract duplicated _col_type, rules_schema, rows_to_dataframe into _flatten.py - Move import os to top of file in both loaders - Update docs to show spark parameter and explain auto-resolve behavior - Sync docstrings with actual dq_env format (uppercase DEV/PROD convention) - Add tests for non-numeric error_drop_threshold and non-dict top-level files Made-with: Cursor
The YAML and JSON example rules referenced columns that don't exist in order.csv (order_quantity, unit_price, total_amount, customer_name, product_name, customer_email, order_status, country_code). Updated rules to use the actual CSV columns (quantity, sales, discount, ship_mode, profit, etc.) so the example scripts run successfully. Made-with: Cursor
Cover previously uncovered branches: None value defaults for boolean,
int, and string columns; string-to-boolean coercion ("true", "1",
"yes"); and non-bool non-string to boolean coercion.
Made-with: Cursor
…date The sample_rules.json was trimmed from 19 to 16 rules in b9e6039 but the test assertion was not updated accordingly. Made-with: Cursor
…date Made-with: Cursor
- Change query_dq_delimiter default from empty string to None in
_flatten.py to prevent ValueError from split('') in reader.py
for query_dq rules
- Validate resolved rule_type is non-empty after merging defaults,
giving a clear error instead of silently accepting empty rule_type
- Fix docs stating query_dq_delimiter default is dollar sign when
actual default is at-sign
- Add test for missing rule_type validation
Fix query_dq_delimiter crash, enforce rule_type, and correct docs
Remove advertised.listeners override and sleep delays between Zookeeper/Kafka starts, which are not needed in the container environment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Introduce a pluggy-based rule loader plugin system that allows users to define DQ rules in YAML or JSON files instead of managing them in database tables with SQL INSERT statements. The loader supports the rules-list format with optional dq_env for multi-environment support.
All rules are normalised into the standard 17-column rules DataFrame schema.
What's included:
Key design decisions:
Single format: Only the rules-list format (with optional dq_env) is supported, keeping the codebase simple and the user experience consistent
Case-insensitive dq_env lookup: Environment keys like DEV, dev, and Dev all match, so users don't need to worry about casing
Uppercase convention: DEV, QA, PROD is the standard convention used in all examples and docs
Pluggy architecture: Makes it easy for third parties to add custom loaders (e.g., from S3, APIs, databases) without modifying core code
Related Issue
#219
Motivation and Context
Currently, setting up DQ rules requires creating a rules table in a database and inserting rules via SQL. This has several drawbacks:
Rules are not easily version-controlled or reviewable in pull requests
SQL INSERT statements are verbose and error-prone for large rule sets
There is no way to share rule definitions across environments (dev/staging/prod) via config files
Repeated boilerplate fields on every rule make the configuration hard to read
File-based rules (YAML/JSON) solve these problems by allowing teams to define rules in human-readable config files that can be checked into Git, reviewed in PRs, and shared across environments.
How Has This Been Tested?
Unit tests covering all four input formats, edge cases, error handling, and format auto-detection:
All tests run with pytest using a local SparkSession
Screenshots (if appropriate):
Types of changes
Checklist: