Is your feature request related to a problem? Please describe.
When loading rules from a YAML or JSON file via the pluggy-based rule loader (introduced in PR Nike-Inc#300), duplicate rule names within the same file are silently accepted. Both entries are loaded into the Spark DataFrame, which causes duplicate DQ checks being executed against the same data, double-counting in statistics (error counts, output counts), and confusing reports where the same rule name appears multiple times. There is no warning or validation to alert users to this misconfiguration.
Describe the solution you'd like
Add a _warn_duplicate_rule_names() helper to spark_expectations/rules/plugins/_flatten.py, called at the end of flatten_rules_list(). It should log a WARNING for every rule name that appears more than once, including the count. The warning should be non-breaking -- all rows are still returned so existing pipelines continue to function while users are alerted to the issue.
Example warning output:
WARNING - Duplicate rule name detected: 'col1_not_null' appears 2 times in the rules file.
This will cause duplicate DQ checks and double-counting in statistics.
Please ensure every rule has a unique name.
Describe alternatives you've considered
- Raising an error on duplicate rule names -- rejected because it would be a breaking change for existing pipelines that may have unintentional duplicates.
- Automatically deduplicating rules -- rejected because silently dropping rules could cause unexpected behavior and data quality gaps.
- Adding validation at the YAML/JSON loader level before flattening -- viable but less centralized since
flatten_rules_list() is the single point all rule formats pass through.
Additional context
A full implementation with 7 unit tests is available in PR #6 on this fork. The tests cover: no duplicates (no warning), single duplicate, multiple duplicates, correct count reporting, empty input, and integration via flatten_rules_list().
Is your feature request related to a problem? Please describe.
When loading rules from a YAML or JSON file via the pluggy-based rule loader (introduced in PR Nike-Inc#300), duplicate rule names within the same file are silently accepted. Both entries are loaded into the Spark DataFrame, which causes duplicate DQ checks being executed against the same data, double-counting in statistics (error counts, output counts), and confusing reports where the same rule name appears multiple times. There is no warning or validation to alert users to this misconfiguration.
Describe the solution you'd like
Add a
_warn_duplicate_rule_names()helper tospark_expectations/rules/plugins/_flatten.py, called at the end offlatten_rules_list(). It should log aWARNINGfor every rule name that appears more than once, including the count. The warning should be non-breaking -- all rows are still returned so existing pipelines continue to function while users are alerted to the issue.Example warning output:
Describe alternatives you've considered
flatten_rules_list()is the single point all rule formats pass through.Additional context
A full implementation with 7 unit tests is available in PR #6 on this fork. The tests cover: no duplicates (no warning), single duplicate, multiple duplicates, correct count reporting, empty input, and integration via
flatten_rules_list().