Skip to content

[FEATURE] Warn on duplicate rule names in YAML/JSON rule files #7

@asingamaneni

Description

@asingamaneni

Is your feature request related to a problem? Please describe.
When loading rules from a YAML or JSON file via the pluggy-based rule loader (introduced in PR Nike-Inc#300), duplicate rule names within the same file are silently accepted. Both entries are loaded into the Spark DataFrame, which causes duplicate DQ checks being executed against the same data, double-counting in statistics (error counts, output counts), and confusing reports where the same rule name appears multiple times. There is no warning or validation to alert users to this misconfiguration.

Describe the solution you'd like
Add a _warn_duplicate_rule_names() helper to spark_expectations/rules/plugins/_flatten.py, called at the end of flatten_rules_list(). It should log a WARNING for every rule name that appears more than once, including the count. The warning should be non-breaking -- all rows are still returned so existing pipelines continue to function while users are alerted to the issue.

Example warning output:

WARNING - Duplicate rule name detected: 'col1_not_null' appears 2 times in the rules file.
This will cause duplicate DQ checks and double-counting in statistics.
Please ensure every rule has a unique name.

Describe alternatives you've considered

  1. Raising an error on duplicate rule names -- rejected because it would be a breaking change for existing pipelines that may have unintentional duplicates.
  2. Automatically deduplicating rules -- rejected because silently dropping rules could cause unexpected behavior and data quality gaps.
  3. Adding validation at the YAML/JSON loader level before flattening -- viable but less centralized since flatten_rules_list() is the single point all rule formats pass through.

Additional context
A full implementation with 7 unit tests is available in PR #6 on this fork. The tests cover: no duplicates (no warning), single duplicate, multiple duplicates, correct count reporting, empty input, and integration via flatten_rules_list().

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions