Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ se_user_conf = {
user_config.se_dq_obs_alert_flag: True,
user_config.se_dq_obs_default_email_template: "",
user_config.se_notifications_email_from: "<sender_email_id>",
user_config.se_notifications_email_to_other_nike_mail_id: "<receiver_email_id's>",
user_config.se_notifications_email_to_other_mail_id: "<receiver_email_id's>",
user_config.se_notifications_email_subject: "spark expectations - data quality - notifications",
user_config.se_notifications_enable_slack: True,
user_config.se_notifications_slack_webhook_url: "<slack-webhook-url>",
Expand All @@ -131,8 +131,8 @@ se_user_conf = {
user_config.se_notifications_on_error_drop_threshold: 15,
#Optional
#Below two params are optional and need to be enabled to capture the detailed stats in the <stats_table_name>_detailed.
#user_config.enable_query_dq_detailed_result: True,
#user_config.enable_agg_dq_detailed_result: True,
#user_config.se_enable_query_dq_detailed_result: True,
#user_config.se_enable_agg_dq_detailed_result: True,
#Below two params are optional and need to be enabled to pass the custom email body
#user_config.se_notifications_enable_custom_email_body: True,
#user_config.se_notifications_email_custom_body: "'product_id': {}",
Expand Down
2 changes: 1 addition & 1 deletion docs/Observability_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ Users have two options for generating the report table:
- If no custom template is provided by the user, the system will automatically fall back to the default jinja template ([spark_expectations/config/templates/advanced_email_alert_template.jinja](../spark_expectations/config/templates/advanced_email_alert_template.jinja)) for rendering the report table.

### *Sample for the alert received in the mail*
![Spark Expectation alert](se_diagrams/alert_sample.png)
![Spark Expectation alert](https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/alert_sample.png?raw=true)
10 changes: 5 additions & 5 deletions docs/bigquery.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
### Example - Write to Delta
### Example - Write to BigQuery

Setup SparkSession for BigQuery to test in your local environment. Configure accordingly for higher environments.
Refer to Examples in [base_setup.py](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/scripts/base_setup.py) and
Expand All @@ -22,9 +22,9 @@ spark.conf.set("viewsEnabled", "true")
spark.conf.set("materializationDataset", "<temp_dataset>")
```

Below is the configuration that can be used to run SparkExpectations and write to Delta Lake
Below is the configuration that can be used to run SparkExpectations and write to BigQuery

```python title="iceberg_write"
```python title="bigquery_write"
import os
from pyspark.sql import DataFrame
from spark_expectations.core.expectations import (
Expand Down Expand Up @@ -88,8 +88,8 @@ user_conf = {
# user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
# user_config.se_notifications_on_error_drop_threshold: 15,
# user_config.se_enable_error_table: True,
# user_config.enable_query_dq_detailed_result: True,
# user_config.enable_agg_dq_detailed_result: True,
# user_config.se_enable_query_dq_detailed_result: True,
# user_config.se_enable_agg_dq_detailed_result: True,
# user_config.se_dq_rules_params: { "env": "local", "table": "product", },
}

Expand Down
Empty file.
10 changes: 0 additions & 10 deletions docs/configurations/databricks_setup_guide.md

This file was deleted.

11 changes: 10 additions & 1 deletion docs/css/custom.css
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,13 @@ a.autorefs-external:hover::after {
.md-typeset table td:first-child {
min-width: 160px;
white-space: nowrap;
}
}

/* Mermaid diagram - uniform light fills matching the sequence diagram style */
:root {
--md-mermaid-node-bg-color: #E8EAF6;
--md-mermaid-node-fg-color: #1A237E;
--md-mermaid-edge-color: #546E7A;
--md-mermaid-label-bg-color: #ffffff;
--md-mermaid-label-fg-color: #263238;
}
4 changes: 2 additions & 2 deletions docs/delta.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ user_conf = {
# user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
# user_config.se_notifications_on_error_drop_threshold: 15,
# user_config.se_enable_error_table: True,
# user_config.enable_query_dq_detailed_result: True,
# user_config.enable_agg_dq_detailed_result: True,
# user_config.se_enable_query_dq_detailed_result: True,
# user_config.se_enable_agg_dq_detailed_result: True,
# user_config.se_dq_rules_params: { "env": "local", "table": "product", },
}

Expand Down
12 changes: 6 additions & 6 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ se_user_conf = {
user_config.se_notifications_on_rules_action_if_failed_set_ignore: True, # (18)!
user_config.se_notifications_on_error_drop_threshold: 15, # (19)!
user_config.se_enable_error_table: True, # (20)!
user_config.enable_query_dq_detailed_result: True, # (21)!
user_config.enable_agg_dq_detailed_result: True, # (22)!
user_config.se_enable_query_dq_detailed_result: True, # (21)!
user_config.se_enable_agg_dq_detailed_result: True, # (22)!
user_config.querydq_output_custom_table_name: "<catalog.schema.table-name>", #23
user_config.se_dq_rules_params: {
"env": "local",
Expand All @@ -56,16 +56,16 @@ se_user_conf = {
10. The `user_config.se_notifications_email_subject` parameter captures the subject line of the email
11. The `user_config.se_notifications_email_custom_body` optional parameter, captures the custom email body, need to be compliant with certain syntax
12. The `user_config.se_notifications_enable_slack` parameter, which controls whether notifications are sent via slack, is set to false by default
13. The `user_config/se_notifications_slack_webhook_url` parameter accepts the webhook URL of a Slack channel for sending notifications
13. The `user_config.se_notifications_slack_webhook_url` parameter accepts the webhook URL of a Slack channel for sending notifications
14. When `user_config.se_notifications_on_start` parameter set to `True` enables notification on start of the spark-expectations, variable by default set to `False`
15. When `user_config.se_notifications_on_completion` parameter set to `True` enables notification on completion of spark-expectations framework, variable by default set to `False`
16. When `user_config.se_notifications_on_fail` parameter set to `True` enables notification on failure of spark-expectations data quality framework, variable by default set to `True`
17. When `user_config.se_notifications_on_error_drop_exceeds_threshold_breach` parameter set to `True` enables notification when error threshold reaches above the configured value
18. When `user_config.se_notifications_on_rules_action_if_failed_set_ignore` parameter set to `True` enables notification when rules action is set to ignore if failed
19. The `user_config.se_notifications_on_error_drop_threshold` parameter captures error drop threshold value
20. The `user_config.se_enable_error_table` parameter, which controls whether error data to load into error table, is set to true by default. The default error table name is `<target_table>_error`; users can override it by calling `set_error_table_name(...)` on the `SparkExpectations` instance context (for example, `se._context.set_error_table_name("catalog.schema.custom_error_table")`) before executing the decorated function.
21. When `user_config.enable_query_dq_detailed_result` parameter set to `True`, enables the option to capture the query_dq detailed stats to detailed_stats table. By default set to `False`
22. When `user_config.enable_agg_dq_detailed_result` parameter set to `True`, enables the option to capture the agg_dq detailed stats to detailed_stats table. By default set to `False`
21. When `user_config.se_enable_query_dq_detailed_result` parameter set to `True`, enables the option to capture the query_dq detailed stats to detailed_stats table. By default set to `False`
22. When `user_config.se_enable_agg_dq_detailed_result` parameter set to `True`, enables the option to capture the agg_dq detailed stats to detailed_stats table. By default set to `False`
23. The `user_config.querydq_output_custom_table_name` parameter is used to specify the name of the custom query_dq output table which captures the output of the alias queries passed in the query dq expectation. Default is <stats_table>_custom_output
24. The `user_config.se_dq_rules_params` parameter, which are required to dynamically update dq rules
25. The `user_config.se_notifications_enable_templated_basic_email_body` optional parameter is used to enable using a Jinja template for basic email notifications (notifying on job start, completion, failure, etc.)
Expand Down Expand Up @@ -152,7 +152,7 @@ from spark_expectations.config.user_config import Constants as user_config

stats_streaming_config_dict: Dict[str, Union[bool, str]] = {
user_config.se_enable_streaming: True, # (1)!
user_config.secret_type: "databricks", # (2)!
user_config.secret_type: "cerberus", # (2)!
user_config.cbs_url : "https://<url>.cerberus.com", # (3)!
user_config.cbs_sdb_path: "cerberus_sdb_path", # (4)!
user_config.cbs_kafka_server_url: "se_streaming_server_url_secret_sdb_path", # (5)!
Expand Down
48 changes: 48 additions & 0 deletions docs/examples_notebooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Jupyter Notebook Examples

Spark Expectations ships with ready-to-run Jupyter notebooks in the [`examples/notebooks/`](https://github.com/Nike-Inc/spark-expectations/tree/main/examples/notebooks) directory. These notebooks are the fastest way to explore the framework interactively.

## Running Locally with Docker

The repository provides a Docker Compose setup that launches Jupyter Lab alongside Kafka and Mailpit (a local SMTP test server):

```bash
# Generate self-signed certificates for the mail server
make generate-mailserver-certs

# Start all services (Jupyter + Kafka + Mailpit)
make local-se-server-start ARGS="--build"
```

Once running:

- **Jupyter Lab**: [http://localhost:8888](http://localhost:8888)
- **Mailpit UI** (view sent emails): [http://localhost:8025](http://localhost:8025)
- **Kafka**: `localhost:9092`

## Available Notebooks

| Notebook | Description |
|---|---|
| [`spark_expectation_basic_jlab.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_basic_jlab.ipynb) | Basic DQ setup with Delta Lake. Demonstrates row, aggregate, and query DQ rules with a local SparkSession. |
| [`spark_expectation_basic_mail.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_basic_mail.ipynb) | Email notifications. Sends DQ alerts to the local Mailpit SMTP server. |
| [`spark_expectation_basic_mail_templates.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_basic_mail_templates.ipynb) | Jinja-templated email notifications. Demonstrates custom and default HTML email templates. |
| [`spark_expectations_basic_slack_notification.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectations_basic_slack_notification.ipynb) | Slack notifications via incoming webhooks. |
| [`spark_expectation_basic_pagerduty_notification.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_basic_pagerduty_notification.ipynb) | PagerDuty notifications via Events API v2. |
| [`spark_expectation_notifications_integration.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_notifications_integration.ipynb) | Multi-channel notification integration (email + Slack + more). |
| [`spark_expectations_aggregation_rules.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectations_aggregation_rules.ipynb) | Aggregate and query DQ rules with detailed stats tables. |
| [`spark_expectation_basic_dbx.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_basic_dbx.ipynb) | Databricks-specific setup. Designed for import into Databricks Repos. |
| [`spark_expectation_streaming_dbx.ipynb`](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/notebooks/spark_expectation_streaming_dbx.ipynb) | Streaming DQ with `WrappedDataFrameStreamWriter` on Databricks. |

## Sample Scripts

In addition to notebooks, runnable Python scripts are available in [`examples/scripts/`](https://github.com/Nike-Inc/spark-expectations/tree/main/examples/scripts):

| Script | Description |
|---|---|
| `sample_dq_delta.py` | Delta Lake batch DQ with order/product/customer data |
| `sample_dq_bigquery.py` | BigQuery batch DQ |
| `sample_dq_iceberg.py` | Iceberg batch DQ |
| `sample_dq_delta_streaming.py` | Streaming DQ with `WrappedDataFrameStreamWriter` |
| `sample_dq_yaml_json.py` | Loading rules from YAML/JSON files via `load_rules_from_yaml` |
| `base_setup.py` | Shared SparkSession setup for Delta, Iceberg, and BigQuery |
112 changes: 0 additions & 112 deletions docs/github_copilot_chat_commands.md

This file was deleted.

28 changes: 28 additions & 0 deletions docs/home/why_spark_expectations.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,34 @@ required for this activity

`Spark-Expectations solves all of the above problems by following the below principles`

## How It Works

```mermaid
sequenceDiagram
participant Input as Input DataFrame
participant SE as SparkExpectations
participant Target as Target Table
participant Error as Error Table
participant Stats as Stats Table

Input->>SE: Submit DataFrame
rect rgb(232, 234, 246)
Note over SE: Phase 1 — Source Validation
SE->>SE: Run agg_dq and query_dq on source
end
rect rgb(232, 234, 246)
Note over SE: Phase 2 — Row Validation
SE->>SE: Run row_dq on every row
SE->>Error: Failed rows written to error table
end
rect rgb(232, 234, 246)
Note over SE: Phase 3 — Target Validation
SE->>SE: Run agg_dq and query_dq on cleaned data
end
SE->>Target: Clean rows written to target table
SE->>Stats: Metrics and rule results logged
```

* Spark Expectations provides the ability to run both individual row-based and overall aggregated data quality rules
on both the source and validated data sets. In case a rules fails, the row-level error is recorded in the `_error` table
and a summarized report of all failed aggregated data quality rules is compiled in the `_stats` table.
Expand Down
10 changes: 5 additions & 5 deletions docs/iceberg.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### Example - Write to Delta
### Example - Write to Iceberg

Setup SparkSession for iceberg to test in your local environment. Configure accordingly for higher environments.
Setup SparkSession for Iceberg to test in your local environment. Configure accordingly for higher environments.
Refer to Examples in [base_setup.py](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/scripts/base_setup.py) and
[iceberg.py](https://github.com/Nike-Inc/spark-expectations/blob/main/examples/scripts/sample_dq_iceberg.py)

Expand Down Expand Up @@ -29,7 +29,7 @@ builder = (
spark = builder.getOrCreate()
```

Below is the configuration that can be used to run SparkExpectations and write to Delta Lake
Below is the configuration that can be used to run SparkExpectations and write to Iceberg

```python title="iceberg_write"
import os
Expand Down Expand Up @@ -85,8 +85,8 @@ user_conf = {
# user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True,
# user_config.se_notifications_on_error_drop_threshold: 15,
# user_config.se_enable_error_table: True,
# user_config.enable_query_dq_detailed_result: True,
# user_config.enable_agg_dq_detailed_result: True,
# user_config.se_enable_query_dq_detailed_result: True,
# user_config.se_enable_agg_dq_detailed_result: True,
# user_config.se_dq_rules_params: { "env": "local", "table": "product", },
}

Expand Down
Loading
Loading