Data quality testing for SQL-, Spark-, and Pandas-accessible data.
✔ An open-source, CLI tool and Python library for data quality testing
✔ Compatible with the Soda Checks Language (SodaCL)
✔ Enables data quality testing both in and out of your data pipelines and development workflows
✔ Integrated to allow a Soda scan in a data pipeline, or programmatic scans on a time-based schedule
Soda Core is a free, open-source, command-line tool and Python library that enables you to use the Soda Checks Language to turn user-defined input into aggregated SQL queries.
When it runs a scan on a dataset, Soda Core executes the checks to find invalid, missing, or unexpected data. When your Soda Checks fail, they surface the data that you defined as bad-quality.
Consider migrating to Soda Library, an extension of Soda Core that offers more features and functionality, and enables you to connect to a Soda Cloud account to collaborate with your team on data quality.
- Use Group by and Group Evolution configurations to intelligently group check results
- Leverage Reconciliation checks to compare data between data sources for data migration projects.
- Use Schema Evolution checks to automatically validate schemas.
- Set up Anomaly Detection checks to automatically learn patterns and discover anomalies in your data.
Install Soda Library and get started with a 45-day free trial.
Soda Core currently supports connections to several data sources. See Compatibility for a complete list.
Requirements
- Python 3.8 or greater
- Pip 21.0 or greater
Install and run
-
To get started, use the install command, replacing
soda-core-postgres
with the package that matches your data source. See Install Soda Core for a complete list.pip install soda-core-postgres
-
Prepare a
configuration.yml
file to connect to your data source. Then, write data quality checks in achecks.yml
file. See Configure Soda Core. -
Run a scan to review checks that passed, failed, or warned during a scan. See Run a Soda Core scan.
soda scan -d your_datasource -c configuration.yml checks.yml
# Checks for basic validations
checks for dim_customer:
- row_count between 10 and 1000
- missing_count(birth_date) = 0
- invalid_percent(phone) < 1 %:
valid format: phone number
- invalid_count(number_cars_owned) = 0:
valid min: 1
valid max: 6
- duplicate_count(phone) = 0
# Checks for schema changes
checks for dim_product:
- schema:
name: Find forbidden, missing, or wrong type
warn:
when required column missing: [dealer_price, list_price]
when forbidden column present: [credit_card]
when wrong column type:
standard_cost: money
fail:
when forbidden column present: [pii*]
when wrong column index:
model_name: 22
# Check for freshness
- freshness(start_date) < 1d
# Check for referential integrity
checks for dim_department_group:
- values in (department_group_name) must exist in dim_employee (department_name)