"Hold on a second! Why is this column accepting that value?" Is this something that you always hear and see on data projects? If you are working in data, you must have come across data quality issues before. Rectifying data quality issues can prove to be a nightmare especially in the curated data layer(s).
Fear not! Introducing you Databricks' new data quality tool: Databricks Quality eXtended (DQX). A tool designed to proactively check data quality and minimise data quality issues in downstream layers. In this blog we will be looking at the features offered by DQX.
Introduction
Databricks Quality eXtended (DQX) is a new tool designed to simplify data quality checking at scale for streaming and standard data frames. This tool is fairly new and new features are added as the tool evolves. At a high-level, DQX seems to offer two main features:
- Data Profiling: Profile the data to generate summary statistics and data expectations (i.e. data validation rules) which can be used downstream as part of the data validation process.
- Data Validation: Validate the data for a given dataset. The validation rules can be defined in several ways, e.g. in Python or using a config YAML file.
Image from GitHub: Databrickslabs DQX
The above image is from the DQX GitHub documentation; it shows a high-level process on how to incorporate DQX in an end-to-end workflow based on a medallion architecture. Key things to observe: we have the ability to quarantine data and bring it back later into the silver layer.
The GitHub documentation provides three demo notebooks that may be suitable based on a specific workload you wish to explore:
- using DQX as a Python library
- using DQX as a tool
- using DQX with DLT (Delta Live Tables)
What are my current alternative options?
You may have implemented data quality solutions with alternatives offered by Databricks, e.g. rescued data or Delta Live Table (DLT) expectations. How does DQX compare? Below are some of the key differentiators we observed:
Feature | Rescued Data | DLT | DQX |
---|---|---|---|
Purpose | Data preservation during ingestion | Declarative ETL framework with built-in data quality features | Comprehensive data quality management |
Integration | Built-in feature of Auto Loader | Native Databricks feature | Separate framework by Databricks Labs |
Data Handling | Preserves mismatched data | Manages data flow between tables | Validates data quality across dimensions |
Output | Stores mismatched data in '_rescued_data' column | Creates and updates Delta tables | Provides detailed quality issue information |
Implementation | Automatic with Auto Loader | Requires pipeline definition | Requires explicit implementation |
Flexibility | Limited to schema mismatches | Flexible in defining data transformation logic and simple quality expectations | Highly customizable with user-defined rules |
Scope | Individual records during ingestion | End-to-end data pipelines | Entire datasets or specific columns |
Benefits of using DQX
DQX is capable of profiling your data and output the profiled results back to you in a config.yml file. This file can subsequently be used in the data validation phase. Within the configuration file, you can essentially see the data quality rules that the profiler has inferred. You have the option to modify or add new data quality rule as you wish. The profiler is definitely something to look at!
Utilising the config.yml file, you are able to validate more data. Validating the data against the data quality rules ensures flagging or quarantining of invalid records. In a nutshell, data validation ensures only the 'right' data get passed in the subsequent layers which in effect ensures the overall data integrity. You can also implement a mechanism to re-introduce the invalid records as, at the time of writing, there is no explicit feature offered by DQX to do so.
Challenges
- Currently, there is little to no information/documentation about how DQX dashboard works. It was observed that this is currently being worked on (GitHub).
- Currently, there is little to no information/documentation utilising the main config.yaml for creating the output- and quarantine- tables.
- Profiling the data is compute intensive process. If profiling is to occur on large datasets / repeated workflows, consideration to cluster size must be given. In addition, if using workflows for data profiling, this process needs to consider the end-to-end data architecture and its impacts on performance and costs.
- The DQX framework, while powerful, may have its own learning curve.
- Different data types (e.g., numerical, categorical, temporal) may require different validation approaches.
- Incorporating DQX checks into existing data pipelines and workflows may require careful planning and potential refactoring of current processes.
Prerequisites
- Databricks CLI - if installing on Databricks cluster
- Unity Catalog (UC): Although this currently seems to be a soft requirement as we could implement DQX without UC
- Python 3.10+
- Databricks workspace and cluster (DBR spark 3.5.0+)
A detailed list of all prerequisites can be found here. For the demo examples below, we will assume all prerequisites have been met.
Data Profiling
At an initial glance of the GitHub documentation, data profiling can be done in a few ways. The best one for data engineering workloads would be either using a Python approach with its relevant methods or using the workflow method. The workflow method looks appealing as it seems to be able to support the full end-to-end process - bronze to gold layer. When installing DQX, a sample workflow is given to you. This allows you to get started quickly.
Before we deep dive into data profiling, there are some key remarks to mention. Profiling seems to be a compute intensive process. Therefore, you may wish to ensure you are using the best compute capacity for the dataset your are profiling. Another remark is that profiling may not be suitable for lower environments, e.g. development if using mocked/simulated data. In this scenario, you could utilise profiling to establish a template for data quality rules. However, keep in mind that these rules might not accurately reflect the quality rules of the real production data.
Now let's get started with data profiling. In our example, we will assume we are profiling the bronze layer in the medallion architecture. Depending on how you configured the DQX, you should be able to see a .dqx folder in your user workspace (in this demo we did the default installation. You can also install the DQX globally - see the GitHub documentation).
Within this folder, you should see a "config.yml" file. Let's edit this file. From the GitHub documentation, to profile data, we should specify the following:
- input_location: Location of the data - note if it is a path no quotes required.
- input_format: format of the input data.
- checks_file: The relative file location for the data quality files to be populated.
- profile_summary_stats: The relative file location for the summary stats of our data to be populated.
Next, when navigating to the Databricks Workflows section you should see a DQX profiler workflow:
When running this job, the config.yml file gets picked up and is run against the DQX python wheel.
After running, you should see "Status = Success" (Note that we used my interactive cluster which was up and running rather than a job cluster hence the fast compute time):
You can navigate into the job and see the output logs. Detailed output logs can be found inside the .dqx folder:
Now in the .dqx folder, we also have the two files we were expecting:
These files were auto-generated when the profiler workflow ran. Looking into the YAML files will give you the expectation of potential data quality rules (check_xxxx.yml). You may need to modify this to update/modify data quality rules in which case you can directly edit the YAML file. The profile_summary_stats_xxx.yml contains statistical information about the data, such as min/max values of a column.
Data Validity
Now when data flows into the silver layer, any records that do not match the data quality rules, can be either quarantined or flagged in the silver table. We will plan to separate the invalid records into a quarantine table - we can verify the invalid records with the business user(s). Remember, you also have the option to simply flag invalid records in the next layers' table with two additional audit columns (_errors and _warnings).
Initially, we tried utilising the config.yaml and the workflow, however, it did not work as expected (see below). In addition, at the time writing, not much documentation is provided on this.
Therefore, as a workaround, we had to utilise the Python library and its relevant methods. We read more data from the bronze and passed it to the "apply_checks_by_metadata_and_split()"
as a dataframe and also passed the data quality YAML definition. Note if you want to simply flag it in the next table as additional columns, you will need to use the "apply_checks()"
method instead. The output in our case returned two data frames - the valid and quarantined dataframes. The valid dataframe contains the records that have passed the data checks and the quarantined dataframe contains the records which did not pass the data quality checks.
When looking into the quarantined dataframe, we can see the reason why the records failed the data quality checks by checking the audit columns _errors and _warnings:
The above gives an introduction on how to quarantine data with DQX. At the time of writing, not much information is given on how to reintroduce quarantined data back into the silver layer. Therefore, you may need to implement this manually depending on your needs.
More on Data Quality rules
Data quality is crucial for effective analytics and decision-making. We will look at standard built-in functions like "is_not_null" or "value_in_range" that are automatically generated by the profiler that are helpful, as well as more advanced checks that may require customised rules and in most cases are a necessary requirement in order to meet expectations.
DQX supports various types of data quality checks. We can use built-in functions for common validations, SQL-based expressions for more complex checks, and even custom Python functions for highly specialised validations. This flexibility allows us to address a wide range of data quality scenarios. Let us explore ways to validate the data using DQX.
Python based:
DQX supports direct Python-based checks, allowing custom logic beyond the capabilities of simple expressions. Below is an example using built-in functions for validating the New York taxi dataset for fields such as trip_distance, pickup_datetime, dropoff_datetime, and passenger_count.
SQL Expressions:
Now, let's validate the same data using SQL-based rules. Here's an example of how to configure the SQL-based validation using built-in SQL expression:
Custom Functions:
For even greater control and reusability, DQX allows the creation of custom Python functions. These functions can implement specific validation logic tailored to your business needs.
To use this custom function in your checks, you'll need to include it in your configuration:
Note the addition of "globals()"
allows the use of custom check functions defined in your global scope. A check function should return a pyspark.sql.Column
. Custom functions offer powerful customisation options, enabling domain-specific validations and comprehensive quality assurance tailored to your specific data needs.
Improving DQ Across Layers
Data quality is a critical concern that spans across various stages of data processing. To ensure consistent and reliable data throughout your pipeline, it's essential to implement robust quality checks across multiple layers, such as bronze, silver, and gold. The Data Quality eXtended (DQX) framework provides a powerful solution for achieving this goal.
Configuring Multi-Layer Checks
To apply data quality checks across different layers, you'll need to modify the config.yml
file to specify distinct settings for each layer. Here's an example configuration:
This configuration allows you to define separate input locations, check files, and profile summary files for each layer, enabling tailored quality control at every stage of your data pipeline
Executing Layer-Specific Workflows
Once you've set up the configuration, you can run the DQX profiler for each layer individually:
- Ensure data is present in the
input_location
for the target layer. - Navigate to the Workflows section and select your DQX profiler.
- Click "Run with different parameters" from the "Run Now" dropdown.
- Enter the layer name (e.g., "bronze" or "silver") in the 'run_config_name' field.
Analysing Results
After running the workflows, you'll find separate checks and profile summary files for each layer:
- checks_bronze.yml and profile_summary_bronze.yml for the bronze layer
- checks_silver.yml and profile_summary_silver.yml for the silver layer
These layer-specific files allow you to analyse and refine your data quality rules at each stage of processing, ensuring that data integrity is maintained throughout your pipeline. By implementing this multi-layered approach, you can catch and address data quality issues early on in the process, prevent the propagation of errors, and maintain high-quality data across your entire data ecosystem.
Conclusion
This tool is fairly new and should be used with caution for production/end-to-end workloads as it does not come with any service level agreements (SLAs) as of now. The various ways how DQX can be integrated in your existing new workloads seems to be promising but lack of documentation makes it a challenge. We are pretty sure more documentation and features are a top priority as seen by the GitHub feature backlog. One important consideration is the performance and cost implications of using data profiling and data validation on larger datasets- again it's on GitHub feature backlog. In summary, we think keeping an eye out how DQX will play in the wider data ecosystem is a must. Have a go with DQX yourself! Let us know your thoughts!
What next?
If you’re looking for assistance in how you can best approach your data quality implementation in your existing or new data architecture, Advancing Analytics can help! Contact hello@advancinganalytics.co.uk to talk to our Engineering team today.
*Disclaimer: The blog post image was generated by AI and does not depict any real person, place, or event.
References
Topics Covered :
Author
Sarg & Aaron