Twitter engineers recently shared a blog post about how they designed and developed a quality automation platform using Google Cloud Platform (GCP) and open source software.
Data completeness, accuracy, and consistency are some of the criteria used to determine data quality, which assesses the health of the data.
Some product teams performed manual testing, independently, by running SQL commands manually through the BigQuery user interface and/or Jupyter notebooks. There was no single framework for automating and consistently performing data quality checks.
It is important to have automated data quality checks to identify anomalies, accuracy and reliability of datasets at scale to achieve:
Confidence: when there is better data quality, customers have confidence in the results they produce, which lowers risk in the results and increases efficiency.
Better performance: Allows clients to be more productive instead of spending time validating and debugging data. They can focus on their core decisions.
Avoid lost revenue: In the decision-making process, bad data can lead to lost revenue.
Data Quality Platform (DQP), which is a managed, configuration-driven, workflow-based solution for building and collecting standard and custom quality metrics, alerting on data validation, and adding monitoring to those metrics/statistics within GCP.
These features within the platform allow us to identify and monitor anomalies, latency, accuracy and reliability of these data sets.
As shown in the diagram, the system input is a YAML configuration for GCP. It triggers Airflow jobs to test different resources with different cadence and granularity. The results are sent to the PubSub queue. Later, the Dataflow job lands the dataset at the correct destination with the appropriate quality metrics.
As mentioned earlier, automation in data quality is the key to having high-quality data products. Many cloud service providers such as AWS and GCP provide data quality automation solutions. There are also many open sources related to data quality automation for more research and exploration.