Twitter runs over 10 million queries a month on nearly an exabyte of data in BigQuery through an automated framework. It works on top of existing services such as GCP Dataflow and Apache Airflow that move data from Hadoop on-premises into BigQuery.
Why data quality
Data freshness, completeness, accuracy, and consistency are some of the criteria used to determine data quality, which assesses the health of data.
Some product teams perform manual testing, independently, by running SQL commands manually through the BigQuery user interface and/or Jupyter notebooks. There is no single framework for automating and consistently performing data quality checks.
It is important to have automated data quality checks to identify anomalies, accuracy and reliability of datasets at scale to achieve:
- Confidence
- Better productivity
- Avoid lost revenue
Solution design
The data quality platform relies on a number of technologies in its stack. Using a CI/CD workflow, we upload YAML configurations into GCS. From there, Airflow’s connected worker will run the associated resource granularity and cadence test. The test results will execute and send their results to a PubSub queue. Later, the Dataflow job lands the dataset from the queue into the target table in BigQuery used in Looker, allowing users to debug and identify trends in metrics.
The Data Quality Platform allowed Twitter to use open source libraries, Apache Airflow and Great Expectations, and integrate with GCP services such as GCS, PubSub, Dataflow, BigQuery, and Looker. This provides a complete automated solution to ensure the accuracy and reliability of the thousands of datasets that are ingested daily, increasing the confidence in the data provided to advertisers.