Leveraging dbt to Simulate Data Outages for Revefi

At Revefi, we empower users to better understand their data quality, cost, and performance from end to end. Our data operations cloud automatically alerts on both load and usage-related data anomalies, while providing insights into optimization and spend. To assess our product, we needed a test dataset that showcases Revefi’s capabilities, including deep lineage, compiled incident digests, external integrations, and RCA potential. We knew that the best way to demonstrate these features end-to-end was to build a realistic test dataset – one that effectively mocks real-world use cases.

TLDR: Outcomes & learnings from this project

Every company’s data infrastructure is unique and comes with its challenges. As data needs constantly fluctuate, so do data solutions - a process that we have witnessed firsthand as data engineers. Striving to meet the initial foundational requirements, this test dataset project led to many learning experiences. We explored many features that dbt offers, including jinja, macros, jobs, schedules, tests, sources, and documentation. We were able to step into the shoes of our users and better understand their thought processes from start to finish, creating a test dataset that has proved useful for several months.

However, the largest takeaway is that ultimately, our test dataset is not yet complete! As the product and specifically the ML & AI models we use at Revefi evolve, we continue to improve the dataset to keep pace.

What did we achieve?

A flexible data model and pipeline that can scale to any number of tables using DBT
Deep lineage which replicates customer environments
Arbitrary injection of multiple types of failures at any stage in the data graph
Ability to generate cascading failures in the data graph, where data quality issues in one table manifest as data quality issues in multiple downstream tables, even affecting the end user-facing BI dashboards
Multiple scheduling granularities of DBT models from an hour to days, mimicking the scenarios in real data warehouses

What challenges did we face in creating this test dataset?‍

To replicate a customer environment, our test dataset required complexity and scalability. In particular, we needed enough tables to build a deep lineage graph that captures cascading failures. These failures could include low row count, high row count, no/late data load, and long execution times, all of which could impact downstream tables. Furthermore, these events had to occur consistently but at unpredictable intervals throughout the pipeline to be realistic. Because of this probabilistic factor and our previous experience with these tools, we chose Snowflake and dbt as our respective data warehouse and data orchestration solution. Leveraging these already-integrated platforms helped us create a test dataset that met our requirements.

Where did the data come from?‍

We decided to utilize the TPC-H dataset. TPC-H is a benchmark used to measure the performance of highly complex decision-support databases. This dataset contains eight production-grade intercorrelated tables (CUSTOMER, LINEITEM, ORDERS, PART, etc.). An added benefit is that several scale variations of the TPC-H dataset are readily available on Snowflake.

‍

‍

Building the dataset

For our dataset, we took the original eight TPC-H tables and created closer to 100 tables with complex relationships. To accomplish this, we leveraged geo and time partitioning. Given a configurable number of partitions, we divided each table to represent different “countries”--in this case, the United States, Canada, and Mexico. Furthermore, to mimic real-world scenarios, we split our pipeline into three stages, raw, staging, and production. To simulate the ingestion of new data every day, we shifted the dates in the TPC-H dataset. An example lineage and model file for the ORDERS table are shown below.

‍

BI Systems

The last stage in our data-lineage-building process was adding Tableau and ThoughtSpot visualizations on top of the tables to provide us with custom insights into the trends and patterns in our dataset. Adding this integration allowed us to simulate data engineers’ holistic experience on Revefi’s production, from adding a data source to building dashboards on top. An example lineage with ThoughtSpot is shown below.

‍

Scheduling

Utilizing dbt as our ETL tool allowed us to schedule recurring runs of our data models at customizable intervals. We currently have models that run at hourly, daily, and monthly frequencies, which we specify using tags. While some models fully refresh in each run, others are incrementally materialized and grow consistently over time. We define these details for each dbt model using the config Jinja macro within a model, as shown below. This variation in model configuration leads to a more accurate representation of real-world dbt projects.

Injecting anomalies

Once we established the structure of the test dataset, we needed to inject recurring anomalies to test our incident detection. To accomplish this, we employed the Jinja templating language in our dbt models. Jinja let us use control structures like if statements and for loops in our SQL to introduce random variability. We wanted to enable several different failures, including low row count, high row count, no data load, or late data load, each with a 2-4% probability of occurring daily.

To simulate low or high row count errors, we limited the number of rows returned for the model in each run. In the example below, we have a query that 94% of the time returns 1500 rows. But 2% of the time, the query returns 0 rows (which also happens to coincide with a no data load failure), and 2% of the time, the query returns 6000 rows. We also simulated SQL errors by injecting fundamentally incorrect code into the SQL (in this case ‘ERROR’) with a 2% probability.

Embarking on this project has been very beneficial to our team, and empowered us to build a better system for our customers. As the product and specifically the ML & AI models we use at Revefi evolve, we will continue to improve the dataset to keep pace. Stay tuned for more updates from our team as we continue to use the combination of Revefi & dbt internally for further testing. For more information, check out our quick video on how Revefi works with dbt.