How Meta is Benefitting from the Power of Automation and How Your Business Can, Too

Data Quality
Article
Sep 25, 2023
|
Shashank Gupta

The Meta warehouse is one of the largest on the planet, with multiple exabytes of data, more than 1 million tables and roughly 100,000 data pipeline/transformation jobs running daily. Tens of thousands of people – from data engineers and data scientists to SW and machine learning engineers – rely daily on this data. I saw it all from a front-row seat because before co-founding Revefi, I was embedded in Meta’s data infrastructure team, which supports all of Meta. It was a fascinating experience! 

Just prior to joining Meta, I got a master’s degree in education. This taught me the craft of interviewing people to get their perspectives and discover the root cause of problems. I’m a strong believer in the process of peeling the layers of the onion to create greater understanding.

During my time at Meta, I used these interviewing skills that I honed and came to love. At Meta, I shadowed many data stakeholders and conducted deep customer interviews. The story I heard again and again was that data quality and consistency is the Achilles’ heel for all data teams.

Data teams spent around 30%-40% of their time investigating, finding root causes and fixing potential data problems. Many data quality issues at Meta were classified as Severity 1s and were especially problematic for data infrastructure teams because recovering from them required large time and resource investments – and often full recovery was not even conceivable or attempted!

Data Teams Clearly Need Better Quality Data Tools and Infrastructure

One of the top requests from Meta’s data engineering leadership was better quality tools and infrastructure. These engineering leaders and their teams wanted solutions to help them better understand what was going on with the data warehouse tables that were important to them.

For example, if Meta was using certain data for critical workflows, and that data typically loaded at 4 a.m., but on a certain day it had not loaded by 10 a.m., the data team would want to know:

  • How do I know if the data did not land in time?
  • Did I set up any alerts for that?
  • What is causing this issue?
  • When will it be fixed?
  • Will it be fixed in time to power the critical workflows today?

Meta data engineering leaders and teams also wanted to be able to get quick answers to these questions if a dashboard indicated that no data loaded to a table today, yesterday or days ago.

  • Was this problem created by something that I did?
  • If not, is an upstream table (or tables) the issue?
  • And if the problem came from upstream, how far up the parent chain does it go?

That was one of the big pet peeves of the data engineering leadership. After all, if you are responsible for hundreds or thousands of tables, it is not possible to write checks for all of them to see if the data has been loaded. Working this way is just far too manually intensive.

In sharing their frustration about this far from optimal situation, they would say things such as:

  • Why can’t the data infrastructure team inform me automatically when no data was loaded in a table?
  • Why do the users of the platform have to detect this by themselves?

Meta data engineers also wanted to ensure that if they made a change in the transformation logic for a specific table, they would not inadvertently impact tables downstream given that there are often multiple people and teams who depend upon a particular dataset. However, at the time, Meta had no good way to identify who depended upon each dataset and whether specific changes that data engineers wanted to make to tables would affect those other users.

All of the above left Meta’s data engineers guessing about whether they had a data problem and why. If they wanted answers, they were forced to contact owners of the upstream tables to uncover the root of the problem. This time-intensive process sometimes required the data engineers to contact multiple table owners along the parent chain to find needed answers.

Depending on Humans to Write Every Validation Is Not Scalable

Meta had infrastructure to write manual SQL-based data quality checks. This infrastructure collected and saved historical values for the metrics being computed and allowed users to do both instantaneous/stateless (e.g. the column should have zero nulls) and stateful checks (e.g. the number of rows today should be within 20% of the average from the last 10 days). 

Data engineering teams considered this a highly valuable resource. Some very important tables, like Meta’s user dimension table, had close to 100 tests using this framework. However, beyond a few important tables from some select passionate teams, the adoption of this infrastructure was low – hovering in the single digits as a percentage of the total warehouse tables. That’s because it is typically hard for data engineers to create and manage data quality checks because there are too many tables and artifacts, and it’s not possible to write manual checks for all of it.

Deciding what thresholds to set for alerts was also tricky. A data engineer might be unsure whether to set an alert when a row count is more than 10% of the last five-day average, or whether to set that threshold at 25%. And static thresholds at some point stop being relevant anyway because they can’t dynamically adjust to changes in your application or user behavior.

Once an alert fires, data engineers would also face another challenge – determining the root cause of the problem and deciding who will do something about it. Too many alerts or false positives was a problem as well. When teams started getting too many alerts, they became desensitized to those alerts and would just ignore them. Some teams went as far as turning off the alerts because they just couldn’t deal with the volume of alerts. But teams did so at their peril. Ignoring and switching off alerts is dangerous because you risk missing real problems.

And then there was the cost issue. Data teams were regularly exceeding their data warehouse compute quotas. Though the total number of these data quality checks was small, costs added up fast. Meta might have a billion events happening every day. It would have to scan all of that and then look at the unique values. At the same time, data teams were under pressure from Meta’s management to reduce their use of the warehouse resources since everybody wanted to use it.

Automation Is the Best Path Forward

Clearly, ensuring data quality and freshness are massive pain points for data teams. Existing solutions, especially those that entail writing and managing data quality checks manually, could be useful – but they had big gaps and were not delivering everything that data teams needed.

However, we found that we could deploy many very intelligent automation techniques to discover and root cause data quality problems at scale by providing a system that allows data engineers to override or augment specific SQL checks. And that’s exactly what we did at Meta.

Now the data warehouse pipelines that produce data are scheduled periodic workloads. The metrics for each table update include timestamp of the update and number of rows loaded; a set of metrics for each column, such as the number of null values; and for numerical columns, minimum, maximum and average across the updated row. The solution automatically monitors a subset of these metrics for all tables in the warehouse and uses anomaly detection with seasonality detection overlaid with a set of rules to identify issues and alert users about them.

This Yielded Impressive Results for Meta and Learnings from Which Any Business Can Benefit

By deploying checks in a much more automated fashion, we were able to take coverage from 10% of the warehouse to approximately 90% of the important tables. This is important because all tables are not created equal. Certain tables are critical to the business. But data warehouses also have a good amount of data and tables that people don’t really care about, including testing tables, staging tables and tables whose usage has been deprecated but are still around.

We could now do automatic root cause detection in a way that was not possible before. By enabling these checks on all tables in the dependency graph (in which Table A is the parent of Table B, which is the parent of Table C), the system is able to walk the upstream graph to pinpoint where the failure actually originated. This is a powerful capability because it eliminates the need for data teams to make their way through the human chain across tables, contacting the owners of each of those upstream tables in an attempt to uncover the root of the problem.

There were a lot of interesting, sometimes counterintuitive learnings as well. For example, we need stronger signals than “this did not happen in the past” to assert it will not happen in future. We enabled non-null checks on columns which had no null values in the past 90 days. However, this was not a strong enough condition. We got feedback that many of these columns were allowed to have nulls, just that the data did not have them! We needed to augment this with other information such as how the columns were used in queries.

Another learning was that not all false alerts are created equal! We were very strongly focused on reducing all alert noise. But data teams were okay to receive alerts on tables important to them even when these were for known issues. In fact, this category of false alerts was appreciated! This gave them the confidence that the system was monitoring things on their behalf and would alert them when the problem really happens.

And because you can learn about data from the data itself, we used metadata that was already present to allow for low-cost checks. In contrast to SQL based checks, which are expensive, these checks can scale and enable monitoring of all tables in the warehouse. And Meta is able to use its data warehouse bandwidth, which is always at a premium, for the “real” work! 

I’m a huge believer in automation. Embracing automation is a much more effective strategy than asking data teams to do everything manually and leaving them drowning in a sea of data checks. That’s why we launched Revefi and introduced the Revefi Data Operations Cloud. 

Revefi’s zero-touch approach automatically creates many different types of checks and starts monitoring your data immediately to solve your problems, enabling you to use the right data to make critical business decisions and provide your business with significant cost savings.

Article written by
Shashank Gupta
CTO, Co-founder
Table of Contents
Transform your data observability experience with Revefi
Get started for free