The Meta warehouse is one of the largest on the planet, with multiple exabytes of data, more than 1 million tables and roughly 100,000 data pipeline/transformation jobs running daily. Tens of thousands of people – from data engineers and data scientists to SW and machine learning engineers – rely daily on this data. I saw it all from a front-row seat because before co-founding Revefi, I was embedded in Meta’s data infrastructure team, which supports all of Meta. It was a fascinating experience!
Just prior to joining Meta, I got a master’s degree in education. This taught me the craft of interviewing people to get their perspectives and discover the root cause of problems. I’m a strong believer in the process of peeling the layers of the onion to create greater understanding.
During my time at Meta, I used these interviewing skills that I honed and came to love. At Meta, I shadowed many data stakeholders and conducted deep customer interviews. The story I heard again and again was that data quality and consistency is the Achilles’ heel for all data teams.
Data teams spent around 30%-40% of their time investigating, finding root causes and fixing potential data problems. Many data quality issues at Meta were classified as Severity 1s and were especially problematic for data infrastructure teams because recovering from them required large time and resource investments – and often full recovery was not even conceivable or attempted!
Data Teams Clearly Need Better Quality Data Tools and Infrastructure
One of the top requests from Meta’s data engineering leadership was better quality tools and infrastructure. These engineering leaders and their teams wanted solutions to help them better understand what was going on with the data warehouse tables that were important to them.
For example, if Meta was using certain data for critical workflows, and that data typically loaded at 4 a.m., but on a certain day it had not loaded by 10 a.m., the data team would want to know:
- How do I know if the data did not land in time?
- Did I set up any alerts for that?
- What is causing this issue?
- When will it be fixed?
- Will it be fixed in time to power the critical workflows today?
Meta data engineering leaders and teams also wanted to be able to get quick answers to these questions if a dashboard indicated that no data loaded to a table today, yesterday or days ago.
- Was this problem created by something that I did?
- If not, is an upstream table (or tables) the issue?
- And if the problem came from upstream, how far up the parent chain does it go?
That was one of the big pet peeves of the data engineering leadership. After all, if you are responsible for hundreds or thousands of tables, it is not possible to write checks for all of them to see if the data has been loaded. Working this way is just far too manually intensive.
In sharing their frustration about this far from optimal situation, they would say things such as:
- Why can’t the data infrastructure team inform me automatically when no data was loaded in a table?
- Why do the users of the platform have to detect this by themselves?
Meta data engineers also wanted to ensure that if they made a change in the transformation logic for a specific table, they would not inadvertently impact tables downstream given that there are often multiple people and teams who depend upon a particular dataset. However, at the time, Meta had no good way to identify who depended upon each dataset and whether specific changes that data engineers wanted to make to tables would affect those other users.
All of the above left Meta’s data engineers guessing about whether they had a data problem and why. If they wanted answers, they were forced to contact owners of the upstream tables to uncover the root of the problem. This time-intensive process sometimes required the data engineers to contact multiple table owners along the parent chain to find needed answers.
Depending on Humans to Write Every Validation Is Not Scalable
Meta had infrastructure to write manual SQL-based data quality checks. This infrastructure collected and saved historical values for the metrics being computed and allowed users to do both instantaneous/stateless (e.g. the column should have zero nulls) and stateful checks (e.g. the number of rows today should be within 20% of the average from the last 10 days).
Data engineering teams considered this a highly valuable resource. Some very important tables, like Meta’s user dimension table, had close to 100 tests using this framework. However, beyond a few important tables from some select passionate teams, the adoption of this infrastructure was low – hovering in the single digits as a percentage of the total warehouse tables. That’s because it is typically hard for data engineers to create and manage data quality checks because there are too many tables and artifacts, and it’s not possible to write manual checks for all of it.
Deciding what thresholds to set for alerts was also tricky. A data engineer might be unsure whether to set an alert when a row count is more than 10% of the last five-day average, or whether to set that threshold at 25%. And static thresholds at some point stop being relevant anyway because they can’t dynamically adjust to changes in your application or user behavior.
Once an alert fires, data engineers would also face another challenge – determining the root cause of the problem and deciding who will do something about it. Too many alerts or false positives was a problem as well. When teams started getting too many alerts, they became desensitized to those alerts and would just ignore them. Some teams went as far as turning off the alerts because they just couldn’t deal with the volume of alerts. But teams did so at their peril. Ignoring and switching off alerts is dangerous because you risk missing real problems.
And then there was the cost issue. Data teams were regularly exceeding their data warehouse compute quotas. Though the total number of these data quality checks was small, costs added up fast. Meta might have a billion events happening every day. It would have to scan all of that and then look at the unique values. At the same time, data teams were under pressure from Meta’s management to reduce their use of the warehouse resources since everybody wanted to use it.