[HN Gopher] How Airbnb Built "Wall" to prevent data bugs
___________________________________________________________________
How Airbnb Built "Wall" to prevent data bugs
Author : charlysl
Score : 23 points
Date : 2022-05-18 18:43 UTC (2 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| SOLAR_FIELDS wrote:
| Ive been in the end stage of this (worked on data validation for
| a good chunk of my career) and these are my thoughts on the
| article:
|
| Determining blocking vs non blocking is a big issue - deciding
| which checks should be stoppers and which shouldn't is often a
| matter of extensive debate. In my experience, only a few data
| checks are absolute show stoppers under any circumstance and a
| lot of things need to spawn tickets that should be routed to the
| correct team and followed up on. Some type of tracking system is
| necessary for this.
|
| Defining the logic of checks themselves in YAML is a trap. We
| went down this DSL route first and it basically just completely
| falls apart once you want to add moderately complex logic to your
| check. AirBnB will almost certainly discover this eventually.
| YAML does work well for the specification of how the check should
| behave though (eg metadata of the data check). The solution we
| were eventually able to scale up with was coupling specifications
| in a human readable but parseable file with code in a single unit
| known as the check. These could then be grouped according to
| various pipeline use cases.
|
| A model that plugs into an Airflow DAG as AirBnB has designed
| seems like a good approach. Often when it was time to incorporate
| checks into the pipeline we had heterogenous strategies to invoke
| our checks engines. Having a standardized approach helps drive
| adoption across the organization- oftentimes I've found that
| people are reluctant to run non critical checks if it's a
| significant time and effort cost and will only run critical ones
| to try and push data quality accountability either upstream or
| downstream. If it's really easy to turn on and incorporate that's
| one less excuse that can be used to not run the checks.
| testbjjl wrote:
| Maybe Jim Buckmaster and Craig Neumark are taking notes.
___________________________________________________________________
(page generated 2022-05-20 23:00 UTC)