[HN Gopher] Using an ETL framework vs. writing yet another ETL s...
___________________________________________________________________
Using an ETL framework vs. writing yet another ETL script
Author : Arimbr
Score : 81 points
Date : 2021-12-17 16:03 UTC (6 hours ago)
(HTM) web link (airbyte.io)
(TXT) w3m dump (airbyte.io)
| johanneskanybal wrote:
| Well this is just an ad. At glance looks like an airflow knock
| off.
| hadlock wrote:
| Reminds me of the xkcd "how standards proliferate"
|
| Arguably you could use Kubernetes as a scheduler, or "ETL
| framework/kit", it supports cron jobs, has a restful api, local
| and remote secrets storage, native support for cloud storage,
| support for multiple logging solutions, distributed workloads,
| supports cron jobs etc.
|
| Years ago I worked for a financial services company and they
| would run their batch ETL jobs via a product called Tidal that
| later got bought by Cisco. I really liked using Tidal for the
| longest time, but 100% of what Tidal does, you can replicate with
| the scheduler features of Kubernetes.
| kartoshechka wrote:
| I just got out of the job where we were working on legacy ETL
| "script" in Elixir, and terrible code architecture decisions
| aside, I think the pattern where you have to launch and monitor
| long lasting jobs is a breeze in BEAM. You just spawn one process
| for each job, report back via mailbox and monitor via Supervisor.
| Unfortunately making changes to that system where all sources
| were hardcoded was to say at least abysmal, but the job running
| core was quite elegant.
|
| Hopefully Elixir and other BEAM compiled languages will gain
| enough traction, I can't imagine rewriting something that
| available in Erlang from the box in OOP languages with mutable
| objects.
| chasers wrote:
| Definitely feels like the beam was built for this kind of work.
| at_a_remove wrote:
| I am just an amateur, but gosh I love doing ETL. There's
| something about the specification of the data you're supposed to
| be getting, and then the harsh reality of the filthy trash you
| will actually be fed, and making something that can carefully
| examine each of the assumptions in the spec to test for how it
| won't go right, making tests for things that will, you are told,
| "never happen" (only to get an email from your program three
| years later that this has, in fact, happened), interpolating data
| where you can, and in general making a "simple load process" into
| one of those grinders people can feed cows and tungsten carbide
| rods into.
|
| I feel like Data mentioning that he just ... loves scanning for
| life forms.
| eatonphil wrote:
| I had this same kind of impression writing product/business
| reports as an engineering manager at Oracle and elsewhere. But my
| way of solving it was to build a smarter IDE that helps you do
| this stuff without ETL.
|
| You shouldn't need to waste time every time you build a report
| figuring out again how to query Postgres/MySQL/Elastic in Python
| and how to generate graphs and where to host it or how to set up
| recurring alerts on changes. The only part you should care about
| is the actual queries to write, the joins to do between datasets,
| and the fields to graph. The actual integration code (connecting
| to the database, copying a file from a remote server, etc.)
| should be handled for you.
|
| The tool I'm building to solve this is open source if you're
| curious!
|
| https://github.com/multiprocessio/datastation
| elchief wrote:
| I gave up on all these frameworks and just use Cron and Make and
| Python
| johanneskanybal wrote:
| Depends on what you need. It's classic thinking you need a too
| big tool. Can imagine that setup working great for many cases.
| Cwizard wrote:
| Do you use some kind of library to help with transformations?
| If you have some volume using plain python can become a
| bottleneck no? Any tips in this area would be appreciated.
| WhyCause wrote:
| I'm not the person you were asking, but I, too, use mostly
| Python for ETL tasks as a good chunk of my job.
|
| Python's speed, or lack thereof, is rarely an issue. Most
| tasks that I write/run finish in less than a minute,
| including the time for transformations. Some of those can get
| fairly hairy, as well. Most of the time is spent querying the
| DB and uploading to the remote.
|
| For the tasks where the transformations are too complex, you
| can greatly decrease the time necessary by tweaking the
| queries ( _i.e._ , limiting rows to just changed rows, _etc._
| ). And frankly, once you realize most data doesn't need to
| update in real time, it doesn't matter how long the
| transformation step takes (as long as it's fewer than about
| 23 1/2 hours).
| elchief wrote:
| pyspark if big-ol' data sets
|
| pandas otherwise
| MichaelRazum wrote:
| Please correct me if I'm doing it wrong:
|
| Scripts are collection data from external sources Scripts are
| inserting Data to DB Other Scripts are fetching DB and adding
| Data to other tables
|
| Sure I mean it was mentioned, that data was missing and so on.
| But in a mature project you basically have already some
| monitoring, so you could just use existing solution instead of a
| new Framework.
|
| Is there something wrong with such an approach?
| jumpman500 wrote:
| I've always found ETL frameworks to have their own problems. They
| seem great on paper but usually they don't account for a specific
| source system, APIs, applications, data size, data distribution
| or scheduling situations. If your project is using it then
| developers end up hacking the frameworks instead of writing
| simple code that does the specific thing they need to do.
|
| Before you know it you have super long and super inefficient code
| just to fit the framework. It takes about the same time to read
| and understand an ETL framework as it is to write your own
| python/bash script, and at least with your own code it's easier
| to see bottlenecks.
| mcguire wrote:
| Couple that with the way ETL frameworks quickly become
| undocumented, featuriferous that are opaque to anyone who isn't
| deeply embedded into the framework, yeah.
| aarreedd wrote:
| I agree. I've used AWS Data Pipelines for some jobs but there
| is a steep learning curve. It is good for launching servers on
| demand to run your ETL jobs if you need that.
|
| The best solution I have found is writing ETL scripts in
| Laravel which I use for most projects anyway. The framework has
| built-in scheduling and error reporting.
| dillondoyle wrote:
| Not sure if this provides any insight or value.
|
| But I had this same experience.
|
| First exapmle was connecting Iterable - which looks like
| Airbyte supports - to bigquery.
|
| In the past I had someone help me setup snowflake which was too
| complicated for me to maintain / understand myself especially
| AWS is too complicated for me compared to simpler google cloud.
|
| Have also tried stich and fivetran at different times. Mostly
| to try to save time setting up non webhook syncs from FB
| marketing, Front. The volume of iterable data would be way
| hugely prohibitably expensive for us on those as paid
| platforms.
|
| In the end I was able to do FB Marketing myself less than 1k
| lines of python modified from a script I found on github which
| used google cloud scheduler & function. I don't know python so
| that felt somewhat satisfying.
|
| Another nuance in favor of a hosted/paid platform is that it
| looks like airbyte uses an api lookup sync instead of webhooks.
| That lets Airbyte get some more meta data to join to that I
| don't collect. That's valuable!
|
| For iterable I ended up making a GAE app to accept incoming
| webhook data -> push to pubsub -> pushes to function -> which
| writes to bigquery.
|
| The latency for bq writes was too much to try and do it all at
| once and i don't think iterable does webhook retries. Also
| Iterable is MEGA bursty like I've seen our GAE will scale up to
| somethings 40+ instances within minutes after we hit send on a
| campaign. That was the hardest problem to figure out getting
| the latency down for cold starts and scaling, cloud functions
| didn't work. It's not perfect but it's good enough for our
| needs. The simpler FB function grabs data 100% correct each day
| which feels good last I talked to some of the paid ETL it was
| flat $500 minimum a month not worth it.
|
| From learning all this I've been able to reuse this gae,
| pubsub, function, bq/spanner pattern for other stuff I build
| and it has saved a lot of time and headache.
| ibains wrote:
| Prophecy.io let's you create visual components from any Spark
| function. Same with Airflow. So you can use standard components
| (built-in or your new ones) without being restricted.
|
| Founder here - we're working to solve this exact problem.
| wobblykiwi wrote:
| What's the difference between Prophecy and the multitude of
| other ETL tools out there, like StreamSets, Talend, Ab
| Initio, and plenty more?
| ibains wrote:
| We're very different from the ETL tools in that we're bring
| software development best practices to data.
|
| When you do visual drag and drop - prophecy is generating
| high quality code on git that is 100% open source (spark,
| airflow), you have tests and CI/CD - so you're visually
| doing solid data engineering.
|
| You can toggle between code and visual - so if you change
| the code (some), the visual graph updates - so small edits
| directly to git don't break the visual layer.
|
| All visual components are generated from a spec - think a
| Spark function with a some more info. So the data platform
| teams will create their own library/framework and roll it
| out to the wider teams. How it works us that in the visual
| editor, you start with standard Spark library, but can load
| visual components for delta, or encryption or data quality.
|
| Our customers are typically fed up of these ETL tools and
| moving to us. We can also import the ETL formats (AbInitio,
| Informatica, ...) in an automated way (we reverse
| engineered their formats and created source to source
| compilers)
| switch007 wrote:
| Indeed. A classic one is dealing with oauth2...
|
| Airbyte docs:
|
| > Note that the OAuth2Authenticator currently only supports
| refresh tokens and not the full OAuth2.0 loop.
| sherifnada wrote:
| I think this is saying that particular class expects to
| receive a refresh token as input. The "full oauth loop" means
| the UI needs to produce a refresh token via user consent in
| the browser.
| huetius wrote:
| I started writing and never completed a dead simple ETL
| framework that left most of the work up to the programmer. It
| was basically just an http server that you could hit with cron
| jobs, a DAG data structure for mapping the process, and some
| annotations that you could use to specify that a node/program
| step was concurrency safe, blocking, etc. You're entirely right
| that there's way more to ETL than meets the eye, but this still
| makes me want to dig it back up.
| travisgriggs wrote:
| ETL was actually a new acronym for me: Extract, Transform, Load.
|
| https://en.m.wikipedia.org/wiki/Extract,_transform,_load
| mcguire wrote:
| Welcome to the enterprise! :-)
| elchief wrote:
| It really stands for Excel Taking Longtime
| cheriot wrote:
| Can any of these ETL frameworks kick off an ETL script without a
| rewrite? Something that would handle scheduling, retries, emit
| metrics around those actions, but let me use my own tools for the
| actual data manipulation.
| jasonpbecker wrote:
| We use Jenkins for essentially this right now, and it works
| fine (especially if you have someone who knows Jenkins
| already).
| johanneskanybal wrote:
| One of the standard patterns is airflow and kubernetes executor
| and then the pod can do whatever. There's the occasional side
| effect and some setup but more or less it works as advertised.
| timwis wrote:
| Celery?
| 0xbadcafebee wrote:
| The work companies put into ETL is absolutely bizarre from a
| business standpoint.
|
| Say you want to build a new kind of hammer. Normally what you do
| is you pay a company for access to some facility that has forging
| process, and you put your effort into designing the hammer,
| working with the facility to get it forged the way you want, and
| selling it.
|
| Building ETL pipelines from scratch is like building an ire ore
| mining facility, building diggers, trucks, shipping containers,
| trains, boats, iron smelting and forging equipment, and
| warehouses, on top of designing and selling the hammer. People
| today are so brainwashed by the idea that they need to be writing
| software that they go to these extreme lengths to build
| everything from scratch, when they should almost always be paying
| somebody else to do most of it, and only spend time working on
| your business's actual problem. Building factories and logistics
| chains should not be a problem your business needs to solve if
| all you need to do is build and sell a hammer.
| davinchia wrote:
| Airbyte Engineer here.
|
| I think some of the points made here about ETL scripts being just
| 'ETL scripts' are very relevant. Definitely been on the other
| side of the table arguing for a quick 3-hour script.
|
| Having written plenty of ETL scripts - in Java with Hadoop/Spark,
| Python with Airflow and pure Bash - that later morphed into tech
| debt monsters, I think many people underestimate how quickly
| these can quickly snowball into proper products with actual
| requirements.
|
| Unless one is extremely confident an ETL script will remain a
| non-critical good-to-have part of the stack, I believe evaluating
| and adopting a good ETL framework, especially one with pre-built
| integrations is good case of 'sharpening the axe before cutting
| the tree' and well worth the time.
|
| We've been very careful to minimise Airbyte's learning curve.
| Starting up Airbyte is as easy as checking out the git repo and
| running 'docker compose up'. A UI allows users to select,
| configure and schedule jobs from a list of 120+ supported
| connectors. It's not uncommon to see users successfully using
| Airbyte within tens of mins.
|
| If a connector is not supported, we offer a Python CDK that lets
| anyone develop their own connectors in a matter of hours. We have
| a commitment to supporting community contributed connectors so
| there is no worry about contributions going to waste.
|
| Everything is open source, so anyone is free to deep as dive as
| they need or want to.
|
| We also build in the open and have single-digit hour Slack
| response time on weekdays. Do check us out -
| https://github.com/airbytehq/airbyte!
| 0xbadcafebee wrote:
| _" Definitely been on the other side of the table arguing for a
| quick 3-hour script."_
|
| But that's probably the time it took to write it, right? 80% of
| the cost of software is in maintenance, so there's another 12
| hours worth of maintenance left to account for. If you know
| you're going to spend 15 hours on it, then you might as well
| use a system you know will cost less to extend or scale over
| time.
|
| _" We've been very careful to minimise Airbyte's learning
| curve."_
|
| That's good for quickly onboarding new customers, but not
| necessarily for the system to be scalable or extensible.
|
| _" Starting up Airbyte is as easy as checking out the git repo
| and running 'docker compose up'."_
|
| I'm always curious about this. Docker-compose doesn't run on
| more than a single host, unless you're using it with the AWS
| ECS integration (and maybe Swarm?). So sure, the developer can
| "get it running" quickly to look at it, but to actually deploy
| something to production they'll have to rewrite the docker-
| compose thing in something else. If you provide them with
| Terraform modules or a Helm chart, that would get them into
| production faster. And maybe even a canned CI/CD pipeline in a
| container so they can start iterating on it immediately. It's
| more work for your company, but it shortens the friction for
| the developers to get to production, and enables businesses to
| start using your product in production _immediately_ , which I
| think is a pretty big differentiator of business value.
| unklefolk wrote:
| Of course, the flip side is that sometimes that initial step of:
|
| > "We are doing a prototype on our new mapping product, we need
| to write a one-off script to pull in some map data."
|
| ... is just that - a one off script. And it can prove to be a lot
| quicker to write a one-off script than getting involved with an
| ETL framework. I am not arguing against ETL Frameworks (Airbyte
| etc). Just that over-engineering carries its own costs, just like
| under-engineering does.
| devnull255 wrote:
| At our company, we actually built ETL-Framework-agnostic
| wrappers, monitoring, logging and scheduling tooling around the
| different ETL tools we used for four different ETL Product
| Frameworks we used: Microfocus COBOL, Torrent Orchestrate,
| Datastage (which incorporated Torrent) and Abinitio. The wrappers
| invoked the ETL command, reformatted and consolidated logs. For
| scheduling, we relied mostly on CA Autosys, instead of whatever
| scheduling mechanisms came with the ETL Product.
|
| We found this approach made it easier to transition from one
| product to another. As it consistently faster to plug the ETL
| framework into the supporting framework than to implement
| everything a new ETL Product offered.
|
| As we move from our on-prem environment to the cloud, we hope we
| can implement a similar strategy even if we have to switch the
| support frameworks.
| cgardens wrote:
| Hiya, I'm the original author. tl;dr for those deciding whether
| or not to read it:
|
| If you are thinking about build versus buy for your ETL solution,
| in this day and age, there are enough great tools out there where
| buy is almost always the right option. You may think you can
| write a "simple" little ETL script to solve your problem, but
| invariably it grows into a monster that will be a reliability
| liability and engineering time suck. The post goes into more
| depth on why that is. Enjoy!
| bob1029 wrote:
| I've just been spinning up new C# console projects in VS and
| pulling down Dapper to do most nasty ad-hoc things between
| databases.
|
| I never bother to save any of these to source control because the
| boilerplate is negligible and its usually a one-time deal.
| computershit wrote:
| I have been working a lot in this space for the last two years
| but especially in the last 6 months. I believe we're about to
| enter a phase where much more elegant and less restrictive ETL
| platforms or frameworks are as commonplace as modern software
| CICD offerings. Prefect and Dagster both stand out to me as
| viable replacements for Airflow.
| mrbungie wrote:
| We currently are using Airflow for ELTs/ETLs to ingest from
| different Postgres databases to BigQuery/Google Cloud Storage.
| Airbyte looks sweet for the same task and would free us from a
| big effort burden, but its Postgres source only supports SELECT *
| statements (i.e. you can't deselect columns).
|
| That's kind of a dealbreaker for us, because for security reasons
| our Postgres users permissions are granularly configured with
| column-based security. I hope the Airbyte team solves this
| eventually because the software is looking great.
| lloydatkinson wrote:
| That's crazy, why would it have the approach of "select all
| columns" when you might never need all of them? Could you
| create a view for it to select from instead?
|
| Either way if someone told me a ETL supports only literally all
| the columns or nothing then I'd assume it was a MVP product to
| demonstrate the idea... not a production system.
| davinchia wrote:
| We are working on it. This should be out next quarter!
| lixtra wrote:
| It's strange to advertise a framework that doesn't support
| more than "select *". A framework should make easy things
| easy and complicated things possible. Ideally there is a
| natural way to do stuff, so different people understand
| each other's approach.
|
| The whole article sounds like you target total amateurs in
| the ETL domain. Every decent ETL engineer knows all of that
| what is revealed after "a few months".
|
| That said, I always welcome new competition in the field.
| Tools still suck.
| jermaustin1 wrote:
| I was once tasked with replacing a dying set of ETLs composed in
| "ScribeSoft", apparently the built in scheduling and speed left
| too much to be desired, and calling other jobs from inside the
| job would halt the current job. Ended up replacing everything
| with a C# console application that ran every 1 minute unless it
| was currently running. There were a lot of bugs on both ends, but
| they were tired of paying $5k/yr for the ETL to run.
|
| After I wrote the initial application, they handed it off to
| their South African dev team to maintain it.
| weego wrote:
| Their standard scenario to avoid is actually a perfectly
| acceptable process to grow though prior to wedging another but of
| infrastructure into your org that needs it's own provisioning,
| maintenance and support, redundancy planning etc.
|
| In that scenario the situation is so nebulous that the original
| implementers had no way to know at what point the business/use
| case would end up so why would they immediately jump to a belt
| and braces solution.
|
| It's the infrastructure version of my go to phrase: don't code
| for every future.
| pphysch wrote:
| My favorite ETL framework is CAP.
|
| C - curl+jq or CLI - extract
|
| A - awk/jq - transform
|
| P - postgres/prometheus(VictoriaMetrics) - load
|
| Recently reduced an inherited 1000 line PHP ETL nonsense into a
| CAP "1-liner" (my awk script is a dozen lines unrolled).
|
| Frankly, this covers a lot of ETL cases. You can do basic
| cleaning in AWK (drop nulls/outliers) and format the output
| however you please, and it's blazingly fast.
| 89vision wrote:
| Argo workflow looks like the best of both worlds to me. You can
| easily build up complex etl/data processing dags where each step
| is a docker container, so you can choose the best tool for the
| job. Argo has all the retry/backoff logic built into it and you
| can plug the workflows into Argo events. It runs on kubernetes so
| you can scale your compute according to your workflow demands
| mcguire wrote:
| Is this an advertisement?
___________________________________________________________________
(page generated 2021-12-17 23:00 UTC)