[HN Gopher] Data Engineering Design Patterns
___________________________________________________________________
Data Engineering Design Patterns
Author : sebg
Score : 156 points
Date : 2023-12-07 15:51 UTC (7 hours ago)
(HTM) web link (www.dedp.online)
(TXT) w3m dump (www.dedp.online)
| hdespiritu wrote:
| DAMA-DMBOK2 covers this very comprehensively
|
| https://www.dama.org/cpages/body-of-knowledge
| datadrivenangel wrote:
| Data engineering is cool and new while data management is old
| school and enterprise.
|
| Specifically, data engineering in some tech companies is truly
| a revenue driver, so it makes data engineering in other
| organizations be viewed as a cost center so much, even if it is
| the same work at most organizations.
| esafak wrote:
| How do you define the two terms?
| datadrivenangel wrote:
| Data engineering versus data management?
|
| Data engineering is nominally more pipeline oriented and
| less concerned with the governance & people side of things,
| but good data engineering people end up driving a lot of
| data management work because that's what makes the data
| engineering less painful (eliminate root cause of data
| errors and annoying data requests) and data overall more
| useful and valuable.
| tremon wrote:
| Data Engineering is an engineering displine -- it can
| involve anything from data ingestion, transformation,
| storage, enrichment, aggregation up to presentation in
| operational reports. But it's still a manufacturing process
| with "data" as an input and "data" as its output.
|
| Data Management is an organization discipline -- it is
| about how the enterprise manages data as an asset and how
| data is embedded in the organization. This includes data
| governance issues like common data models, and a chain of
| command (which person/role is responsible for which piece
| of data), but also second-tier data processes such as
| quality control and data valuation.
| erhserhdfd wrote:
| This may be nitpicking, but technologies being described as
| "cool" versus "enterprise" or "new" versus "old" I find
| meaningless. I don't necessarily want to have the "coolest"
| or "newest" tech stack; I want to have the tech stack that
| solves reasonably and reliably solves my business problems.
| If that means leveraging "old" or "enterprise" technologies
| and practices, that could be totally fine.
| bfors wrote:
| Could be interesting once there's more content, in its current
| state the content is mostly just definitions.
| articsputnik wrote:
| Author of the book here -- 100%. The idea is to release early
| and update on the go. You might wonder if it's worth delving
| into the book in its current, unfinished form. I'd say it
| depends.
|
| There's quite a bit of content around general DE knowledge.
| However, the anticipated design patterns are still in the
| works. Suppose that's what you're most excited about. In that
| case, you'll find the beginnings of exploring the patterns of
| `caching` and `ad-hoc querying` in the first Convergent
| Evolution chapter, but otherwise, you need to wait for more.
| mdaniel wrote:
| The "please let me know"
| <https://www.dedp.online/appendix/feedback> link at the bottom of
| the page gives an ugly Apache 404
|
| heh, I would send feedback about that but ...
|
| ---
|
| _update_ : so, it seems the Feedback in the sidebar
| <https://www.dedp.online/appendix/feedback.html> works so it was
| just a missing page extension. While reading that I discovered
| the GitHub repo for the project is private, which explains why I
| couldn't find it, either
| articsputnik wrote:
| Author of the book here -- yes, I used Mdbook
| (https://github.com/rust-lang/mdBook), and the links strangely
| end with `.html. as you correctly updated. Initially, I wanted
| to change it but left it as it was.
| JAlexoid wrote:
| It's interesting to note, that when I was first called a DE - it
| was just software engineer in the data domain.
|
| As in writing full software, that happen to focus on data.
|
| Just 6 years ago I would be tinkering with PrestoDB code, looking
| at optimizing the scheduler and building Hadoop extensions.
|
| Between that and today the field swung to people who came from
| BI, with considerably less software engineering background. To
| the point that just 2 years ago, when applying for DE roles I
| would be confused why majority of my screening questions came in
| the form of "how well do you know SQL".
|
| Today I do the same as I did 3-4 years ago, but I am no longer a
| data engineer.
| lysecret wrote:
| Yea same. Also, honestly feels like the way the field is
| progressing it will just be eaten up by an SWE role. Feel the
| same for ML engineer and many other specialized roles.
| JAlexoid wrote:
| I don't think that SWEs will.
|
| The software and services are going to be getting advanced
| enough to just eliminate the need for a dedicated team to
| build ETL. People with relevant domain knowledge will have an
| easier time to deliver their work product, without the
| overhead of building phase.
|
| To get a reasonably good data platform - point-and-click ETL
| service, SAAS offering and the likes of Metabase - are
| already good enough for medium enterprises... and beat
| Databricks offerings for speed(setup, delivery and operation)
| in reporting and operational data access.
|
| I am absolutely sure that there will be a massive contraction
| in the DS, DE and ML opportunity market in the next few
| years. The major companies will consolidate and jobs in those
| domains will only be available at only a handful of
| companies... or extremely specialized startups.(much like
| chip design is now consolidated)
|
| Long story short, for companies - you probably don't need DS,
| ML and DE departments.
| qsort wrote:
| The BI world is honestly kind of weird.
|
| You have people who are at the intersection of "understands
| databases, the relational model, query optimization etc. at the
| level of a very senior SWE" [?] "needs to be told how git works
| in the year of our lord 2023".
| ddol wrote:
| I had a similar experience at Airbnb.
|
| My title at Airbnb was "Data Engineer" in 2016, then "Software
| Engineer - Data" 2016-2019, then just "Software Engineer"
| 2019-2023.
|
| When I joined the DE team we were not in the Engineering Org,
| our manager reported to the head of Analytics (Chief Data
| Scientist). The DE perf cycle, job levels and comp were all
| tied to the Analytics Org levels. There was a Data Infra team
| (DI) under Engineering > Infrastructure who managed Presto,
| HBase, HIVE, &c. but didn't touch pipelines, that was DE's job.
|
| Most of the DE's owned more than pipelines though, many of us
| also wrote and owned services. Max on our team built Airflow
| and Caravel/Panoramix/Superset during hackathons, Johnathan
| built our Data Quality tool, Amit built the Minerva semantic
| metrics layer (which Nick, James and Paul spun out as
| Transform), Aaron built our Anomaly Detection platform, John
| built Dataportal, I built our Customer Support Roster service
| and a Kafka indexing service.
|
| Our manager was awesome. She saw that we were undervalued in
| Analytics and lobbied successfully to move the team to the
| Engineering Infrastructure org. We were all retitled in
| Workday, our perf structure changed to align with the rest of
| Engineering, as did our levels.
|
| DE living as a whole org team under Infra lasted less than a
| year before we were split up and distributed into the
| respective product teams we supported, as Software Engineers
| with a focus on building & maintaining pipelines, schemas,
| logging libraries... and the existing tools we had built. The
| intention was to be embedded into the product teams (Homes,
| Trips, Support Tools, &c.), skill up these teammates and share
| the oncall load. In reality what happened was that (at least) 3
| DE teams then grew in the various product orgs.
| gigatexal wrote:
| Maybe this is different at the highest levels of the game but
| for the engineers in the more mainstream parts of the bell
| curve at the less than Google level of craziness and volume
| companies Data Engineers -- folks that have come up as former
| DBAs, DataWarehouse devs, db heavy backend devs, analytics /
| reporting folks -- it's been my experience that these folks
| tend to solve problems in a more straight forward, data
| centric, practical sort of way. And in my experience folks
| who enter a data role from the sofware side of things tend to
| come up with rather convoluted solutions to simple things.
|
| Therefore I think the title distinction is warranted. It
| frames that the company is looking for engineers with skills
| in the software space -- source control mastery, knowledge of
| a language or two other than SQL, but also experience looking
| at query plans, designing large scale data systems, dealing
| with BI tools etc etc. A sw engineer from a traditional
| background CAN do this but I'd rather someone that fits the
| DE role more.
| kozikow wrote:
| It's the least defined role.
|
| Currently, I am in a funny situation when all teams agree we
| need an additional data engineer. But basically:
|
| - Sales and finance want more of business intelligence analyst
|
| - Devs want more of a backend engineer
|
| - ML researchers want a data analyst proficient in ETL to do
| pipelines on the training dataset
|
| All of those 3 have only one thing in common - they need to
| know SQL very well. I've worked extensively with various
| technologies to analyze data - pandas, sql, spark. And still, I
| find SQL (especially recently BigQuery) getting me what I want
| the quickest.
| ed_elliott_asc wrote:
| Yeah I'm thinking of changing my title back to software dev
| instead of DE - it's sort of getting a bad rep.
| fifilura wrote:
| How do you define "bad rep"?
| opportune wrote:
| A lot of "data engineers" are former db analysts and such
| that don't know much of anything technically outside of SQL
| and even that might be something they only are "certified"
| to know rather than actually good at.
|
| It's basically becoming a title I'd associate with being
| low-skill. I used to be a "software engineer in data" and
| never call myself a data engineer because people would
| think I don't know how to write/maintain production
| services, just write ETL pipelines
| g0xA52A2A wrote:
| Maybe I'm just of a different vintage but I'm just not a fan of
| these e-mail newsleters that seem to be the trend these days. I'd
| much rather follow a Github repo for this sort of thing.
| maddynator wrote:
| Does anyone know what tech is used here to create the online
| book?
| emlos wrote:
| looks like mdBook https://github.com/rust-lang/mdBook
___________________________________________________________________
(page generated 2023-12-07 23:00 UTC)