[HN Gopher] Data Engineering Design Patterns
       ___________________________________________________________________
        
       Data Engineering Design Patterns
        
       Author : sebg
       Score  : 156 points
       Date   : 2023-12-07 15:51 UTC (7 hours ago)
        
 (HTM) web link (www.dedp.online)
 (TXT) w3m dump (www.dedp.online)
        
       | hdespiritu wrote:
       | DAMA-DMBOK2 covers this very comprehensively
       | 
       | https://www.dama.org/cpages/body-of-knowledge
        
         | datadrivenangel wrote:
         | Data engineering is cool and new while data management is old
         | school and enterprise.
         | 
         | Specifically, data engineering in some tech companies is truly
         | a revenue driver, so it makes data engineering in other
         | organizations be viewed as a cost center so much, even if it is
         | the same work at most organizations.
        
           | esafak wrote:
           | How do you define the two terms?
        
             | datadrivenangel wrote:
             | Data engineering versus data management?
             | 
             | Data engineering is nominally more pipeline oriented and
             | less concerned with the governance & people side of things,
             | but good data engineering people end up driving a lot of
             | data management work because that's what makes the data
             | engineering less painful (eliminate root cause of data
             | errors and annoying data requests) and data overall more
             | useful and valuable.
        
             | tremon wrote:
             | Data Engineering is an engineering displine -- it can
             | involve anything from data ingestion, transformation,
             | storage, enrichment, aggregation up to presentation in
             | operational reports. But it's still a manufacturing process
             | with "data" as an input and "data" as its output.
             | 
             | Data Management is an organization discipline -- it is
             | about how the enterprise manages data as an asset and how
             | data is embedded in the organization. This includes data
             | governance issues like common data models, and a chain of
             | command (which person/role is responsible for which piece
             | of data), but also second-tier data processes such as
             | quality control and data valuation.
        
           | erhserhdfd wrote:
           | This may be nitpicking, but technologies being described as
           | "cool" versus "enterprise" or "new" versus "old" I find
           | meaningless. I don't necessarily want to have the "coolest"
           | or "newest" tech stack; I want to have the tech stack that
           | solves reasonably and reliably solves my business problems.
           | If that means leveraging "old" or "enterprise" technologies
           | and practices, that could be totally fine.
        
       | bfors wrote:
       | Could be interesting once there's more content, in its current
       | state the content is mostly just definitions.
        
         | articsputnik wrote:
         | Author of the book here -- 100%. The idea is to release early
         | and update on the go. You might wonder if it's worth delving
         | into the book in its current, unfinished form. I'd say it
         | depends.
         | 
         | There's quite a bit of content around general DE knowledge.
         | However, the anticipated design patterns are still in the
         | works. Suppose that's what you're most excited about. In that
         | case, you'll find the beginnings of exploring the patterns of
         | `caching` and `ad-hoc querying` in the first Convergent
         | Evolution chapter, but otherwise, you need to wait for more.
        
       | mdaniel wrote:
       | The "please let me know"
       | <https://www.dedp.online/appendix/feedback> link at the bottom of
       | the page gives an ugly Apache 404
       | 
       | heh, I would send feedback about that but ...
       | 
       | ---
       | 
       |  _update_ : so, it seems the Feedback in the sidebar
       | <https://www.dedp.online/appendix/feedback.html> works so it was
       | just a missing page extension. While reading that I discovered
       | the GitHub repo for the project is private, which explains why I
       | couldn't find it, either
        
         | articsputnik wrote:
         | Author of the book here -- yes, I used Mdbook
         | (https://github.com/rust-lang/mdBook), and the links strangely
         | end with `.html. as you correctly updated. Initially, I wanted
         | to change it but left it as it was.
        
       | JAlexoid wrote:
       | It's interesting to note, that when I was first called a DE - it
       | was just software engineer in the data domain.
       | 
       | As in writing full software, that happen to focus on data.
       | 
       | Just 6 years ago I would be tinkering with PrestoDB code, looking
       | at optimizing the scheduler and building Hadoop extensions.
       | 
       | Between that and today the field swung to people who came from
       | BI, with considerably less software engineering background. To
       | the point that just 2 years ago, when applying for DE roles I
       | would be confused why majority of my screening questions came in
       | the form of "how well do you know SQL".
       | 
       | Today I do the same as I did 3-4 years ago, but I am no longer a
       | data engineer.
        
         | lysecret wrote:
         | Yea same. Also, honestly feels like the way the field is
         | progressing it will just be eaten up by an SWE role. Feel the
         | same for ML engineer and many other specialized roles.
        
           | JAlexoid wrote:
           | I don't think that SWEs will.
           | 
           | The software and services are going to be getting advanced
           | enough to just eliminate the need for a dedicated team to
           | build ETL. People with relevant domain knowledge will have an
           | easier time to deliver their work product, without the
           | overhead of building phase.
           | 
           | To get a reasonably good data platform - point-and-click ETL
           | service, SAAS offering and the likes of Metabase - are
           | already good enough for medium enterprises... and beat
           | Databricks offerings for speed(setup, delivery and operation)
           | in reporting and operational data access.
           | 
           | I am absolutely sure that there will be a massive contraction
           | in the DS, DE and ML opportunity market in the next few
           | years. The major companies will consolidate and jobs in those
           | domains will only be available at only a handful of
           | companies... or extremely specialized startups.(much like
           | chip design is now consolidated)
           | 
           | Long story short, for companies - you probably don't need DS,
           | ML and DE departments.
        
         | qsort wrote:
         | The BI world is honestly kind of weird.
         | 
         | You have people who are at the intersection of "understands
         | databases, the relational model, query optimization etc. at the
         | level of a very senior SWE" [?] "needs to be told how git works
         | in the year of our lord 2023".
        
         | ddol wrote:
         | I had a similar experience at Airbnb.
         | 
         | My title at Airbnb was "Data Engineer" in 2016, then "Software
         | Engineer - Data" 2016-2019, then just "Software Engineer"
         | 2019-2023.
         | 
         | When I joined the DE team we were not in the Engineering Org,
         | our manager reported to the head of Analytics (Chief Data
         | Scientist). The DE perf cycle, job levels and comp were all
         | tied to the Analytics Org levels. There was a Data Infra team
         | (DI) under Engineering > Infrastructure who managed Presto,
         | HBase, HIVE, &c. but didn't touch pipelines, that was DE's job.
         | 
         | Most of the DE's owned more than pipelines though, many of us
         | also wrote and owned services. Max on our team built Airflow
         | and Caravel/Panoramix/Superset during hackathons, Johnathan
         | built our Data Quality tool, Amit built the Minerva semantic
         | metrics layer (which Nick, James and Paul spun out as
         | Transform), Aaron built our Anomaly Detection platform, John
         | built Dataportal, I built our Customer Support Roster service
         | and a Kafka indexing service.
         | 
         | Our manager was awesome. She saw that we were undervalued in
         | Analytics and lobbied successfully to move the team to the
         | Engineering Infrastructure org. We were all retitled in
         | Workday, our perf structure changed to align with the rest of
         | Engineering, as did our levels.
         | 
         | DE living as a whole org team under Infra lasted less than a
         | year before we were split up and distributed into the
         | respective product teams we supported, as Software Engineers
         | with a focus on building & maintaining pipelines, schemas,
         | logging libraries... and the existing tools we had built. The
         | intention was to be embedded into the product teams (Homes,
         | Trips, Support Tools, &c.), skill up these teammates and share
         | the oncall load. In reality what happened was that (at least) 3
         | DE teams then grew in the various product orgs.
        
           | gigatexal wrote:
           | Maybe this is different at the highest levels of the game but
           | for the engineers in the more mainstream parts of the bell
           | curve at the less than Google level of craziness and volume
           | companies Data Engineers -- folks that have come up as former
           | DBAs, DataWarehouse devs, db heavy backend devs, analytics /
           | reporting folks -- it's been my experience that these folks
           | tend to solve problems in a more straight forward, data
           | centric, practical sort of way. And in my experience folks
           | who enter a data role from the sofware side of things tend to
           | come up with rather convoluted solutions to simple things.
           | 
           | Therefore I think the title distinction is warranted. It
           | frames that the company is looking for engineers with skills
           | in the software space -- source control mastery, knowledge of
           | a language or two other than SQL, but also experience looking
           | at query plans, designing large scale data systems, dealing
           | with BI tools etc etc. A sw engineer from a traditional
           | background CAN do this but I'd rather someone that fits the
           | DE role more.
        
         | kozikow wrote:
         | It's the least defined role.
         | 
         | Currently, I am in a funny situation when all teams agree we
         | need an additional data engineer. But basically:
         | 
         | - Sales and finance want more of business intelligence analyst
         | 
         | - Devs want more of a backend engineer
         | 
         | - ML researchers want a data analyst proficient in ETL to do
         | pipelines on the training dataset
         | 
         | All of those 3 have only one thing in common - they need to
         | know SQL very well. I've worked extensively with various
         | technologies to analyze data - pandas, sql, spark. And still, I
         | find SQL (especially recently BigQuery) getting me what I want
         | the quickest.
        
         | ed_elliott_asc wrote:
         | Yeah I'm thinking of changing my title back to software dev
         | instead of DE - it's sort of getting a bad rep.
        
           | fifilura wrote:
           | How do you define "bad rep"?
        
             | opportune wrote:
             | A lot of "data engineers" are former db analysts and such
             | that don't know much of anything technically outside of SQL
             | and even that might be something they only are "certified"
             | to know rather than actually good at.
             | 
             | It's basically becoming a title I'd associate with being
             | low-skill. I used to be a "software engineer in data" and
             | never call myself a data engineer because people would
             | think I don't know how to write/maintain production
             | services, just write ETL pipelines
        
       | g0xA52A2A wrote:
       | Maybe I'm just of a different vintage but I'm just not a fan of
       | these e-mail newsleters that seem to be the trend these days. I'd
       | much rather follow a Github repo for this sort of thing.
        
       | maddynator wrote:
       | Does anyone know what tech is used here to create the online
       | book?
        
         | emlos wrote:
         | looks like mdBook https://github.com/rust-lang/mdBook
        
       ___________________________________________________________________
       (page generated 2023-12-07 23:00 UTC)