[HN Gopher] The evolution of the data engineer role
___________________________________________________________________
The evolution of the data engineer role
Author : Arimbr
Score : 123 points
Date : 2022-10-24 14:30 UTC (8 hours ago)
(HTM) web link (airbyte.com)
(TXT) w3m dump (airbyte.com)
| unholyguy001 wrote:
| It's fascinating that the conclusion / forecast is that tools
| will abstract engineering problems and DE will move closer to the
| business . While over the last 20 years the exact opposite has
| happened and the toolset has actually become harder (not easier
| to use) but orders of magnitude more powerful and DE has moved
| closer to engineering, to the point where a good data engineer
| basically is a specialized software engineer.
|
| The absolute pinnacle of "easy to use" was probably the
| Informatica / Oracle stack of the late 90's and early 00's. It
| just wasn't powerful or scalable enough to meet the needs of the
| Big Data shift
|
| Of course I guess this makes sense given the author works for a
| company with a vested interest in reversing that trend.
| hnews_account_1 wrote:
| I think those days were easy to use within the zeitgeist. Even
| advanced versions of those tools would struggle against the
| data needs today which have become incredibly bespoke. My skill
| set extends all the way from my actual industry (finance) to
| the boundary of software development. I also have data, big
| data and cluster usage skills (slurm etc). I don't use
| everything every day and obviously I cannot be a specialist in
| most of this stuff (I concentrate on finance more than anything
| else) considering the incredible range, but this is just the
| past 2 years for me.
|
| I cannot imagine a less specialized future looking around today
| where some nice tool does 80% of my work. Not because the work
| I do is difficult to automate. But because the work I do won't
| match the work other industries may do (beyond existing
| generalizations of pandas, regression toolkits and other low
| level stuff). There's no point building a full automation suite
| just for my single work profile which itself will differ from
| other areas of finance.
| prions wrote:
| IMO Data engineering is already a specialized form of software
| engineering. However what people interpret as DE's being slow to
| adopt best practices from traditional software engineering is
| more about the unique difficulties of working with data
| (especially at scale) and less about the awareness or desire to
| use best practices.
|
| Speaking from my DE experience at Spotify and previously in
| startup land, the biggest challenge is the slow and distant
| feedback loop. The vast majority of data pipelines don't run on
| your machine and don't behave like they do on a local machine.
| They run as massively distributed processes and their state is
| opaque to the developer.
|
| Validating the correctness of a large scale data pipeline can be
| incredibly difficult as the successful operation of a pipeline
| doesn't conclusively determine whether the data is actually
| correct for the end user. People working seriously in this space
| understand that traditional practices here like unit testing only
| go so far. And integration testing really needs to work at scale
| with easily recyclable infrastructure (and data) to not be a
| massive drag on developer productivity. Even getting the correct
| kind of data to be fed into a test can be very difficult if the
| ops/infra of the org isn't designed for it.
|
| The best data tooling isn't going to look exactly like
| traditional swe tooling. Tools that vastly reduce the feedback
| loop of developing (and debugging) distributed pipelines running
| in the cloud and also provide means of validating the output on
| meaningful data is where tooling should be going. Trying to
| shoehorn traditional SWE best practices will really only take off
| once that kind of developer experience is realized.
| mywittyname wrote:
| > Validating the correctness of a large scale data pipeline can
| be incredibly difficult as the successful operation of a
| pipeline doesn't conclusively determine whether the data is
| actually correct for the end user. People working seriously in
| this space understand that traditional practices here like unit
| testing only go so far.
|
| I'm glad to see someone calling this out because the comment
| here are a sea of "data engineering needs more unit tests."
| Reliably getting data into a database is rarely where I've
| experienced issues. That's the easy part.
|
| This is the biggest opportunity in this space, IMHO, since
| validation and data completeness/accuracy is where I spend the
| bulk of my work. Something that can analyze datasets and
| provide some sort of ongoing monitoring for confidence on the
| completeness and accuracy of the data would be great. These
| tools seem to exist mainly in the network security realm, but
| I'm sure they could be generalized to the DE space. When I
| can't leverage a second system for validation, I will generally
| run some rudimentary statistics to check to see if the volume
| and types of data I'm getting is similar to what is expected.
| abrazensunset wrote:
| There is a huge round of "data observability" startups that
| address exactly this. As a category it was overfunded prior
| to the VC squeeze. Some of them are actually good.
|
| They all have various strengths and weaknesses with respect
| to anomaly detection, schema change alerts, rules-based
| approaches, sampled diffs on PRs, incident management,
| tracking lineage for impact analysis, and providing
| usage/performance monitoring.
|
| Datafold, Metaplane, Validio, Monte Carlo, Bigeye
|
| Great Expectations has always been an open source standby as
| well and is being turned into a product.
| robertlagrant wrote:
| I've worked with medium-sized ETL, and not only does it have
| unique challenges, it's a sub-domain that seems to reward quick
| and dirty and "it works" over strong validation.
|
| The key problem is that more you validate incoming data, the
| more you can demonstrate correctness, but then the more often
| data coming in will be rejected, and you will be paged out of
| hours :)
| conkeisterdoor wrote:
| I also manage a medium sized set of ETL pipelines (approx 40
| pipelines across 13k-ish lines of Python) and have a very
| similar experience.
|
| I've never been in a SWE role before, but am related to and
| have known a number of them, and have a general sense of what
| being a SWE entails. That disclaimer out of the way, it's my
| gut feeling that a DE typically does more "hacky" kind of
| coding than a SWE. Whereas SWEs have much more clearly
| established standards for how to do certain things.
|
| My first modules were a hot nasty mess. I've been refactoring
| and refining them over the past 1.5 years so they're more
| effective, efficient, and easier to maintain. But they've
| always just worked, and that has been good enough for my
| employer.
|
| I have one 1600 line module solely dedicated to validating a
| set of invoices from a single source. It took me months of
| trial and error to get that monster working reliably.
| azurezyq wrote:
| This is actually a great observation. Data pipelines are often
| written in various languages, running on heterogenous systems,
| with different time alignment schemes. I always found it tricky
| to "fully trust" a piece of result. Hmm, any best practice from
| your side?
| oa335 wrote:
| Not OP, but a Data Engineer with 4 years experience in the
| space - I think the key is to first build the feedback loop -
| i.e. any thing that helps you answer how do you know the data
| pipeline is flowing and that the data is correct - then
| getting sign-off from both the producers and consumers of the
| data. Actually getting the data flowing is usually pretty
| easy after both parties agree about what that actually means.
| Tycho wrote:
| I would describe myself as a _dataframe_ engineer.
| usgroup wrote:
| I live in the world of data lakes and elaborate pipelines. Now
| and again I get to use a traditional star schema data warehouse
| and ... it is an absolute pleasure to use in contrast to modern
| data access patterns.
| sherifnada wrote:
| In some sense, Data engineering today is where software
| engineering was a decade ago:
|
| - Infrastructure as code is not the norm. Most tools are UI-
| focused. It's the equivalent of setting up your infra via the AWS
| UI.
|
| - Prod/Staging/Dev environments are not the norm
|
| - Version Control is not a first class concept
|
| - DRY and component re-use is exceedingly difficult (how many
| times did you walk into a meeting where 3 people had 3 different
| definitions of the same metric?)
|
| - API Interfaces are rarely explicitly defined, and fickle when
| they are (the hot name for this nowadays is "data contracts")
|
| - unit/integration/acceptance testing is not as nearly as
| ubiquitous as it is in software
|
| On the bright side, I think this means DE doesn't need to re-
| invent the wheel on a lot of these issues. We can borrow a lot
| from software engineering.
| beckingz wrote:
| You're talking about analytics not data engineering.
|
| But yes, Data Analysis still needs more of this, though the
| smarter folks are getting on the Analytics Engineering /
| DataOps trains.
| mywittyname wrote:
| My DE team has all of these, and I've never worked on a team
| without them. I speak as someone whose official title has been
| Data Engineer since 2015 and I've consulted for lots of F500
| companies.
|
| Unit testing is the only thing we tend to skip, mainly because
| it's more reliable to allow for fluidity in the data that's
| being ingested. Which is really easy now that so many databases
| can support automatic schema detection. External APIs can
| change without notice, so it's better to just design for that,
| then use the time you would spend on unit tests to build alerts
| around automated data validation.
| jimcavel888 wrote:
| swyx wrote:
| > Titles and responsibilities will also morph, potentially
| deeming the "data engineer" term obsolete in favor of more
| specialized and specific titles.
|
| "analytics engineer" is mentioned but also just had its first
| conference at dbt's conference. all the talks are already up
| https://coalesce.getdbt.com/agenda/keynote-the-end-of-the-ro...
| pjot wrote:
| Just to clarify, last week was dbt's first _in person_
| conference. Third overall.
| iblaine wrote:
| A full history of DE should include some original low code tools
| (Cognos, Informatica, SSIS). To some extent, the failure of these
| tools to adopt to the evolution of the DE role has lead to our
| modern data stack.
| Eumenes wrote:
| Agreed. This is the first thing I thought about - the evolution
| from reporting systems to ETL code to Hadoop to Spark, etc.
| MrPowers wrote:
| Great article.
|
| > data engineers have been increasingly adopting software
| engineering best practices
|
| I think the data engineering field is starting to adopt some
| software engineering best practices, but it's still really early
| days. I am the author of popular Spark testing libraries (spark-
| fast-tests, chispa) and they definitely have a large userbase,
| but could also grow a lot.
|
| > The way organizations structure data teams has changed over the
| years. Now we see a shift towards decentralized data teams, self-
| serve data platforms, and ways to store data beyond the data
| warehouse - such as the data lake, data lakehouse, or the
| previously mentioned data mesh - to better serve the needs of
| each data consumer.
|
| I think the Lakehouse architecture is the real future of data
| engineering, see the paper:
| https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
|
| Disclosure: I am on the Delta Lake team, but joined because I
| believe in the Lakehouse architecture vision.
|
| It will take a long time for folks to understand all the
| differences between data lakes, Lakehouses, data warehouses, etc.
| Over time, I think mass adoption of the Lakehouse architecture is
| inevitable (benefits of open file formats, no lock in, separating
| compute from storage, cost management, scalability, etc.).
| victor106 wrote:
| > It will take a long time for folks to understand all the
| differences between data lakes, Lakehouses, data warehouses,
| etc.
|
| What are some good resources that can help educate folks on
| these differences?
| claytonjy wrote:
| Short version:
|
| - data warehouse: schema on write. you have to know the end
| form before you load it. breaks every time upstream changes
| (a lot, in this world)
|
| - data lake: schema on read. load everything into S3 and deal
| with it later. Mongo for data platforms
|
| - data lakehouse: something in between. store everything
| loosely like a lake, but have in-lakehouse processes present
| user-friendly transforms or views like a warehouse. Made
| possible by cheap storage (parquet on S3), reduces schema
| breakage by keeping both sides of the T in the same place
| DougBTX wrote:
| Materialised views for cloud storage?
| MrPowers wrote:
| I am working on some blogs / videos that will hopefully help
| clarify the differences. I'm working on a Delta Lake vs
| Parquet blog post right now and gave a 5 Reasons Parquet
| files are better than CSV talk last year:
| https://youtu.be/9LYYOdIwQXg
|
| Most of the content that I've seen in this area is really
| high-level. I'm trying to write posts that are a bit more
| concrete with some code snippets / high level benchmarks,
| etc. Hopefully this will help.
| abrazensunset wrote:
| "Lakehouse" usually means a data lake (bunch of files in
| object storage with some arbitrary structure) that has an
| open source "table format" making it act like a database.
| E.g. using Iceberg or Delta Lake to handle deletes,
| transactions, concurrency control on top of parquet (the
| "file format").
|
| The advantage is that various query engines will make it
| quack like a database, but you have a completely open interop
| layer that will let any combination of query engines (or just
| SDKs that implement the table format, or whatever) coexist.
| And in addition, you can feel good about "owning" your data
| and not being overtly locked in to Snowflake or Databricks.
| swyx wrote:
| (OP's coworker) We actually published a guide on data
| lakes/lakehouses last month! https://airbyte.com/blog/data-
| lake-lakehouse-guide-powered-b...
|
| covering:
|
| - What's a Data Lake and Why Do You Need One?
|
| - What's the Differences between a Data Lake, Data Warehouse,
| and Data Lakehouse
|
| - Components of a Data Lake
|
| - Data Lake Trends in the Market
|
| - How to Turn Your Data Lake into a Lakehouse
| oa335 wrote:
| I am a data engineer, and I STILL don't understand the
| differences between the following terms:
|
| 1. Data Warehouse
|
| 2. Datalake
|
| 3. Data Lakehouse
|
| 4. Data Mesh
|
| Can someone please clearly explain the differences between
| these concepts?
| chrisjc wrote:
| You mean Lake 'Shanty' Architecture (think DataSwamp vs
| DataLake) am I right?
|
| But in all seriousness, I totally agree with your opinion on
| LakeHouse Architecture and am especially excited about Apache
| Iceberg (external table format) and the support and attention
| it's getting.
|
| Although I don't think that selecting any of these data
| technologies/philosophies comes down to making a mutually
| exclusive decision. In my opinion, they either build on or
| compliment each other quite nicely.
|
| For those that are interested, here are my descriptions of
| each...
|
| Data Lake Arch - all of your data is stored on blob-storage
| (S3, etc) in a way that is partitioned thoughtfully and easily
| accessible, along with a meta-index/catalogue of what data is
| there, and where it is.
|
| Lake House Arch - similar to a DataLake but data is structured
| and mutable, and hopefully allows for transactions/atomic-ops,
| schema evolution/drift, time-travel/rollback, so on... Ideally
| all of the properties that you usually assume to get with any
| sort of OLAP (maybe even OLTP) DB table. But the most important
| property in my opinion is that the table is accessible through
| any compatible compute/query engine/layer. Separating storage
| and compute has revolutionized the Data Warehouse as we know
| it, and this is the next iteration of this movement in my
| opinion.
|
| Data Mesh/Grid Arch - designing how the data moves from a
| source all the way through each and every target while
| storing/capturing this information in an accessible
| catalogue/meta-database even as things transpire and change. As
| a result it provides data lineage and provenance, potentially
| labeling/tagging, inventory, data-dictionary-like information
| etc... This one is the most ambiguous and maybe most difficult
| to describe and probably design/implement, and to be honest
| I've never seen a real-life working example. I do think this
| functionality is a critical missing piece of the data stack,
| whether the solution is a Data Mesh/Grid or something else.
| Data Engineers have their work cutout on this one, mostly bc
| this is where their paths cross with those of
| Application/Service Developers, Software Engineers. In my
| opinion, developers are usually creating services/applications
| that are glorified CRUD wrappers around some kind of
| operational/transactional data store like MySQL, Postgres,
| Mongo, etc. Analytics, reporting, retention, volume, etc are
| usually an after thought and not their problem. Until someone
| hooks the operational data store up to their SQL IDE or
| Tableau/Looker and takes down prod. Then along comes the data
| engineer to come up with yet another ETL/ELT to get the data
| out of the operational data store and into a data warehouse so
| that reports and analytics can be run without taking prod down
| again.
|
| Data Warehouse (modern) - Massive Parallel Processing (MPP)
| over detached/separated columnar (for now) data. Some Data
| Warehouses are already somewhat compatible with Data Lakes
| since they can use their MPP compute to index and access
| external tables. Some are already planning to be even more Lake
| House compatible by not only leveraging their own MPP compute
| against externally managed tables (eg), but also managing
| external tables in the first place. That includes managing
| schemas and running all of the DDLs (CREATE, ALTER, DROP, etc)
| as well as DQLs (SELECT) and DMLs (MERGE, INSERT, UPDATE,
| DELETE, ...). Querying data across native DB tables, external
| tables (potentially from multiple Lake Houses, Data Lakes) all
| becomes possible with a join in a SQL statement. Additionally
| this allows for all kinds of governance related functionality
| as well. Masking, row/column level security, logging, auditing,
| so on.
|
| As you might be able to tell from this post (and my post
| history) is that I'm a big fan of Snowflake. I'm excited for
| Snowflake managed Iceberg tables and then consume the data with
| a different compute/query engine. Snowflake (or other modern
| DW) could prepare the data (ETL/calc/crunch/etc) and then
| manage (DDL & DML) it in an Iceberg table. Then something like
| DuckDB could consume the Iceberg table schema and listen for
| table changes (oplog?), and then read/query the data performing
| last-mile analytics (pagination, order, filter, aggs, etc).
|
| DuckDB doesn't support Apache Iceberg, but it can read parquet
| files which are used internally in Iceberg. Obviously
| supporting external tables is far more complex than just
| reading a parquet file, but I don't see why this isn't in their
| future. DuckDB guys, I know you're out there reading this :)
|
| https://iceberg.apache.org/
|
| https://www.snowflake.com/guides/olap-vs-oltp
|
| https://www.snowflake.com/blog/iceberg-tables-powering-open-...
|
| Finally one of my favorite articles:
|
| https://engineering.linkedin.com/distributed-systems/log-wha...
| beckingz wrote:
| I'm going to use "Lake Shanty" in the future. Powerful phrase
| to describe what happens when you run aground on the shore of
| a data swamp.
| oa335 wrote:
| Great write-up. I would add that I actually have seen
| something like a "Data Mesh" architecture, at a bank of all
| places. The key was a very stable, solid infrastructure and
| dev platform, as well as a custom Python library that worked
| across that Platform which was capable of ELT across all
| supported datastores and would properly log/annotate/catalog
| the data flows. Such a thing is really only possible when the
| platform is actually very stable and devs are somewhat forced
| to use the library.
| z3c0 wrote:
| For those of you who are genuinely curious why this field has so
| many similarly-named roles, here's a sincere, non-snarky, non-
| ironic explanation:
|
| A Data Analyst is distinct from a Systems Analyst or a Business
| Analyst. They may perform both systems and business analysis
| tasks, but their distinction comes from their understanding of
| statistics and how they apply that to other forms of analysis.
|
| A ML specialist is not a Data Scientist. Did you successfully
| build and deploy an ML model to production? Great! That's still
| uncommon, despite the hype. However, that would land you in the
| former position. You can claim the latter once you've walked that
| model through the scientific method, complete with hypothesis
| verification and democratization of your methodology.
|
| A BI Engineer and a Data Engineer are going to overlap a lot, but
| the former is going to lean more towards report development,
| where the latter will spend more time with ELTs/ETLs. As a data
| engineer, most of the report development that I do is to report
| on the state of data pipelines. BI BI, I like to call it.
|
| A Big Data Engineer or Specialist is a subset of data engineers
| and architects angled towards the problems of big data. This
| distinction actually matters now, because I'm encountering data
| professionals these day who have never worked outside the cloud
| or with small enterprise datasets (unthinkable only half-a-decade
| ago.)
|
| It doesn't help that lack of understanding often leads to
| misnomer positions, but anybody who has spent time in this field
| gets used to the subtle differences quickly.
| itsoktocry wrote:
| What about _Analytics Engineer_ , the hypiest-of-the-hyped
| right now?
| claytonjy wrote:
| BI engineer that knows dbt
| claytonjy wrote:
| This strikes me as incredibly rosy; I want to live in this
| world, but I don't. The world I live in:
|
| - Data Analyst: someone who knows some SQL but not enough
| programming, so we can pay < 6 figures
|
| - ML specialist: someone who figured out DS is a race to the
| bottom and ML in a title gets you paid more. Spends most of
| their time installing pytorch in various places
|
| - BI Engineer: Data Analyst but paid a bit more
|
| - Data Engineer: Airflow babysitter
|
| - Big Data Engineer: middle-adged Scala user, Hadoop babysitter
| tdj wrote:
| In my experience, I have started to believe ML Engineer is
| short for "YAML Engineer".
| z3c0 wrote:
| Out of all the snark in this thread, this is the only bit
| to illicit a chuckle from me. Thank you.
| mywittyname wrote:
| > The declarative concept is highly tied to the trend of moving
| away from data pipelines and embracing data products -
|
| Of course an Airbyte article would say this, because they are
| selling these tools, but my experience has been the opposite.
| People buy these tools because they claim to make it easier for
| non-software people to build pipelines. But the problem is that
| these tools seem to end up being far more complicated and less
| reliable than pipelines built in code.
|
| There's a reason that this domain is saturated with so. many.
| tools. None of them do a great job. And when a company invariably
| hits the limits of one, they start shopping for a replacement,
| which will have it's own set of limitations. Lather-rinse-repeat.
|
| I built a solid career over the past 8 or so years of replacing
| these "no code" pipeline tools with code once companies hit the
| ceilings of these tools. You can get surprisingly far in the data
| world with Airflow + a large scale database, but all of the major
| cloud providers have great tool offerings in this space. Plus,
| for platforms that these tools don't interface with, you're going
| to have to write code anyway.
| muspimerol wrote:
| > I built a solid career over the past 8 or so years of
| replacing these "no code" pipeline tools with code once
| companies hit the ceilings of these tools.
|
| I'm sure you earn a nice living doing this, but surely this is
| not a convincing argument against using off-the-shelf data
| products. It will always come down to the cost (including
| ongoing maintenance) for the business. Bespoke in-house
| software is always the most flexible route, but rarely the
| cheaper one.
| Arimbr wrote:
| Oh, declarative doesn't necessarily mean no-code. Airbyte data
| integration connectors are built with an SDK in Python, Java,
| and a low-code SDK that was just released...
|
| You can then build custom connectors on top of these and many
| users actually need to modify an existing connector, but would
| rather start from a template than from scratch.
|
| Airbyte also provides a CLI and YAML configuration language
| that you can use to declare sources, destinations and
| connections without the UI:
| https://github.com/airbytehq/airbyte/blob/master/octavia-cli...
|
| I agree with you that code is here to stay and power users need
| to see the code and modify it. That's why Airbyte code is open-
| source.
| buscoquadnary wrote:
| Business Analyst, Big Data Specialist, Data Mining Engineer, Data
| Scientist, Data Engineer.
|
| Why is this field so prone to hype and repeating the same things
| with a new coat of paint. I mean what ever happened to OLAP, data
| cubes, Big Data, and whatever other super big next thing that has
| happened in the past 2 decades?
|
| Methinks the problem with Business Intelligence solving problems
| is the firdt part of the term and not the second.
| hatware wrote:
| > Why is this field so prone to hype and repeating the same
| things with a new coat of paint.
|
| Money and Marketing. It's no different from how Hadoop was a
| big deal around 2010, or how Functional Programming became the
| new thing from 2015 onwards.
|
| Personally I think this is a failure of regulatory agencies.
| MonkeyMalarky wrote:
| I dunno, I have to first put my data somewhere though. But
| where.. In a warehouse? Silo? Data lake? Lake house? (I really
| despise that last one, who could coin that phase with a
| straight face..)
| MrPowers wrote:
| Data warehouse: bundles compute & storage and comes at a
| comparatively high price point. Great option for certain
| workflows. Not as great for scaling & non-SQL workflows.
|
| Data lake: typically refers to Parquet files / CSV files in
| some storage system (cloud or HDFS). Data lakes are better
| for non-SQL workflows compared to data warehouses, but have a
| number of disadvantages.
|
| Lakehouse storage formats: Based on OSS files and solve a
| number of data lake limitations. Options are Delta Lake,
| Iceberg, and Hudi. Lakehouse storage formats offer a ton of
| advantages and basically no downsides compared to Parquet
| tables for example.
|
| Lakehouse architecture: An architectural paradigm to store
| data in a way such that's it's easily accessible for SQL-
| based and non-SQL-based workflows, see the paper:
| https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf
|
| There are a variety of tradeoffs to be weighed when selecting
| the optimal solution for your particular needs.
| marktangotango wrote:
| If this is satire it's brilliant. I don't doubt it's
| factual, but the last sentence is a slayer.
| fabatka wrote:
| Can you explain why do you find the above explanation
| amusing? I honestly don't see the absurdity of it,
| although, my livelihood may depend on me not seeing it :)
| victor106 wrote:
| It's seems like you are just being negative to a reply
| that was meant to genuinely clarify confusing
| terminology.
|
| If not please elaborate why you doubt this is not
| factual?
| PubliusMI wrote:
| IMHO,
|
| "data lake" = collection of all data sources: HDFS, S3, SQL
| DB, etc
|
| "lake house" = tool to query and combine results from all
| such sources: Dremio
| buscoquadnary wrote:
| But what lake house is complete without a boat.
|
| That's why my company is looking for investors who are
| interested in being at the forefront of the data
| revolution, using our data rowboat that will allow you to
| proactively leverage your data synergies to break down
| organizational data silos and use analytics to address your
| core competencies in order to leverage a strategic
| advantage to become the platform of choice in a holistic
| environment.
|
| Tell me if this sounds familiar, your company has tons of
| data but it is spread out all over the place and you can't
| seem to get good info, you end up hounding engineers to get
| your reports and provide you information so you can look
| like you are making data driven decisions. Maybe you've
| implemented a data lake but now have no idea how to use it.
| We've got you covered with our patent pending data rowboat
| solution.
|
| This will allow you to impressive everyone else in the mid
| level staff meetings by allowing you to say you are doing
| something around the "data revolution" in your org. The
| best part is that every implementation will come with a
| team of our in house consultants that will allow the
| project to drag on forever so that you always have
| something to report on in staff meetings and make you look
| good to your higher ups.
|
| Now you may be an engineer looking to revolutionize your
| career and get involved in the next step of the glorious
| October data revolution. Well we've got you covered for a
| very reasonable price you can enroll in our "data rowboat
| boot camp", where you will spend hours locked in a room
| where someone who barely speaks English will read
| documentation to you.
|
| But act quick otherwise you'll end up as one of the data
| kulaks as the new data rowboat revolution proceeds into a
| glorious future with our 5 year plan.
| jdgoesmarching wrote:
| Brb, running to trademark every nautical data metaphor I
| can get my hands on.
|
| What happens when your data rowboat runs ashore?
| Introducing Data Tugboat(tm), your single pane of glass
| solution for shoring up undocumented ETLs and reeling
| your data lineage safely into harbor.
| MonkeyMalarky wrote:
| Need to run ML on your data? Try our DeepSea data
| drilling rigs, delivered in containers!
| MonkeyMalarky wrote:
| Sir, I'm sorry, but a rowboat just won't scale, my needs
| are too vast. What I'm proposing is the next level of
| data extraction. You've heard of data mining? Well meet
| the aquatic equivalent, the Data Trawler. To find out
| more, contact our solution consultants today!
| MikeDelta wrote:
| 'Tis a field riddled with yachts... but where are all the
| customers' yachts?
| Avicebron wrote:
| I think the real interesting point is slapping the title of
| engineer/scientiest on to anything and everything regardless of
| the accreditation actual handed out. soon coming up.."cafeteria
| engineer", "janitorial engineer"...
| the-printer wrote:
| Woah, woah, woah. Cool it, buddy.
|
| That's already begun.
| Test0129 wrote:
| The difference of course being that other types of engineers
| have to take a PE. The idea of requiring a PE to have that
| title is protectionism no different than limiting the number
| of graduating doctors to keep salaries high. No one will ask
| a software engineer to build a bridge - relax. Your
| protection racket is safe. Software engineer is a title
| conferred on someone who builds systems. It is fitting. And,
| if we're being honest, the average "job threat level" of a
| so-called "real" engineer is about the same as a software
| engineer these days anyway. With the exception of some niche
| jobs every engineer I know is just a CADIA/SW/etc jocky and
| the real work is gatekept by greybeards.
|
| No one will call someone a cafeteria engineer or janitorial
| engineer. The premise is ridiculous. There is a title called
| "operations engineer" that uses math to optimize processes.
| Does this one bother you too?
| iamjk wrote:
| I won't be surprised if DE ends up just falling under the
| "software engineering" umbrella as the jobs grow closer together.
| With hybrid OLAP/OLTP databases becoming more popular, the
| skillset delta is definitely smaller than it used to be. Data
| Engineers are higher leverage assets to an organization than they
| ever have been before.
| snapetom wrote:
| I think it's mostly already there, but your big, enterprise
| houses were late getting the memo. About 12 years ago, I
| switched to a DE role/title and held it ever since. I worked in
| a variety of startups doing DE - moving data from over here to
| over there, with a variety of tools from orchestration
| frameworks to homegrown code in a variety of languages.
|
| About six years ago, I walked into a local hospital to
| interview for a DE role and it was very clear that their
| definition of DE was different than mine. The whole dept worked
| in nothing but SQL. I thought I was good with SQL, but they
| absolutely crucified me on SQL and data architecture theories.
| I ended up getting kicked over to a software engineering role,
| doing DE in another capacity, which made more sense for me.
|
| Only now I'm hearing that they're migrating to other tools like
| dbt and requiring their DEs to learn programming languages.
| buscoquadnary wrote:
| Well my understanding is that a Data Engineer is basically just
| a DevOps engineer but instead of building infra to run
| applications they build infra to process, sanitize and
| normalize data.
| Avalaxy wrote:
| Imho that is absolutely not doing the role justice. For some
| people that may hold true, but I would expect a data engineer
| to know everything about distributed systems, database
| indexes, how different databases work and why you pick them,
| partitioning, replication, transactions/locking. These are
| topics a software engineer is typically familiar with. A
| DevOps engineer wouldn't be.
| hbarka wrote:
| Or to denormalize data, the distinction of which the data
| engineer would be most familiar with why and how.
| tbarrera wrote:
| Author here - Of course, data engineering involves building
| infra and being knowledgeable about DevOps practices, but
| that's not the only area data engineers should be familiar
| with. There are many, many more! In my personal experience,
| sometimes we end up not using DevOps best practices because
| we spread too thin. That's why I believe in specialization
| within data engineering and the surge of "data reliability
| engineer" or similar
| 3minus1 wrote:
| Yeah, maybe this will happen. Where I work (FAANG), I know that
| DEs get lower compensation than SWEs and SREs.
| rectang wrote:
| I'm a software dev who's been bumping up against the data
| engineering field lately, and I've been dismayed as to how many
| popular tools shunt you towards unmaintainable, unevolvable
| system design.
|
| - A predilection for SQL, yielding "get the answer right once"
| big-ball-of-SQL solutions which are infeasible to debug or modify
| without causing regressions.
|
| - Poor support for unit testing.
|
| - Poor support for version control.
|
| - Frameworks over libraries (because the vendors want to lock you
| in).
|
| > data engineers have been increasingly adopting software
| engineering best practices
|
| We can only hope. I think it's more likely that in the near term
| data engineers will get better and better at prototyping within
| low-code frameworks, and that transitioning from the prototype to
| an evolvable system will get harder.
| MrPowers wrote:
| > A predilection for SQL, yielding "get the answer right once"
| big-ball-of-SQL solutions which are infeasible to debug or
| modify without causing regressions.
|
| Yea, thankfully some frameworks have Python / Scala APIs that
| let you abstract "SQL logic" into programatic functions that
| can be chained and reused independently to avoid the big-ball-
| of-SQL problem. The best ones also allow for SQL because that's
| the best way to express some logic.
|
| > Poor support for unit testing.
|
| I've written pandas, PySpark, Scala Spark, and Dask testing
| libs. Not sure which framework you're referring to.
|
| > Poor support for version control.
|
| The execution platform should make it easy to package up code
| in JAR files / Python wheel files and attach them to your
| cluster, so you can version control the business logic. If not,
| yea, that's a huge limitation.
|
| > Frameworks over libraries (because the vendors want to lock
| you in)
|
| Not sure what you're referring to, but interested in learning
| more.
| forgetfulness wrote:
| They were talking about the "modern data stack" no doubt.
|
| The trend has been to shift as much work possible to the
| current generation of Data Warehouses, that abstract the
| programming model that Spark on columnar storage provided
| with only a SQL interface, reducing the space where you'd use
| Spark.
|
| It makes it very accessible to write data pipelines then
| using dbt (which outcompeted Dataform, though the latter is
| still kicking), but then you don't have the richer
| programming facilities, stricter type systems, tooling and
| the practices of Python or Scala programming, you're in the
| world of SQL, set back decade or two in testing, checking,
| and a culture of using them, and with little tools to
| organize your code.
|
| That, if the team has rebuked the siren songs of a myriad of
| cloud, low-code platforms for this or the other, with even
| fewer facilities to keep the data pipelines under control, be
| it that we call control any of: access, versioning,
| monitoring, data quality, anything really.
| morelisp wrote:
| Let me offer the more blunt materialist analysis: Data
| engineers are being deskilled into data analysts and too
| blinded by shiny cloud advertisements to notice.
|
| (In this view though, "lack of tests" or whatever is the
| least concern - until someone figures out how to spin up
| another expensive cloud tool selling "testable queries".)
| forgetfulness wrote:
| The "data engineer" became a distinct role to bring over
| Software Engineering practices to data processing; such
| as those practices are, they were a marked improvement
| over their absence.
|
| Building a bridge from one shore to the other with
| application programming languages and data processing
| tools that worked much closer to other forms of
| programming was a huge part of that push.
|
| Of course, the big data tools were intricate machines
| that were easy to learn and very hard to master, and data
| engineers had to be pretty sophisticated.
|
| So, it became cheaper to move much of that apparatus to
| data warehouses and, as you said, commoditize the
| building of data pipelines that way.
|
| Software is as widespread as it is today because in every
| generation the highly skilled priestly classes that were
| needed to get the job done were displaced by people with
| less training enabled by new tools or hardware; else it'd
| be all rocket simulations done by PhD physicists still.
|
| But the technical debt will be hefty from this shift.
| chrisjc wrote:
| FYI
|
| > write data pipelines then using dbt (which outcompeted
| Dataform, though the latter is still kicking), but then you
| don't have the richer programming facilities, stricter type
| systems, tooling and the practices of Python or Scala
| programming, you're in the world of SQL...
|
| Recently announced and limited to only a handful of data
| platforms, but dbt now supports python models.
|
| https://docs.getdbt.com/docs/building-a-dbt-
| project/building...
| MrPowers wrote:
| > The trend has been to shift as much work possible to the
| current generation of Data Warehouses, that abstract the
| programming model that Spark on columnar storage provided
| with only a SQL interface, reducing the space where you'd
| use Spark.
|
| I feel like there there are some data professionals that
| only want to use SQL. Other data professionals only want to
| use Python. I feel like the trend is to provide users with
| interfaces that let them be productive. I could be
| misreading the trend of course.
| morelisp wrote:
| It's very unclear to me that anyone is more productive
| under these new tooling stacks. I'm certain they're not
| more productive commensurately with new costs and long-
| term risks.
| marcosdumay wrote:
| > stricter type systems ... the practices of Python or
| Scala
|
| I do understand what you are talking about. But I really
| think you and the OP are both complaining about the wrong
| problem.
|
| SQL doesn't require bad practices, doesn't inherently harm
| composability (the way the OP was referring), and don't
| inherently harm verification. Instead, it has stronger
| support for many of those than the languages you want to
| replace it with.
|
| The problems you are talking about are very real. But they
| do not come from the language. (SQL does bring a few
| problems by itself, but they are much more subtle than
| those.)
| forgetfulness wrote:
| At least BigQuery does a fair bit of typechecking, and
| gives error messages in a way that's to the par of
| application programming (e.g. not letting you pass a
| timestamp to a DATE function and stating that there's no
| matching signature).
|
| But a tool that doesn't "require" bad practices but
| doesn't require good practices either makes your work
| harder in the long run.
|
| Tooling is poor, the best IDE-similes you got until
| recently were of the type that connects to a live
| environment but doesn't tie to your codebase, and
| encourages you to put your code directly on the database
| rather than version control, the problems of developing
| with a REPL and little in the way to mitigate them. I'm
| talking of course of the problem of having view and
| function definitions live in the database with no tools
| to statically navigate the code.
|
| Testing used to be completely hand rolled if anyone
| bothered with it at all.
|
| That was until now, that data pipeline orchestration
| tools exist and let you navigate the pipeline as a
| dependency graph, a marked improvement, but until dbt's
| Python version is ready for production, we're talking
| here of a graph of Jinja templates and YAML definitions,
| with modest support for unit testing.
|
| Dataform is a bit better but virtually unknown and was
| greatly hindered by the Google acquisition.
|
| Functions have always been clunky and still are.
|
| RDDs and then, to a lesser extent, Dataframes offered a
| much stronger programming model, but they were still
| subject to a lack of programming discipline from data
| engineers in many shops. The results of that, however,
| are on a different scale with undisciplined SQL
| programming, and it's downright hard to be disciplined
| when using it.
|
| The trend to move from ETL to ELT I feel shouldn't have
| been unquestioningly transitioned to untyped Dataframes
| and then SQL.
| mynameisash wrote:
| I'm also a software engineer, though I've had the unofficial
| title "data engineer" applied to me for quite some time now.
|
| The more I work with tools like Spark, the more dissatisfied I
| become with the data engineering world. Spark is a hot mess of
| fiddling with configuration - I've lost more productivity to
| dealing with memory limit, executor count, and other
| configuration than I think is reasonable.
|
| Pandas is another one. It was good enough to make quick
| processing concise that it got significant uptake and became de
| facto. The API is a pain, though, and processing is slow. Now,
| couple Pandas and Spark in your day-to-day job and you get what
| I see from my data science colleagues: "I'll hack together some
| processing in Pandas until my machine can't handle any more
| data, at which point I'll throw it into Spark and provision a
| bunch of nodes to do that work for me." I don't mean that to
| sound pejorative, as they're generally just trying to be
| productive, but there's so little attention paid to real
| efficiency in the frameworks and infrastructure that we're
| blowing through compute, memory, and person-hours unnecessarily
| (IMHO).
| rectang wrote:
| > Pandas is another one.
|
| But at least if I write a transform in pandas it's
| straightforward to unit test it: create a DataFrame with some
| dummy data, send it through the function which wraps the
| transform, test that what comes out is what's expected.
|
| Validating a transform done in SQL is not nearly as
| straightforward. For starters it needs to be an integration
| test not a unit test (because you need a database). And
| that's assuming there's even a way to hook unit tests into
| your framework.
|
| I'm not a huge fan of Pandas -- it's _way_ too prone to
| silent failure. I 've written wrappers around e.g. read_csv
| which are there to defeat all the magic. But at least I can
| do that with Python code instead of being stuck with the big-
| ball-of-SQL (e.g. some complicated view SELECT statement that
| does a bunch of JOINs, CASE tests and casts).
| dang wrote:
| Please stick to plain text - formatting gimmicks are
| distracting, not the convention here, and tend to start an arms
| race of copycats.
|
| I've put hyphens there now. Your comment is otherwise fine of
| course.
| nmarinov wrote:
| Out of curiosity, could you give or point to an example of
| "formatting gimmicks"?
|
| I tried searching the FAQ but only found the formatdoc guide:
| https://news.ycombinator.com/formatdoc
| rectang wrote:
| OT...
|
| I used one of the Unicode bullets instead of a hyphen (*).
| I understand what dang is getting at here -- it's in the
| same spirit as stripping emojis (because otherwise we'd be
| drowning in them). I'm a past graphic designer so I'm
| accustomed to working with the medium to achieve maximum
| communicative impact, but I don't mind operating within
| constraints.
|
| The convention of using hyphens to start a bulleted list
| isn't officially documented AFAIK. Having to separate
| bulleted list items with double newlines is a little weird,
| but it's fine.
| webshit2 wrote:
| As someone who knows nothing about this stuff, I'm looking at the
| "Data Mart" wiki page: https://en.wikipedia.org/wiki/Data_mart.
| Ok, so the entire diagram here is labelled "Data Warehouse", and
| within that there's a "Data Warehouse" block which seems to be
| solely comprised of a "Data Vault". Do you need a special data
| key to get into the data vault in the data warehouse? Okay,
| naturally the data marts are divided into normal marts and
| strategic marts - seems smart. But all the arrows between
| everything are labelled "ETL". Seems redundant. What does it mean
| anyway? Ok apparently it's just... moving data.
|
| Now I look at
| https://en.wikipedia.org/wiki/Online_analytical_processing.
| What's that? First sentence: "is an approach to answer multi-
| dimensional analytical (MDA)". I click through to
| https://en.wikipedia.org/wiki/Multidimensional_analysis ... MDA
| "is a data analysis process that groups data into two categories:
| data dimensions and measurements". What the fuck? Who wrote this?
| Alright, back on the OLAP wiki page... "The measures are placed
| at the intersections of the hypercube, which is spanned by the
| dimensions as a vector space." Ah yes, the intersections... why
| not keep math out of it if you have no idea how to talk about it?
| Also, there's no actual mention of why this is considered
| "online" in the first place. I feel like I'm in a nightmare where
| the pandas documentation was rewritten in MBA-speak.
| rjbwork wrote:
| It's a difficult sphere of knowledge to penetrate. All of that
| is perfectly coherent to me, FWIW.
|
| From first principals, I can highly recommend Ralph Kimball's
| primer, The Data Warehouse Toolkit: The Definitive Guide to
| Dimensional Modeling.[1]
|
| [1]https://www.amazon.com/gp/product/B00DRZX6XS
___________________________________________________________________
(page generated 2022-10-24 23:00 UTC)