[HN Gopher] How to Become a Data Engineer in 2021
       ___________________________________________________________________
        
       How to Become a Data Engineer in 2021
        
       Author : adilkhash
       Score  : 175 points
       Date   : 2021-01-11 12:49 UTC (9 hours ago)
        
 (HTM) web link (khashtamov.com)
 (TXT) w3m dump (khashtamov.com)
        
       | josephmosby wrote:
       | (source for everything following: I recently hired entry-level
       | data engineers)
       | 
       | The experience required differs dramatically between
       | [semi]structured transactional data moving into data warehouses
       | versus highly unstructured data that the data engineer has to do
       | a lot of munging on.
       | 
       | If you're working in an environment where the data is mostly
       | structured, you will be primarily working in SQL. A LOT of SQL.
       | You'll also need to know a lot about a particular database stack
       | and how to squeeze it. In this scenario, you're probably going to
       | be thinking a lot about job-scheduling workflows, query
       | optimization, data quality. It is a very operations-heavy
       | workflow. There are a lot of tools available to help make this
       | process easier.
       | 
       | If you're working in a highly unstructured data environment,
       | you're going to be munging a lot of this data yourself. The
       | "operations" focus is still useful, but at the entry level data
       | engineer, you're going to be spending a lot more time thinking
       | about writing parsers and basic jobs. If you're focusing your
       | practice time on writing scripts that move data in Structure A in
       | Place X to Structure B in Place Y, you're setting yourself up for
       | success.
       | 
       | I agree with a few other commentators here that Hadoop/Spark
       | isn't being used a lot in their production environments - but -
       | there are a lot of useful concepts in Hadoop/Spark that are
       | helpful for data engineers to be familiar with. While you might
       | not be using those tools on a day-to-day basis, chances are your
       | hiring manager used them when she was in your position and it
       | will give you an opportunity you know a few tools at a deeper
       | level.
        
         | Mauricebranagh wrote:
         | It does rather depends what sort of data I bet a data engineer
         | at CERN or JPL has quiet a different set of required skills to
         | say Google or a company playing at data science because its the
         | next big thing.
         | 
         | I should imagine at CERN etc knowing which end of soldering
         | iron gets hot might still be required in some cases.
         | 
         | I recall back in the _mumble_ extracting data from b &w film
         | shot with a high speed camera, by projecting it on to graph
         | paper taped to the wall and manualy marking the position of the
         | "object"
        
         | llbeansandrice wrote:
         | > I agree with a few other commentators here that Hadoop/Spark
         | isn't being used a lot in their production environments
         | 
         | I guess I'm the odd-man out because that's all I've used for
         | this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.
        
           | teddyuk wrote:
           | I'm also the odd one out, so many enterprises moving to spark
           | on databricks.
        
           | josephmosby wrote:
           | I should have specified more thoroughly.
           | 
           | I am not seeing Spark being chosen for _new_ data eng roll-
           | outs. It is still very prevalent in existing environments
           | because it still works well. (used at $lastjob myself)
           | 
           | However - I am still seeing a lot of Spark for machine-
           | learning work by data scientists. Distributed ML feels like
           | it is getting split into a different toolkit than distributed
           | DE.
        
             | llbeansandrice wrote:
             | I guess it depends on what jobs you're looking for. There's
             | a lot of exiting companies/teams (like mine) looking to
             | hire people but we're on the "old stack" using Kafka,
             | Scala, Spark, etc. We don't do any ML stuff but I'm on the
             | pipeline side of it. The data scientists down the line tend
             | to use Hive/SparkSQL/Athena for a lot of work but I'm much
             | less involved with that.
             | 
             | Not all jobs are new pasture and I think that's forgotten
             | very frequently.
        
         | dominotw wrote:
         | Agree 100% with this comment,
         | 
         | Old stack: Hadoop, spark, hive, hdfs.
         | 
         | New stack: kafka/kinesis, fivetran/stitch/singer,
         | airflow/dagster, dbt/dataform, snowflake/redshift
        
           | disgruntledphd2 wrote:
           | Huh, what replaces Spark in those lists?
           | 
           | For my money, its the best distributed ML system out there,
           | so I'd be interested to know what new hotness I'm missing.
        
             | dominotw wrote:
             | > best distributed ML system out there
             | 
             | I was comparing it for "traditional" data engineering stack
             | that used spark for data munging, transformations ect.
             | 
             | I don't have much insight into ML systems or how spark fits
             | there. Not all data teams are building 'ml systems' though.
             | Parent comment wasn't referring to any 'ml systems', not
             | sure why that would be automatically inferred when someone
             | mentions data stack .
        
               | disgruntledphd2 wrote:
               | Yeah, I suppose. I kinda think that distributed SQL is a
               | mostly commoditised space, and wondered what replaced
               | Spark for distributed training.
               | 
               | For context, I'm a DS who's spent far too much time not
               | being able to run useful models because of hardware
               | limitations, and a Spark cluster is incredibly good for
               | that.
               | 
               | Additionally, I'd argue in favour of Spark even for ETL,
               | as the ability to write (and test!) complicated SQL
               | queries in R, Python and Scala was super, super
               | transformative.
               | 
               | We don't really use Spark at my current place, and every
               | time I write Snowflake (which is great, to be fair), I'm
               | reminded of the inherent limitations of SQL and how
               | wonderful Spark SQL was.
               | 
               | I'm weird though, to be fair.
        
             | sails wrote:
             | Snowflake I suppose for the average ML use case. Not for
             | your high-performance ML, but for your average data
             | scientist, maybe?
             | 
             | Edit: I may be wrong[1], would be curious to know what
             | users who've used Spark AND Snowflake would add to the
             | conversation.
             | 
             | [1] https://www.snowflake.com/blog/snowflake-and-spark-
             | part-1-wh...
        
               | marcinzm wrote:
               | Snowflake hits its limits with complex transformations I
               | feel. Not just due to using SQL. It's "type system" is
               | simpler than Spark's which makes certain operations
               | annoying. There's a lack of UDFs for working with complex
               | types (lists, structs, etc.). Having to write UDFs in
               | Javascript is also not the greatest experience.
        
               | dominotw wrote:
               | > There's a lack of UDFs for working with complex types
               | (lists, structs, etc.). Having to write UDFs in
               | Javascript is also not the greatest experience.
               | 
               | We load our data into SF in json and do plenty of
               | list/struct manipulation using their inbuilt
               | functions[1]. I guess you might have write a UDF if you
               | are doing something super weird but inbuilt functions
               | should get you pretty far 90% of the time.
               | 
               | https://docs.snowflake.com/en/sql-reference/functions-
               | semist...
        
             | somurzakov wrote:
             | distributed ML != Distributed DWH.
             | 
             | Distributed ML is tough to train because of very little
             | control over train loop. I personally prefer using single
             | server trainkng even on large datasets, or switch to online
             | learning algos that do train/inference/retrain at the same
             | time.
             | 
             | as for snowflake, I havent heard of people using snowflake
             | to train ML, but sbnowflake is a killer in managed
             | distribited DWH that you dont have to tinker and tune
        
               | disgruntledphd2 wrote:
               | So do I, theoretically at least.
               | 
               | But Spark is super cool and actually has algorithms which
               | complete in a reasonable time frame on hardware I can get
               | access to.
               | 
               | Like, I understand that the SQL portion is pretty
               | commoditised (though even there, SparkSQL python and R
               | API's are super nice), but I'm not aware of any other
               | frameworks for doing distributed training of ML models.
               | 
               | Have all the hipsters moved to GPUs or something? \s
               | 
               | > sbnowflake is a killer in managed distribited DWH that
               | you dont have to tinker and tune
               | 
               | It's so very expensive though, and their pricing model is
               | frustratingly annoying (why the hell do I need tickets?).
               | 
               | That being said, tuning Spark/Presto or any of the non-
               | managed alternatives is no fun either, so I wonder if
               | it's the right tradeoff.
               | 
               | One thing I really, really like about Spark is the
               | ability to write Python/R/Scala code to solve the
               | problems that cannot be usefully expressed in SQL.
               | 
               | All the replies to my original comment seem to forget
               | that, or maybe Snowflake has such functionality and I'm
               | unaware of it.
        
               | marcinzm wrote:
               | >I'm not aware of any other frameworks for doing
               | distributed training of ML models.
               | 
               | Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet
               | all support distributed training across CPUs/GPUs in a
               | single machine or multiple machines. So does XGBoost if
               | you don't want deep learning. You can then run them with
               | KubeFlow or on whatever platform your SaaS provider has
               | (GCP AI Platform, AWS Sagemaker, etc.).
               | 
               | edit:
               | 
               | >All the replies to my original comment seem to forget
               | that, or maybe Snowflake has such functionality and I'm
               | unaware of it.
               | 
               | Snowflake has support for custom Javascript UDFs and a
               | lot of built in features (you can do absurd things with
               | window functions). I also found it much faster than
               | Spark.
        
               | disgruntledphd2 wrote:
               | > Snowflake has support for custom Javascript UDFs and a
               | lot of built in features (you can do absurd things with
               | window functions). I also found it much faster than
               | Spark.
               | 
               | UDF support isn't really the same, to be honest. You're
               | still prisoner of the select from pattern. Don't get me
               | wrong, SQL is wonderful where it works, but it doesn't
               | work for everything that I need.
               | 
               | I completely agree that it's faster than Spark, but it's
               | also super-expensive and more limited. I suspect it would
               | probably be cheaper to run a managed Spark cluster vs
               | Snowflake and just eat the performance hit by scaling up.
               | 
               | Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet
               | all support distributed training across CPUs/GPUs in a
               | single machine or multiple machines. So does XGBoost if
               | you don't want deep learning.
               | 
               | I forgot about Xgboost, but I'm a big fan of unsupervised
               | methods (as input to supervised methods, mostly) and
               | Spark has a bunch of these. I haven't ever tried to do
               | it, but based on my experience of running deep learning
               | frameworks and distributed ML, I suspect the combination
               | of both to be exponentially more annoying ;) (And i deal
               | mostly with structured data, so it doesn't buy me as
               | much).
               | 
               | > You can then run them with KubeFlow or on whatever
               | platform your SaaS provider has (GCP AI Platform, AWS
               | Sagemaker, etc.).
               | 
               | Do people really find these tools useful? Again, I'm not
               | really sure what SageMaker (for example) buys me on AWS,
               | and their pricing structure is so opaque that I'm
               | hesitant to even invest time in it.
        
               | marcinzm wrote:
               | >UDF support isn't really the same, to be honest. You're
               | still prisoner of the select from pattern. Don't get me
               | wrong, SQL is wonderful where it works, but it doesn't
               | work for everything that I need.
               | 
               | Not sure how it's different from what you can do in Spark
               | in terms of data transformations. Taking a list of
               | objects as an argument basically allows your UDF to do
               | arbitrary computations on tabular data.
               | 
               | > I forgot about Xgboost, but I'm a big fan of
               | unsupervised methods (as input to supervised methods,
               | mostly) and Spark has a bunch of these.
               | 
               | That's true, distributed unsupervised methods aren't done
               | in most other places I know of. I'm guessing there's ways
               | to do that with neural network although I haven't looked
               | into it. The datasets I deal with have structure in them
               | between events even if they're unlabeled.
               | 
               | >I completely agree that it's faster than Spark, but it's
               | also super-expensive and more limited. I suspect it would
               | probably be cheaper to run a managed Spark cluster vs
               | Snowflake and just eat the performance hit by scaling up.
               | 
               | I used to do that on AWS. For our use case, Athena ate
               | its lunch in terms of performance, latency and cost by an
               | order of magnitude. Snowflake is priced based on demand
               | so I suspect it'd do likewise.
        
           | sseppola wrote:
           | Can you elaborate more on the "roles" of the "new stack"? To
           | me dbt/dataform and airflow/dagster are quite similar, so why
           | do you need one of each? fivetran/stitch/singer are all new
        
             | Grimm1 wrote:
             | Airflow allows for more complex transformations of data
             | that SQL may not be suited for. DBT is largely stuck
             | utilizing the SQL capabilities of the warehouse it sits on,
             | so for instance, with Redshift you have a really bad time
             | working with JSON based data with DBT, Airflow can solve
             | this problem. That's one example, but last I was working
             | with it we found DBT was great for analytical modeling type
             | transformations but from getting whatever munged up data
             | into a useable format in the first place Airflow was king.
             | 
             | We also trained our analysts to write the more analytical
             | DBT transformations which was nice, shifted that work onto
             | them.
             | 
             | Don't get me wrong though, you can get really far with just
             | DBT + Fivetran, in fact, it removes like 80% of the really
             | tedious, but trivial ETL work. Airflow is just there for
             | the last 20%
             | 
             | (Plus you can then utilize airflow as a general job
             | scheduler)
        
             | pjot wrote:
             | I've used all of these so I might be able to offer some
             | perspective here
             | 
             | In an ELT/ETL pipeline:
             | 
             | Airflow is similar to the "extract" portion of the pipeline
             | and is great for scheduling tasks and provides the high-
             | level view for understanding state changes and status of a
             | given system. I'll typically use airflow to schedule a job
             | that will get raw data from xyz source(s), do something
             | else with it, then drop it into S3. This can then trigger
             | other tasks/workflows/slack notifications as necessary.
             | 
             | You can think of dbt as the "transform" part. It really
             | shines with how it enables data teams to write modular,
             | testable, and version controlled SQL - similar to how a
             | more traditional type developer writes code. For example,
             | when modeling a schema in a data warehouse all of the
             | various source tables, transformation and aggregation
             | logic, as well as materialization methods are able to to
             | live in the their own files and be referenced elsewhere
             | through templating. All of the table/view dependencies are
             | handled under the hood by dbt. For my organization, it
             | helped untangle the web of views building views building
             | views and made it simpler to grok exactly what and where
             | might be changing and how something may affect something
             | else downstream. Airflow could do this too in theory, but
             | given you write SQL to interface with dbt, it makes it far
             | more accessible for a wider audience to contribute.
             | 
             | Fivetran/Stitch/Singer can serve as both the "extract" and
             | "load" parts of the equation. Fivetran "does it for you"
             | more or less with their range of connectors for various
             | sources and destinations. Singer simply defines a spec for
             | sources (taps) and destinations (targets) to be used as a
             | standard when writing a pipeline. I think the way Singer
             | drew a line in the sand and approached defining a way of
             | doing things is pretty cool - however active development on
             | it really took a hit when the company was acquired. Stitch
             | came up with the singer spec and their offered service is
             | through managing the and scheduling various taps and
             | targets for you.
        
         | alexpetralia wrote:
         | Great points. It depends on where the business is at, the scale
         | of their data, how processed their data is, and the
         | timeliness/accuracy requirements of that data.
        
       | justinzollars wrote:
       | Amazon introduced Step Functions, which are very nice to dig into
       | and a helpful skill for Data Engineering.
        
       | [deleted]
        
       | dominotw wrote:
       | I've been in this space last 6 yrs or so and my scala usuage has
       | gone down to zero. Not worth learning scala.
        
         | switch007 wrote:
         | What languages are worth learning?
        
           | dominotw wrote:
           | SQL has taken over the space completely. 90% of data munging
           | and transforms happen via SQL.
           | 
           | I would learn python. Its the number one language outside
           | sql.
        
             | sonofaragorn wrote:
             | What do you use if you need to process hundreds of GBs of
             | data?
        
               | dominotw wrote:
               | sql on snowflake
        
           | johanneskanybal wrote:
           | sql, python, terraform, maybe some basic java. Airflow is
           | pretty common. Whatever the company is migrating away from.
           | As long as you`re good at one of those and can pick up the
           | rest on the fly you should be fine to start out.
           | 
           | edit: Guess this was pretty much in the post.
        
         | pgoggijr wrote:
         | This is an anecdote - plenty of firms are using Scala in their
         | data engineering stacks and it's a great tool for the job.
         | 
         | While maybe not strictly necessary per se, it's a great way to
         | get a foot in the door, and provides a great way to foster
         | advanced type systems and functional programming (I personally
         | find it to be a really fun language to write in to boot).
        
           | st1x7 wrote:
           | > plenty of firms are using Scala in their data engineering
           | stacks
           | 
           | Isn't that just a result of everyone being into Spark a few
           | years ago?
        
           | dominotw wrote:
           | > it's a great tool for the job.
           | 
           | What job can this do that can't be done via sql. dealing with
           | unstructured data?
        
         | sidlls wrote:
         | Scala, when it's not used because it's just what someone
         | learned the ropes with, is the Haskell of data science and
         | machine learning: it's what people use when they want to
         | inflate their credentials and/or egos.
        
       | ectoplasmaboiii wrote:
       | Is anyone here using kdb+/q for data engineering, specifically
       | outside of finance?
        
       | darth_avocado wrote:
       | We want all these skills, yet, we'll give you a separate title
       | and pay you less than a software engineer. Meanwhile front end
       | software engineers are still software engineers and get high pay.
        
         | alexpetralia wrote:
         | I don't think data engineers are paid less than software
         | engineers.
        
           | darth_avocado wrote:
           | They are. I should know. I've worked as one for years
           | including big tech companies. For e.g. FB has a lower pay
           | than SWE, lower RSUs etc. and you can only get SWE pay if you
           | transition into one, and that requires you to go through an
           | interview process internally.
        
             | red_hare wrote:
             | That's totally contrary to my experience where Data
             | Engineers are considered specialized SWE and paid more.
             | 
             | But I've never worked for FB.
        
               | darth_avocado wrote:
               | You are right in the sense that if you look at average
               | SWE salaries and data engineering salaries, the average
               | salary is higher for data engineers. Because the starting
               | salaries for data engineers tend to be higher because of
               | all the skills that are needed and there's plenty of SWE
               | positions that require more than just a degree in CS. But
               | if you start comparing salaries at maybe a senior level
               | (4-5 yoe+), the salaries for SWEs start becoming a lot
               | more than DEs. And again, I've worked in different
               | companies, big and small, this holds true for all
               | companies that have "Data Engineer" titles. There's of
               | course companies like Netflix where you are a data
               | engineer but still get a SWE title and get paid the same.
        
               | somurzakov wrote:
               | experienced data engineers should graduate to data
               | architects/ML engineers and this way they can get on par
               | with SWE, pls correct me if I am wrong.
        
               | darth_avocado wrote:
               | You're right, but most companies do not have those
               | positions formalized and therefore you're expected to do
               | those as part of your job, but not gain the financial
               | benefits. Also, there is a big disparity in these titles
               | and what the duties entail, which inherently again feeds
               | into the problem.
        
           | tharne wrote:
           | In my experience, they get paid more.
        
             | Grimm1 wrote:
             | Same here
        
       | wheaties wrote:
       | ...and nothing of basic statistics? Data Science people want to
       | know about your data pipeline and have some quantification of the
       | quality of that data. Also, monitoring data pipelines for data
       | integrity often relies upon a statistical test. You don't need to
       | go as far as Bayesian but you do need to understand when a median
       | goes way off or if it bi-modal, etc.
        
         | diehunde wrote:
         | That should be assumed in the "engineer" part of the role.
        
           | ZephyrBlu wrote:
           | Yeah I would definitely expect an engineer to have a grasp of
           | basic statistics such as mean, median, mode and be able to
           | interpret statistical graphs on a basic level (Modality,
           | skewness, shape, etc) .
        
         | adilkhash wrote:
         | a good catch, thanks!
        
       | sseppola wrote:
       | Great resource, thanks for sharing it! I will dig deeper into the
       | resources linked here as there's a lot I have never seen before.
       | The main topics are more or less exactly what I've found to be
       | key in this space in the last 2 months trying to wrap my head
       | around data engineering in my new job.
       | 
       | What I'm still trying to grasp is first how to assess the big
       | data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases
       | (mostly ETL). It just seems like Spark wins because it's most
       | used, but I have no idea how to differentiate these tools beyond
       | the general streaming/batch/real-time taglines. Secondly,
       | assessing the "pipeline orchestrator" for our use cases, where
       | like Spark, Airflow usually comes out on top because of usage.
       | Would love to read more about this.
       | 
       | Currently I'm reading Designing Data-Intensive Applications by
       | Kleppman, which is great. I hope this will teach me the
       | fundamentals of this space so it becomes easier to reason about
       | different tools.
        
       | diehunde wrote:
       | Nice article. From experience I would say the SQL knowledge
       | should be advanced though. Not intermediate.
        
       | StreamBright wrote:
       | 2021? More like 2010. Hadoop is getting deprecated rapidly and
       | more companies split their write and read workloads. Separated
       | storage and compute is also popular. Scala is not used that much,
       | I think it is not worth the time investment. More and more
       | companies go for Kotlin instead of Java when these want to tap
       | into the Java ecosystem.
        
         | adilkhash wrote:
         | Hadoop is still widely used in enterprises (especially in
         | banks), if you have experience working with Hadoop ecosystems
         | it is a big plus anyway.
        
           | StreamBright wrote:
           | Yes, that is the status quo.
           | 
           | There is also some trends:
           | 
           | https://trends.google.com/trends/explore?date=today%205-y&ge.
           | ..
        
         | smattiso wrote:
         | Are you working in this field? I am looking for a consultant to
         | setup a modern data processing pipeline for a data driven
         | hardware product I am building.
        
           | StreamBright wrote:
           | Yes, I am moving companies to their next data pipeline, that
           | is my specialty. I have added my email to my profile, you can
           | reach out to me.
        
       | u678u wrote:
       | Incidentally does anyone have resources for SMALL data? EG a few
       | MB of a time, but requires the same ETL, scheduling,
       | traceability. I'd love some lite versions of big-data tools but
       | needs to be simple, small and cheap.
        
         | musingsole wrote:
         | Most orgs working with small data I've seen will just fall back
         | to the tutorial version of some big data tools (often times
         | just eating the unused infrastructure cost for something like
         | Hadoop when they're generating biweekly reports). Most project
         | managers have a dream of their project scaling up and want to
         | be prepared should a dream become a reality. And if you "under-
         | engineer" (by which I mean specifically engineer for the
         | problem the company is facing), you'll get called out by every
         | armchair developer for not going with the "obvious, best
         | solution."
         | 
         | I'm not bitter; you're bitter. /s
        
           | u678u wrote:
           | Yeah I'm bitter. I want to bring structure, tools and
           | discipline to small scale data gathering, but the big tools
           | are just too time-consuming to get up and keep running with
           | just a few hours.
        
         | hermitcrab wrote:
         | Take a look at our https://www.easydatatransform.com tool. It
         | is a drag and drop data munging tool for datasets up to a few
         | million rows. It runs locally on Windows or Mac. You should be
         | able to install it and start transforming your data within a
         | few minutes. It doesn't have a built in scheduler (yet), but
         | you can run it from the command line.
         | 
         | Excel Power Query is also quite lightweight. But is pretty
         | klunky in my (biased) opinion.
        
           | u678u wrote:
           | Thanks there are enough workstation tools, but I want an
           | automated tool that runs on a server.
        
         | kfk wrote:
         | I have been working with small data few years. We built an
         | internal library to move data in/out of systems and the
         | schedule this into jobs. We mostly leverage S3 and Spectrum.
         | The major complexities we found were in scheduling, proxies and
         | fetching raw data from legacy applications.
        
         | ABeeSea wrote:
         | In AWS, lambda's and step functions.
         | 
         | https://aws.amazon.com/step-functions/
        
           | u678u wrote:
           | Thanks, presumably this is similar to AWS Airflow too.
        
       | snidane wrote:
       | In data engineering your goal is "standardization". You can't
       | afford every team using their unique tech stack, their own
       | databases, coding styles, etc. People leave the company all the
       | time and you as a data engineer always end up with their mess
       | which now becomes your responsibility to maintain. You'd at least
       | be grateful if those people had used the same methods to code
       | stuff as you and your team so that you wouldn't have to become a
       | Bletchley Park decoding expert any time someone leaves. Or you'd
       | hope the tech stack was powerful and flexible enough that other
       | people other than engineer types could pick it up and maintain
       | themselves. They mostly cannot do that, because there is no such
       | powerful system out there. Even when some modern ELT systems get
       | you 80% there, you, data engineer, are still needed to bridge the
       | gap for the 20% of the cases.
       | 
       | Data Engineering really comes down to being a set of hacks and
       | workarounds, because there is no data processing system which you
       | could use in a standardized systematic way that data analysts,
       | engineers, scientists and anyone else could use. It's kind of a
       | blue-collar "dirty job" of the software world, which nobody
       | really wants to do, but which pays the highest.
       | 
       | There are of course other parts to it, such as managing multiple
       | data products in a systematic way, which engineering minds seem
       | to be best suited for. But the core of data engineering in 2020,
       | I believe, is still implementing hacks and gluing several systems
       | together so as to have a standardized processing system.
       | 
       | Snowflake or Databricks Spark bring you closest to the ideal
       | unified system despite all their shortcomings. But still, you
       | sometimes need to process unstructured jsons, extract stuff from
       | html and xml files, unzip a bunch of zip archives and put them
       | into something that these systems recognize and only then you can
       | run sql on it. It is much better than the ETL of the past, where
       | you really had to hack and glue 50% of the system yourself, but
       | it is still nowhere near the ideal system in which you'd simply
       | tell your data analysts: you can do it all yourself, I'm going to
       | show you how. And I won't have to run and maintain a
       | preprocessing job to munge some data into something spark
       | recognizable for you.
       | 
       | It is not that difficult to imagine a world where such a system
       | exists and data engineering is not evem needed. But you can be
       | damn sure, that before this happens, that this position will be
       | here to stay, and will be paying high, when 90% of ML and data
       | science is data engineering and cleaning and all these companies
       | hired a shitton of data science and ML people who are now trying
       | to justify their salaries by desperately trying to do data
       | engineers' job.
        
       | mywittyname wrote:
       | For GCP, our stacks tend to be Composer (Airflow), BigQuery,
       | Cloud Functions, and Tensorflow.
       | 
       | There's the occasional Hadoop/Spark platform out there, but
       | clients using those tend to have older platforms.
        
         | smattiso wrote:
         | What is your product? I am looking for a consultant to help me
         | setup a good process for a data driven product hardware
         | product.
        
           | mywittyname wrote:
           | The work I do is almost entirely Google Anaytics/Ads related.
           | So probably not what you're looking for, but if so, leave
           | your email and I'll reach out!
        
       | zaptheimpaler wrote:
       | Somewhat outdated view. This may be the current stack, but its
       | outdated now and is slowly being replaced. The new view is not
       | big data pipelines and ETL jobs, its lambda architecture, live
       | aggregations/materialized views and simple SQL queries on large
       | data warehouses that hide the underlying details. The batch model
       | may still apply to ML I guess, but I'm no expert there.
        
         | molsongolden wrote:
         | Any resources/guides you'd recommend?
        
       | laichzeit0 wrote:
       | I think it's missing the resources to one of the hardest
       | sections: Data modelling, like Kimball and Data Vault. That, and
       | maybe a section to modern data infrastructure. I'd put a link to
       | [1] and [2] for a quick overview and probably [3] for more
       | detail.
       | 
       | [1] https://www.holistics.io/books/setup-analytics/ [2]
       | https://a16z.com/2020/10/15/the-emerging-architectures-for-m...
       | [3] https://awesomedataengineering.com/
        
         | markus_zhang wrote:
         | This. I also think modern columnar databases and other
         | techniques somehow makes Kimball to be obsolete or relaxed some
         | how, but I could be very wrong.
         | 
         | For example we use Vertica and DBA told us that Vertica loves
         | wide tables with many columns, which doesn't look very Kimball
         | to me. This gives me some trouble as I'm not really show how to
         | model data properly.
        
       | [deleted]
        
       | prions wrote:
       | SQL proficiency is important but I wouldn't say it supersedes
       | programming experience. To me, Data Engineering is a
       | specialization of software engineering, and not something like an
       | analyst who writes SQL all day.
       | 
       | As DE has evolved, the role has transitioned away from
       | traditional low code ETL tools towards code heavy tools. Airflow,
       | Dagster, DBT, to name a few.
       | 
       | I work on a small DE team. We don't have the human power to grind
       | out SQL queries for analysts and other teams. Our solutions are
       | platforms and tools we build on top of more fundamental tools
       | that allows other people to get the data themselves. Think
       | tables-as-a-service.
        
         | otter-in-a-suit wrote:
         | I wholeheartedly agree with the "specialization" comment.
         | 
         | Unless you are in a position where you can entirely rely on
         | managed tools that do the work for you and all effort is
         | centered around managing the data, rather than the holistic
         | view of your data pipelines (Talend ETL, Informatica - the
         | "pre-Hadoop" world, if you will, and maybe some modern tools
         | like Snowflake), then a good Data Engineer needs a deep
         | understanding of programming languages, networking, some
         | sysadmin stuff, distributed systems, containerization, statics,
         | and of course a good "architect" view on the ever-growing zoo
         | of tools and languages with different pros and cons.
         | 
         | Given that at the end of the day, most "Data Pipelines" run on
         | distributed Linux machines, I've seen and solved endless issues
         | with Kernel and OS configurations (noexec flags, ulimits,
         | permissions, keyring limits ...), network bottlenecks,
         | hotspotting (both in networks and databases), overflowing
         | partitions, odd issues on odd file systems, bad partition
         | schemes, a myriad of network issues, JVM flags, needs for
         | auditing and other compliance topics, heavily multi-threaded
         | custom implementations that don't use "standard" tools and rely
         | on language features (goroutines, multiprocessing in Python,
         | Threadpools in Java ...), encoding problems, various TLS and
         | other security challenges, and of course, endless use of GNU
         | tools and other CLI-fun and I would not necessarily expect for
         | a pure SQL use case (not discounting the fact that SQL is, in
         | fact, very important).
         | 
         | Not to mention that a lot of jobs / workflows Data Engineers
         | design and write tend to be very, very expensive, especially on
         | managed Clouds - generally a good idea to make sure everything
         | works and your engineers understand what they are doing.
        
         | garciasn wrote:
         | TL;DR: SQL is still a DE's best and most ubiquitous
         | tool/language in the space--hands down.
         | 
         | I've led DE teams for the last decade. I have lived through
         | shifts in toolsets, languages, etc. Regardless of platform,
         | languages, model types, etc, etc, etc, etc, the one constant
         | has been SQL with some sort of of scripting around it.
         | 
         | Right now, it seems Python is the big wrapper language, whether
         | it's via dag or some other means but that's just the preferred
         | method TODAY. Considering SQL has been around for decades and
         | has outlasted just about every other language and system, many
         | of which have opted for a SQL-like interface on top of their
         | system, I would highly recommend DEs be very strong there.
        
         | cbdumas wrote:
         | SQL proficiency is something I've seen developers of all sorts
         | neglect, which I think is a huge mistake. And relegating SQL to
         | something that just an "analyst" does is an even bigger
         | mistake.
         | 
         | Several times over my career I've been brought in on a project
         | where the team was considering replacing their RDBMS entirely
         | with a no-SQL data store (a huge undertaking!) because they
         | were having "performance problems". In many cases the solution
         | is as simple as adding an index or modifying a query to use an
         | index, but the devs regard it as some kind of wizardry to read
         | a query plan.
        
           | Grimm1 wrote:
           | I spent way too large a portion of my last position teaching
           | developers about indexes, query plans and underlying join
           | types and their impact on performance and memory consumption.
        
       | somurzakov wrote:
       | advanced proficiency in SQL and in any scripting language of your
       | choice (C#/powershell, python) is enough to be a data engineer on
       | any technical stack: windows/linux, on-prem/cloud, vendor
       | specific/opensource, literally anything.
        
         | knur wrote:
         | I disagree. That's not enough these days.
         | 
         | If you want to build anything mildly interesting, you need to
         | have a solid background on software engineering (building data
         | pipelines in Spark, Flink, etc. goes way beyond knowing SQL),
         | you need to really understand your runtime (e.g. the JVM, and
         | how to tune it when working with massive amounts of data), you
         | need a bit of knowledge about infrastructure, because some of
         | the most specialized and powerful tools do not have yet an
         | established "way of doing things", and the statefulness nature
         | of them make them different from your typical web app
         | deployment.
         | 
         | Maybe if you want to become a data analyst you only need SQL,
         | and I would still doubt it. But data engineering is a bit
         | different.
        
           | somurzakov wrote:
           | I believe what you described is a job of Platform
           | Engineer/Systems Engineer/Data lake Architect, especially JVM
           | aspect of it. The interesting job is in the beginning when
           | you build the cluster initially, or do major extension, after
           | that the ops/maintenance is usually outsourced to cheap labor
           | offshore - so this kinda job is personally not for me.
           | 
           | spark has dataframe API which is similar to pandas api and
           | can be learned in one day, especially if you know python.
           | 
           | same for Airflow and other frameworks, it just a fancy
           | scheduler that anyone can pick up in a couple days.
        
           | dominotw wrote:
           | > building data pipelines in Spark, Flink, etc. goes way
           | beyond knowing SQL
           | 
           | What if you build you data pipelines in sql? curious if you
           | have an example of a data pipeline that needs spark?
        
       | ABeeSea wrote:
       | I think learning Scala is a bit of a waste of time, but I don't
       | know everyone's stack. Maybe it's a west coast bubble, but
       | serverless seems to be the most popular choice for new ETL stacks
       | even if the rest the cloud tech stack isn't serverless. AWS tools
       | like kinesis, glue (pyspark), step functions, pipelines, lambdas,
       | etc.
       | 
       | If you are working in that domain, being able to use the CDK in
       | TypeScript becomes way more important than being able to build a
       | Hadoop cluster from scratch using Scala.
        
         | Grimm1 wrote:
         | Glue is both more of a pain in the butt than regular old spark
         | with pyspark and way more expensive, from my experience I would
         | seriously question someone suggesting to use it.
         | 
         | We could have been using it wrong, but porting our Glue scripts
         | to standard EMR after our initial POC saved us over 10x the
         | cost and it was substantially faster.
        
           | ABeeSea wrote:
           | Both pricing and start-up times are significantly better in
           | Glue 2.0 (assuming one can migrate). But even on Glue 1.0,
           | orchestrating an ETL process with with several dozen jobs is
           | a non-trivial amount of configuration and labor. (Jobs
           | failures, job restarts, paging, job run history, cloudwatch
           | logs, re-usable infrastructure as code when creating a new
           | jobs, permissions and security, etc) that the increased cost
           | is more than worth it for us.
           | 
           | https://aws.amazon.com/blogs/aws/aws-glue-
           | version-2-0-featur...
        
       | freebee16 wrote:
       | Im my experience teams operating under the "The AI Hierarchy of
       | Needs" principles are optimized for generating white papers
        
       ___________________________________________________________________
       (page generated 2021-01-11 22:01 UTC)