[HN Gopher] How to Become a Data Engineer in 2021
___________________________________________________________________
How to Become a Data Engineer in 2021
Author : adilkhash
Score : 175 points
Date : 2021-01-11 12:49 UTC (9 hours ago)
(HTM) web link (khashtamov.com)
(TXT) w3m dump (khashtamov.com)
| josephmosby wrote:
| (source for everything following: I recently hired entry-level
| data engineers)
|
| The experience required differs dramatically between
| [semi]structured transactional data moving into data warehouses
| versus highly unstructured data that the data engineer has to do
| a lot of munging on.
|
| If you're working in an environment where the data is mostly
| structured, you will be primarily working in SQL. A LOT of SQL.
| You'll also need to know a lot about a particular database stack
| and how to squeeze it. In this scenario, you're probably going to
| be thinking a lot about job-scheduling workflows, query
| optimization, data quality. It is a very operations-heavy
| workflow. There are a lot of tools available to help make this
| process easier.
|
| If you're working in a highly unstructured data environment,
| you're going to be munging a lot of this data yourself. The
| "operations" focus is still useful, but at the entry level data
| engineer, you're going to be spending a lot more time thinking
| about writing parsers and basic jobs. If you're focusing your
| practice time on writing scripts that move data in Structure A in
| Place X to Structure B in Place Y, you're setting yourself up for
| success.
|
| I agree with a few other commentators here that Hadoop/Spark
| isn't being used a lot in their production environments - but -
| there are a lot of useful concepts in Hadoop/Spark that are
| helpful for data engineers to be familiar with. While you might
| not be using those tools on a day-to-day basis, chances are your
| hiring manager used them when she was in your position and it
| will give you an opportunity you know a few tools at a deeper
| level.
| Mauricebranagh wrote:
| It does rather depends what sort of data I bet a data engineer
| at CERN or JPL has quiet a different set of required skills to
| say Google or a company playing at data science because its the
| next big thing.
|
| I should imagine at CERN etc knowing which end of soldering
| iron gets hot might still be required in some cases.
|
| I recall back in the _mumble_ extracting data from b &w film
| shot with a high speed camera, by projecting it on to graph
| paper taped to the wall and manualy marking the position of the
| "object"
| llbeansandrice wrote:
| > I agree with a few other commentators here that Hadoop/Spark
| isn't being used a lot in their production environments
|
| I guess I'm the odd-man out because that's all I've used for
| this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.
| teddyuk wrote:
| I'm also the odd one out, so many enterprises moving to spark
| on databricks.
| josephmosby wrote:
| I should have specified more thoroughly.
|
| I am not seeing Spark being chosen for _new_ data eng roll-
| outs. It is still very prevalent in existing environments
| because it still works well. (used at $lastjob myself)
|
| However - I am still seeing a lot of Spark for machine-
| learning work by data scientists. Distributed ML feels like
| it is getting split into a different toolkit than distributed
| DE.
| llbeansandrice wrote:
| I guess it depends on what jobs you're looking for. There's
| a lot of exiting companies/teams (like mine) looking to
| hire people but we're on the "old stack" using Kafka,
| Scala, Spark, etc. We don't do any ML stuff but I'm on the
| pipeline side of it. The data scientists down the line tend
| to use Hive/SparkSQL/Athena for a lot of work but I'm much
| less involved with that.
|
| Not all jobs are new pasture and I think that's forgotten
| very frequently.
| dominotw wrote:
| Agree 100% with this comment,
|
| Old stack: Hadoop, spark, hive, hdfs.
|
| New stack: kafka/kinesis, fivetran/stitch/singer,
| airflow/dagster, dbt/dataform, snowflake/redshift
| disgruntledphd2 wrote:
| Huh, what replaces Spark in those lists?
|
| For my money, its the best distributed ML system out there,
| so I'd be interested to know what new hotness I'm missing.
| dominotw wrote:
| > best distributed ML system out there
|
| I was comparing it for "traditional" data engineering stack
| that used spark for data munging, transformations ect.
|
| I don't have much insight into ML systems or how spark fits
| there. Not all data teams are building 'ml systems' though.
| Parent comment wasn't referring to any 'ml systems', not
| sure why that would be automatically inferred when someone
| mentions data stack .
| disgruntledphd2 wrote:
| Yeah, I suppose. I kinda think that distributed SQL is a
| mostly commoditised space, and wondered what replaced
| Spark for distributed training.
|
| For context, I'm a DS who's spent far too much time not
| being able to run useful models because of hardware
| limitations, and a Spark cluster is incredibly good for
| that.
|
| Additionally, I'd argue in favour of Spark even for ETL,
| as the ability to write (and test!) complicated SQL
| queries in R, Python and Scala was super, super
| transformative.
|
| We don't really use Spark at my current place, and every
| time I write Snowflake (which is great, to be fair), I'm
| reminded of the inherent limitations of SQL and how
| wonderful Spark SQL was.
|
| I'm weird though, to be fair.
| sails wrote:
| Snowflake I suppose for the average ML use case. Not for
| your high-performance ML, but for your average data
| scientist, maybe?
|
| Edit: I may be wrong[1], would be curious to know what
| users who've used Spark AND Snowflake would add to the
| conversation.
|
| [1] https://www.snowflake.com/blog/snowflake-and-spark-
| part-1-wh...
| marcinzm wrote:
| Snowflake hits its limits with complex transformations I
| feel. Not just due to using SQL. It's "type system" is
| simpler than Spark's which makes certain operations
| annoying. There's a lack of UDFs for working with complex
| types (lists, structs, etc.). Having to write UDFs in
| Javascript is also not the greatest experience.
| dominotw wrote:
| > There's a lack of UDFs for working with complex types
| (lists, structs, etc.). Having to write UDFs in
| Javascript is also not the greatest experience.
|
| We load our data into SF in json and do plenty of
| list/struct manipulation using their inbuilt
| functions[1]. I guess you might have write a UDF if you
| are doing something super weird but inbuilt functions
| should get you pretty far 90% of the time.
|
| https://docs.snowflake.com/en/sql-reference/functions-
| semist...
| somurzakov wrote:
| distributed ML != Distributed DWH.
|
| Distributed ML is tough to train because of very little
| control over train loop. I personally prefer using single
| server trainkng even on large datasets, or switch to online
| learning algos that do train/inference/retrain at the same
| time.
|
| as for snowflake, I havent heard of people using snowflake
| to train ML, but sbnowflake is a killer in managed
| distribited DWH that you dont have to tinker and tune
| disgruntledphd2 wrote:
| So do I, theoretically at least.
|
| But Spark is super cool and actually has algorithms which
| complete in a reasonable time frame on hardware I can get
| access to.
|
| Like, I understand that the SQL portion is pretty
| commoditised (though even there, SparkSQL python and R
| API's are super nice), but I'm not aware of any other
| frameworks for doing distributed training of ML models.
|
| Have all the hipsters moved to GPUs or something? \s
|
| > sbnowflake is a killer in managed distribited DWH that
| you dont have to tinker and tune
|
| It's so very expensive though, and their pricing model is
| frustratingly annoying (why the hell do I need tickets?).
|
| That being said, tuning Spark/Presto or any of the non-
| managed alternatives is no fun either, so I wonder if
| it's the right tradeoff.
|
| One thing I really, really like about Spark is the
| ability to write Python/R/Scala code to solve the
| problems that cannot be usefully expressed in SQL.
|
| All the replies to my original comment seem to forget
| that, or maybe Snowflake has such functionality and I'm
| unaware of it.
| marcinzm wrote:
| >I'm not aware of any other frameworks for doing
| distributed training of ML models.
|
| Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet
| all support distributed training across CPUs/GPUs in a
| single machine or multiple machines. So does XGBoost if
| you don't want deep learning. You can then run them with
| KubeFlow or on whatever platform your SaaS provider has
| (GCP AI Platform, AWS Sagemaker, etc.).
|
| edit:
|
| >All the replies to my original comment seem to forget
| that, or maybe Snowflake has such functionality and I'm
| unaware of it.
|
| Snowflake has support for custom Javascript UDFs and a
| lot of built in features (you can do absurd things with
| window functions). I also found it much faster than
| Spark.
| disgruntledphd2 wrote:
| > Snowflake has support for custom Javascript UDFs and a
| lot of built in features (you can do absurd things with
| window functions). I also found it much faster than
| Spark.
|
| UDF support isn't really the same, to be honest. You're
| still prisoner of the select from pattern. Don't get me
| wrong, SQL is wonderful where it works, but it doesn't
| work for everything that I need.
|
| I completely agree that it's faster than Spark, but it's
| also super-expensive and more limited. I suspect it would
| probably be cheaper to run a managed Spark cluster vs
| Snowflake and just eat the performance hit by scaling up.
|
| Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet
| all support distributed training across CPUs/GPUs in a
| single machine or multiple machines. So does XGBoost if
| you don't want deep learning.
|
| I forgot about Xgboost, but I'm a big fan of unsupervised
| methods (as input to supervised methods, mostly) and
| Spark has a bunch of these. I haven't ever tried to do
| it, but based on my experience of running deep learning
| frameworks and distributed ML, I suspect the combination
| of both to be exponentially more annoying ;) (And i deal
| mostly with structured data, so it doesn't buy me as
| much).
|
| > You can then run them with KubeFlow or on whatever
| platform your SaaS provider has (GCP AI Platform, AWS
| Sagemaker, etc.).
|
| Do people really find these tools useful? Again, I'm not
| really sure what SageMaker (for example) buys me on AWS,
| and their pricing structure is so opaque that I'm
| hesitant to even invest time in it.
| marcinzm wrote:
| >UDF support isn't really the same, to be honest. You're
| still prisoner of the select from pattern. Don't get me
| wrong, SQL is wonderful where it works, but it doesn't
| work for everything that I need.
|
| Not sure how it's different from what you can do in Spark
| in terms of data transformations. Taking a list of
| objects as an argument basically allows your UDF to do
| arbitrary computations on tabular data.
|
| > I forgot about Xgboost, but I'm a big fan of
| unsupervised methods (as input to supervised methods,
| mostly) and Spark has a bunch of these.
|
| That's true, distributed unsupervised methods aren't done
| in most other places I know of. I'm guessing there's ways
| to do that with neural network although I haven't looked
| into it. The datasets I deal with have structure in them
| between events even if they're unlabeled.
|
| >I completely agree that it's faster than Spark, but it's
| also super-expensive and more limited. I suspect it would
| probably be cheaper to run a managed Spark cluster vs
| Snowflake and just eat the performance hit by scaling up.
|
| I used to do that on AWS. For our use case, Athena ate
| its lunch in terms of performance, latency and cost by an
| order of magnitude. Snowflake is priced based on demand
| so I suspect it'd do likewise.
| sseppola wrote:
| Can you elaborate more on the "roles" of the "new stack"? To
| me dbt/dataform and airflow/dagster are quite similar, so why
| do you need one of each? fivetran/stitch/singer are all new
| Grimm1 wrote:
| Airflow allows for more complex transformations of data
| that SQL may not be suited for. DBT is largely stuck
| utilizing the SQL capabilities of the warehouse it sits on,
| so for instance, with Redshift you have a really bad time
| working with JSON based data with DBT, Airflow can solve
| this problem. That's one example, but last I was working
| with it we found DBT was great for analytical modeling type
| transformations but from getting whatever munged up data
| into a useable format in the first place Airflow was king.
|
| We also trained our analysts to write the more analytical
| DBT transformations which was nice, shifted that work onto
| them.
|
| Don't get me wrong though, you can get really far with just
| DBT + Fivetran, in fact, it removes like 80% of the really
| tedious, but trivial ETL work. Airflow is just there for
| the last 20%
|
| (Plus you can then utilize airflow as a general job
| scheduler)
| pjot wrote:
| I've used all of these so I might be able to offer some
| perspective here
|
| In an ELT/ETL pipeline:
|
| Airflow is similar to the "extract" portion of the pipeline
| and is great for scheduling tasks and provides the high-
| level view for understanding state changes and status of a
| given system. I'll typically use airflow to schedule a job
| that will get raw data from xyz source(s), do something
| else with it, then drop it into S3. This can then trigger
| other tasks/workflows/slack notifications as necessary.
|
| You can think of dbt as the "transform" part. It really
| shines with how it enables data teams to write modular,
| testable, and version controlled SQL - similar to how a
| more traditional type developer writes code. For example,
| when modeling a schema in a data warehouse all of the
| various source tables, transformation and aggregation
| logic, as well as materialization methods are able to to
| live in the their own files and be referenced elsewhere
| through templating. All of the table/view dependencies are
| handled under the hood by dbt. For my organization, it
| helped untangle the web of views building views building
| views and made it simpler to grok exactly what and where
| might be changing and how something may affect something
| else downstream. Airflow could do this too in theory, but
| given you write SQL to interface with dbt, it makes it far
| more accessible for a wider audience to contribute.
|
| Fivetran/Stitch/Singer can serve as both the "extract" and
| "load" parts of the equation. Fivetran "does it for you"
| more or less with their range of connectors for various
| sources and destinations. Singer simply defines a spec for
| sources (taps) and destinations (targets) to be used as a
| standard when writing a pipeline. I think the way Singer
| drew a line in the sand and approached defining a way of
| doing things is pretty cool - however active development on
| it really took a hit when the company was acquired. Stitch
| came up with the singer spec and their offered service is
| through managing the and scheduling various taps and
| targets for you.
| alexpetralia wrote:
| Great points. It depends on where the business is at, the scale
| of their data, how processed their data is, and the
| timeliness/accuracy requirements of that data.
| justinzollars wrote:
| Amazon introduced Step Functions, which are very nice to dig into
| and a helpful skill for Data Engineering.
| [deleted]
| dominotw wrote:
| I've been in this space last 6 yrs or so and my scala usuage has
| gone down to zero. Not worth learning scala.
| switch007 wrote:
| What languages are worth learning?
| dominotw wrote:
| SQL has taken over the space completely. 90% of data munging
| and transforms happen via SQL.
|
| I would learn python. Its the number one language outside
| sql.
| sonofaragorn wrote:
| What do you use if you need to process hundreds of GBs of
| data?
| dominotw wrote:
| sql on snowflake
| johanneskanybal wrote:
| sql, python, terraform, maybe some basic java. Airflow is
| pretty common. Whatever the company is migrating away from.
| As long as you`re good at one of those and can pick up the
| rest on the fly you should be fine to start out.
|
| edit: Guess this was pretty much in the post.
| pgoggijr wrote:
| This is an anecdote - plenty of firms are using Scala in their
| data engineering stacks and it's a great tool for the job.
|
| While maybe not strictly necessary per se, it's a great way to
| get a foot in the door, and provides a great way to foster
| advanced type systems and functional programming (I personally
| find it to be a really fun language to write in to boot).
| st1x7 wrote:
| > plenty of firms are using Scala in their data engineering
| stacks
|
| Isn't that just a result of everyone being into Spark a few
| years ago?
| dominotw wrote:
| > it's a great tool for the job.
|
| What job can this do that can't be done via sql. dealing with
| unstructured data?
| sidlls wrote:
| Scala, when it's not used because it's just what someone
| learned the ropes with, is the Haskell of data science and
| machine learning: it's what people use when they want to
| inflate their credentials and/or egos.
| ectoplasmaboiii wrote:
| Is anyone here using kdb+/q for data engineering, specifically
| outside of finance?
| darth_avocado wrote:
| We want all these skills, yet, we'll give you a separate title
| and pay you less than a software engineer. Meanwhile front end
| software engineers are still software engineers and get high pay.
| alexpetralia wrote:
| I don't think data engineers are paid less than software
| engineers.
| darth_avocado wrote:
| They are. I should know. I've worked as one for years
| including big tech companies. For e.g. FB has a lower pay
| than SWE, lower RSUs etc. and you can only get SWE pay if you
| transition into one, and that requires you to go through an
| interview process internally.
| red_hare wrote:
| That's totally contrary to my experience where Data
| Engineers are considered specialized SWE and paid more.
|
| But I've never worked for FB.
| darth_avocado wrote:
| You are right in the sense that if you look at average
| SWE salaries and data engineering salaries, the average
| salary is higher for data engineers. Because the starting
| salaries for data engineers tend to be higher because of
| all the skills that are needed and there's plenty of SWE
| positions that require more than just a degree in CS. But
| if you start comparing salaries at maybe a senior level
| (4-5 yoe+), the salaries for SWEs start becoming a lot
| more than DEs. And again, I've worked in different
| companies, big and small, this holds true for all
| companies that have "Data Engineer" titles. There's of
| course companies like Netflix where you are a data
| engineer but still get a SWE title and get paid the same.
| somurzakov wrote:
| experienced data engineers should graduate to data
| architects/ML engineers and this way they can get on par
| with SWE, pls correct me if I am wrong.
| darth_avocado wrote:
| You're right, but most companies do not have those
| positions formalized and therefore you're expected to do
| those as part of your job, but not gain the financial
| benefits. Also, there is a big disparity in these titles
| and what the duties entail, which inherently again feeds
| into the problem.
| tharne wrote:
| In my experience, they get paid more.
| Grimm1 wrote:
| Same here
| wheaties wrote:
| ...and nothing of basic statistics? Data Science people want to
| know about your data pipeline and have some quantification of the
| quality of that data. Also, monitoring data pipelines for data
| integrity often relies upon a statistical test. You don't need to
| go as far as Bayesian but you do need to understand when a median
| goes way off or if it bi-modal, etc.
| diehunde wrote:
| That should be assumed in the "engineer" part of the role.
| ZephyrBlu wrote:
| Yeah I would definitely expect an engineer to have a grasp of
| basic statistics such as mean, median, mode and be able to
| interpret statistical graphs on a basic level (Modality,
| skewness, shape, etc) .
| adilkhash wrote:
| a good catch, thanks!
| sseppola wrote:
| Great resource, thanks for sharing it! I will dig deeper into the
| resources linked here as there's a lot I have never seen before.
| The main topics are more or less exactly what I've found to be
| key in this space in the last 2 months trying to wrap my head
| around data engineering in my new job.
|
| What I'm still trying to grasp is first how to assess the big
| data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases
| (mostly ETL). It just seems like Spark wins because it's most
| used, but I have no idea how to differentiate these tools beyond
| the general streaming/batch/real-time taglines. Secondly,
| assessing the "pipeline orchestrator" for our use cases, where
| like Spark, Airflow usually comes out on top because of usage.
| Would love to read more about this.
|
| Currently I'm reading Designing Data-Intensive Applications by
| Kleppman, which is great. I hope this will teach me the
| fundamentals of this space so it becomes easier to reason about
| different tools.
| diehunde wrote:
| Nice article. From experience I would say the SQL knowledge
| should be advanced though. Not intermediate.
| StreamBright wrote:
| 2021? More like 2010. Hadoop is getting deprecated rapidly and
| more companies split their write and read workloads. Separated
| storage and compute is also popular. Scala is not used that much,
| I think it is not worth the time investment. More and more
| companies go for Kotlin instead of Java when these want to tap
| into the Java ecosystem.
| adilkhash wrote:
| Hadoop is still widely used in enterprises (especially in
| banks), if you have experience working with Hadoop ecosystems
| it is a big plus anyway.
| StreamBright wrote:
| Yes, that is the status quo.
|
| There is also some trends:
|
| https://trends.google.com/trends/explore?date=today%205-y&ge.
| ..
| smattiso wrote:
| Are you working in this field? I am looking for a consultant to
| setup a modern data processing pipeline for a data driven
| hardware product I am building.
| StreamBright wrote:
| Yes, I am moving companies to their next data pipeline, that
| is my specialty. I have added my email to my profile, you can
| reach out to me.
| u678u wrote:
| Incidentally does anyone have resources for SMALL data? EG a few
| MB of a time, but requires the same ETL, scheduling,
| traceability. I'd love some lite versions of big-data tools but
| needs to be simple, small and cheap.
| musingsole wrote:
| Most orgs working with small data I've seen will just fall back
| to the tutorial version of some big data tools (often times
| just eating the unused infrastructure cost for something like
| Hadoop when they're generating biweekly reports). Most project
| managers have a dream of their project scaling up and want to
| be prepared should a dream become a reality. And if you "under-
| engineer" (by which I mean specifically engineer for the
| problem the company is facing), you'll get called out by every
| armchair developer for not going with the "obvious, best
| solution."
|
| I'm not bitter; you're bitter. /s
| u678u wrote:
| Yeah I'm bitter. I want to bring structure, tools and
| discipline to small scale data gathering, but the big tools
| are just too time-consuming to get up and keep running with
| just a few hours.
| hermitcrab wrote:
| Take a look at our https://www.easydatatransform.com tool. It
| is a drag and drop data munging tool for datasets up to a few
| million rows. It runs locally on Windows or Mac. You should be
| able to install it and start transforming your data within a
| few minutes. It doesn't have a built in scheduler (yet), but
| you can run it from the command line.
|
| Excel Power Query is also quite lightweight. But is pretty
| klunky in my (biased) opinion.
| u678u wrote:
| Thanks there are enough workstation tools, but I want an
| automated tool that runs on a server.
| kfk wrote:
| I have been working with small data few years. We built an
| internal library to move data in/out of systems and the
| schedule this into jobs. We mostly leverage S3 and Spectrum.
| The major complexities we found were in scheduling, proxies and
| fetching raw data from legacy applications.
| ABeeSea wrote:
| In AWS, lambda's and step functions.
|
| https://aws.amazon.com/step-functions/
| u678u wrote:
| Thanks, presumably this is similar to AWS Airflow too.
| snidane wrote:
| In data engineering your goal is "standardization". You can't
| afford every team using their unique tech stack, their own
| databases, coding styles, etc. People leave the company all the
| time and you as a data engineer always end up with their mess
| which now becomes your responsibility to maintain. You'd at least
| be grateful if those people had used the same methods to code
| stuff as you and your team so that you wouldn't have to become a
| Bletchley Park decoding expert any time someone leaves. Or you'd
| hope the tech stack was powerful and flexible enough that other
| people other than engineer types could pick it up and maintain
| themselves. They mostly cannot do that, because there is no such
| powerful system out there. Even when some modern ELT systems get
| you 80% there, you, data engineer, are still needed to bridge the
| gap for the 20% of the cases.
|
| Data Engineering really comes down to being a set of hacks and
| workarounds, because there is no data processing system which you
| could use in a standardized systematic way that data analysts,
| engineers, scientists and anyone else could use. It's kind of a
| blue-collar "dirty job" of the software world, which nobody
| really wants to do, but which pays the highest.
|
| There are of course other parts to it, such as managing multiple
| data products in a systematic way, which engineering minds seem
| to be best suited for. But the core of data engineering in 2020,
| I believe, is still implementing hacks and gluing several systems
| together so as to have a standardized processing system.
|
| Snowflake or Databricks Spark bring you closest to the ideal
| unified system despite all their shortcomings. But still, you
| sometimes need to process unstructured jsons, extract stuff from
| html and xml files, unzip a bunch of zip archives and put them
| into something that these systems recognize and only then you can
| run sql on it. It is much better than the ETL of the past, where
| you really had to hack and glue 50% of the system yourself, but
| it is still nowhere near the ideal system in which you'd simply
| tell your data analysts: you can do it all yourself, I'm going to
| show you how. And I won't have to run and maintain a
| preprocessing job to munge some data into something spark
| recognizable for you.
|
| It is not that difficult to imagine a world where such a system
| exists and data engineering is not evem needed. But you can be
| damn sure, that before this happens, that this position will be
| here to stay, and will be paying high, when 90% of ML and data
| science is data engineering and cleaning and all these companies
| hired a shitton of data science and ML people who are now trying
| to justify their salaries by desperately trying to do data
| engineers' job.
| mywittyname wrote:
| For GCP, our stacks tend to be Composer (Airflow), BigQuery,
| Cloud Functions, and Tensorflow.
|
| There's the occasional Hadoop/Spark platform out there, but
| clients using those tend to have older platforms.
| smattiso wrote:
| What is your product? I am looking for a consultant to help me
| setup a good process for a data driven product hardware
| product.
| mywittyname wrote:
| The work I do is almost entirely Google Anaytics/Ads related.
| So probably not what you're looking for, but if so, leave
| your email and I'll reach out!
| zaptheimpaler wrote:
| Somewhat outdated view. This may be the current stack, but its
| outdated now and is slowly being replaced. The new view is not
| big data pipelines and ETL jobs, its lambda architecture, live
| aggregations/materialized views and simple SQL queries on large
| data warehouses that hide the underlying details. The batch model
| may still apply to ML I guess, but I'm no expert there.
| molsongolden wrote:
| Any resources/guides you'd recommend?
| laichzeit0 wrote:
| I think it's missing the resources to one of the hardest
| sections: Data modelling, like Kimball and Data Vault. That, and
| maybe a section to modern data infrastructure. I'd put a link to
| [1] and [2] for a quick overview and probably [3] for more
| detail.
|
| [1] https://www.holistics.io/books/setup-analytics/ [2]
| https://a16z.com/2020/10/15/the-emerging-architectures-for-m...
| [3] https://awesomedataengineering.com/
| markus_zhang wrote:
| This. I also think modern columnar databases and other
| techniques somehow makes Kimball to be obsolete or relaxed some
| how, but I could be very wrong.
|
| For example we use Vertica and DBA told us that Vertica loves
| wide tables with many columns, which doesn't look very Kimball
| to me. This gives me some trouble as I'm not really show how to
| model data properly.
| [deleted]
| prions wrote:
| SQL proficiency is important but I wouldn't say it supersedes
| programming experience. To me, Data Engineering is a
| specialization of software engineering, and not something like an
| analyst who writes SQL all day.
|
| As DE has evolved, the role has transitioned away from
| traditional low code ETL tools towards code heavy tools. Airflow,
| Dagster, DBT, to name a few.
|
| I work on a small DE team. We don't have the human power to grind
| out SQL queries for analysts and other teams. Our solutions are
| platforms and tools we build on top of more fundamental tools
| that allows other people to get the data themselves. Think
| tables-as-a-service.
| otter-in-a-suit wrote:
| I wholeheartedly agree with the "specialization" comment.
|
| Unless you are in a position where you can entirely rely on
| managed tools that do the work for you and all effort is
| centered around managing the data, rather than the holistic
| view of your data pipelines (Talend ETL, Informatica - the
| "pre-Hadoop" world, if you will, and maybe some modern tools
| like Snowflake), then a good Data Engineer needs a deep
| understanding of programming languages, networking, some
| sysadmin stuff, distributed systems, containerization, statics,
| and of course a good "architect" view on the ever-growing zoo
| of tools and languages with different pros and cons.
|
| Given that at the end of the day, most "Data Pipelines" run on
| distributed Linux machines, I've seen and solved endless issues
| with Kernel and OS configurations (noexec flags, ulimits,
| permissions, keyring limits ...), network bottlenecks,
| hotspotting (both in networks and databases), overflowing
| partitions, odd issues on odd file systems, bad partition
| schemes, a myriad of network issues, JVM flags, needs for
| auditing and other compliance topics, heavily multi-threaded
| custom implementations that don't use "standard" tools and rely
| on language features (goroutines, multiprocessing in Python,
| Threadpools in Java ...), encoding problems, various TLS and
| other security challenges, and of course, endless use of GNU
| tools and other CLI-fun and I would not necessarily expect for
| a pure SQL use case (not discounting the fact that SQL is, in
| fact, very important).
|
| Not to mention that a lot of jobs / workflows Data Engineers
| design and write tend to be very, very expensive, especially on
| managed Clouds - generally a good idea to make sure everything
| works and your engineers understand what they are doing.
| garciasn wrote:
| TL;DR: SQL is still a DE's best and most ubiquitous
| tool/language in the space--hands down.
|
| I've led DE teams for the last decade. I have lived through
| shifts in toolsets, languages, etc. Regardless of platform,
| languages, model types, etc, etc, etc, etc, the one constant
| has been SQL with some sort of of scripting around it.
|
| Right now, it seems Python is the big wrapper language, whether
| it's via dag or some other means but that's just the preferred
| method TODAY. Considering SQL has been around for decades and
| has outlasted just about every other language and system, many
| of which have opted for a SQL-like interface on top of their
| system, I would highly recommend DEs be very strong there.
| cbdumas wrote:
| SQL proficiency is something I've seen developers of all sorts
| neglect, which I think is a huge mistake. And relegating SQL to
| something that just an "analyst" does is an even bigger
| mistake.
|
| Several times over my career I've been brought in on a project
| where the team was considering replacing their RDBMS entirely
| with a no-SQL data store (a huge undertaking!) because they
| were having "performance problems". In many cases the solution
| is as simple as adding an index or modifying a query to use an
| index, but the devs regard it as some kind of wizardry to read
| a query plan.
| Grimm1 wrote:
| I spent way too large a portion of my last position teaching
| developers about indexes, query plans and underlying join
| types and their impact on performance and memory consumption.
| somurzakov wrote:
| advanced proficiency in SQL and in any scripting language of your
| choice (C#/powershell, python) is enough to be a data engineer on
| any technical stack: windows/linux, on-prem/cloud, vendor
| specific/opensource, literally anything.
| knur wrote:
| I disagree. That's not enough these days.
|
| If you want to build anything mildly interesting, you need to
| have a solid background on software engineering (building data
| pipelines in Spark, Flink, etc. goes way beyond knowing SQL),
| you need to really understand your runtime (e.g. the JVM, and
| how to tune it when working with massive amounts of data), you
| need a bit of knowledge about infrastructure, because some of
| the most specialized and powerful tools do not have yet an
| established "way of doing things", and the statefulness nature
| of them make them different from your typical web app
| deployment.
|
| Maybe if you want to become a data analyst you only need SQL,
| and I would still doubt it. But data engineering is a bit
| different.
| somurzakov wrote:
| I believe what you described is a job of Platform
| Engineer/Systems Engineer/Data lake Architect, especially JVM
| aspect of it. The interesting job is in the beginning when
| you build the cluster initially, or do major extension, after
| that the ops/maintenance is usually outsourced to cheap labor
| offshore - so this kinda job is personally not for me.
|
| spark has dataframe API which is similar to pandas api and
| can be learned in one day, especially if you know python.
|
| same for Airflow and other frameworks, it just a fancy
| scheduler that anyone can pick up in a couple days.
| dominotw wrote:
| > building data pipelines in Spark, Flink, etc. goes way
| beyond knowing SQL
|
| What if you build you data pipelines in sql? curious if you
| have an example of a data pipeline that needs spark?
| ABeeSea wrote:
| I think learning Scala is a bit of a waste of time, but I don't
| know everyone's stack. Maybe it's a west coast bubble, but
| serverless seems to be the most popular choice for new ETL stacks
| even if the rest the cloud tech stack isn't serverless. AWS tools
| like kinesis, glue (pyspark), step functions, pipelines, lambdas,
| etc.
|
| If you are working in that domain, being able to use the CDK in
| TypeScript becomes way more important than being able to build a
| Hadoop cluster from scratch using Scala.
| Grimm1 wrote:
| Glue is both more of a pain in the butt than regular old spark
| with pyspark and way more expensive, from my experience I would
| seriously question someone suggesting to use it.
|
| We could have been using it wrong, but porting our Glue scripts
| to standard EMR after our initial POC saved us over 10x the
| cost and it was substantially faster.
| ABeeSea wrote:
| Both pricing and start-up times are significantly better in
| Glue 2.0 (assuming one can migrate). But even on Glue 1.0,
| orchestrating an ETL process with with several dozen jobs is
| a non-trivial amount of configuration and labor. (Jobs
| failures, job restarts, paging, job run history, cloudwatch
| logs, re-usable infrastructure as code when creating a new
| jobs, permissions and security, etc) that the increased cost
| is more than worth it for us.
|
| https://aws.amazon.com/blogs/aws/aws-glue-
| version-2-0-featur...
| freebee16 wrote:
| Im my experience teams operating under the "The AI Hierarchy of
| Needs" principles are optimized for generating white papers
___________________________________________________________________
(page generated 2021-01-11 22:01 UTC)