[HN Gopher] We don't need data scientists, we need data engineers
___________________________________________________________________
We don't need data scientists, we need data engineers
Author : winkywooster
Score : 487 points
Date : 2021-01-14 13:09 UTC (9 hours ago)
(HTM) web link (www.mihaileric.com)
(TXT) w3m dump (www.mihaileric.com)
| tfehring wrote:
| One annoying thing about being a generalist is that domain
| experts in any given area that you need familiarity with can't
| help but complain about how little you know about that domain,
| ignoring the fact that your job requires equally deep knowledge
| of several other domains simultaneously.
|
| In the case of data scientists, I think the business folks that
| want them to understand the business domain better generally have
| the strongest argument, followed by the statisticians - good data
| scientists need to personally understand both of those things
| well, while the engineering and ops stuff that data scientists
| are also expected to do is easier to compartmentalize on other
| teams. So I agree that we should have more data engineers, but
| apparently for the opposite reason as most people in this thread.
| patothon wrote:
| why not both?
| b0rsuk wrote:
| Oh, but maybe we don't need either?
|
| It seems that data science is used primarily for advertising.
| Local internet communities are dead. Message boards are dying.
| Everything is either a reddit / discord / Steam forums /
| Boardgamegeek. In 2021, GOG forums pass for a small forum. Only
| the biggest can float in the ocean of spam.
| mywittyname wrote:
| I always felt that tech-focused data scientists should also be
| required to know how process data end-to-end; at minimum, from a
| SQL database to deployed model, but knowing how to collect &
| clean data is important too. It seems like the industry is trying
| fill the gap that was created by a glut of people without math/cs
| backgrounds going into 5-week data science courses who then need
| hand-holding when they get real jobs.
|
| Data science & engineering should be treated as a single
| collection of skill-sets. Lacking ETL experience is a major
| deficit, considering how prevalent that kind of work is.
|
| This might just be my personal biases coming through. I consider
| myself a "full-stack" data scientist & engineer. But because data
| scientists who can work on the backends are rare, I always end up
| doing the plumbing while other people do the fun analysis work.
|
| I think companies that are data "science" heavy are going to be
| at huge disadvantage soon. Tools like Rekognition and Google AI
| APIs are making the model training & deployment aspect almost
| trivial. At some point, the only real work involved in this space
| will be the data "engineering."
| superbcarrot wrote:
| > Data science & engineering should be treated as a single
| collection of skill-sets.
|
| This can be tough because there could be _a lot_ in that skill
| set. You can 't realistically expect someone to have solid
| knowledge of statistics including specialising in the sub-field
| and type of algorithms that your product needs, and also be
| able to write good code and act as a developer, and also have
| solid knowledge of all the tools for data
| streaming/processing/ETL. There is a point at which you're just
| stretching yourself too thin if you try to do all of these at
| once.
|
| Of course, stuff like knowing how to interact with a database
| or employing good software development practices should be a
| very basic prerequisite and some scientists certainly shift
| things too far in the other direction and use their academic
| knowledge as an excuse to write poor code and not learn new
| tools.
|
| I guess what I'm trying to say is that they are distinct skills
| but you still need all of them to some extent and striking the
| correct balance in one's skillset is really difficult.
| mywittyname wrote:
| These are all skills taught in standard computer science
| programs. Granted, some are electives, like high-level stats.
| But even back in 2010, data science electives were available
| to fill the gaps. I took three DS&E classes in college with
| projects that were end-to-end platforms, where you'd have to
| collect, clean, and analyze the data, then build, test, and
| deploy models from it.
|
| I would certainly hope that college courses are even more
| comprehensive after 10 years and an explosion in interest for
| the field.
|
| Also, much like being a full stack developer, a full stack
| data engineer doesn't need to know everything at a master
| level. But that you can at least handle tasks at most points
| in the chain.
| mongol wrote:
| Why is data scientist a profession in IT but not for example
| computer scientist? Many IT professionals studied computer
| science but they don't call themselves scientists in their line
| of work.
| eeZah7Ux wrote:
| Don't give developer ideas. People with a javascript bootcamp
| and 2 years experience are already called "senior engineer".
| globular-toast wrote:
| "Computer science" is a well-known misnomer as it is not about
| computers nor is it science. It's generally reserved for the
| academic study of computing. Data science is much closer to
| science.
| zzbzq wrote:
| Basically this manager guy had like a dozen job titles under
| him, such as Business Analyst, Data Analyst, ML Engineer, and
| sundry more. Then his HR team came to him and told him to just
| boil it down to 1. So he made up this title "Data Scientist"
| because it sounded badass, and then everyone from both the
| business analytics side and the software engineering side
| decided this sounds awesome and they want it to be the next
| step of their career.
|
| Then the federal government came up with this title something
| like "Secretary of Data Scientist" and the guy who made up the
| term worked at that.
| noahmoss wrote:
| What does CV stand for in this context?
| macksd wrote:
| Computer vision (i.e. deep learning on images / video)
| jimsparkman wrote:
| Keeping in mind DE can mean different things at different
| companies, I spend a lot of time working on infrastructural
| components to just get at data reliably. Working in a product
| company with disparate generators of data, I'm often building out
| network connectivity (VPC peering, VPNs, etc.), subnets, ACLs,
| firewalls and load balancers across our visualization tools,
| managing job flows, controlling AWS costs, building read replicas
| for production databases, yadda yadda. There might be a ton of
| hoops to jump through before I can even start to process data and
| it's the type of work that wouldn't make sense to hand off to my
| DS counterparts.
| jh88 wrote:
| Just like with data science, don't we expect data engineering to
| get to a point where they codify best practices into a tool, so
| 1x data engineers can be closer to be as effective as 10x data
| engineers? The amount of data engineering work won't change, but
| the number of people needed to do it will reduce, reducing the
| open headcounts.
| giantg2 wrote:
| I don't find it too surprising. I think more people would want to
| work with the conceptual work in the data science space, like
| models. The ETL stuff seems extremely boring and it seems it pays
| less too.
| analog31 wrote:
| Disclosure: I'm a just plain scientist, without the data. ;-)
|
| Since "data science" appeared on my radar (in other words, on
| HN), I've noticed that we fling the "scientist" and "engineer"
| terms around without asking whether the practitioners have
| science or engineering backgrounds, or something else such
| statistics, math, programming, etc.
|
| It strikes me that "do we need scientists or engineers" is not
| unique to data science/engineering. I think we need both, but of
| course it's an open question as to how many of each are needed.
| Also, both "scientist" and "engineer" are loosely defined in
| practice, with some overlap.
|
| Painting with a broad brush, a scientist wants to learn how
| things work. An engineer wants to make things that work. If
| you're in the business of making things that work in order to
| sell them, then you need lots of engineers, but maybe a few
| scientists. It's about 10:1 at my workplace.
|
| Overlap? Of course. Engineers use scientific knowledge and
| methodology. Scientists have to make things work in order to make
| experiments and theoretical computations work. Also, scientists
| often show up in emerging fields before they become recognizable
| engineering disciplines, for many reasons: 1) We need the newest
| stuff, right away. 2) We are opportunists by nature and
| necessity. For instance many of the older "software engineers" at
| my workplace have science degrees, but the younger ones all have
| computer science degrees. The oldest people I know who did
| programming, digital logic, and embedded systems, did not have
| degrees in those areas. The youngest ones all do. At my college,
| the physics professors had personal computers, and were ripping
| them apart, while the CS professors used the mainframe. That's
| part of why I chose to major in physics, though I was interested
| in programming.
|
| Another reason for overlap is that both "scientist" and
| "engineer" titles include things that are not strictly within
| either area. A lot of people with science degrees end up working
| as technicians, marketeers, salesmen, managers, etc. A lot of
| people with engineering degrees do little or no quantitative
| engineering, but work as programmers, designers, marketeers,
| salesmen, managers, etc.
|
| Something to ponder is if there are differences between how
| scientists and engineers think and approach problems. Naturally
| there's a lot of folklore about that, probably little hard
| evidence. That's where I think you should start if you want to
| know whether to hire scientists, engineers, or both.
| [deleted]
| ska wrote:
| In many cases, what is needed isn't even more data engineers;
| it's data janitors.
| basseq wrote:
| The problem that I've seen is often "data scientists" are
| expected to be the equivalent of full-stack engineers (or maybe
| more accurately: one-man CTO shops)--to understand data
| architecture, understand _business_ architecture, ensure data
| quality, build data into product, build dashboards, derive
| insights, posit hypotheses, set strategy, and drive business
| value.
|
| Thus many "data scientists" are juiced-up report-builders who
| can't analyze their way out of a paper bag.
| nerdponx wrote:
| This is uncharitable.
|
| In my experience, this is true:
|
| > "data scientists" are expected to be the equivalent of full-
| stack engineers (or maybe more accurately: one-man CTO
| shops)--to understand data architecture, understand business
| architecture, ensure data quality, build data into product,
| build dashboards, derive insights, posit hypotheses, set
| strategy, and drive business value.
|
| But this is not:
|
| > Thus many "data scientists" are juiced-up report-builders who
| can't analyze their way out of a paper bag.
|
| Rather, the data scientists are trained in only two of the
| requirements you mentioned: derive insights, posit hypotheses.
| The rest is all self-study and on-the-job experience. This
| means that we are putting unrealistic expectations on data
| scientist and/or their training is insufficient, _not_ that
| data scientists are somehow morons.
| basseq wrote:
| Now it's my turn to claim uncharitibility.
|
| Indeed, my context here is that people who wear the data
| scientist title come from multiple backgrounds, and are often
| asked to wear too many hats. They are _non_ morons--they may
| be _darn good_ report-builders, but haven 't been trained in
| insights, for instance.
|
| If you're reacting to my word choice in that last sentence,
| know that I _am_ frustrated with people who claim to be data
| scientists but can 't derive insight. (And we can argue about
| "many".) But that's not a broad denouncement against all data
| scientists, either.
| kevin_thibedeau wrote:
| It's the new CIS.
| spacemanmatt wrote:
| Data engineer, here. Or at least, that has been my title a couple
| of times.
|
| Some data is inherently trash but a huge part of the data quality
| problem is sources who are allowed to produce trash that everyone
| else has to clean up, when it would be way more efficient for
| them to quit producing trash.
|
| Not to pick on any one institution, but SOAP seems to be a read
| flag that the service will also deliver some screwy data.
| jaegerpicker wrote:
| Every time I see SOAP in a ticket/task I die a little inside.
| Without fail every SOAP related project I've dealt with in the
| last 10 years has been a shit show. I've actually and
| unfortunately gotten pretty good at dealing with it but I so
| hate it.
| a_zaydak wrote:
| My view is from a small startup with little to no room for single
| purpose employees.
|
| When I first started hiring and working with data scientist my
| view was this: If you can only manipulate data and run it through
| pipelines to generate models then you can't do enough to be
| highly valuable. You either need to have a strong enough
| background in CS to build the pipelines / tools or a strong
| enough mathematics background to be able to propose cutting edge
| new ideas. From my experience it is hard to find someone who has
| one of these skill just from a University "data science" program.
| At a small company (at least ones that I have worked with) being
| only proficient in R and basic Python isn't enough. That being
| said, I have met and handful of Data Scientist who were very
| smart and self motivated enough to pick up on the lacking skills
| when given the chance.
|
| My question to HN is this; are there rolls at these larger
| companies for a Data Scientist who who primarily just crunches
| data in R and Python without the ability to actually build the
| pipelines / tools or conduct research?
| proverbialbunny wrote:
| I would be cautious about that. I've worked in the startup
| space for over 10 years now as a data scientist, often the
| first one hired on, working on the pipes.
|
| From my experience, there are two types of data scientists who
| work who do infrastructure work: 1) Those who do not make the
| best data scientist because their skill set is too far in
| engineering land, leaving them weak where it counts. If the
| startup is relying on the data scientist to be profitable, I'd
| be cautious with these types. or 2) Someone who is senior,
| beyond senior really, who has worked both jobs, and doesn't
| mind doing both jobs. This unicorn is so rare it is mythical.
| The joke when the terminology was created is they're so rare no
| one has ever seen one, hence unicorn.
|
| Me, I can not do the work I need to do if I'm on call. That is
| where I draw the line. That means hiring someone to monitor the
| infrastructure. Furthermore, I'm an okay architect, but you
| really do want to hire a specialist if you can help it for
| that. Do I help them with the infrastructure? Absolutely, but
| they're on call if a server is on fire. They have the admin
| login credentials, not me.
|
| I get wearing multiple hats, but keep in mind to be a data
| scientist you're already wearing multiple hats. Being a data
| scientist is like double majoring and getting a phd. At what
| point are they stretched too thin? The consensus in the
| industry is they're already stretched too thin and should be
| broken up into different specialized roles.
|
| >My question to HN is this; are there rolls at these larger
| companies for a Data Scientist who who primarily just crunches
| data in R and Python without the ability to actually build the
| pipelines / tools or conduct research?
|
| That is the standard role, even at startups. However, the
| industry consensus these days is data scientists should have
| more responsibility when it comes to deploying models than
| previous standards.[1] So data scientists are being pushed in a
| more engineering direction, not with hosting sql servers and
| infrastructure, but with working with engineers to make sure
| the models are monitored properly. This change comes from model
| deployment being further automated as time goes on, making it
| easier for the data scientist to have more responsibility
| during this stage.
|
| [1] source:
| https://www.dominodatalab.com/static/gfx/uploads/domino-mana...
| page 9. Suboptimal organization and incentive structures.
| a_zaydak wrote:
| Thanks for the feedback! Seems like you and I both have had a
| bit of experience being first engineering hires at startups
| but have had very different experiences when it comes to
| rolls or a data scientist. I appreciate that.
| proverbialbunny wrote:
| Np. There is a common trend in the industry where a company
| hires on a data scientist, doesn't know the data
| prerequisites (specifically labeled data), the data
| scientist struggles, after a while the company fires the
| data scientist. This leaves the company with a bad taste in
| their mouth. In recent years I tend to get hired on as a
| specialist to help fix this. (And yes, I've been the first
| engineer hired on too.)
|
| What's interesting is they tend to struggle in two
| different ways: 1) The data scientist that is gung ho about
| infrastructure work, jumps in, and then ends up doing a bad
| job, because it's not their strength. They end up getting
| let go for not being ideal at that work. 2) The data
| scientist who struggles with the idea of infrastructure
| work at all, jumps into other roles they're good at like
| data analyst work, helps the company in that way, but
| ultimately because they did not push to get an
| infrastructure engineer hired, they end up let go as well.
|
| Me, I go out of my way to get an infrastructure engineer /
| data engineer hired early on. Also, I have worked as an
| engineer, so I tend to do a lot of the "hard" stuff most
| software engineers struggle with early on, if applicable.
| Eg, at one job I wrote a compression format to reduce
| battery drain on our devices that were collecting data.
|
| Most data scientists struggle when it comes to
| CS/engineering skills (4/5th of them), so it's not uncommon
| for them early one while the pipes are being built to do
| data analyst and BI work. BI work to automate reports,
| which management loves, and DA work to show some amazing
| future service the company might be able provide to its
| customers. It's selling the sun and the moon really, but it
| gets management inspired, and helps them know what data to
| collect. It's not unheard of to need a minimum of two years
| of collected data before building a model that can be
| deployed becomes feasible. This can be hard on the data
| scientist, because there is a lot of down time before that.
| Many get fired during this time even when they're doing a
| good job. They have to wear multiple hats, but it's analyst
| roles (like BI work). Technically a data scientist is a
| kind of analyst, not engineer, so it makes sense that
| wearing multiple hats for them tilts in the analyst
| direction, not the engineering direction.
|
| I've been writing code since I was 8 years old, so I'm one
| of the unusual ones that tilts in the engineering
| direction, but I think it is unreasonable to expect that
| from the average data scientist. Let them do what they do
| best, and hire someone else who can round everything out
| and you'll be in a good place. Unicorns aside, you'll need
| a minimum of two professionals for a data project to
| succeed.
| _RPL5_ wrote:
| Thank you for your comments! They are very insightful. To
| piggyback a bit:
|
| Assuming you are a competent data "analyst" who wants to
| become a data engineer, how would you go about it? Is "go
| back to school and get a CS degree" the answer? I suppose
| this question is very broad, but I am curious if a
| practitioner like you has an opinion.
|
| ---
|
| To give some context:
|
| I recently graduated with a STEM PhD, and looking to move
| into data science. Reading the comments, I feel like I
| fall into the "pointless data scientist" cohort derided
| in this thread. Eg: I am very comfortable doing typical
| analytical work & occasionally training models inside a
| notebook, but I am neither a cutting-edge theoretical
| statistician nor a data engineer.
|
| I've been trying to improve on the engineering side. For
| example, I did a project recently where I set up a
| rudimentary pipeline that continuously pings an API,
| uploads the data to a cloud database, then serves up the
| analysis via a Flask app. For me this was a big step up
| from just doing notebooks on a csv file :)
|
| But moving beyond the basics, I am not sure what to study
| next. Hence my question. If you have any suggestions, I
| would greatly appreciate it!
| avs733 wrote:
| I teach engineers for a living. I struggle to see how this is not
| just a straw man argument based on colloquial usage of terms. It
| is just inferences drawn based on job ads that are rarely written
| by people doing the job and instead are effectively human-as-seo-
| optimized so the best candidates can find the job they hopefully
| fit for and not be too confused to apply for it.
| hn_throwaway_99 wrote:
| It's not a straw man, I've seen it clear as day in several
| companies. When it comes to data science, it's "garbage in,
| garbage out". I've seen companies do lots of "data science"
| with a bunch of data scientists skilled in python and jupyter
| notebooks, only to discover a ton of work was useless because
| the incoming event data was tagged incorrectly due to a bug.
|
| The actual process of collecting, aggregating, cleaning and
| verifying data is a hugely important skill, and not one I've
| really seen typical data scientists possess.
| avs733 wrote:
| I have experienced the same thing...but I just don't think it
| has anything to do with whether the positions are labeled
| data scientist or data engineer.
|
| And I would warn you from my experience teaching statistics
| to undergraduate engineers...they are not going to be much
| better. Regularly get 'hey we have this data what test can we
| run?' 'what are you trying to show?' 'we don't care we just
| need to run a statistical test' conversations.
| bonniemuffin wrote:
| I suspect this may actually be an issue of school vs real
| world rather than scientist vs engineer.
|
| Data in the classroom setting is pristine and beautiful; data
| in the real world is messy and buggy. You have to get burned
| by buggy data a few times (or maybe a bunch of times) in the
| real world to learn to look for bad data smells -- I don't
| think schools effectively teach this kind of intuition,
| regardless of whether the students are training as data
| engineers or data scientists.
|
| If data scientists are spending more time in school getting
| advanced degrees, they're not getting as much exposure to
| buggy data, whereas data engineers with a BS and a few years
| of industry experience would already have built up this
| skill.
| avs733 wrote:
| >Data in the classroom setting is pristine and beautiful;
| data in the real world is messy and buggy.
|
| I got to take over our department's undergraduate
| statistics course a few years back.
|
| The first change I made was all homework, tests, and
| projects used real data set. I intentionally have them
| collect bad data (they don't know its bad before hand).
| First day of class we collect data using the board game
| operation...I give basic instructions and then halfway
| through ask everyone to stop and agree on how they are
| entering data for the variable of 'success or failure' of
| the surgery. Oops...
|
| In my experience teaching the course, the reason the
| students (engineers) find statistical reasoning hard is:
|
| * They have never been given anything 'broken', everything
| is curated to avoid things not working. The result is they
| think data has inherent meaning. A right answer.
|
| * Their entire learning experience has been stripped of
| context and the need to make decisions with information.
| They can give me a p value but are terrified (not unable,
| just unwilling) to interpret it or give it meaning.
|
| * They have never encountered the concept of
| variability...everything is presented as systems with exact
| inputs and outputs.
|
| When I work with postdocs, I sometimes (less frequently)
| encounter many of the same challenges. Data is treated as
| sacred and external and inherent. It's wild to me.
| finnthehuman wrote:
| >The actual process of collecting, aggregating, cleaning and
| verifying data is a hugely important skill, and not one I've
| really seen typical data scientists possess.
|
| Then they are not scientists. They have a label "scientist"
| but lack of rigor of actual science.
|
| I don't see why changing the label to "engineer" would
| suddenly make them have rigor.
| [deleted]
| avs733 wrote:
| Right?!
|
| This is sort of the meta failure of the argument. They are
| arguing that people's data skillsets are wrong. To make
| that argument they are analyzing based on the wrong
| variable in a data set.
| antipaul wrote:
| The article is so true, my latest mantra at work is
| "engineering is more important than data science".
|
| Everyone is buzzing about the latter, and few even realize what
| is the former.
| avs733 wrote:
| eh...I think this can be analogized to what we already see in
| code...
|
| You need architecture, you need backends, you need a front
| end, you need product design...all with data.
|
| Why are computer scientists computer scientists not
| engineers? Why is computer science about the code side? Why
| did computer engineering end up being more on the hardware
| end of the spectrum?
|
| Words, especially newly coined terms are pointers to meaning.
| That meaning is socially mediated, it is not inherent.
|
| You're saying this (adn I think the author is too) because
| there is a need for this group of people to look beyond
| titles to skillsets, and the existing titles carry linguistic
| baggage of the difference between science and engineering
| that has existed for decades.
| jariel wrote:
| So I think that the delineation between the scientist working
| with the content, and the Engineers who actually provide the
| mechanics for it is very fair.
|
| If there is a question mark here - it's really how much value
| are we deriving from all of these data people?
|
| Where is all the ML that's changing our lives? Search, Alexa
| and TikTok, I can see it.
|
| In the future obviously vision systems for autonomous cars
| etc..
|
| But I'm really wary about the heavily decreasing marginal
| returns after that.
|
| It will surely change the world, but I think in specific areas.
| Most of the entire field seems like an optimization on
| something rather than anything new.
|
| Washing Machines feed up immense amount of labour and toil.
| Alexa telling me the weather is not.
| avs733 wrote:
| Most engineering and science jobs aren't a binary as much as
| they are a spectrum.
|
| If the article is trying to make a point about skill
| development and diversification, I'm totally on board.
| Bifurcating the roles instead is going to be less effective.
|
| To the value point...my sense has been we are seeing the
| Webcommerce 1.0 bubble Machine Learning edition. Lots of uses
| of it, not all of them have value. I am excited for where we
| will be in 10 or 15 years, but I suspect the difference will
| be huge. If you put me to a guess, I would say better data
| handling practices and ethics will likely be the linchpins of
| value creation vs. using tools for the sake of tools.
| serjester wrote:
| I used to work at a legacy automaker and you'd be shocked at
| how much ML has changed certain areas of the business. It
| used to take an entire department to sort warranty claims and
| it's now mostly automated. Aluminum part defects are now
| spotted automatically on the plant floor. Don't even get me
| started with telematics data.
|
| Most software isn't consumer facing but just because you
| don't see it doesn't mean it's not changing things around
| you. ML tends to be overhyped but your assessment is too
| pessimistic.
| bart_spoon wrote:
| The vast majority of applications of machine learning that is
| changing the world isn't happening on a consumer level. Its
| happening in factories, warehouses, farms, logistics chains,
| etc.
| superbcarrot wrote:
| The overall point that there is demand for data engineering
| skills seems valid but in reality you don't have to pick between
| data scientists and data engineers, it's not one or the other. I
| don't know if the argument is set up this way just to get more
| clicks or to encourage arguments but it would have been better to
| just focus on the overall state of the market and the demand for
| certain skills.
| shrumm wrote:
| this post is so timely. How do you guys handle hiring? I'm
| looking to hire more data engineers (Singapore only for now),
| especially senior ones and it's not easy. My gut is to just find
| good engineers who have good first principles and let them learn
| on the job. Feedback/experiences welcome!
| yaster wrote:
| Curious to hear this from the other side. I'm an SE manager
| interested in pivoting to Data Engineering. I've built modest
| pipelines in R / Postgres for an operations analyst job I did
| before I became a programmer, but I'm not experienced with most
| of the technologies I see listed on DE roles (e.g. Airflow) and
| my statistics etc. are rusty.
| Dumblydorr wrote:
| I'm a scientist first, data person second. I'll use the label
| data scientist if that's what the employer wants. I'll use data
| analyst, scientist, researcher, or any other term, so long as
| they pay me.
|
| What matters is: are we contributing to the betterment of society
| with data-driven decision making?
| Dumblydorr wrote:
| Another aspect: so many PhDs and postdocs in the sciences do
| similar work but get paid 1/3 or less of the wages. We
| shouldn't be surprised they snap up industry jobs in data
| science, it's way more remuneration than academics, with less
| stress IMO
| gigatexal wrote:
| Why then are a data Scientists paid a ton more? Seems they are in
| more demand.
| slt2021 wrote:
| Full Stack Data Scientist (data janitor + data engineer + ML
| engineer + ML Ops + Business Analyst) is the future
| alexpetralia wrote:
| These are incredibly disparate skill sets. Of course anyone
| would want to hire someone like this, and far more would claim
| to possess such a broad skill set, but in practice it is
| extremely rare.
|
| You'd need someone with excellent communication skills
| (presentation, memo writing, teamwork), project management
| skills (identifying & overcoming workflow bottlenecks),
| professional skills (timely responses, political savvy),
| technical skills (application programming, advanced databases,
| advanced machine learning, Excel modeling) and finally some
| business domain knowledge.
|
| This is an uncommon intersection of skills.
| marcinzm wrote:
| I, interestingly enough, have that skill set (mostly) and
| probably a broader set of technical skills than you're
| imagining. I use it to hire a team of specialists under me
| and interact with other specialized teams (ie: I speak their
| language) since I lack depth in too many areas. I wouldn't
| ever imagine hiring a clone of myself except in cases where I
| can't build out a larger team for a long period of time.
| nerdponx wrote:
| > You'd need someone with excellent communication skills
| (presentation, memo writing, teamwork), project management
| skills (identifying & overcoming workflow bottlenecks),
| professional skills (timely responses, political savvy),
| technical skills (application programming, advanced
| databases, advanced machine learning, Excel modeling) and
| finally some business domain knowledge.
|
| This is pretty much the _bare minimum_ requirement for any
| data scientist job I 've ever interviewed for or held.
| alexpetralia wrote:
| In my experience, software engineers do not make good
| business analysts (and data/machine learning engineering is
| a subset of software engineering). Most business analysts
| cannot program.
|
| However, it's likely that our experiences simply diverge
| here.
| nerdponx wrote:
| I'm talking specifically about data science, not business
| analyst or software engineer.
| slt2021 wrote:
| exactly, these are characteristics of a unicorn and I think
| most of these skills are trivial to build up over time
| through practice and self-learning and these skills can yield
| great benefits both for employers and employees
| valarauko wrote:
| I think the point is that these skills are not trivial to
| build up over time
| SirSourdough wrote:
| I guess in a vacuum each of those skills is easy to build
| up through practice and self-learning (which, lets
| remember, many people struggle with to begin with).
| However, I think the fact that you refer to people
| possessing all of them as "unicorns" should be telling as
| far as how trivial it actually is to build all these skills
| beyond a simply passable level.
| bcrosby95 wrote:
| Or you can recognize that they're the characteristics of a
| unicorn and split the role into multiple positions.
| bearjaws wrote:
| I've seen this in leadership who want to move to "Devops",
| its the classic "if we find this one person who can do
| everything we will have no problems!"
|
| The reality is of course, nobody can be amazing at the full
| lifecycle of an application. Some do better in infra, some
| better in backend, front end, etc.
|
| A successful leader must find what is needed for the
| product/application pipeline and hire appropriate skill sets,
| trying to find the one candidate to rule them all is giving
| up on planning IMO.
| beckingz wrote:
| As the saying goes:
|
| If you're looking for a data scientist with XYZABC skills,
| that's not a data scientist, that's a data science team.
| esyir wrote:
| Ugh, all this gets you is being mediocre at all of this.
| slt2021 wrote:
| yes, if you need to roll-up everything on your own from
| scratch.
|
| NO, if you use right amount of automation and software
| (usable data science workbench with MLOps built in, usable
| and scalable ETL/ELT framework, usable AutoML, etc, etc.)
| marcinzm wrote:
| You still get someone mediocre at everything just they
| cover up the gaps for a bit longer. Eventually things they
| don't understand will interact in ways they don't
| understand and cause production issues. It's okay to be a
| generalist, one should however understand the blind spots a
| generalist has.
| nerdponx wrote:
| Maybe my sense of terminology is warped, but I always thought
| of DataEngineer = DataJanitor [?]
| MlEngineer [?] MlOps [?] BusinessAnalyst
|
| Data Scientist is more like some combination of statistician,
| "whatever ML is if it isn't statistics", lighweight
| mathematician, data janitor (yes there is overlap), business
| domain specialist, and code monkey.
| civilized wrote:
| Just my personal experience, but the people at my company titled
| "Data Engineers" basically can only be trusted to (very slowly)
| move data around, while the people titled "Data Scientists" have
| to do all the cleaning work to make the data suitable for
| analysis and modeling, in addition to doing that analysis and
| modeling.
|
| Is the point here that data scientists are doing too much of the
| work that should be handled by data engineers? If so I agree, but
| there are some org barriers. For one thing, our data engineers
| are not accountable to any particular project. All we can do is
| send them tickets to move data around, and it already takes them
| weeks to move a table from one database to another. I can't
| imagine the headaches if we gave them anything nontrivial to do.
| C4stor wrote:
| I can't recommend the Data Engineer career enough for junior
| developers. It's how I started and what I pursued for 6 years
| (and I would love doing it again), and I feel like it gave me
| such an incredible foundation for future roles :
|
| - Actually big data (so, not something you could grep...) will
| trigger your code in every possible way. You quickly learn that
| with trillions of input, the probabily to reach a bug is either
| 0% or 100%. In turn, you quickly learn to write good tests.
|
| - You will learn distributed processing at a macro level, which
| in turn enlighten your thinking at a micro level. For example,
| even though the order of magnitudes are different, hitting data
| over network versus on disk is very much like hitting data on
| disk versus in cache. Except that when the difference ends up
| being in hours or days, you become much more sensible to that, so
| it's good training for your thoughts.
|
| - Data engineering is full of product decisions. What's often
| called data "cleaning" is in fact one of the import product
| decisions made in a company, and a data engineer will be
| consistently exposed to his company product, which I think makes
| for great personal development
|
| - Data engineering is fascinating. In adtech for example, logs of
| where ads are displayed are an unfiltered window on the rest of
| humanity, for the better or the worse. But it definitely expands
| your views on what the "average" person actually does on its
| computer (spoiler : it's mainly watching porn...), and challenges
| quite a bit what you might think is "normal"
|
| - You'll be plumbing technologies from all over the web, which
| might or might not be good news for you.
|
| So yeah, data engineering is great ! It's not harder than other
| specialties for developers, but imo, it's one of the fun ones !
| secondcoming wrote:
| Indeed, adtech is a great place to work for anyone interesting
| in working with data. And yes, people working in adtech hate,
| and block, ads too.
| pricci wrote:
| And where would you recommend someone to start a data
| engineering path. Any book, learning source?
| edmundsauto wrote:
| Long time DE here. I recommend trying to build your own data
| warehouse around something you're interested in. Don't worry
| about teh scaling - focus on the core engineering, taking
| data from different places, combining it into a sensible data
| model, update it automatically every day. Add in more
| sources.
|
| It's shockingly difficult, and something that only experience
| can teach.
| thecolorgreen wrote:
| I have the same question and I believe the answer is in the
| same vein as someone who asks about software engineering.
| Books/courses are great for the concepts, but your goal
| should be to build something ASAP since that's where actual
| learning will come from.
| khaledh wrote:
| Work at a company that has a good data engineering
| discipline. Shopify is hiring:
| https://www.shopify.ca/careers/2021
| Karrot_Kream wrote:
| A lot of these are just "garden variety" (distributed)
| systems problems. Dealing with systems with differing latency
| distributions, recovering from failure, acceptable tradeoffs
| between speed and accuracy, etc
| alexpetralia wrote:
| The other thing I'd emphasize here is dealing with "state".
| Data is effectively state.
|
| As application engineers build increasingly "stateless" code
| (e.g. pure functions, serverless deployments, etc), that state
| gets pushed elsewhere. Someone has to manage the queues, file
| versions/locations, logs, databases, configurations and so on.
| That is all "data".
|
| State management is a tricky problem even in a single-threaded
| application. It's doubly so in distributed systems, where state
| can be inconsistent between all the moving pieces. This is the
| source of endless data integrity issues. I think data
| engineering is a great way to get some exposure to all of this.
| darksaints wrote:
| > As application engineers build increasingly "stateless"
| code (e.g. pure functions, serverless deployments, etc), that
| state gets pushed elsewhere.
|
| Exactly. You can't magically make a stateful problem
| stateless, you can merely move that state around. Sometimes
| moving state around means moving it somewhere that is
| appropriate and capable of expertly handling that data. But
| if you make those choices wrong, it makes every aspect of
| your application more complex.
|
| UI programming tried going down this idea of stateless
| programming, and for a while it was trendy to do so stuff
| like redux. The problem is that UIs are state machines.
| That's not an analogy, that is a literal statement. And it is
| true of all UI's...it's just as true of the transmission
| lever in your car as it is for your saas dashboard. You can't
| program stateless UIs...they would cease to be a UI. So at
| best, you can move that state around. And with most of these
| solutions (eg. redux), you end up pushing that state into a
| massive global singleton, where even simple things like the
| state of a single radio button needs to be fed through dozens
| of tightly coupled components in order to "statelessly"
| render. And even worse, you lose the extremely helpful
| distinction between UI state and domain state, mixing them
| both together into a gigantic shit stew.
| jsinai wrote:
| >The other thing I'd emphasize here is dealing with "state".
| Data is effectively state.
|
| It gets even more complicated. It's not just the current
| state that matters, but also the history (sometimes the
| entire history) up to that state.
| theflyinghorse wrote:
| I wonder how: 1. one finds organizations that have data
| engineering 2. gets hired to said organization with software
| engineering background.
| khaledh wrote:
| Shopify is hiring 2,021 engineers (not just data engineers)
| in 2021: https://www.shopify.ca/careers/2021
| walleeee wrote:
| Nearly any field of computational science likely needs
| skilled data engineers. You could search for topics that
| interest you online and contact people accordingly.
|
| I cold-emailed my current lab's P.I. and just asked for work.
| Search for "research software engineer" or "scientific
| computing professional" positions. Plenty of data engineering
| goes on in many fields (environmental science, climate
| modeling, high energy physics, physical chemistry, etc), and
| plenty of fields desperately need to develop an engineering
| culture (e.g., plant biology, my field), whatever interests
| you. Availability and compensation will vary by discipline.
| [deleted]
| nerdponx wrote:
| This hits me at a personal level.
|
| This is how I imagine programmers must have felt in the 80s and
| 90s.
| jesseryoung wrote:
| 4 years ago I moved from a role where I primarily wrote C# as an
| architect on a web application, to an architect helping to build
| a data warehouse. The contrast in tooling, discipline and
| information available to build anything in the data world is so
| stark it had me questioning my career decisions. Sure, you can
| read Kimball and Inmon and I'm sure there are a handful of others
| out there - but there are drastically fewer than what you can
| find in the application development space.
|
| Things are getting better, Visual ETL tools are falling out of
| favor to proper coded ETL (spark, dbt, etc) and data teams are
| starting to see the value of actually engineering a solution
| instead of just throwing it over the wall to a DBA to deal with.
| But tooling, and general information on the web is still lacking.
| Pushing data engineers over "etl developers" or "bi developers"
| (or "data scientists") will drastically improve any organizations
| ability to actually deliver real analytics and hopefully an
| industry wide push will raise all ships.
| tpoacher wrote:
| We don't need [old but still relatively new definition, whose
| meaning still isn't fully agreed on or established]!
|
| We need [brand new definition of the same, which most people are
| even more confused what it means and how it's different from the
| old]!
| itsoktocry wrote:
| My first job as a "Data Scientist" (it wasn't called that, but
| the work was the same) was for a small gaming shop, around 2011.
| It involved applying econometric analysis and doing simple
| statistical testing on the player data sets. I realized quickly
| that knowing how to do statistical testing was only a very small
| portion of what it took to create value in such a role. At the
| time, I didn't even know (but learned) SQL. Everything I wanted
| to do involved teaming up with a developer, which wasn't
| efficient in a small operation. So I learned to program. I
| continue to enjoy skilling-up, most recently learning cloud-tech
| to enable me to deploy data tools I develop.
|
| The most valuable people in the data chain will be those that can
| take idea to near-production. Running ML libraries over clean
| datasets is overrated. The fact is, 80% of the value of "Data
| Science" comes from KPIs and basic stuff.
| hardtke wrote:
| > The most valuable people in the data chain will be those that
| can take idea to near-production.
|
| Having hired many data scientists/ML engineers over the years,
| people that build robust automated intelligence directly into
| products are extremely rare. I've estimated a maximum of 10K
| people in the entire world. They also command the highest
| salaries, not coincidentally. Very few people have both the
| statistics and engineering backgrounds as well as temperament
| to be successful, particularly when the problem requires new
| data sources or new types of models. There are some real simple
| practical hurdles such as the need to implement robust tracking
| that allows data snapshotting at the time when a decision needs
| to be made without affecting product performance, as well as
| figuring out how to gather data on users/situations that are
| actually important for moving the needle (first time users,
| casual users, etc.) There is also a mismatch between the best
| frameworks for prototyping/research and implementation which
| (at least at the companies I've worked for) can be summarized
| as "Java is good for application development, not ML. Python is
| good for ML, not application development."
| tristor wrote:
| Agreed. In my decently long career the types of data problems
| I've seen be most impactful on the business are not head-in-the-
| clouds ML issues, but more mundane yet more far-reaching:
|
| 1. Appropriately identifying what data needs to be captured from
| a product to correctly operationalize it.
|
| 2. Understanding and modeling data structures in internal
| applications to identify and tune backend data storage mechanisms
| (including DBMS). Inclusive in this is helping the application
| development team pick the correct structure and implement it
| correctly.
|
| 3. Validating implementation of instrumentation within the
| application so that data cleaning isn't necessary and that
| telemetry can be appropriately reported on. Building said
| reports.
|
| 4. Doing ETL and taking care of out of band data management to
| link disparate systems within the business to help build holistic
| views of the business overall.
|
| 5. Be a safeguard against the over-collection of data, because
| data engineers understand that data isn't an asset, it's a
| liability that increase costs and risks as a business or product
| scales, and when there's not a specific need that can be
| articulated clearly for that data, collecting it is a
| user/customer-hostile action.
|
| My experience has been that data is a crucial element to
| understand the health and state of the business with both breadth
| and depth at a given point in time and identify trends. However,
| it's mostly used by folks in management as a crutch to try to de-
| risk decision making, or worse as a political tool to give a faux
| support to a decision that's already been made but not yet
| publicized. Decisions carry inherent risk, including the decision
| to do nothing, you cannot eliminate this, it's one of the
| components of decision trade-offs. This sort of broken use of
| data by management is supported by "Data Scientists" that see the
| field as a cash-cow they can milk while they work on pie-in-the-
| sky ML strategies which are often unnecessary, even when they
| actually work.
|
| Done correctly a strong data culture in a company can increase
| decision velocity, empower engineers, and reduce overhead on
| management to understand the business. Done improperly, data
| culture in a business can easily destroy decision velocity,
| empower dysfunctional politics, and increase engineering overhead
| to understand systems. Getting it right is the main test for
| businesses in the new era.
| lamename wrote:
| This sounds reasonable. The trick is to identify
| companies/cultures of each type at interview time. But is that
| even possible?
| hehehaha wrote:
| I want to live in a world where data scientists making nearly
| $500K can understand and correctly implement simple concepts such
| as fixed effects. Is that asking for too much?
| beckingz wrote:
| Sure, but what about at $80k?
|
| None of this discussion is helped by the fact that companies
| want to get into data science without actually having much data
| strategy.
| donkeyd wrote:
| This is interesting. I was the technical founder for a data
| startup that used NLP, elastic and some other stuff for analysis.
| It's still active, growing and approaching profitability, is used
| by fortune 500 companies and has had some media attention.
| However, I've never been approached for a related role and have
| never been invited after applying for similar roles.
|
| Maybe my resume is bad, maybe my experience doesn't really fit
| anywhere, but I thought it was an interesting observation in
| light of this article.
| dqpb wrote:
| workera.ai has a very interesting approach to measuring,
| categorizing, and visualizing where you are on the skill spectrum
| and what employers are looking for on that spectrum.
| 6d6b73 wrote:
| I would go even farther. We don't need data, we need knowledge.
| shoshin8 wrote:
| I think just replace this with. Need Engineers
| monksy wrote:
| This has been a bit of an annoying thing for me for quite a
| while. There is a huge difference between a data engineer and a
| data scientist. A data scientist is not and should not be a data
| engineer. These are 2 different specializations that work
| together.
|
| A data engineer is more of what we use to refer to as people who
| wrote utility processes to process data or do system
| optimizations. At some point the industry decided to do away with
| a lot of the things we used to do (desktop apps, distirbuted
| systems, etc) and moved to REST services only. Then people
| realized oh wait.. we can't process data on a rest/web app. In
| typical inexperienced fashion, people tried to cram in there, but
| it doesn't work. (See Javascript neural networks)
|
| What is data engineering? It's all about moving data around
| efficiently and processing it in a way to is per formant and
| reactive. A lot of people tie hadoop/spark to being a data
| engineer. That's a terrible way of going about it. More of the
| modern approaches to this is using streaming platforms and
| reacting to events. (Sadly a lot of the ML stuff is tied to
| either python/tensorflow or spark)
|
| At times data engineering is pushed towards data maintenance and
| pushing all of the data in a bucket. This isn't a very valuable
| use of effort.. but people want to be buzzword compliant.
|
| Note: There are use cases for hadoop and spark.. but those rarer
| now. (They've better for very large datasets and merging for data
| that you have a much longer timeframe for the answer).
| sbpayne wrote:
| +1. I generally say that data scientists bring an amazing skillet
| to the table, but companies can only leverage 10% of it.
| AndrewKemendo wrote:
| Preach!
|
| The data lifecycle is waaay overpopulated with Data Scientists
| who are not empowered or knowledgeable enough to work with
| product designers and engineers to do everything that empowers
| Data Science and ML.
|
| We need more Data Engineers involved at time zero in projects to
| help:
|
| 1. Plan out what data should be produced/captured by the product
|
| 2. Instrument systems to actually generate data consistently and
| effectively
|
| 3. Build ETL pipelines and data management systems
|
| 4. Manage enterprise data sharing and resiliency
|
| etc...
|
| What ends up happening is you have a bunch of Data Scientists
| just handed a pg_dump or flat file from some ops team. That is
| typically missing data or poorly formatted and they spend 90% of
| their time cleaning it up then running some basic regression with
| numpy or whatever.
|
| Need better understanding of the data lifecycle by organizations
| and investment in instrumentation and data management.
| bitcharmer wrote:
| This same sentiment (which I personally agree with) applies to
| software engineering. As in: engineers deliver more practical
| value than comp scientists. Now you can down-vote me to
| oblivion.
| spacemanmatt wrote:
| Is this sentiment perhaps due to someone "practicing CS" on
| your engineering schedule? What's the real harm you're
| describing?
| vbtemp wrote:
| You and me both, friend. Except I normally get down-voted to
| oblivion for saying the opposite.
| adamisom wrote:
| I guess depends on what valuable means. I imagine most comp
| scientists are less replaceable than most software
| engineers, so point for compsci.
| collyw wrote:
| Depends if you have a computer scientist doing a software
| engineers job
| johnqpub wrote:
| The two are complimentary. Engineers can't do anything
| without the fundamental insights scientists provide. But
| scientists don't have the practical experience of writing
| end products that real users use.
|
| Obviously this is a huge generalization but I think it's
| a useful way to think about it. And when I say scientist,
| I mean "Professor of CS" not "24 year old with a BS in
| CS".
| jimbokun wrote:
| I think generally, Computer Science is a degree and Software
| Engineer is a job description. So many people get Computer
| Science degrees, then have a career as a Software Engineer.
|
| Yes, there are Software Engineering degrees. But I think a
| minority of Software Engineers have a Software Engineering
| degree.
|
| What this means in practice, is that Computer Science majors
| need to learn the engineering skills on the job or on their
| own after they graduate. Although some programs help students
| pick up some of those skills as part of the degree program.
| Fellshard wrote:
| Anecdotal: University of Washington considers (considered?)
| them two separate degrees, holding CS as more theory and
| research-driven, and CE as more practice and career-driven.
| TheCoelacanth wrote:
| I think it would apply if companies were hiring large number
| of computer scientists and using them to try to build usable
| software. I don't see many making that mistake. Most
| recognize that computer scientists belong in a research or
| academic setting.
| mgh2 wrote:
| Who will do the proper cleaning then?
| nerdponx wrote:
| It doesn't matter, as long as you don't make the person with
| the PhD in biostatistics spend their time writing ETL
| pipelines, which is a wildly inefficient use of a very
| expensive resource.
| names_are_hard wrote:
| Do people with PhDs in biostatistics earn significantly
| more than programmers? I honestly know nothing about the
| market for biostatisticians, but my impression was that
| advanced degrees in the natural sciences don't really pay
| that well compared to software engineers, especially given
| that they're much more educated.
| aldanor wrote:
| If they work e.g. in a hedge fund / trading firm, then -
| yea. And you see lots of PhDs from unrelated fields
| working as quants there.
| Enginerrrd wrote:
| Not to worry, corporate will just outsource to firm which
| hires Data Janitors
| alex_anglin wrote:
| The aspiration that GP was getting to was that less cleaning
| is required as a result of better data engineering, I
| believe.
| AndrewKemendo wrote:
| Correct. If you build your instrumentation correctly, then
| you don't really need to do any "cleaning."
|
| Doesn't mean you might not need to do transformation for
| different uses but ideally wouldn't need to, for example
| change data types like turning a bool into an int.
| mgh2 wrote:
| Do data engineers have good analysis skills? Do business
| analysts have good engineering skills? I don't think
| either of them can fill the data scientist role.
|
| The scientific training and mindset (scientific method,
| hypothesis, experiment setup, etc.) to even create an
| accurate model is an undervalued skill here no? Even if
| data cleaning is automated, these skills cannot be easily
| learned.
|
| There is a reason why so many PhDs get into the field,
| because they were trained in the exploratory/research
| mindset that no engineering or analytics skills can fill.
| Correct me if I am wrong.
| dijksterhuis wrote:
| > Do data engineers have good analysis skills?
|
| Yes.
|
| > Do business analysts have good engineering skills?
|
| Depends on the analyst.
|
| > I don't think either of them can fill the data
| scientist role.
|
| > The scientific training and mindset (scientific method,
| hypothesis, experiment setup, etc.) to even create an
| accurate model is an undervalued skill here no? Even if
| data cleaning is automated, these skills cannot be easily
| learned.
|
| It's not about replacing data scientists with data
| engineers, it's about both roles working together to make
| everything more efficient.
|
| The hiring rate for data scientists has plateaued. The
| industry doesn't need any more of them. Why? Because data
| scientists often can't solve problems fast enough. It's a
| commonly quoted statistic that 70% of any data science
| task is data cleansing and/or etl. A data engineer's job
| is to take that 70% and turn it into 10%. The data
| engineer saves the data scientist time, meaning they can
| focus on what they're supposed to do -- build models.
| dpayonk wrote:
| If we only had to use 1st party data, that might be
| easier. But then again, if you're building your product
| incrementally, you're still going to have instrumentation
| holes that you may or may not be able to partially
| backfill.
| darksaints wrote:
| The problem is that data engineers that are geared
| towards analytics very very rarely control the systems
| that create the data. If you're lucky, you have the task
| of hounding a team within your company to get their data
| management practices in order. And the conversation there
| is whether they should make their job harder in order to
| make your job easier.
|
| Unfortunately, data engineers rarely deal with purely in-
| house data. You're gonna be pulling data from a variety
| of data sources. I can assure you that if you're pulling
| from government data sources, you're gonna have a hell of
| a time. Speaking from direct experience, my team is
| probably going to spend $10M/year just trying to keep a
| government dataset in order, because they won't do it
| themselves. I'm talking lawyers, legal analysts, data
| engineers, data scientists, data entry personnel, etc..
| just to fix data that should have never been broken in
| the first place.
|
| It shouldn't be a shock that cleaning the data is the
| path of least resistance for many.
| AndrewKemendo wrote:
| Hence why I said DE need to be involved as early as
| possible. Aspirational sure, but that's what I've seen
| work the best and repeatably. It's the only scalable
| solution IMO otherwise you're perpetually playing catch-
| up.
|
| On the point about the govt I literally built a
| completely new contract type and civilian hiring
| practices for the DoD to bring in Data Engineers so they
| could do exactly what I describe to make your life
| easier.
| [deleted]
| antipaul wrote:
| I have a dream - and it looks like this!
| alexfromapex wrote:
| I agree, if anything the data engineers (folks with engineering
| backgrounds) should be doing the applied work while a
| department of data scientists works on the theoretical or novel
| data analysis methods.
|
| Right now our product has accumulated a lot of technical debt
| on the data validation side because data scientists designed
| the test code in a way that dramatically slows the development
| process.
| listenallyall wrote:
| > novel data analysis methods
|
| Many "data scientists" (not all, but many) have little to no
| ability to do anything other than apply "recipes" of
| algorithms or classification methods or logistic regressions,
| etc. Asking them to develop a "novel" method would be
| fruitless. Asking them to clean and scrub the source data set
| is like telling an amateur pie-baker the store was out of pie
| crusts, you'll have to make your own from scratch -- it's not
| going to happen, they just don't have that skill, the
| instructions on the box don't account for that possibility.
| As soon as the task diverges from the simple step 1, step 2,
| step 3 that they were originally taught, you realize they
| have very little ability to adapt. YMMV of course.
| oivey wrote:
| Yep. The key is really software skills. If you're unable to
| even filter the data yourself, you're also probably
| unlikely to be able do implement novel analysis techniques,
| especially if the analysis algorithm has many complicated
| steps or is computationally expensive.
|
| In all fairness, it's basically impossible for a new grad
| to have those skills. 4 years of a bachelors in any field
| isn't enough to cover such a wide area. Even for people
| with graduate degrees it's a stretch.
| jimbokun wrote:
| The hope is that the 4 year degree gave you the ability
| to quickly pick up those skills on your own.
|
| If your four year degree didn't give you the ability to
| learn and expand your knowledge on your own, its a
| colossal waste of your time and money.
| oivey wrote:
| Sure, but depending on what you're doing, "quick" might
| be years. You can get a PhD in understanding the theory,
| a PhD in designing fast numerical algorithms, or spend
| many years becoming a strong software engineer. I think
| the willingness to learn a diverse set of things is much
| more important than learning narrow areas fast. The short
| length of a bachelors usually isn't enough to get this
| diversity.
| tchalla wrote:
| > Many "data scientists" (not all, but many) have little to
| no ability to do anything other than apply "recipes" of
| algorithms or classification methods or logistic
| regressions, etc.
|
| This is because they rarely hire people with scientific
| thinking ability. They just hire people who can code and
| program from set recipes. Once you hire such people you can
| not expect them to do non-recipe work. If you don't want
| recipe work, don't hire people will recipe skills. Do not
| have job interviews that select for recipe people. But,
| that is exactly what most companies do.
| simo7 wrote:
| Precisely.
| superhuzza wrote:
| >Asking them to clean and scrub the source data set...it's
| not going to happen, they just don't have that skill
|
| I think you've been working with conmen/conwomen. I've
| never seen a data science project that _doesn 't_ involve
| data cleaning or wrangling of some sort.
| listenallyall wrote:
| Have you read through the comment thread? Did you read
| the article? Most everyone is in agreement that projects
| require a lot of cleaning & wrangling and a lot more --
| the point is that data _scientists_ are generally not
| doing that stuff, they expect academic-quality, pre-
| processed, pristine data, so it 's data _engineers_ who
| are stuck preparing the data, and who are in high demand.
| superhuzza wrote:
| Yes I read both the article and the comments.
|
| I meant a data science project in terms of a project
| completed _by_ data scientists. In my experience, all
| data scientists are accustomed to doing extensive
| cleaning etc.
| watermanio wrote:
| This feels especially true when you have access to things
| like BigQuery ML.
|
| It's very easy for an average engineer (like me) to start
| using ML using these tools, but a lot harder to explain how
| it works, or exactly which type of models to use.
|
| In my mind a DS would be really useful to just point us in
| the right direction and check work. Like a super specialist
| QA...
| fatnoah wrote:
| >The data lifecycle is waaay overpopulated with Data Scientists
| who are not empowered or knowledgeable enough to work with
| product designers and engineers to do everything that empowers
| Data Science and ML.
|
| Reading this thread has made me realize just how lucky I am to
| work very closely with strong a very strong Data Scientist, who
| is complemented by a very strong Data Engineer. Conversations
| with the Data Scientist are always about strategy, product
| alignment, and ensuring we're optimizing what we build for
| learning. The Data Engineer works very closely to ensure we're
| actually capturing the data we think we are, getting it to
| analysis systems, and making sure those data pipelines stay
| healthy.
| dumb1224 wrote:
| In specific research areas such as biomedical science it is
| certainly tricky to get involved because of the data governance
| / confidentiality issue... so we have to do both roles to some
| extent
| mynameisash wrote:
| > What ends up happening is you have a bunch of Data Scientists
| just handed a pg_dump or flat file from some ops team
|
| Not to disparage the amazing data scientists I've worked with,
| but I've been on teams where this is very much the approach to
| operationalizing models. It's basically, "Here's the sklearn
| model and some fragile featurization scripts we built. Can you
| take this to prod ASAP?"
|
| The problem I've seen is that DS & DE teams were in different
| parts of the org and had their own sprints that were in no way
| connected. So they kept chucking models over the wall and we
| kept trying to faithfully operationalize. Once we convinced
| leadership that we had to collaborate from the get-go, things
| went a whole lot better. It also improved the working
| relationship of engineers and scientists.
|
| I learned a hell of a lot from the scientists; they learned how
| to write better code. They also learned what code they didn't
| need to write because I could do it faster or better than them,
| leaving them to focus on more important things. It was pretty
| amazing to find what manual processes they would setup in lieu
| of proper (or even _any_ ) engineering support. Again, these
| are amazingly smart people, but they were being square-pegged
| into a lot of round-hole engineering tasks.
|
| Now, the much more frustrating issue I had was being in a very
| data-heavy organization and being told by a distinguished
| engineer (my skip-level) plus my direct manager that, "data
| engineering isn't a real discipline." I left that org very
| shortly thereafter.
| civilized wrote:
| This is 100% my experience as a data scientist. The
| engineering support we get is restricted to submitting a
| ticket for database access or moving data from one system to
| another. Wouldn't dream of involving an engineer in a data
| science project team, because I have no evidence that they
| have any experience or expertise in anything other than
| tickets to move data around.
| Mauricebranagh wrote:
| That's first line support not engineering
| hinkley wrote:
| This is a systemic problem. We ask non software engineers to
| write code, and then we expect them to apply a level of
| robustness and long term planning that even we have difficulty
| achieving. Not because we're being picky, but because we know
| the failure modes that are likely, and we know that people
| convince themselves that they aren't.
|
| We've been through this with installer writers, database
| admins, test automation, operations people, and now 'devops'
| people who were supposed to be the answer to these problems. It
| never stops.
| AndrewKemendo wrote:
| It's a two way street. SWE need to learn some data practices
| and data folks need to learn some SWE practices.
| hinkley wrote:
| Oh absolutely. We'll build completely the wrong thing, but
| build it well (which just makes it all the harder to throw
| it away).
| aqme28 wrote:
| > you have a bunch of Data Scientists just handed a pg_dump or
| flat file from some ops team.
|
| I feel seen. At a previous job, our output after some cleaning
| and transforming was a pg_dump for the data scientists to load.
| We had little visibility of what they did to that database once
| they got it.
| nitrogen wrote:
| I suspect in rare cases this is by design, because engineers
| would object to the behavior of the Business Intelligence
| department on ethical grounds.
| aqme28 wrote:
| If anything, we would have objected to the quality of the
| code they were writing.
| coding123 wrote:
| A couple of us inherited a machine learning project a while back.
| The code was horrible. Riddled with copy pasta (nearly half of
| the entire thing was copy paste and no code reuse). We basically
| refactored everything, standardized input and output file names.
| We put up a small Flask service to allow outside services hit it
| easily and wrapped it up in a Docker container so it was
| ultimately easy to deploy. Yes it was all the plumbing. However
| we also looked at the code, and the ML strategies, and while
| there was "some" level of competence, it was nothing more than
| word2vec add and divide. Totally horrible for actually finding
| key phrases that matter to the subject we're matching. So we
| started tackling that too with LSTM but our time got cut short
| and shifted off to another area. So not only was the "scientist"
| they hired completely crappy at the engineering, they weren't
| really helpful in the ML either.
|
| This is obviously of lesser value to the topic at hand, and more
| about making sure you hire good people I think.
| wpietri wrote:
| The phrase "If you can't dazzle them with brilliance, baffle
| them with bullshit" comes to mind.
| mlthoughts2018 wrote:
| I am curious what your take is on things like this article:
|
| https://managingml.substack.com/p/the-myth-that-machine-lear...
|
| It has been my experience too. Basically, ML / DS engineers are
| thrown under the bus for being poor general software engineers,
| but in practice it's totally the opposite.
| nerdponx wrote:
| The problem is that ML engineers are not the people who wrote
| GP's garbage code. Data scientists wrote it, and I know at
| least a few of my very intelligent, high-functioning data
| scientist colleagues who are alarmingly, astoundingly bad
| programmers.
| asimjalis wrote:
| Who gets paid more? Maybe the question to ask is why market
| forces are not naturally producing more of what is needed?
| program-iscuous wrote:
| That's interesting post, however there's large bias in my opinion
| in how the analysis is done.
|
| You have a stratified sample of companies in their early stages,
| I think it's quite normal for most companies in their early
| stages to prioritize data engineers rather than data scientists.
|
| Data scientist comes after the data engineer, and if you have a
| data scientist and not a data engineer then probably the data
| scientist does both jobs. On the other hand, data engineer is not
| dependent on a data scientist.
|
| To conclude, I think that indeed there are more data engineer
| positions because there are too many "data scientists", however
| the true difference is not as large as in your analysis.
| hankchinaski wrote:
| the article is long overdue. so many times i have seen data
| scientists glue together non production ready code and libraries
| expecting someone else to finish the job and put the project to
| production. in various places i worker at they have soon realised
| that we mostly and exclusively needed data engineers - glorified
| DevOps people that make production ready data pipelines - more
| than we needed data scientists to do "modelling"
| thecolorgreen wrote:
| I'm a SWE and data engineering actually sounds super interesting
| to me. Unfortunately, my day-to-day doesn't provide opportunities
| to work with the massive amounts of data we generate. I've looked
| into learning this stuff online but courses like DataCamp seem
| too basic (I have experience with Python and data cleaning in a
| research setting along with some academic ML experience) or
| downright a bit scammy. Many of the articles I read online about
| this also frame data engineering as a way to transition into the
| tech industry, which isn't my blocker. Does anyone have advice to
| help me transition away from pure software to a job as a data
| engineer?
| nojito wrote:
| We just hired a SWE turned Data Engineer. You don't need to
| handle massive data to make the transition. I believe all the
| person did was build a small but robust pipeline that took some
| API response data, cleaned it, populated some sqlite dbs,
| replicated it for a small team to use and kept it updated every
| few days automatically.
| thecolorgreen wrote:
| That's good news for me. For a minute, I genuinely thought
| I'd have to waste time going through a bunch of stuff I
| already knew to learn a tiny bit and earn a certificate to
| prove my skills.
| eVoLInTHRo wrote:
| Agree with the general idea. Data engineers are like the
| essential workers of the data world, people who today may not
| receive the appropriate level of appreciation that they deserve.
|
| I think the glut of data scientist occurs because we clump so
| many different skills and disciplines under the single term "data
| scientist". Data scientists today come from so many different
| backgrounds that the definition means something different to
| everyone. Because of this, the surface area of possible skills
| that could be expected of a data scientist is vast, to the point
| where it's pretty unlikely to be sufficiently competent in all of
| them, let alone a majority.
|
| I'd like to go back to a world where we had a little more
| specificity about what kind of data scientist you are (e.g. I had
| no problem with terms like statistician and data miner), which
| could help ground expectations that others have of us, and it'd
| also help clearly define the scope of various career paths for
| the next generation.
|
| Sadly the individual who coined the term shows no contrition for
| the degree of confusion that the rest of us have been left to
| deal with: https://observer.com/2019/11/data-scientist-inventor-
| dj-pati...
| Ansil849 wrote:
| Genuine question: why is there so much pure teeming hatred for
| data scientists in this comment thread? Almost every comment
| comes off as full of snark and vitriol against data scientists.
| danbrooks wrote:
| I would guess that it's a reaction to job title hype.
|
| There's a huge variety in DS responsibility and background
| between companies.
| bart_spoon wrote:
| It tends to happen any time something becomes trendy. Data
| science has/had a lot of hype over the last decade, and people
| seem to have an inherent tend to want trendy things to fail.
| Combined with that, you have a bunch of people hopping on the
| data science bandwagon, so you get a lot of grifters, snake oil
| salesmen, or simply individuals whose output is poor quality.
| Seems to have created a feedback loop, where there is always a
| new example of some AI solution failing, or a data science
| initiative that didn't work out that everyone can point at and
| say "See! I always knew this trend was dumb!".
|
| Reality is that data science is here to stay. It's coming out
| of the honeymoon period, and things may never be as hyped up as
| it has been the last decade, but that's probably a good thing
| for the field. Everyone will probably move on to hating the
| next up and coming thing. I have a hunch it could be something
| in data engineering because, while not exactly new, it is
| absolutely the next "data science" in terms of demand, and with
| products like Snowflake having so much hype behind them, it
| seems the backlash will be inevitable.
| andrewnc wrote:
| I came to this thread interested in the discussion, but I feel
| now like the homer simpson meme retreating into the hedge.
|
| Maybe I'll come back in a few hours, but for now I'll stay
| away.
| lottin wrote:
| Yes, the tide is turning now... who came up with the term "data
| scientist" anyway? It's a made up profession. If you need
| someone who understands statistics, get a statistician, or
| maybe a mathematician. If you need someone that designs and
| writes computer programs, get a computer programmer. But a
| "data scientist"? No, thanks.
| astrophysician wrote:
| I'm guessing just venting personal frustrations due to their
| own experiences, plus maybe poor hiring and guidance of data
| scientists in their own teams? I have definitely seen DS people
| in my experience that fit some of the descriptions here, but I
| think it's a mistake to trivialize the DS position itself. A
| good DS is a valuable asset, but depending on your
| company/data, maybe not worth the cost. Plus there is no
| "single" DS candidate or role, a lot of these roles (data
| engineer, DS, analyst, swe) blend together at times, and it's
| about finding the right balance of skills.
|
| Sometimes I think a company (not having the DS experience
| themselves) mistakenly over-hire DS roles in today's hype of
| "AI" when their data is mostly run-of-the-mill and only
| requires simple linear models that can be architected an
| understood by a stats/math-savvy engineer. Even then, a good DS
| is still useful (even linear models can be complex: e.g. what
| priors do you want to use? Do you want a multi-task solution?
| etc.), but maybe not worth the cost.
| mywittyname wrote:
| I think a lot of people feel like data scientists get all the
| credit and the fun work, while we have to do all the heavy
| lifting and boring stuff that they need. There's this idea that
| data science is a special and unique skillset that couldn't
| possibly be possessed by a simple software engineer. Despite
| being largely a subdomain of CS.
|
| I remember an era before "data scientist" was a job title. When
| we (programmers) would analyze data to see if we had enough
| information available to identify the problem, if not, fix
| that, then come up with a strategy to solve it, test, and
| finally deploy the model. The fun part was trying different
| solutions and analyzing the data. It also felt awesome to
| deploy a product that worked like "magic." Product owners
| didn't know or care what a neural net was, they were just happy
| it worked.
|
| Now there are tons of data scientist out there who take the
| easy, fun, rewarding work and try to skip over the nitty gritty
| implementation details. Then management thinks engineering is
| incapable of doing such work, and the only time we get the
| opportunity to do something fun is to do so behind the scenes.
| bartleby_ wrote:
| Probably just people blowing off steam and the target de jour
| are data scientists. If people had to sit down with their
| company's data scientists to air their grievances face to face
| I doubt they would be so condescending.
| aaroncrayford wrote:
| What scientist doesn't use data? Such terrible naming.
| LawrenceHecht wrote:
| Building or running data infrastructure is an important part of
| 55% of 372 data engineers' jobs, according to the "2020 Kaggle
| Machine Learning & Data Science Survey." ...Check out three
| charts I created showing differences between data scientists,
| data engineers and machine learning engineers:
| https://thenewstack.io/software-engineers-use-spreadsheets-d...
| nlbrown wrote:
| What are the typical entry level Data Science/Engineering
| positions like? Are they available to fresh college or bootcamp
| graduates?
| marcinzm wrote:
| From everything I'm hearing Data Science is right now flooded
| with entry level applicants to the point where they're applying
| for even pure Data Analyst jobs. Data Engineering a lot less
| so.
| brd wrote:
| As an industry we're letting history repeat itself and making all
| the same mistakes.
|
| There are different kinds of developers. At it's most base form,
| you have systems focused developers and algorithmic focused
| developers. Sure there is a grey area but I think those two
| buckets are pretty defensible.
|
| In the data science world you have an exact parallel. Those who
| build the systems and those who optimize the thing the system
| supports.
|
| In the ML world you have another parallel. Those who build the
| systems and those who optimize and pioneer the model
| architectures and parameters.
|
| We never reached consensus on the titles for different kinds of
| developer/programmer/computer scientists. And we're failing now
| to reach consensus on sane titles for ML and DS.
| ziml77 wrote:
| Having to deal with data scientists, I absolutely agree. The
| thing that I've seen that lands in the "lab" vs production
| distinction is that these people expect their data to be
| pristine. They flip out when the world isn't as perfect as their
| models want. Leads to me as just a normal software developer
| having to do the data analysis and figure out how to clean it up.
|
| I also end up having to be the one to talk to data vendors to
| understand their data feeds and essentially translate that for
| the data scientists. Having to sit in the middle is annoying for
| me and suboptimal for the business.
| baron_harkonnen wrote:
| The data science field has been flooded with PhDs with nowhere
| else to go that have _no_ background in engineering, and sadly
| often have a very poor understanding of both machine learning
| and statistics.
|
| Companies were in a rush hire "data scientists" and boot camps
| like Insight were more than happy to pump out very impressive
| PhDs with just enough understanding to build a Keras model.
|
| I've worked in industry awhile doing DS work and have been
| astounded at the number of PhDs that both don't know how to
| write Python that doesn't live in a notebook _and_ throw away
| years of disciplined experimentation experiences to just throw
| keras models at data until the needle moves.
|
| There do exist excellent data scientists out there, who are
| both very solid software engineers and really know their stuff
| mathematically, but I've found most of these people can't
| reliably find jobs because the people interviewing them know so
| little that good data scientists will be penalized for answer a
| stock question _correctly_.
|
| The field has been so flooded with amateurs that have no idea
| what they're doing, that potential mentors have been driven
| out, and now it's just a mess. To get a job doing DS if you do
| know what you are doing you have to play a weird game where you
| guess the incorrect answer the interviewer has in mind.
| fock wrote:
| I do an introductory Python lab course at my university. It's
| targeted at engineers who still create graphs from Excel and
| then normally level up to MATLAB, if things get complicated
| (think insets, ...). I guess about 30% of the people
| previously did at least some of the YT/Udemy "courses" on
| datascience. It's really horrifying for me (not being an
| engineer myself, but imo having a relatively engineering-like
| mindset) to see these people horrified at simple tasks like
| writing a variadic function. "What do I need this for?".
| Well, it's using the programming environment. And then let
| them code up a simple version of Levenberg-Marquardt. The
| level of "why do I need to do this" is astonishing again...
| nitrogen wrote:
| _why do I need to do this_
|
| IMO this is _the_ number one problem of our modern culture
| around education. Popular culture makes it popular to treat
| education as pointless, and this even affects students who
| are pursuing difficult degrees. "Why do I need to study
| humanities? Why should I learn to code if I think I am born
| to be someone else's boss?"
|
| On the other hand, many teachers in K12 and early
| university have no ability to connect the "what" with the
| "why." "The curriculum is the curriculum. The test is the
| test."
|
| If we can solve these problems, our societies will be much
| better off.
| wickedsickeune wrote:
| If the educator cannot explain why the knowledge is
| useful, then he is unfit to teach it.
| theflyinghorse wrote:
| I work at a place with a very high count of PhDs. Some of
| them write code. All of them view writing code as something
| menial and unimportant and its shows in the resulting work,
| which from my experience is atrocious.
|
| Of course I understand that YMV, but I will forever be
| skeptical of anyone writing code with a PhD after working
| here.
| lordgrenville wrote:
| Are they CS/EE PhDs?
| nerdponx wrote:
| Not to mention the dark pattern of giving data scientist
| candidates an unsolved industry problem as their interview
| take-home task, and then telling them to only spend 4 hours
| on it. Data science hiring often feels like a competition
| where the winner is the one who has the most free time and
| willingness to do other people's work without compensation.
|
| It's kind of a fucked up field right now.
| boterham wrote:
| Do you have any suggestions for where to start looking for
| good places to apply that don't suffer from this?
| glitchc wrote:
| This implies a lack of rigorous training. In the physical
| sciences, one wouldn't become an applied scientist without
| conducting an experiment to test a phenomenon, and the teeth
| gnashing that goes with making that experiment work.
|
| Those who have been fed pristine data without having to undergo
| the trials and tribulations of actually having to collect the
| data have missed a crucial part of scientific training. Like
| you, I find this lack of rigour is rather common among data
| scientists. Not all, but quite a few.
| astrophysician wrote:
| Just want to say that while the data science profession
| definitely includes a wide range of people and skillsets, a
| _good_ data scientist should be practical and able to work with
| the available data in whatever state it 's in.
|
| No good data scientist should ever expect data to be pristine.
| And a good data scientist, even if they don't have quite the
| engineering chops necessary to build a production-quality ETL,
| should know enough about the process to help guide it. If they
| aren't a part of that process, they're not being a good DS.
| They can't expect someone not involved with their problem to
| know what tradeoffs to make, and if they don't know _exactly_
| how their data went from raw form to the ETL-ed form, they 're
| probably going to make bad assumptions, and those assumptions
| may very well make their architected solution a complete pile
| of garbage. Not to mention, how can a DS offer suggestions for
| solutions if they aren't deeply familiar with the raw data
| that's available?
|
| To me, a good data scientist should, at bare minimum, have
| several skills.
|
| * They should first and foremost (but _not solely_ ) be an in
| house expert in statistics and machine learning to know what
| can be done with data, and what _can 't_ be done with data.
| They should arrive with that knowledge. Engineers I think have
| a tendency to trivialize this, but true expertise in this
| domain comes only with years of experience.
|
| * They should strive to find modeling solutions that are
| _right_ for a _particular business problem_. If they seem to be
| only applying the hottest research regardless of the tradeoffs
| for the particular business problem, that 's a red flag.
|
| * Their focus should be on integrating themselves with the
| product/business as much as possible, and with the engineering
| team as much as possible. If they're expecting to be handed
| directives, that's a recipe for a ton of wasted time.
|
| DS should never, ever be siloed into their own little DS world.
| They will be useless without a deeply intimate knowledge of the
| business goals, the needs of product, and the capabilities of
| the engineering team.
|
| As they progress, they should become more and more "full-
| stack", otherwise they are stagnating.
| tchalla wrote:
| A good data scientist should also be good at science.
| Otherwise, you can simply hire people with engineering skills
| - you don't need scientists. If you hire scientists and then
| are surprised they aren't good at engineering, the hiring
| process needs a reality check.
| jsinai wrote:
| Statistics is a science as well. Unfortunately it's
| overloaded in business terms and can mean anything from
| "knows means and regressions" to "has a copy of _Meyn and
| Tweedie_ on their shelf".
| superbcarrot wrote:
| I think that this is more of a problem with the specific people
| that you have worked with and it isn't inherent to the role of
| a data scientist.
| ct0 wrote:
| Doesn't sound like a modern Data Scientist, sounds more like
| a statistician with 30+ years of experience.
| sidlls wrote:
| It's becoming more inherent, especially as the field is
| populated with people who have no experience with the
| "science" part. That is, with the very real and ubiquitous
| problem of collecting and cleaning data to make it fit for
| scientific study. Even theoretical physicists, for example,
| participate in and rely on empirical data collection, and
| understand deeply how messy and fraught with error it is.
|
| I don't see the same appreciation or consideration in general
| in the field of data "science."
| Mauricebranagh wrote:
| I remember working with some one who has PhD in Physics and
| who worked at CERN - and one comment I loved "a key skill
| is knowing how to place the legend, so it obscures that
| annoying outlier data point"
| superbcarrot wrote:
| > with people who have no experience with the "science"
| part
|
| It's interesting that you put it that way because a lot of
| the other complaints in this thread are that the people who
| expect their data to be ready for use are exactly the
| people with science experience but without the relevant
| technical background.
| nerdponx wrote:
| Instead of sneering at "having to deal with" data scientists,
| consider that the data scientists themselves would often much
| rather have data engineers and dev ops people involved in the
| process.
|
| Data scientists like to quip that 80% of the job is data
| cleaning, with the remaining 20% divided up arbitrarily among
| other tasks as suited the joke. In some shops nowadays, it's
| more like 45% data cleaning, 45% data
| engineering/ops/programming just trying to make your results
| available to the rest of your organization, and 10% research.
|
| If I can spend less time learning/doing software engineering
| and devops and more time doing actual data science, that's
| great. At a previous job, my team was _clamoring_ for more data
| engineer hiring, and part of the reason our projects were
| slipping and starting to fail was lack of data engineering
| support. Our tooling was shit, our processes were shit, our
| code was shit, and access to (and trust of) our data sources
| was especially wet and stinky shit.
|
| It made the daily work of doing data science a miserable slog
| of ad-hoc duct-tape solutions, and it contributed to us being
| generally ineffective as a team.
|
| All of this would have been fixed if we had _one_ competent
| data engineer with some actual real-world data /ML engineering
| experience and good communication/advocacy skills. Let alone
| two or three!
| ramraj07 wrote:
| If the DE tooling was shit and you couldn't hire more fast
| enough, why didn't your team members start addressing these
| problems? Surely spending half the time cleaning up the pipes
| would increase the value of what you do with the other half?
| prionassembly wrote:
| I'm late to the comment party, but: this is classic "commoditize
| your complement".
|
| This guy would have you believe that Pytorch has Solved the
| entire, vast field of data analysis as inherited from Newton, de
| Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage,
| Jaynes, Breiman, Pearl.
|
| This is a lot like saying that photography has Solved art, and
| now we need people who can climb ladders and glue the posters on
| them big billboards. It would be delusional if it didn't have a
| self-interested angle.
|
| What, we with math degrees are fully confident that the plumbing
| problem is easier to commoditize than the problem of making sense
| of data.
| zeku wrote:
| I'm a data engineer for most of my day right now, and a lot of it
| is done with ruby/python/shell scripts into postgres DBs.
|
| What learning path should I go down? I'm a solo actor at work
| with a lot of agency to decide my workflows.
|
| I see myself building small to medium size data collections over
| the next year or two at my job.
|
| Can someone point me to some learning?
|
| I have a CS degree etc. and my title in software engineer etc.
| etc.
|
| End users of my data _usually_ like their data as a CSV that is
| then read using R or Python. However there is also a use case
| where I will build an app to view my data in a simple way.
|
| All of this is completely doable with my current
| knowledge/workflow but I can't help feel like I do a data
| engineering job with very different tools than i see "data
| engineers" speaking about online.
| langitbiru wrote:
| A couple of days ago, there was a thread about "How to become a
| data engineer in 2021":
| https://news.ycombinator.com/item?id=25728198
| zeku wrote:
| thanks!
| ageofwant wrote:
| Adopt a good Workflow tool, like apache airflow. Easiest is to
| rent a service from AWS https://aws.amazon.com/managed-
| workflows-for-apache-airflow/
| mywittyname wrote:
| Seconding this, Apache Airflow is awesome.
|
| I can't believe how much time is saves during development.
| zeku wrote:
| thanks!
| glitchc wrote:
| Good data is hard. Anyone conducting research in the physical
| sciences knows this firsthand. It takes painstaking effort to
| conduct carefully controlled experiments and collect a batch of
| good data that could then be used for analysis.
|
| The promise of ML has always been to churn out good results from
| not-so-good data. If I now need to sanitize my data carefully,
| what's the advantage?
| claytonjy wrote:
| What makes you think that was the promise? I would say the
| promise has been to replace code with data, and to build things
| with data that would be practically impossible to build with
| code.
|
| "Garbage in, garbage out" has been the mantra for a long time.
| Sure, there are tools and techniques to deal with not-so-good
| data, but those are add-ons, not a core part of the value
| proposition of ML.
| lordnacho wrote:
| My experience is in quant hedge funds, where sometimes you get
| some guys who develop the strategy and some guys who put it into
| production.
|
| Yes, I do admit there can be some specialization in terms of time
| spent on science vs engineering.
|
| But you really need people who understand both. Particularly if
| you have a strategist who thinks his job is just to dream up
| profitable models, he ends up carving that role out in a way
| that's detrimental to the rest of the team. You get people who
| just don't appreciate that there's other work to do than finding
| models, and that models depend on that other work to function.
|
| You also get a huge prestige gap, because inevitably management
| will think that there's a magician and a blacksmith. One guy
| needs to be paid a lot, and the other guy needs to be paid
| enough.
|
| These effects feed each other. Magician will say "where's my
| data" and expect blacksmith to make it, promptly. He won't do it
| himself, because spending time on mundane stuff makes the magic
| disappear. And not doing it yourself, or taking the time to
| understand it, will eventually lead to problems with the magic.
| hntrader wrote:
| To add, quants that can't do the data engineering work are
| always crappy quants. I haven't seen a counter-example to that.
| Profitable models aren't going to be delivered on a silver
| platter. They need to be able to process pretty low level data
| effectively and build ad-hoc custom tools and data pipelines
| around that to test out their ideas. Otherwise they're
| constrained to the tools others have built and that massively
| narrows the search space that they're capable of traversing.
|
| The best quants are 1/3 statistician, 1/3 developer and 1/3
| trader, in my view.
| [deleted]
| Karrot_Kream wrote:
| > 1/3 statistician, 1/3 developer and 1/3 trader
|
| How is being a trader different from being a statistician?
| Curious as I've never worked in finance before.
| hntrader wrote:
| By trader, I mean domain knowledge about the markets.
| Statistics is the toolbox that this domain expert uses to
| test their hypotheses and turn them into a profitable
| model. But if the person isn't a domain expert and only
| knows statistics, their ideas about what to test won't be
| good.
| wpietri wrote:
| And to knowledge, I'd add disposition. It's been years
| since I've been in finance, but the best traders I worked
| with were all very driven to succeed, to dominate, to
| win. Markets were really interesting to me, but I never
| cared much about that part.
| hntrader wrote:
| Yeah, it's a performance discipline like any other
| (competitive gaming, athletics, etc) where only the top
| few % can succeed. If someone isn't very driven then they
| won't make it.
| twic wrote:
| I'm not sure about _crappy_ quants. Some people of the
| "quantitatively inclined trader who has learned Python"
| variety are never going to be good at the engineering side -
| it takes years to learn to be a good software engineer, and
| that's not a good use of time, for them, or for their
| employer. But they can still do useful work.
|
| The trick is to figure out how to work effectively with those
| people. Build infrastructure that keeps them on the rails,
| refactor their code, push them in the right direction, tell
| them when they've fucked up, teach them little things with
| high leverage. As long as that doesn't turn into being their
| slave, that's fine.
| [deleted]
| proverbialbunny wrote:
| If they're using a dynamically typed language to do
| monetary calculations, it's not going to be ideal.
|
| Researchers do not need to have deep programming
| experience, but they have to be comfortable enough to use
| an environment that can lend itself itself to the problem
| at hand. On the quant side, unlike on the data science
| side, the barrier of entry on the programming side is a bit
| higher. To solve this problem many firms have their own
| internal programming language.
| greenshackle2 wrote:
| > If they're using a dynamically typed language to do
| monetary calculations, it's not going to be ideal.
|
| And yet, Q.
| chadash wrote:
| > "To solve this problem many firms have their own
| internal programming language."
|
| Any examples other than Jane Street?
| anonymousDan wrote:
| Goldman Sachs (Slang)
| singhrac wrote:
| > If they're using a dynamically typed language to do
| monetary calculations, it's not going to be ideal.
|
| I think this is an inaccurate take. No one in finance is
| doing accounting or model estimation using Python's
| floats; they are using numpy's float32 (or float64) type
| instead. I think a more accurate version of what you're
| saying is that static type checking is useful when
| modeling complicated contracts; this might be true, but I
| think it's not that important, as those things aren't
| that liquid anyway.
|
| Jane Street's decision to use OCaml is almost as much
| about hiring and history as it is about language
| features.
| twic wrote:
| > No one in finance is doing accounting or model
| estimation using Python's floats
|
| We are. When your input data only has five significant
| figures, and probably less than that of real information,
| numerical accuracy is the least of your worries.
| [deleted]
| aldanor wrote:
| Or, they're using _ints_ instead, at least for market
| data.
| proverbialbunny wrote:
| Fixed precision types technically. Internally they are an
| int under the hood, so yah basically that.
| oivey wrote:
| This is dogmatism swung too far in the other direction,
| IMO. There are many, many successful production code
| bases written in dynamic languages. In my own experience
| as a vision scientist/engineer, there is tremendous value
| in being able to quickly whip up a concept in Python and
| then being able to easily visualize the results. Doing
| this exploration in C++ is wasteful. Implementation takes
| much longer, the correctness brought by static typing is
| dubious since the code isn't in prod, and the canned
| CV/visualization libraries are fewer and frequently suck
| in at least some way. That said, there is also tremendous
| value in understanding how to map your Python prototype
| into production code, too. Someone strong in this field
| can do both.
| proverbialbunny wrote:
| This was addressed in the previous comment
|
| >On the quant side, *unlike on the data science side*,
|
| Vision scientist is on the data science side. You're not
| dealing with monetary values where floating point error
| compounds on itself to the point your models become
| garbage. Quant work is it's own unique field with its own
| unique prerequisites.
| oivey wrote:
| Nothing precludes you from doing integer arithmetic in a
| dynamic language.
|
| I'm not a quant and this isn't my area of expertise, but,
| for example, I'm pretty sure various differential
| equation solving methods depend on variables taking on
| continuous values, so floating point basically must be
| used. Understanding the impact of that is definitely very
| important. Analogously, I frequently run into numerical
| precision issues in image processing. Understanding how
| numbers are represented on a computer isn't unique to
| being a quant. Understanding how the choice of
| representation can impact prod is also not unique to
| being a quant. The dynamicness of the language isn't
| particularly relevant, either.
| proverbialbunny wrote:
| >Nothing precludes you from doing integer arithmetic in a
| dynamic language.
|
| You would be surprised. The second you use pandas with a
| custom data type (let alone any other library you'd want
| to use) it can randomly auto convert it to a float.
| Furthermore identifying when it randomly converts the
| type on you is a pain.
|
| >so floating point basically must be used.
|
| Quants tend to use fixed precision types. It is like a
| float in every way, except base 10 instead of base 2 so
| there is no floating point error.
| hntrader wrote:
| Quants don't care about floating point precision in
| research. It's just applied stats
| oivey wrote:
| I think this insight exists across a lot of fields. Basically,
| if you want to be a really excellent magician you also better
| be a decent blacksmith. More concretely in this case, if you're
| unable to do the data "engineering" yourself then it will close
| a lot of doors for interesting and novel work on the "science"
| side. Beyond that, if the scientist's job just involves gluing
| sklearn models together I think that job is more on the
| engineering side of things than the supposed scientist usually
| wants to admit.
| inthewoods wrote:
| That's interesting - I just completed book on Jim
| Simon/Renaissance (The Man Solved The Market). One of their
| early advantages was having a person who was just focused on
| acquiring and cleaning data. I expect that advantage has
| largely gone away at this point due to wide availability of
| market data but I thought it was interesting in the context of
| this article.
| curiousgal wrote:
| Same for CFM too, they have an entire team working on
| alternative data and they feed it to a modeling team.
| 1vuio0pswjnm7 wrote:
| What must be communicated to management: It is easy to find
| other magicians. It is not easy to find another blacksmith.
| Without the right blacksmith, there can be no magic.
|
| Magicians will be magicians, always hustling (bullshitting),
| but they will never have the value and job security of the
| blacksmith. The blacksmith can see the fruits of her own
| labour, whilst the magician must lie to herself and others in
| order to claim the blacksmith's value as her own.
|
| If the blacksmith is good enough, she will earn the trust of
| management and management may consult the blacksmith in the
| selection of magicians. Management may ask the blacksmith to
| interview the magician and seek her advice on the final hiring
| decision.
|
| The blacksmith may not carry the "prestige" of the hustling,
| bullshitting magician but she can command a high salary and
| dictate her own working conditions. This is only if management
| understands her value. What the magician thinks of the
| blacksmith is irrelevant.
|
| Reliable blacksmiths are hard to find. Magicians are a dime-a-
| dozen.
| legerdemain wrote:
| > It is easy to find other magicians. It is not easy to find
| another blacksmith. Without the right blacksmith, there can
| be no magic.
|
| What? That runs counter to my experience at every company
| where I've either seen data engineers or worked as one. My
| observations of how management treats the two groups is this:
|
| Data engineers ("blacksmiths"): Blacksmiths are paid less.
| People think of them as less highly educated. Their work is
| less creative. When they are successful, their work is mostly
| invisible. They are interchangeable. People think of what
| blacksmiths do as more like scripting than writing code.
| Blacksmiths mostly work on configuring systems they didn't
| build. Blacksmiths do more troubleshooting than building.
| Their roles are focused on support.
|
| Data scientists ("magicians"): Magicians are paid more.
| People think of them as more highly educated. By definition,
| what they do is magic. They work on prominent projects. Their
| successes are highly visible. They build large systems that
| only they comprehend. They use support staff to clear away
| mundane obstales so they can focus on unique, highly creative
| aspects of work.
|
| Saying that we need more data engineers than data scientists
| is like saying that we need more janitors than CEOs. That's
| true, but it's true because _we made it true_ by structuring
| projects around one prominent, well-paid person supported by
| a staff of invisible drudges.
| notretarded wrote:
| Please don't change the status quo. I love my cushy job.
| lumost wrote:
| This problem only grows as the company scales and the science
| and engineering pieces are formally split along some role
| guideline.
|
| Inevitably if you treat a job role as a support role, you'll
| attract weaker individuals into that role then you would get if
| it wasn't considered a support role. The problem with Science
| oriented teams is that all roles other than the science role
| morph into science support roles over time. The same pattern
| used to occur with Engineers and QA, or Engineers and ops.
| chadash wrote:
| I worked in investment banking (as an analyst, not an
| engineer), so very different part of finance, but this was my
| take as well. Companies might love to talk about how important
| engineers are, but at the end of the day, if you can't directly
| link someone to revenue, they get viewed as a cost center and
| take on second tier status in the organization. Then the same
| companies complain that they can't find enough (or retain)
| engineering talent. Not many places get the balance right.
| Silicon valley treats engineers well because for the most part,
| the value they bring is more obvious (and also, they don't
| threaten the existing hierarchy in the company). Curious to
| hear if anyone has had the opposite experience.
| isolli wrote:
| Yes, quite a few developers left our investment bank and went
| to work for our suppliers (of trading software), stating
| they'd rather work somewhere where they're seen as value
| creators rather than a cost center.
| mushbino wrote:
| Engineers get paid well in SV because they are in demand,
| have lots of employment opportunities, and therefore are more
| difficult to retain.
| mywittyname wrote:
| And because their contributions can be tied back to
| revenue. You need both, demand for talent, as well as the
| ability & justification to pay for it.
|
| Engineers are in high demand all over the world. But most
| companies do not profit enough from technology to justify
| similar paying SV salaries.
| pmiller2 wrote:
| Not always. Frequently, the connection between code
| that's written today and revenue tomorrow is tenuous and
| difficult to package in a way that says "look at me! I'm
| valuable!"
|
| And, then there are those somewhat rare occasions where a
| project is not intended to increase revenue, and may even
| decrease it. At my last employer, we guesstimated that a
| project I worked on for months could possibly have ended
| up costing us $2M per year in revenue. That was both
| accepted and expected, because we were doing it to gain
| goodwill with users, but in such a way that it might end
| up pissing off a small minority of our customers.
|
| I really wish, just once, I could work on a project and
| put underneath it on my resume "Increased revenue by X%,"
| because I've never worked on anything that was so easy to
| directly trace back to the top line.
|
| Cost savings are another story, because engineers _can_
| fairly easily quantify how much less money is being spent
| by doing $THING a bit more efficiently....
| AtomicOrbital wrote:
| I worked for 15 years as a software engineer at Morgan
| Stanley where they valued the process of taking a 3 martini
| lunch idea into a production platform so value of engineers
| was recognized and rewarded as such ... its somewhat easier
| to whip up a new financial wrinkle its a whole other level of
| magic to design and implement that idea when it takes 60
| software developers 3 years to get that idea to market before
| the rest of the street ... of course the IT department was/is
| the largest budgeted portion at the entire bank and for a
| good reason
| marcinzm wrote:
| As I see it you need people who have shallow knowledge of many
| areas and deep knowledge of one area. That lets you have a
| group of experts but ones that know enough about other areas of
| expertise to work with those other experts.
| wpietri wrote:
| > Particularly if you have a strategist who thinks his job is
| just to dream up profitable models, he ends up carving that
| role out in a way that's detrimental to the rest of the team.
|
| My god, this. These people make me bonkers. Especially because
| I feel like I have a bit of this tendency myself, the desire
| just to think big thoughts and do no actual work. Happily, I
| long ago learned that ideas were approximately worthless
| without labor, and that I anyway had much better ideas when
| laboring because it forced me to engage with the details.
|
| And yes, those people can poison a team. My best working
| experiences have all been with people who a) all valued actual
| work and b) believed that everybody could have good ideas.
| jnwatson wrote:
| "I'm the idea guy" out of someone's mouth is the stark red-
| flag warning that their net contribution is 0.
| mumblemumble wrote:
| Or even negative. I've seen situations where the idea
| person is so busy being Mr. Toad that everyone around them
| is regularly scrambling to clean up messes and it ends up
| being a constant distraction from actually pushing projects
| through to completion.
| LeifCarrotson wrote:
| I really like your magician/blacksmith analogy.
|
| I'm in industrial automation, but it's much the same. Projects
| where someone developed a strategy but has never been involved
| in the details of a machine are doomed to failure (or at best
| to be unreliable and producing low quality parts). Projects
| built by machine fabricators are over-engineered, frequently
| late, and sometimes unprofitable, but damn if they don't work
| well.
|
| The main trouble, I think, is that when a shiny new contraption
| is brought to the king, it's too often the magicians doing the
| talking - whether they're speaking words of power or Common,
| their job is to talk. Meanwhile, the blacksmith is probably
| busy at in his workshop some ornate scroll work for the next
| thing, or repairing the previous gizmo, because he'd rather be
| hammering away at his anvil than talking.
|
| The higher you go in an org chart, the fewer the number of
| people who understand the work their company actually does, and
| the more voices you have between the workers and the decision-
| makers to take some of the credit for work as it passes up the
| chain.
| hef19898 wrote:
| That seems to be true in every field I can think of. The
| smaller the gap, or rather the more practical experience the
| strategy people have, the better a given org seems to be.
|
| One common issue I run into is that when the blacksmiths
| start talking, nobody listens.
| zzzeek wrote:
| maybe hedge funds would be able to find more people if they
| didn't only hire "guys".
| noodlenotes wrote:
| Also, a lot of data scientists find the science fun and the
| engineering boring. But they have overlapping skill sets - if
| you aren't good at one, you're probably not good at the other
| either. Somebody who shows up to a team with the goal of only
| modeling and pushing all the dirty engineering work to their
| teammates is basically a worst case scenario because
|
| 1) They probably aren't going to produce good models since
| they're not sensitive to data nuances, but now they've taken
| over ALL the modeling work.
|
| 2) They bring down the job satisfaction of everyone else on the
| team who would like to be doing at least some modeling.
|
| 3) They're sucking up the prestige that should be distributed
| over the entire team and management thinks they should be paid
| more for work that it turns out everybody thinks is more fun
| anyway.
|
| My number one advice to entry level data scientists is to not
| be this guy. Don't give your interviewers the impression that
| you won't do your own engineering work because they won't want
| someone who brings negative value to the team.
| NikolaNovak wrote:
| Here's the tricky thing:
|
| I love your post; I agree with your post; but it takes a 90
| degree turn at the end:
|
| "My number one advice to entry level data scientists is to
| not be this guy. "
|
| Everything most people are saying here indicates it's GREAT
| to be that guy. You're paid, you're respected, you get the
| fun parts, you love your job and it's pretty safe. It just
| happens to suck for everybody else including team and
| business... but it feels that in a practical sense, gist of
| everybody's actual unwitting message is "BE that guy, if you
| can" :-<<<
| proverbialbunny wrote:
| It sucks being that guy because everyone else ends up
| hating you.
|
| Depending on the work environment it's not a stretch to see
| software engineers complaining to management, sometimes
| going as far to create rumors to get the jr data scientist
| fired.
|
| So, no the grass is not greener. It's best to not be that
| person. This is why I go out of my way to prevent that
| scenario when I lead a team.
| urthor wrote:
| Not really.
|
| You just get seen as the product owner/project manager.
| proverbialbunny wrote:
| That's a really good point.
|
| I tend to be seen as a product lead / owner /
| stakeholder, so I feel like I'm being called out. lol
|
| I think one difference is the software engineers see me
| as someone who is helping them by making their life
| easier. I'm not just throwing work at them blindly. I'm
| working with them. Also, they like it when I include them
| in the data science brainstorming sessions to solve
| difficult problems. I guess it's seen as exotic or
| something, but whatever the reason, they really love to
| be apart of it.
| rhizome wrote:
| I think it's probably seen more as just being a decent
| boss.
| RobRivera wrote:
| easy to ignore hate when youre pulling a 300k bonus at
| comp season and can jet to st. barts to go deep sea
| fishing and drink claws.
| proverbialbunny wrote:
| Data scientists do not pull that kind of bonus. Today
| many of them get paid less than the data engineers do.
| RobRivera wrote:
| news to me, and welcome news to hear at that since I'm
| more in the data plumbing and packaging business, not
| algo publications.
|
| my personal data points are from folks on buyside.
| trading margins have been downward trending for years
| proverbialbunny wrote:
| Quant research work isn't data science work which is
| probably where the mix up is.
|
| On the quant side bonuses are distributed to the team.
| ramraj07 wrote:
| If you're that guy and you have a secure job it means you
| write models no one ever sees in a company which doesn't
| know or respect data, or you work in some data science
| factory as a small cog of a fairly well oiled team. The
| latter does happen from time to time, but it's often the
| former.
|
| In every other place, your job is on the line to be erased
| because people will soon realize no one wants a wise-ass
| who doesn't actually contribute much to the bottom-line in
| the end.
| urthor wrote:
| The flipside is that there are 4x the job posts for data
| engineering as there are for "that guy".
|
| Companies understand that you can't hire five of that guy
| and get things done. If you have 5-8 years of experience as
| a technical product manager/data science combo then you are
| very happy as the magician. But very few magicians are
| being hired out of college, and a lot of "software
| engineers in data"
| mywittyname wrote:
| Pretty soon companies are going to start realizing that
| the 4x DEs can largely replace that 1 DS, and they will
| be more than happy to do so.
|
| I went into DE because I was kind of forced into the
| space, but I'd strongly prefer doing full-stack DE.
| Anymore, I still have the opportunity to build models,
| they just aren't client-facing stuff, but instead are
| kind of Data Plumber Bots that help me do my job better
| so I can waste more time building other fun bots that I
| can't otherwise be paid for.
|
| Seems like a waste of resources, but my manager could
| have another DS tomorrow, but my role would take months
| to fill.
| noodlenotes wrote:
| Specifically, this is my advice to ENTRY level data
| scientists who are trying to find a job and compete against
| a flood of candidates hot off the bootcamps. I guess once
| you get your foot in the door, you can be that guy if you
| want. It seems to be a successful strategy at companies
| without technical leadership.
| ramraj07 wrote:
| It's hard for most people entering this field because the
| incentives are perverted - there's this perception that DS is
| sexy and you actually don't need to know coding that much
| (just enough to scikit learn). Thus people with pipe dreams
| of tweaking model hyperparameters to spin gold come in and
| get a rude awakening. Not a lot unlike people flocking to
| become actors to LA.
| proverbialbunny wrote:
| Back in the day (3 years ago and earlier) at every company I
| was at we used the term 'productionization' to describe
| someone making a model aka a proof of concept, and then
| someone else, a machine learning engineer or some kind of
| engineer rewriting it to work on a server.
|
| This process is horrible, and not just because it doubles the
| work, but because it introduces bugs. When the version up in
| the cloud does not work as intended, is it a bug in
| productionizing or is it in the original model? Fixing bugs
| in this space can take longer than the initial model
| development and the initial productionization. Many companies
| have failed over this.
|
| So what's the solution? In recent years the industry has
| turned to deployment over productionization. The idea is you
| deploy the model to the cloud directly. Both engineers and
| scientists work together on the process. The scientist
| defines what cells in the notebook get called for the final
| algorithm (as there are EDA / plotting cells and
| documentation cells too). The engineer sets up the amazon IO
| stuff, database login stuff, and monitoring services. The
| scientist works with them to create tests and what to monitor
| so they get notified if there is a problem with the service.
|
| No more mystery bugs. The model gets directly deployed, the
| work load is minimal, and it brings people together. The
| downside is often the engineers and scientists are on
| different teams, and sometimes companies will not let them
| merge for a while, so it becomes a telephone game instead of
| everyone feeling like they're on the same team working
| together. imo moving the scientist to the engineering team
| during this time can be helpful, or moving the engineer to
| the data team.
|
| Some companies have services where entire notebooks get put
| up into the cloud and all of it gets called, so the scientist
| has to write the notebook in a way that works for the cloud.
| It's rarer, but how I prefer it is a wrapper py file is
| created that calls just the relevant parts of the notebook,
| kind of like a header file. This process works well for me,
| but it as far as I know it is not standardized in the
| industry yet.
|
| In short, if you end up in this situation, there is a better
| way. Import the notebook into a .py file or into the cloud,
| don't rewrite it. This (hopefully) will remove this scenario
| you're describing (comment this is replying to) so those
| issues will become a historical footnote.
| carabiner wrote:
| How do you maintain notebooks in production? You use
| papermill? What about versioning?
| proverbialbunny wrote:
| Most libraries load entire notebooks from top to bottom
| when executing, and I believe papermill does too. (Please
| correct me if I'm wrong, as I've not used papermill.)
|
| This is great for making a dashboard, a report, or some
| other kind of analytics, but when it comes to a service
| the customer uses, you typically never want to load the
| whole notebook. This is where the industry standard way
| of loading the whole notebook tends to fall on its face.
|
| What we do is the cells that will end up in prod are
| written as functions inside of the notebook. This helps
| reduce globals when writing the notebook, so it is good
| form when prototyping, but also it allows just those
| functions to be called from the notebook, instead of
| running the entire notebook.
|
| You will probably want to write your own library to do
| this, but in the mean time there is one that works for
| this purpose https://github.com/grst/nbimporter
| (Ironically the author doesn't recognize this use case.)
|
| Using nbimporter you can import a notebook without
| loading it. You can then call functions within that
| notebook and only those functions get loaded and called.
|
| In my notebooks I have a process function which is like
| main(), but for for feature engineering. On the prod side
| the process function is called from the notebook. Process
| calls all of the necessary cells/functions for me in the
| correct order. This way the py wrapper only has to call
| one function, then the ML predict function gets called,
| so it's pretty small on the .py wrapper side. There are
| tests written on the .py side, IO functions and what not
| too.
|
| Data engineers love their classes, so it's easy to write
| a class that calls the notebook, and best of all calling
| a single function this way does not load globals, so the
| data engineers are happy. It's a nice library, because
| otherwisw you'd have to write your own (which you may end
| up wanting to do).
|
| This way if the model doesn't work as intended in
| production it's my fault. We log everything, so I can run
| the instance prod caught on my local machine, figure out
| what is going on, update the model, and then it can be
| deployed instantly.
|
| Version numbers on the engineering side I can't comment
| on as they have their own method, but on my end the
| second the model writes to a database then I strongly
| push for having a version number column or a version
| number metadata table in the database, so it's easy for
| me to access for future analysis.
| [deleted]
| z3ncyberpunk wrote:
| Then you end up with terrible engineers with zero practical
| experience who make terrible designs due to said lack of
| practical experience. Engineers out of college today have no
| zero clue how to run machines or do anything but create
| drawings that half work which is pathetic
| [deleted]
| yters wrote:
| While we like to pooh pooh at the theoreticians, there are some
| remarkable results proven just through thought experiments and
| math, which are only confirmed and used many, many years later.
| People would not even think to look into such things if
| theoreticians did not come up with the original abstract proof.
| afryer wrote:
| I recommend the book: Agile Data Science by Russell Jurney[1].
| The tech stack is circa 2017, but the chapters on the Agile Data
| Science Process and Teams are timeless.
|
| He clearly articulates team roles: from Biz Devs, marketers, PMs,
| UX designers, UI designers, Web Developers, API Engineers, Data
| scientists, Applied researchers, Platform/Data engineers, QA
| engineers, DevOps Engineers.
|
| Then he talks about different ways to increase agility by
| combining these roles into generalists empowered to iteratively
| explore the "pyramid of data value" until the right product-
| market fit is found.
|
| Building Data-science Intensive Web Applications is inherently
| waterfall, not agile, and I find this book to be a fascinating
| reference.
|
| [1] https://www.oreilly.com/library/view/agile-data-
| science/9781...
| Wonnk13 wrote:
| What's the career trajectory for data engineers?
|
| I enjoy the pipeline building and business stakeholder
| interfacing, but I'm not sure I want to be a SWE a decade from
| now...
| claytonjy wrote:
| How is that different from a SWE? I see DE as a specialized
| SWE; tons of overlap, but DEs focus on different tools and
| concepts than other SWEs.
| amelius wrote:
| In other words: we need plumbers.
| gregw2 wrote:
| As a data engineer I've made the same joke.
|
| But the statement is also a bit like saying you can use
| plumbers to design and build a chemical refinement plant which
| also just moves chemicals from point A to point B. Or you can
| design a citywide sewer system with a bunch of plumbers.
|
| There are many cleansing, refining, orchestration, dependency,
| data quality, governance and optimization problems to be solved
| and a wide variety of tools that have for whatever reason never
| grown into higher-level open source frameworks and are thus
| reimplemented in various forms in many places.
|
| Data engineering (somewhat like software engineering actually)
| doesn't require much if any of the math and physics I took in
| engineering courses in college, but it does require rigorous
| systems thinking about how to design and build structures that
| withstand adverse conditions that are thought patterns common
| to other engineering practices, so I don't think it's a totally
| crazy title for the role.
| digitalsushi wrote:
| Or, if the data is food, we have chefs making incredible
| plates, but we need wait staff to get it to the people who want
| to eat it.
|
| I would love to be on that wait staff; as an infradev I feel
| like the process is very close to me but I am struggling to
| break into it.
| spacemanmatt wrote:
| I would put data engineering on the supply side of the chef:
| This would be ingredients, delivery scheduling, and pre-prep
| functions. That sort of thing.
___________________________________________________________________
(page generated 2021-01-14 23:00 UTC)