[HN Gopher] We don't need data scientists, we need data engineers
       ___________________________________________________________________
        
       We don't need data scientists, we need data engineers
        
       Author : winkywooster
       Score  : 487 points
       Date   : 2021-01-14 13:09 UTC (9 hours ago)
        
 (HTM) web link (www.mihaileric.com)
 (TXT) w3m dump (www.mihaileric.com)
        
       | tfehring wrote:
       | One annoying thing about being a generalist is that domain
       | experts in any given area that you need familiarity with can't
       | help but complain about how little you know about that domain,
       | ignoring the fact that your job requires equally deep knowledge
       | of several other domains simultaneously.
       | 
       | In the case of data scientists, I think the business folks that
       | want them to understand the business domain better generally have
       | the strongest argument, followed by the statisticians - good data
       | scientists need to personally understand both of those things
       | well, while the engineering and ops stuff that data scientists
       | are also expected to do is easier to compartmentalize on other
       | teams. So I agree that we should have more data engineers, but
       | apparently for the opposite reason as most people in this thread.
        
       | patothon wrote:
       | why not both?
        
       | b0rsuk wrote:
       | Oh, but maybe we don't need either?
       | 
       | It seems that data science is used primarily for advertising.
       | Local internet communities are dead. Message boards are dying.
       | Everything is either a reddit / discord / Steam forums /
       | Boardgamegeek. In 2021, GOG forums pass for a small forum. Only
       | the biggest can float in the ocean of spam.
        
       | mywittyname wrote:
       | I always felt that tech-focused data scientists should also be
       | required to know how process data end-to-end; at minimum, from a
       | SQL database to deployed model, but knowing how to collect &
       | clean data is important too. It seems like the industry is trying
       | fill the gap that was created by a glut of people without math/cs
       | backgrounds going into 5-week data science courses who then need
       | hand-holding when they get real jobs.
       | 
       | Data science & engineering should be treated as a single
       | collection of skill-sets. Lacking ETL experience is a major
       | deficit, considering how prevalent that kind of work is.
       | 
       | This might just be my personal biases coming through. I consider
       | myself a "full-stack" data scientist & engineer. But because data
       | scientists who can work on the backends are rare, I always end up
       | doing the plumbing while other people do the fun analysis work.
       | 
       | I think companies that are data "science" heavy are going to be
       | at huge disadvantage soon. Tools like Rekognition and Google AI
       | APIs are making the model training & deployment aspect almost
       | trivial. At some point, the only real work involved in this space
       | will be the data "engineering."
        
         | superbcarrot wrote:
         | > Data science & engineering should be treated as a single
         | collection of skill-sets.
         | 
         | This can be tough because there could be _a lot_ in that skill
         | set. You can 't realistically expect someone to have solid
         | knowledge of statistics including specialising in the sub-field
         | and type of algorithms that your product needs, and also be
         | able to write good code and act as a developer, and also have
         | solid knowledge of all the tools for data
         | streaming/processing/ETL. There is a point at which you're just
         | stretching yourself too thin if you try to do all of these at
         | once.
         | 
         | Of course, stuff like knowing how to interact with a database
         | or employing good software development practices should be a
         | very basic prerequisite and some scientists certainly shift
         | things too far in the other direction and use their academic
         | knowledge as an excuse to write poor code and not learn new
         | tools.
         | 
         | I guess what I'm trying to say is that they are distinct skills
         | but you still need all of them to some extent and striking the
         | correct balance in one's skillset is really difficult.
        
           | mywittyname wrote:
           | These are all skills taught in standard computer science
           | programs. Granted, some are electives, like high-level stats.
           | But even back in 2010, data science electives were available
           | to fill the gaps. I took three DS&E classes in college with
           | projects that were end-to-end platforms, where you'd have to
           | collect, clean, and analyze the data, then build, test, and
           | deploy models from it.
           | 
           | I would certainly hope that college courses are even more
           | comprehensive after 10 years and an explosion in interest for
           | the field.
           | 
           | Also, much like being a full stack developer, a full stack
           | data engineer doesn't need to know everything at a master
           | level. But that you can at least handle tasks at most points
           | in the chain.
        
       | mongol wrote:
       | Why is data scientist a profession in IT but not for example
       | computer scientist? Many IT professionals studied computer
       | science but they don't call themselves scientists in their line
       | of work.
        
         | eeZah7Ux wrote:
         | Don't give developer ideas. People with a javascript bootcamp
         | and 2 years experience are already called "senior engineer".
        
         | globular-toast wrote:
         | "Computer science" is a well-known misnomer as it is not about
         | computers nor is it science. It's generally reserved for the
         | academic study of computing. Data science is much closer to
         | science.
        
         | zzbzq wrote:
         | Basically this manager guy had like a dozen job titles under
         | him, such as Business Analyst, Data Analyst, ML Engineer, and
         | sundry more. Then his HR team came to him and told him to just
         | boil it down to 1. So he made up this title "Data Scientist"
         | because it sounded badass, and then everyone from both the
         | business analytics side and the software engineering side
         | decided this sounds awesome and they want it to be the next
         | step of their career.
         | 
         | Then the federal government came up with this title something
         | like "Secretary of Data Scientist" and the guy who made up the
         | term worked at that.
        
       | noahmoss wrote:
       | What does CV stand for in this context?
        
         | macksd wrote:
         | Computer vision (i.e. deep learning on images / video)
        
       | jimsparkman wrote:
       | Keeping in mind DE can mean different things at different
       | companies, I spend a lot of time working on infrastructural
       | components to just get at data reliably. Working in a product
       | company with disparate generators of data, I'm often building out
       | network connectivity (VPC peering, VPNs, etc.), subnets, ACLs,
       | firewalls and load balancers across our visualization tools,
       | managing job flows, controlling AWS costs, building read replicas
       | for production databases, yadda yadda. There might be a ton of
       | hoops to jump through before I can even start to process data and
       | it's the type of work that wouldn't make sense to hand off to my
       | DS counterparts.
        
       | jh88 wrote:
       | Just like with data science, don't we expect data engineering to
       | get to a point where they codify best practices into a tool, so
       | 1x data engineers can be closer to be as effective as 10x data
       | engineers? The amount of data engineering work won't change, but
       | the number of people needed to do it will reduce, reducing the
       | open headcounts.
        
       | giantg2 wrote:
       | I don't find it too surprising. I think more people would want to
       | work with the conceptual work in the data science space, like
       | models. The ETL stuff seems extremely boring and it seems it pays
       | less too.
        
       | analog31 wrote:
       | Disclosure: I'm a just plain scientist, without the data. ;-)
       | 
       | Since "data science" appeared on my radar (in other words, on
       | HN), I've noticed that we fling the "scientist" and "engineer"
       | terms around without asking whether the practitioners have
       | science or engineering backgrounds, or something else such
       | statistics, math, programming, etc.
       | 
       | It strikes me that "do we need scientists or engineers" is not
       | unique to data science/engineering. I think we need both, but of
       | course it's an open question as to how many of each are needed.
       | Also, both "scientist" and "engineer" are loosely defined in
       | practice, with some overlap.
       | 
       | Painting with a broad brush, a scientist wants to learn how
       | things work. An engineer wants to make things that work. If
       | you're in the business of making things that work in order to
       | sell them, then you need lots of engineers, but maybe a few
       | scientists. It's about 10:1 at my workplace.
       | 
       | Overlap? Of course. Engineers use scientific knowledge and
       | methodology. Scientists have to make things work in order to make
       | experiments and theoretical computations work. Also, scientists
       | often show up in emerging fields before they become recognizable
       | engineering disciplines, for many reasons: 1) We need the newest
       | stuff, right away. 2) We are opportunists by nature and
       | necessity. For instance many of the older "software engineers" at
       | my workplace have science degrees, but the younger ones all have
       | computer science degrees. The oldest people I know who did
       | programming, digital logic, and embedded systems, did not have
       | degrees in those areas. The youngest ones all do. At my college,
       | the physics professors had personal computers, and were ripping
       | them apart, while the CS professors used the mainframe. That's
       | part of why I chose to major in physics, though I was interested
       | in programming.
       | 
       | Another reason for overlap is that both "scientist" and
       | "engineer" titles include things that are not strictly within
       | either area. A lot of people with science degrees end up working
       | as technicians, marketeers, salesmen, managers, etc. A lot of
       | people with engineering degrees do little or no quantitative
       | engineering, but work as programmers, designers, marketeers,
       | salesmen, managers, etc.
       | 
       | Something to ponder is if there are differences between how
       | scientists and engineers think and approach problems. Naturally
       | there's a lot of folklore about that, probably little hard
       | evidence. That's where I think you should start if you want to
       | know whether to hire scientists, engineers, or both.
        
       | [deleted]
        
       | ska wrote:
       | In many cases, what is needed isn't even more data engineers;
       | it's data janitors.
        
       | basseq wrote:
       | The problem that I've seen is often "data scientists" are
       | expected to be the equivalent of full-stack engineers (or maybe
       | more accurately: one-man CTO shops)--to understand data
       | architecture, understand _business_ architecture, ensure data
       | quality, build data into product, build dashboards, derive
       | insights, posit hypotheses, set strategy, and drive business
       | value.
       | 
       | Thus many "data scientists" are juiced-up report-builders who
       | can't analyze their way out of a paper bag.
        
         | nerdponx wrote:
         | This is uncharitable.
         | 
         | In my experience, this is true:
         | 
         | > "data scientists" are expected to be the equivalent of full-
         | stack engineers (or maybe more accurately: one-man CTO
         | shops)--to understand data architecture, understand business
         | architecture, ensure data quality, build data into product,
         | build dashboards, derive insights, posit hypotheses, set
         | strategy, and drive business value.
         | 
         | But this is not:
         | 
         | > Thus many "data scientists" are juiced-up report-builders who
         | can't analyze their way out of a paper bag.
         | 
         | Rather, the data scientists are trained in only two of the
         | requirements you mentioned: derive insights, posit hypotheses.
         | The rest is all self-study and on-the-job experience. This
         | means that we are putting unrealistic expectations on data
         | scientist and/or their training is insufficient, _not_ that
         | data scientists are somehow morons.
        
           | basseq wrote:
           | Now it's my turn to claim uncharitibility.
           | 
           | Indeed, my context here is that people who wear the data
           | scientist title come from multiple backgrounds, and are often
           | asked to wear too many hats. They are _non_ morons--they may
           | be _darn good_ report-builders, but haven 't been trained in
           | insights, for instance.
           | 
           | If you're reacting to my word choice in that last sentence,
           | know that I _am_ frustrated with people who claim to be data
           | scientists but can 't derive insight. (And we can argue about
           | "many".) But that's not a broad denouncement against all data
           | scientists, either.
        
         | kevin_thibedeau wrote:
         | It's the new CIS.
        
       | spacemanmatt wrote:
       | Data engineer, here. Or at least, that has been my title a couple
       | of times.
       | 
       | Some data is inherently trash but a huge part of the data quality
       | problem is sources who are allowed to produce trash that everyone
       | else has to clean up, when it would be way more efficient for
       | them to quit producing trash.
       | 
       | Not to pick on any one institution, but SOAP seems to be a read
       | flag that the service will also deliver some screwy data.
        
         | jaegerpicker wrote:
         | Every time I see SOAP in a ticket/task I die a little inside.
         | Without fail every SOAP related project I've dealt with in the
         | last 10 years has been a shit show. I've actually and
         | unfortunately gotten pretty good at dealing with it but I so
         | hate it.
        
       | a_zaydak wrote:
       | My view is from a small startup with little to no room for single
       | purpose employees.
       | 
       | When I first started hiring and working with data scientist my
       | view was this: If you can only manipulate data and run it through
       | pipelines to generate models then you can't do enough to be
       | highly valuable. You either need to have a strong enough
       | background in CS to build the pipelines / tools or a strong
       | enough mathematics background to be able to propose cutting edge
       | new ideas. From my experience it is hard to find someone who has
       | one of these skill just from a University "data science" program.
       | At a small company (at least ones that I have worked with) being
       | only proficient in R and basic Python isn't enough. That being
       | said, I have met and handful of Data Scientist who were very
       | smart and self motivated enough to pick up on the lacking skills
       | when given the chance.
       | 
       | My question to HN is this; are there rolls at these larger
       | companies for a Data Scientist who who primarily just crunches
       | data in R and Python without the ability to actually build the
       | pipelines / tools or conduct research?
        
         | proverbialbunny wrote:
         | I would be cautious about that. I've worked in the startup
         | space for over 10 years now as a data scientist, often the
         | first one hired on, working on the pipes.
         | 
         | From my experience, there are two types of data scientists who
         | work who do infrastructure work: 1) Those who do not make the
         | best data scientist because their skill set is too far in
         | engineering land, leaving them weak where it counts. If the
         | startup is relying on the data scientist to be profitable, I'd
         | be cautious with these types. or 2) Someone who is senior,
         | beyond senior really, who has worked both jobs, and doesn't
         | mind doing both jobs. This unicorn is so rare it is mythical.
         | The joke when the terminology was created is they're so rare no
         | one has ever seen one, hence unicorn.
         | 
         | Me, I can not do the work I need to do if I'm on call. That is
         | where I draw the line. That means hiring someone to monitor the
         | infrastructure. Furthermore, I'm an okay architect, but you
         | really do want to hire a specialist if you can help it for
         | that. Do I help them with the infrastructure? Absolutely, but
         | they're on call if a server is on fire. They have the admin
         | login credentials, not me.
         | 
         | I get wearing multiple hats, but keep in mind to be a data
         | scientist you're already wearing multiple hats. Being a data
         | scientist is like double majoring and getting a phd. At what
         | point are they stretched too thin? The consensus in the
         | industry is they're already stretched too thin and should be
         | broken up into different specialized roles.
         | 
         | >My question to HN is this; are there rolls at these larger
         | companies for a Data Scientist who who primarily just crunches
         | data in R and Python without the ability to actually build the
         | pipelines / tools or conduct research?
         | 
         | That is the standard role, even at startups. However, the
         | industry consensus these days is data scientists should have
         | more responsibility when it comes to deploying models than
         | previous standards.[1] So data scientists are being pushed in a
         | more engineering direction, not with hosting sql servers and
         | infrastructure, but with working with engineers to make sure
         | the models are monitored properly. This change comes from model
         | deployment being further automated as time goes on, making it
         | easier for the data scientist to have more responsibility
         | during this stage.
         | 
         | [1] source:
         | https://www.dominodatalab.com/static/gfx/uploads/domino-mana...
         | page 9. Suboptimal organization and incentive structures.
        
           | a_zaydak wrote:
           | Thanks for the feedback! Seems like you and I both have had a
           | bit of experience being first engineering hires at startups
           | but have had very different experiences when it comes to
           | rolls or a data scientist. I appreciate that.
        
             | proverbialbunny wrote:
             | Np. There is a common trend in the industry where a company
             | hires on a data scientist, doesn't know the data
             | prerequisites (specifically labeled data), the data
             | scientist struggles, after a while the company fires the
             | data scientist. This leaves the company with a bad taste in
             | their mouth. In recent years I tend to get hired on as a
             | specialist to help fix this. (And yes, I've been the first
             | engineer hired on too.)
             | 
             | What's interesting is they tend to struggle in two
             | different ways: 1) The data scientist that is gung ho about
             | infrastructure work, jumps in, and then ends up doing a bad
             | job, because it's not their strength. They end up getting
             | let go for not being ideal at that work. 2) The data
             | scientist who struggles with the idea of infrastructure
             | work at all, jumps into other roles they're good at like
             | data analyst work, helps the company in that way, but
             | ultimately because they did not push to get an
             | infrastructure engineer hired, they end up let go as well.
             | 
             | Me, I go out of my way to get an infrastructure engineer /
             | data engineer hired early on. Also, I have worked as an
             | engineer, so I tend to do a lot of the "hard" stuff most
             | software engineers struggle with early on, if applicable.
             | Eg, at one job I wrote a compression format to reduce
             | battery drain on our devices that were collecting data.
             | 
             | Most data scientists struggle when it comes to
             | CS/engineering skills (4/5th of them), so it's not uncommon
             | for them early one while the pipes are being built to do
             | data analyst and BI work. BI work to automate reports,
             | which management loves, and DA work to show some amazing
             | future service the company might be able provide to its
             | customers. It's selling the sun and the moon really, but it
             | gets management inspired, and helps them know what data to
             | collect. It's not unheard of to need a minimum of two years
             | of collected data before building a model that can be
             | deployed becomes feasible. This can be hard on the data
             | scientist, because there is a lot of down time before that.
             | Many get fired during this time even when they're doing a
             | good job. They have to wear multiple hats, but it's analyst
             | roles (like BI work). Technically a data scientist is a
             | kind of analyst, not engineer, so it makes sense that
             | wearing multiple hats for them tilts in the analyst
             | direction, not the engineering direction.
             | 
             | I've been writing code since I was 8 years old, so I'm one
             | of the unusual ones that tilts in the engineering
             | direction, but I think it is unreasonable to expect that
             | from the average data scientist. Let them do what they do
             | best, and hire someone else who can round everything out
             | and you'll be in a good place. Unicorns aside, you'll need
             | a minimum of two professionals for a data project to
             | succeed.
        
               | _RPL5_ wrote:
               | Thank you for your comments! They are very insightful. To
               | piggyback a bit:
               | 
               | Assuming you are a competent data "analyst" who wants to
               | become a data engineer, how would you go about it? Is "go
               | back to school and get a CS degree" the answer? I suppose
               | this question is very broad, but I am curious if a
               | practitioner like you has an opinion.
               | 
               | ---
               | 
               | To give some context:
               | 
               | I recently graduated with a STEM PhD, and looking to move
               | into data science. Reading the comments, I feel like I
               | fall into the "pointless data scientist" cohort derided
               | in this thread. Eg: I am very comfortable doing typical
               | analytical work & occasionally training models inside a
               | notebook, but I am neither a cutting-edge theoretical
               | statistician nor a data engineer.
               | 
               | I've been trying to improve on the engineering side. For
               | example, I did a project recently where I set up a
               | rudimentary pipeline that continuously pings an API,
               | uploads the data to a cloud database, then serves up the
               | analysis via a Flask app. For me this was a big step up
               | from just doing notebooks on a csv file :)
               | 
               | But moving beyond the basics, I am not sure what to study
               | next. Hence my question. If you have any suggestions, I
               | would greatly appreciate it!
        
       | avs733 wrote:
       | I teach engineers for a living. I struggle to see how this is not
       | just a straw man argument based on colloquial usage of terms. It
       | is just inferences drawn based on job ads that are rarely written
       | by people doing the job and instead are effectively human-as-seo-
       | optimized so the best candidates can find the job they hopefully
       | fit for and not be too confused to apply for it.
        
         | hn_throwaway_99 wrote:
         | It's not a straw man, I've seen it clear as day in several
         | companies. When it comes to data science, it's "garbage in,
         | garbage out". I've seen companies do lots of "data science"
         | with a bunch of data scientists skilled in python and jupyter
         | notebooks, only to discover a ton of work was useless because
         | the incoming event data was tagged incorrectly due to a bug.
         | 
         | The actual process of collecting, aggregating, cleaning and
         | verifying data is a hugely important skill, and not one I've
         | really seen typical data scientists possess.
        
           | avs733 wrote:
           | I have experienced the same thing...but I just don't think it
           | has anything to do with whether the positions are labeled
           | data scientist or data engineer.
           | 
           | And I would warn you from my experience teaching statistics
           | to undergraduate engineers...they are not going to be much
           | better. Regularly get 'hey we have this data what test can we
           | run?' 'what are you trying to show?' 'we don't care we just
           | need to run a statistical test' conversations.
        
           | bonniemuffin wrote:
           | I suspect this may actually be an issue of school vs real
           | world rather than scientist vs engineer.
           | 
           | Data in the classroom setting is pristine and beautiful; data
           | in the real world is messy and buggy. You have to get burned
           | by buggy data a few times (or maybe a bunch of times) in the
           | real world to learn to look for bad data smells -- I don't
           | think schools effectively teach this kind of intuition,
           | regardless of whether the students are training as data
           | engineers or data scientists.
           | 
           | If data scientists are spending more time in school getting
           | advanced degrees, they're not getting as much exposure to
           | buggy data, whereas data engineers with a BS and a few years
           | of industry experience would already have built up this
           | skill.
        
             | avs733 wrote:
             | >Data in the classroom setting is pristine and beautiful;
             | data in the real world is messy and buggy.
             | 
             | I got to take over our department's undergraduate
             | statistics course a few years back.
             | 
             | The first change I made was all homework, tests, and
             | projects used real data set. I intentionally have them
             | collect bad data (they don't know its bad before hand).
             | First day of class we collect data using the board game
             | operation...I give basic instructions and then halfway
             | through ask everyone to stop and agree on how they are
             | entering data for the variable of 'success or failure' of
             | the surgery. Oops...
             | 
             | In my experience teaching the course, the reason the
             | students (engineers) find statistical reasoning hard is:
             | 
             | * They have never been given anything 'broken', everything
             | is curated to avoid things not working. The result is they
             | think data has inherent meaning. A right answer.
             | 
             | * Their entire learning experience has been stripped of
             | context and the need to make decisions with information.
             | They can give me a p value but are terrified (not unable,
             | just unwilling) to interpret it or give it meaning.
             | 
             | * They have never encountered the concept of
             | variability...everything is presented as systems with exact
             | inputs and outputs.
             | 
             | When I work with postdocs, I sometimes (less frequently)
             | encounter many of the same challenges. Data is treated as
             | sacred and external and inherent. It's wild to me.
        
           | finnthehuman wrote:
           | >The actual process of collecting, aggregating, cleaning and
           | verifying data is a hugely important skill, and not one I've
           | really seen typical data scientists possess.
           | 
           | Then they are not scientists. They have a label "scientist"
           | but lack of rigor of actual science.
           | 
           | I don't see why changing the label to "engineer" would
           | suddenly make them have rigor.
        
             | [deleted]
        
             | avs733 wrote:
             | Right?!
             | 
             | This is sort of the meta failure of the argument. They are
             | arguing that people's data skillsets are wrong. To make
             | that argument they are analyzing based on the wrong
             | variable in a data set.
        
         | antipaul wrote:
         | The article is so true, my latest mantra at work is
         | "engineering is more important than data science".
         | 
         | Everyone is buzzing about the latter, and few even realize what
         | is the former.
        
           | avs733 wrote:
           | eh...I think this can be analogized to what we already see in
           | code...
           | 
           | You need architecture, you need backends, you need a front
           | end, you need product design...all with data.
           | 
           | Why are computer scientists computer scientists not
           | engineers? Why is computer science about the code side? Why
           | did computer engineering end up being more on the hardware
           | end of the spectrum?
           | 
           | Words, especially newly coined terms are pointers to meaning.
           | That meaning is socially mediated, it is not inherent.
           | 
           | You're saying this (adn I think the author is too) because
           | there is a need for this group of people to look beyond
           | titles to skillsets, and the existing titles carry linguistic
           | baggage of the difference between science and engineering
           | that has existed for decades.
        
         | jariel wrote:
         | So I think that the delineation between the scientist working
         | with the content, and the Engineers who actually provide the
         | mechanics for it is very fair.
         | 
         | If there is a question mark here - it's really how much value
         | are we deriving from all of these data people?
         | 
         | Where is all the ML that's changing our lives? Search, Alexa
         | and TikTok, I can see it.
         | 
         | In the future obviously vision systems for autonomous cars
         | etc..
         | 
         | But I'm really wary about the heavily decreasing marginal
         | returns after that.
         | 
         | It will surely change the world, but I think in specific areas.
         | Most of the entire field seems like an optimization on
         | something rather than anything new.
         | 
         | Washing Machines feed up immense amount of labour and toil.
         | Alexa telling me the weather is not.
        
           | avs733 wrote:
           | Most engineering and science jobs aren't a binary as much as
           | they are a spectrum.
           | 
           | If the article is trying to make a point about skill
           | development and diversification, I'm totally on board.
           | Bifurcating the roles instead is going to be less effective.
           | 
           | To the value point...my sense has been we are seeing the
           | Webcommerce 1.0 bubble Machine Learning edition. Lots of uses
           | of it, not all of them have value. I am excited for where we
           | will be in 10 or 15 years, but I suspect the difference will
           | be huge. If you put me to a guess, I would say better data
           | handling practices and ethics will likely be the linchpins of
           | value creation vs. using tools for the sake of tools.
        
           | serjester wrote:
           | I used to work at a legacy automaker and you'd be shocked at
           | how much ML has changed certain areas of the business. It
           | used to take an entire department to sort warranty claims and
           | it's now mostly automated. Aluminum part defects are now
           | spotted automatically on the plant floor. Don't even get me
           | started with telematics data.
           | 
           | Most software isn't consumer facing but just because you
           | don't see it doesn't mean it's not changing things around
           | you. ML tends to be overhyped but your assessment is too
           | pessimistic.
        
           | bart_spoon wrote:
           | The vast majority of applications of machine learning that is
           | changing the world isn't happening on a consumer level. Its
           | happening in factories, warehouses, farms, logistics chains,
           | etc.
        
       | superbcarrot wrote:
       | The overall point that there is demand for data engineering
       | skills seems valid but in reality you don't have to pick between
       | data scientists and data engineers, it's not one or the other. I
       | don't know if the argument is set up this way just to get more
       | clicks or to encourage arguments but it would have been better to
       | just focus on the overall state of the market and the demand for
       | certain skills.
        
       | shrumm wrote:
       | this post is so timely. How do you guys handle hiring? I'm
       | looking to hire more data engineers (Singapore only for now),
       | especially senior ones and it's not easy. My gut is to just find
       | good engineers who have good first principles and let them learn
       | on the job. Feedback/experiences welcome!
        
         | yaster wrote:
         | Curious to hear this from the other side. I'm an SE manager
         | interested in pivoting to Data Engineering. I've built modest
         | pipelines in R / Postgres for an operations analyst job I did
         | before I became a programmer, but I'm not experienced with most
         | of the technologies I see listed on DE roles (e.g. Airflow) and
         | my statistics etc. are rusty.
        
       | Dumblydorr wrote:
       | I'm a scientist first, data person second. I'll use the label
       | data scientist if that's what the employer wants. I'll use data
       | analyst, scientist, researcher, or any other term, so long as
       | they pay me.
       | 
       | What matters is: are we contributing to the betterment of society
       | with data-driven decision making?
        
         | Dumblydorr wrote:
         | Another aspect: so many PhDs and postdocs in the sciences do
         | similar work but get paid 1/3 or less of the wages. We
         | shouldn't be surprised they snap up industry jobs in data
         | science, it's way more remuneration than academics, with less
         | stress IMO
        
       | gigatexal wrote:
       | Why then are a data Scientists paid a ton more? Seems they are in
       | more demand.
        
       | slt2021 wrote:
       | Full Stack Data Scientist (data janitor + data engineer + ML
       | engineer + ML Ops + Business Analyst) is the future
        
         | alexpetralia wrote:
         | These are incredibly disparate skill sets. Of course anyone
         | would want to hire someone like this, and far more would claim
         | to possess such a broad skill set, but in practice it is
         | extremely rare.
         | 
         | You'd need someone with excellent communication skills
         | (presentation, memo writing, teamwork), project management
         | skills (identifying & overcoming workflow bottlenecks),
         | professional skills (timely responses, political savvy),
         | technical skills (application programming, advanced databases,
         | advanced machine learning, Excel modeling) and finally some
         | business domain knowledge.
         | 
         | This is an uncommon intersection of skills.
        
           | marcinzm wrote:
           | I, interestingly enough, have that skill set (mostly) and
           | probably a broader set of technical skills than you're
           | imagining. I use it to hire a team of specialists under me
           | and interact with other specialized teams (ie: I speak their
           | language) since I lack depth in too many areas. I wouldn't
           | ever imagine hiring a clone of myself except in cases where I
           | can't build out a larger team for a long period of time.
        
           | nerdponx wrote:
           | > You'd need someone with excellent communication skills
           | (presentation, memo writing, teamwork), project management
           | skills (identifying & overcoming workflow bottlenecks),
           | professional skills (timely responses, political savvy),
           | technical skills (application programming, advanced
           | databases, advanced machine learning, Excel modeling) and
           | finally some business domain knowledge.
           | 
           | This is pretty much the _bare minimum_ requirement for any
           | data scientist job I 've ever interviewed for or held.
        
             | alexpetralia wrote:
             | In my experience, software engineers do not make good
             | business analysts (and data/machine learning engineering is
             | a subset of software engineering). Most business analysts
             | cannot program.
             | 
             | However, it's likely that our experiences simply diverge
             | here.
        
               | nerdponx wrote:
               | I'm talking specifically about data science, not business
               | analyst or software engineer.
        
           | slt2021 wrote:
           | exactly, these are characteristics of a unicorn and I think
           | most of these skills are trivial to build up over time
           | through practice and self-learning and these skills can yield
           | great benefits both for employers and employees
        
             | valarauko wrote:
             | I think the point is that these skills are not trivial to
             | build up over time
        
             | SirSourdough wrote:
             | I guess in a vacuum each of those skills is easy to build
             | up through practice and self-learning (which, lets
             | remember, many people struggle with to begin with).
             | However, I think the fact that you refer to people
             | possessing all of them as "unicorns" should be telling as
             | far as how trivial it actually is to build all these skills
             | beyond a simply passable level.
        
             | bcrosby95 wrote:
             | Or you can recognize that they're the characteristics of a
             | unicorn and split the role into multiple positions.
        
           | bearjaws wrote:
           | I've seen this in leadership who want to move to "Devops",
           | its the classic "if we find this one person who can do
           | everything we will have no problems!"
           | 
           | The reality is of course, nobody can be amazing at the full
           | lifecycle of an application. Some do better in infra, some
           | better in backend, front end, etc.
           | 
           | A successful leader must find what is needed for the
           | product/application pipeline and hire appropriate skill sets,
           | trying to find the one candidate to rule them all is giving
           | up on planning IMO.
        
           | beckingz wrote:
           | As the saying goes:
           | 
           | If you're looking for a data scientist with XYZABC skills,
           | that's not a data scientist, that's a data science team.
        
         | esyir wrote:
         | Ugh, all this gets you is being mediocre at all of this.
        
           | slt2021 wrote:
           | yes, if you need to roll-up everything on your own from
           | scratch.
           | 
           | NO, if you use right amount of automation and software
           | (usable data science workbench with MLOps built in, usable
           | and scalable ETL/ELT framework, usable AutoML, etc, etc.)
        
             | marcinzm wrote:
             | You still get someone mediocre at everything just they
             | cover up the gaps for a bit longer. Eventually things they
             | don't understand will interact in ways they don't
             | understand and cause production issues. It's okay to be a
             | generalist, one should however understand the blind spots a
             | generalist has.
        
         | nerdponx wrote:
         | Maybe my sense of terminology is warped, but I always thought
         | of                   DataEngineer = DataJanitor           [?]
         | MlEngineer           [?] MlOps           [?] BusinessAnalyst
         | 
         | Data Scientist is more like some combination of statistician,
         | "whatever ML is if it isn't statistics", lighweight
         | mathematician, data janitor (yes there is overlap), business
         | domain specialist, and code monkey.
        
       | civilized wrote:
       | Just my personal experience, but the people at my company titled
       | "Data Engineers" basically can only be trusted to (very slowly)
       | move data around, while the people titled "Data Scientists" have
       | to do all the cleaning work to make the data suitable for
       | analysis and modeling, in addition to doing that analysis and
       | modeling.
       | 
       | Is the point here that data scientists are doing too much of the
       | work that should be handled by data engineers? If so I agree, but
       | there are some org barriers. For one thing, our data engineers
       | are not accountable to any particular project. All we can do is
       | send them tickets to move data around, and it already takes them
       | weeks to move a table from one database to another. I can't
       | imagine the headaches if we gave them anything nontrivial to do.
        
       | C4stor wrote:
       | I can't recommend the Data Engineer career enough for junior
       | developers. It's how I started and what I pursued for 6 years
       | (and I would love doing it again), and I feel like it gave me
       | such an incredible foundation for future roles :
       | 
       | - Actually big data (so, not something you could grep...) will
       | trigger your code in every possible way. You quickly learn that
       | with trillions of input, the probabily to reach a bug is either
       | 0% or 100%. In turn, you quickly learn to write good tests.
       | 
       | - You will learn distributed processing at a macro level, which
       | in turn enlighten your thinking at a micro level. For example,
       | even though the order of magnitudes are different, hitting data
       | over network versus on disk is very much like hitting data on
       | disk versus in cache. Except that when the difference ends up
       | being in hours or days, you become much more sensible to that, so
       | it's good training for your thoughts.
       | 
       | - Data engineering is full of product decisions. What's often
       | called data "cleaning" is in fact one of the import product
       | decisions made in a company, and a data engineer will be
       | consistently exposed to his company product, which I think makes
       | for great personal development
       | 
       | - Data engineering is fascinating. In adtech for example, logs of
       | where ads are displayed are an unfiltered window on the rest of
       | humanity, for the better or the worse. But it definitely expands
       | your views on what the "average" person actually does on its
       | computer (spoiler : it's mainly watching porn...), and challenges
       | quite a bit what you might think is "normal"
       | 
       | - You'll be plumbing technologies from all over the web, which
       | might or might not be good news for you.
       | 
       | So yeah, data engineering is great ! It's not harder than other
       | specialties for developers, but imo, it's one of the fun ones !
        
         | secondcoming wrote:
         | Indeed, adtech is a great place to work for anyone interesting
         | in working with data. And yes, people working in adtech hate,
         | and block, ads too.
        
         | pricci wrote:
         | And where would you recommend someone to start a data
         | engineering path. Any book, learning source?
        
           | edmundsauto wrote:
           | Long time DE here. I recommend trying to build your own data
           | warehouse around something you're interested in. Don't worry
           | about teh scaling - focus on the core engineering, taking
           | data from different places, combining it into a sensible data
           | model, update it automatically every day. Add in more
           | sources.
           | 
           | It's shockingly difficult, and something that only experience
           | can teach.
        
           | thecolorgreen wrote:
           | I have the same question and I believe the answer is in the
           | same vein as someone who asks about software engineering.
           | Books/courses are great for the concepts, but your goal
           | should be to build something ASAP since that's where actual
           | learning will come from.
        
           | khaledh wrote:
           | Work at a company that has a good data engineering
           | discipline. Shopify is hiring:
           | https://www.shopify.ca/careers/2021
        
           | Karrot_Kream wrote:
           | A lot of these are just "garden variety" (distributed)
           | systems problems. Dealing with systems with differing latency
           | distributions, recovering from failure, acceptable tradeoffs
           | between speed and accuracy, etc
        
         | alexpetralia wrote:
         | The other thing I'd emphasize here is dealing with "state".
         | Data is effectively state.
         | 
         | As application engineers build increasingly "stateless" code
         | (e.g. pure functions, serverless deployments, etc), that state
         | gets pushed elsewhere. Someone has to manage the queues, file
         | versions/locations, logs, databases, configurations and so on.
         | That is all "data".
         | 
         | State management is a tricky problem even in a single-threaded
         | application. It's doubly so in distributed systems, where state
         | can be inconsistent between all the moving pieces. This is the
         | source of endless data integrity issues. I think data
         | engineering is a great way to get some exposure to all of this.
        
           | darksaints wrote:
           | > As application engineers build increasingly "stateless"
           | code (e.g. pure functions, serverless deployments, etc), that
           | state gets pushed elsewhere.
           | 
           | Exactly. You can't magically make a stateful problem
           | stateless, you can merely move that state around. Sometimes
           | moving state around means moving it somewhere that is
           | appropriate and capable of expertly handling that data. But
           | if you make those choices wrong, it makes every aspect of
           | your application more complex.
           | 
           | UI programming tried going down this idea of stateless
           | programming, and for a while it was trendy to do so stuff
           | like redux. The problem is that UIs are state machines.
           | That's not an analogy, that is a literal statement. And it is
           | true of all UI's...it's just as true of the transmission
           | lever in your car as it is for your saas dashboard. You can't
           | program stateless UIs...they would cease to be a UI. So at
           | best, you can move that state around. And with most of these
           | solutions (eg. redux), you end up pushing that state into a
           | massive global singleton, where even simple things like the
           | state of a single radio button needs to be fed through dozens
           | of tightly coupled components in order to "statelessly"
           | render. And even worse, you lose the extremely helpful
           | distinction between UI state and domain state, mixing them
           | both together into a gigantic shit stew.
        
           | jsinai wrote:
           | >The other thing I'd emphasize here is dealing with "state".
           | Data is effectively state.
           | 
           | It gets even more complicated. It's not just the current
           | state that matters, but also the history (sometimes the
           | entire history) up to that state.
        
         | theflyinghorse wrote:
         | I wonder how: 1. one finds organizations that have data
         | engineering 2. gets hired to said organization with software
         | engineering background.
        
           | khaledh wrote:
           | Shopify is hiring 2,021 engineers (not just data engineers)
           | in 2021: https://www.shopify.ca/careers/2021
        
           | walleeee wrote:
           | Nearly any field of computational science likely needs
           | skilled data engineers. You could search for topics that
           | interest you online and contact people accordingly.
           | 
           | I cold-emailed my current lab's P.I. and just asked for work.
           | Search for "research software engineer" or "scientific
           | computing professional" positions. Plenty of data engineering
           | goes on in many fields (environmental science, climate
           | modeling, high energy physics, physical chemistry, etc), and
           | plenty of fields desperately need to develop an engineering
           | culture (e.g., plant biology, my field), whatever interests
           | you. Availability and compensation will vary by discipline.
        
       | [deleted]
        
         | nerdponx wrote:
         | This hits me at a personal level.
         | 
         | This is how I imagine programmers must have felt in the 80s and
         | 90s.
        
       | jesseryoung wrote:
       | 4 years ago I moved from a role where I primarily wrote C# as an
       | architect on a web application, to an architect helping to build
       | a data warehouse. The contrast in tooling, discipline and
       | information available to build anything in the data world is so
       | stark it had me questioning my career decisions. Sure, you can
       | read Kimball and Inmon and I'm sure there are a handful of others
       | out there - but there are drastically fewer than what you can
       | find in the application development space.
       | 
       | Things are getting better, Visual ETL tools are falling out of
       | favor to proper coded ETL (spark, dbt, etc) and data teams are
       | starting to see the value of actually engineering a solution
       | instead of just throwing it over the wall to a DBA to deal with.
       | But tooling, and general information on the web is still lacking.
       | Pushing data engineers over "etl developers" or "bi developers"
       | (or "data scientists") will drastically improve any organizations
       | ability to actually deliver real analytics and hopefully an
       | industry wide push will raise all ships.
        
       | tpoacher wrote:
       | We don't need [old but still relatively new definition, whose
       | meaning still isn't fully agreed on or established]!
       | 
       | We need [brand new definition of the same, which most people are
       | even more confused what it means and how it's different from the
       | old]!
        
       | itsoktocry wrote:
       | My first job as a "Data Scientist" (it wasn't called that, but
       | the work was the same) was for a small gaming shop, around 2011.
       | It involved applying econometric analysis and doing simple
       | statistical testing on the player data sets. I realized quickly
       | that knowing how to do statistical testing was only a very small
       | portion of what it took to create value in such a role. At the
       | time, I didn't even know (but learned) SQL. Everything I wanted
       | to do involved teaming up with a developer, which wasn't
       | efficient in a small operation. So I learned to program. I
       | continue to enjoy skilling-up, most recently learning cloud-tech
       | to enable me to deploy data tools I develop.
       | 
       | The most valuable people in the data chain will be those that can
       | take idea to near-production. Running ML libraries over clean
       | datasets is overrated. The fact is, 80% of the value of "Data
       | Science" comes from KPIs and basic stuff.
        
         | hardtke wrote:
         | > The most valuable people in the data chain will be those that
         | can take idea to near-production.
         | 
         | Having hired many data scientists/ML engineers over the years,
         | people that build robust automated intelligence directly into
         | products are extremely rare. I've estimated a maximum of 10K
         | people in the entire world. They also command the highest
         | salaries, not coincidentally. Very few people have both the
         | statistics and engineering backgrounds as well as temperament
         | to be successful, particularly when the problem requires new
         | data sources or new types of models. There are some real simple
         | practical hurdles such as the need to implement robust tracking
         | that allows data snapshotting at the time when a decision needs
         | to be made without affecting product performance, as well as
         | figuring out how to gather data on users/situations that are
         | actually important for moving the needle (first time users,
         | casual users, etc.) There is also a mismatch between the best
         | frameworks for prototyping/research and implementation which
         | (at least at the companies I've worked for) can be summarized
         | as "Java is good for application development, not ML. Python is
         | good for ML, not application development."
        
       | tristor wrote:
       | Agreed. In my decently long career the types of data problems
       | I've seen be most impactful on the business are not head-in-the-
       | clouds ML issues, but more mundane yet more far-reaching:
       | 
       | 1. Appropriately identifying what data needs to be captured from
       | a product to correctly operationalize it.
       | 
       | 2. Understanding and modeling data structures in internal
       | applications to identify and tune backend data storage mechanisms
       | (including DBMS). Inclusive in this is helping the application
       | development team pick the correct structure and implement it
       | correctly.
       | 
       | 3. Validating implementation of instrumentation within the
       | application so that data cleaning isn't necessary and that
       | telemetry can be appropriately reported on. Building said
       | reports.
       | 
       | 4. Doing ETL and taking care of out of band data management to
       | link disparate systems within the business to help build holistic
       | views of the business overall.
       | 
       | 5. Be a safeguard against the over-collection of data, because
       | data engineers understand that data isn't an asset, it's a
       | liability that increase costs and risks as a business or product
       | scales, and when there's not a specific need that can be
       | articulated clearly for that data, collecting it is a
       | user/customer-hostile action.
       | 
       | My experience has been that data is a crucial element to
       | understand the health and state of the business with both breadth
       | and depth at a given point in time and identify trends. However,
       | it's mostly used by folks in management as a crutch to try to de-
       | risk decision making, or worse as a political tool to give a faux
       | support to a decision that's already been made but not yet
       | publicized. Decisions carry inherent risk, including the decision
       | to do nothing, you cannot eliminate this, it's one of the
       | components of decision trade-offs. This sort of broken use of
       | data by management is supported by "Data Scientists" that see the
       | field as a cash-cow they can milk while they work on pie-in-the-
       | sky ML strategies which are often unnecessary, even when they
       | actually work.
       | 
       | Done correctly a strong data culture in a company can increase
       | decision velocity, empower engineers, and reduce overhead on
       | management to understand the business. Done improperly, data
       | culture in a business can easily destroy decision velocity,
       | empower dysfunctional politics, and increase engineering overhead
       | to understand systems. Getting it right is the main test for
       | businesses in the new era.
        
         | lamename wrote:
         | This sounds reasonable. The trick is to identify
         | companies/cultures of each type at interview time. But is that
         | even possible?
        
       | hehehaha wrote:
       | I want to live in a world where data scientists making nearly
       | $500K can understand and correctly implement simple concepts such
       | as fixed effects. Is that asking for too much?
        
         | beckingz wrote:
         | Sure, but what about at $80k?
         | 
         | None of this discussion is helped by the fact that companies
         | want to get into data science without actually having much data
         | strategy.
        
       | donkeyd wrote:
       | This is interesting. I was the technical founder for a data
       | startup that used NLP, elastic and some other stuff for analysis.
       | It's still active, growing and approaching profitability, is used
       | by fortune 500 companies and has had some media attention.
       | However, I've never been approached for a related role and have
       | never been invited after applying for similar roles.
       | 
       | Maybe my resume is bad, maybe my experience doesn't really fit
       | anywhere, but I thought it was an interesting observation in
       | light of this article.
        
       | dqpb wrote:
       | workera.ai has a very interesting approach to measuring,
       | categorizing, and visualizing where you are on the skill spectrum
       | and what employers are looking for on that spectrum.
        
       | 6d6b73 wrote:
       | I would go even farther. We don't need data, we need knowledge.
        
       | shoshin8 wrote:
       | I think just replace this with. Need Engineers
        
       | monksy wrote:
       | This has been a bit of an annoying thing for me for quite a
       | while. There is a huge difference between a data engineer and a
       | data scientist. A data scientist is not and should not be a data
       | engineer. These are 2 different specializations that work
       | together.
       | 
       | A data engineer is more of what we use to refer to as people who
       | wrote utility processes to process data or do system
       | optimizations. At some point the industry decided to do away with
       | a lot of the things we used to do (desktop apps, distirbuted
       | systems, etc) and moved to REST services only. Then people
       | realized oh wait.. we can't process data on a rest/web app. In
       | typical inexperienced fashion, people tried to cram in there, but
       | it doesn't work. (See Javascript neural networks)
       | 
       | What is data engineering? It's all about moving data around
       | efficiently and processing it in a way to is per formant and
       | reactive. A lot of people tie hadoop/spark to being a data
       | engineer. That's a terrible way of going about it. More of the
       | modern approaches to this is using streaming platforms and
       | reacting to events. (Sadly a lot of the ML stuff is tied to
       | either python/tensorflow or spark)
       | 
       | At times data engineering is pushed towards data maintenance and
       | pushing all of the data in a bucket. This isn't a very valuable
       | use of effort.. but people want to be buzzword compliant.
       | 
       | Note: There are use cases for hadoop and spark.. but those rarer
       | now. (They've better for very large datasets and merging for data
       | that you have a much longer timeframe for the answer).
        
       | sbpayne wrote:
       | +1. I generally say that data scientists bring an amazing skillet
       | to the table, but companies can only leverage 10% of it.
        
       | AndrewKemendo wrote:
       | Preach!
       | 
       | The data lifecycle is waaay overpopulated with Data Scientists
       | who are not empowered or knowledgeable enough to work with
       | product designers and engineers to do everything that empowers
       | Data Science and ML.
       | 
       | We need more Data Engineers involved at time zero in projects to
       | help:
       | 
       | 1. Plan out what data should be produced/captured by the product
       | 
       | 2. Instrument systems to actually generate data consistently and
       | effectively
       | 
       | 3. Build ETL pipelines and data management systems
       | 
       | 4. Manage enterprise data sharing and resiliency
       | 
       | etc...
       | 
       | What ends up happening is you have a bunch of Data Scientists
       | just handed a pg_dump or flat file from some ops team. That is
       | typically missing data or poorly formatted and they spend 90% of
       | their time cleaning it up then running some basic regression with
       | numpy or whatever.
       | 
       | Need better understanding of the data lifecycle by organizations
       | and investment in instrumentation and data management.
        
         | bitcharmer wrote:
         | This same sentiment (which I personally agree with) applies to
         | software engineering. As in: engineers deliver more practical
         | value than comp scientists. Now you can down-vote me to
         | oblivion.
        
           | spacemanmatt wrote:
           | Is this sentiment perhaps due to someone "practicing CS" on
           | your engineering schedule? What's the real harm you're
           | describing?
        
           | vbtemp wrote:
           | You and me both, friend. Except I normally get down-voted to
           | oblivion for saying the opposite.
        
             | adamisom wrote:
             | I guess depends on what valuable means. I imagine most comp
             | scientists are less replaceable than most software
             | engineers, so point for compsci.
        
               | collyw wrote:
               | Depends if you have a computer scientist doing a software
               | engineers job
        
               | johnqpub wrote:
               | The two are complimentary. Engineers can't do anything
               | without the fundamental insights scientists provide. But
               | scientists don't have the practical experience of writing
               | end products that real users use.
               | 
               | Obviously this is a huge generalization but I think it's
               | a useful way to think about it. And when I say scientist,
               | I mean "Professor of CS" not "24 year old with a BS in
               | CS".
        
           | jimbokun wrote:
           | I think generally, Computer Science is a degree and Software
           | Engineer is a job description. So many people get Computer
           | Science degrees, then have a career as a Software Engineer.
           | 
           | Yes, there are Software Engineering degrees. But I think a
           | minority of Software Engineers have a Software Engineering
           | degree.
           | 
           | What this means in practice, is that Computer Science majors
           | need to learn the engineering skills on the job or on their
           | own after they graduate. Although some programs help students
           | pick up some of those skills as part of the degree program.
        
             | Fellshard wrote:
             | Anecdotal: University of Washington considers (considered?)
             | them two separate degrees, holding CS as more theory and
             | research-driven, and CE as more practice and career-driven.
        
           | TheCoelacanth wrote:
           | I think it would apply if companies were hiring large number
           | of computer scientists and using them to try to build usable
           | software. I don't see many making that mistake. Most
           | recognize that computer scientists belong in a research or
           | academic setting.
        
         | mgh2 wrote:
         | Who will do the proper cleaning then?
        
           | nerdponx wrote:
           | It doesn't matter, as long as you don't make the person with
           | the PhD in biostatistics spend their time writing ETL
           | pipelines, which is a wildly inefficient use of a very
           | expensive resource.
        
             | names_are_hard wrote:
             | Do people with PhDs in biostatistics earn significantly
             | more than programmers? I honestly know nothing about the
             | market for biostatisticians, but my impression was that
             | advanced degrees in the natural sciences don't really pay
             | that well compared to software engineers, especially given
             | that they're much more educated.
        
               | aldanor wrote:
               | If they work e.g. in a hedge fund / trading firm, then -
               | yea. And you see lots of PhDs from unrelated fields
               | working as quants there.
        
           | Enginerrrd wrote:
           | Not to worry, corporate will just outsource to firm which
           | hires Data Janitors
        
           | alex_anglin wrote:
           | The aspiration that GP was getting to was that less cleaning
           | is required as a result of better data engineering, I
           | believe.
        
             | AndrewKemendo wrote:
             | Correct. If you build your instrumentation correctly, then
             | you don't really need to do any "cleaning."
             | 
             | Doesn't mean you might not need to do transformation for
             | different uses but ideally wouldn't need to, for example
             | change data types like turning a bool into an int.
        
               | mgh2 wrote:
               | Do data engineers have good analysis skills? Do business
               | analysts have good engineering skills? I don't think
               | either of them can fill the data scientist role.
               | 
               | The scientific training and mindset (scientific method,
               | hypothesis, experiment setup, etc.) to even create an
               | accurate model is an undervalued skill here no? Even if
               | data cleaning is automated, these skills cannot be easily
               | learned.
               | 
               | There is a reason why so many PhDs get into the field,
               | because they were trained in the exploratory/research
               | mindset that no engineering or analytics skills can fill.
               | Correct me if I am wrong.
        
               | dijksterhuis wrote:
               | > Do data engineers have good analysis skills?
               | 
               | Yes.
               | 
               | > Do business analysts have good engineering skills?
               | 
               | Depends on the analyst.
               | 
               | > I don't think either of them can fill the data
               | scientist role.
               | 
               | > The scientific training and mindset (scientific method,
               | hypothesis, experiment setup, etc.) to even create an
               | accurate model is an undervalued skill here no? Even if
               | data cleaning is automated, these skills cannot be easily
               | learned.
               | 
               | It's not about replacing data scientists with data
               | engineers, it's about both roles working together to make
               | everything more efficient.
               | 
               | The hiring rate for data scientists has plateaued. The
               | industry doesn't need any more of them. Why? Because data
               | scientists often can't solve problems fast enough. It's a
               | commonly quoted statistic that 70% of any data science
               | task is data cleansing and/or etl. A data engineer's job
               | is to take that 70% and turn it into 10%. The data
               | engineer saves the data scientist time, meaning they can
               | focus on what they're supposed to do -- build models.
        
               | dpayonk wrote:
               | If we only had to use 1st party data, that might be
               | easier. But then again, if you're building your product
               | incrementally, you're still going to have instrumentation
               | holes that you may or may not be able to partially
               | backfill.
        
               | darksaints wrote:
               | The problem is that data engineers that are geared
               | towards analytics very very rarely control the systems
               | that create the data. If you're lucky, you have the task
               | of hounding a team within your company to get their data
               | management practices in order. And the conversation there
               | is whether they should make their job harder in order to
               | make your job easier.
               | 
               | Unfortunately, data engineers rarely deal with purely in-
               | house data. You're gonna be pulling data from a variety
               | of data sources. I can assure you that if you're pulling
               | from government data sources, you're gonna have a hell of
               | a time. Speaking from direct experience, my team is
               | probably going to spend $10M/year just trying to keep a
               | government dataset in order, because they won't do it
               | themselves. I'm talking lawyers, legal analysts, data
               | engineers, data scientists, data entry personnel, etc..
               | just to fix data that should have never been broken in
               | the first place.
               | 
               | It shouldn't be a shock that cleaning the data is the
               | path of least resistance for many.
        
               | AndrewKemendo wrote:
               | Hence why I said DE need to be involved as early as
               | possible. Aspirational sure, but that's what I've seen
               | work the best and repeatably. It's the only scalable
               | solution IMO otherwise you're perpetually playing catch-
               | up.
               | 
               | On the point about the govt I literally built a
               | completely new contract type and civilian hiring
               | practices for the DoD to bring in Data Engineers so they
               | could do exactly what I describe to make your life
               | easier.
        
         | [deleted]
        
         | antipaul wrote:
         | I have a dream - and it looks like this!
        
         | alexfromapex wrote:
         | I agree, if anything the data engineers (folks with engineering
         | backgrounds) should be doing the applied work while a
         | department of data scientists works on the theoretical or novel
         | data analysis methods.
         | 
         | Right now our product has accumulated a lot of technical debt
         | on the data validation side because data scientists designed
         | the test code in a way that dramatically slows the development
         | process.
        
           | listenallyall wrote:
           | > novel data analysis methods
           | 
           | Many "data scientists" (not all, but many) have little to no
           | ability to do anything other than apply "recipes" of
           | algorithms or classification methods or logistic regressions,
           | etc. Asking them to develop a "novel" method would be
           | fruitless. Asking them to clean and scrub the source data set
           | is like telling an amateur pie-baker the store was out of pie
           | crusts, you'll have to make your own from scratch -- it's not
           | going to happen, they just don't have that skill, the
           | instructions on the box don't account for that possibility.
           | As soon as the task diverges from the simple step 1, step 2,
           | step 3 that they were originally taught, you realize they
           | have very little ability to adapt. YMMV of course.
        
             | oivey wrote:
             | Yep. The key is really software skills. If you're unable to
             | even filter the data yourself, you're also probably
             | unlikely to be able do implement novel analysis techniques,
             | especially if the analysis algorithm has many complicated
             | steps or is computationally expensive.
             | 
             | In all fairness, it's basically impossible for a new grad
             | to have those skills. 4 years of a bachelors in any field
             | isn't enough to cover such a wide area. Even for people
             | with graduate degrees it's a stretch.
        
               | jimbokun wrote:
               | The hope is that the 4 year degree gave you the ability
               | to quickly pick up those skills on your own.
               | 
               | If your four year degree didn't give you the ability to
               | learn and expand your knowledge on your own, its a
               | colossal waste of your time and money.
        
               | oivey wrote:
               | Sure, but depending on what you're doing, "quick" might
               | be years. You can get a PhD in understanding the theory,
               | a PhD in designing fast numerical algorithms, or spend
               | many years becoming a strong software engineer. I think
               | the willingness to learn a diverse set of things is much
               | more important than learning narrow areas fast. The short
               | length of a bachelors usually isn't enough to get this
               | diversity.
        
             | tchalla wrote:
             | > Many "data scientists" (not all, but many) have little to
             | no ability to do anything other than apply "recipes" of
             | algorithms or classification methods or logistic
             | regressions, etc.
             | 
             | This is because they rarely hire people with scientific
             | thinking ability. They just hire people who can code and
             | program from set recipes. Once you hire such people you can
             | not expect them to do non-recipe work. If you don't want
             | recipe work, don't hire people will recipe skills. Do not
             | have job interviews that select for recipe people. But,
             | that is exactly what most companies do.
        
               | simo7 wrote:
               | Precisely.
        
             | superhuzza wrote:
             | >Asking them to clean and scrub the source data set...it's
             | not going to happen, they just don't have that skill
             | 
             | I think you've been working with conmen/conwomen. I've
             | never seen a data science project that _doesn 't_ involve
             | data cleaning or wrangling of some sort.
        
               | listenallyall wrote:
               | Have you read through the comment thread? Did you read
               | the article? Most everyone is in agreement that projects
               | require a lot of cleaning & wrangling and a lot more --
               | the point is that data _scientists_ are generally not
               | doing that stuff, they expect academic-quality, pre-
               | processed, pristine data, so it 's data _engineers_ who
               | are stuck preparing the data, and who are in high demand.
        
               | superhuzza wrote:
               | Yes I read both the article and the comments.
               | 
               | I meant a data science project in terms of a project
               | completed _by_ data scientists. In my experience, all
               | data scientists are accustomed to doing extensive
               | cleaning etc.
        
           | watermanio wrote:
           | This feels especially true when you have access to things
           | like BigQuery ML.
           | 
           | It's very easy for an average engineer (like me) to start
           | using ML using these tools, but a lot harder to explain how
           | it works, or exactly which type of models to use.
           | 
           | In my mind a DS would be really useful to just point us in
           | the right direction and check work. Like a super specialist
           | QA...
        
         | fatnoah wrote:
         | >The data lifecycle is waaay overpopulated with Data Scientists
         | who are not empowered or knowledgeable enough to work with
         | product designers and engineers to do everything that empowers
         | Data Science and ML.
         | 
         | Reading this thread has made me realize just how lucky I am to
         | work very closely with strong a very strong Data Scientist, who
         | is complemented by a very strong Data Engineer. Conversations
         | with the Data Scientist are always about strategy, product
         | alignment, and ensuring we're optimizing what we build for
         | learning. The Data Engineer works very closely to ensure we're
         | actually capturing the data we think we are, getting it to
         | analysis systems, and making sure those data pipelines stay
         | healthy.
        
         | dumb1224 wrote:
         | In specific research areas such as biomedical science it is
         | certainly tricky to get involved because of the data governance
         | / confidentiality issue... so we have to do both roles to some
         | extent
        
         | mynameisash wrote:
         | > What ends up happening is you have a bunch of Data Scientists
         | just handed a pg_dump or flat file from some ops team
         | 
         | Not to disparage the amazing data scientists I've worked with,
         | but I've been on teams where this is very much the approach to
         | operationalizing models. It's basically, "Here's the sklearn
         | model and some fragile featurization scripts we built. Can you
         | take this to prod ASAP?"
         | 
         | The problem I've seen is that DS & DE teams were in different
         | parts of the org and had their own sprints that were in no way
         | connected. So they kept chucking models over the wall and we
         | kept trying to faithfully operationalize. Once we convinced
         | leadership that we had to collaborate from the get-go, things
         | went a whole lot better. It also improved the working
         | relationship of engineers and scientists.
         | 
         | I learned a hell of a lot from the scientists; they learned how
         | to write better code. They also learned what code they didn't
         | need to write because I could do it faster or better than them,
         | leaving them to focus on more important things. It was pretty
         | amazing to find what manual processes they would setup in lieu
         | of proper (or even _any_ ) engineering support. Again, these
         | are amazingly smart people, but they were being square-pegged
         | into a lot of round-hole engineering tasks.
         | 
         | Now, the much more frustrating issue I had was being in a very
         | data-heavy organization and being told by a distinguished
         | engineer (my skip-level) plus my direct manager that, "data
         | engineering isn't a real discipline." I left that org very
         | shortly thereafter.
        
           | civilized wrote:
           | This is 100% my experience as a data scientist. The
           | engineering support we get is restricted to submitting a
           | ticket for database access or moving data from one system to
           | another. Wouldn't dream of involving an engineer in a data
           | science project team, because I have no evidence that they
           | have any experience or expertise in anything other than
           | tickets to move data around.
        
             | Mauricebranagh wrote:
             | That's first line support not engineering
        
         | hinkley wrote:
         | This is a systemic problem. We ask non software engineers to
         | write code, and then we expect them to apply a level of
         | robustness and long term planning that even we have difficulty
         | achieving. Not because we're being picky, but because we know
         | the failure modes that are likely, and we know that people
         | convince themselves that they aren't.
         | 
         | We've been through this with installer writers, database
         | admins, test automation, operations people, and now 'devops'
         | people who were supposed to be the answer to these problems. It
         | never stops.
        
           | AndrewKemendo wrote:
           | It's a two way street. SWE need to learn some data practices
           | and data folks need to learn some SWE practices.
        
             | hinkley wrote:
             | Oh absolutely. We'll build completely the wrong thing, but
             | build it well (which just makes it all the harder to throw
             | it away).
        
         | aqme28 wrote:
         | > you have a bunch of Data Scientists just handed a pg_dump or
         | flat file from some ops team.
         | 
         | I feel seen. At a previous job, our output after some cleaning
         | and transforming was a pg_dump for the data scientists to load.
         | We had little visibility of what they did to that database once
         | they got it.
        
           | nitrogen wrote:
           | I suspect in rare cases this is by design, because engineers
           | would object to the behavior of the Business Intelligence
           | department on ethical grounds.
        
             | aqme28 wrote:
             | If anything, we would have objected to the quality of the
             | code they were writing.
        
       | coding123 wrote:
       | A couple of us inherited a machine learning project a while back.
       | The code was horrible. Riddled with copy pasta (nearly half of
       | the entire thing was copy paste and no code reuse). We basically
       | refactored everything, standardized input and output file names.
       | We put up a small Flask service to allow outside services hit it
       | easily and wrapped it up in a Docker container so it was
       | ultimately easy to deploy. Yes it was all the plumbing. However
       | we also looked at the code, and the ML strategies, and while
       | there was "some" level of competence, it was nothing more than
       | word2vec add and divide. Totally horrible for actually finding
       | key phrases that matter to the subject we're matching. So we
       | started tackling that too with LSTM but our time got cut short
       | and shifted off to another area. So not only was the "scientist"
       | they hired completely crappy at the engineering, they weren't
       | really helpful in the ML either.
       | 
       | This is obviously of lesser value to the topic at hand, and more
       | about making sure you hire good people I think.
        
         | wpietri wrote:
         | The phrase "If you can't dazzle them with brilliance, baffle
         | them with bullshit" comes to mind.
        
         | mlthoughts2018 wrote:
         | I am curious what your take is on things like this article:
         | 
         | https://managingml.substack.com/p/the-myth-that-machine-lear...
         | 
         | It has been my experience too. Basically, ML / DS engineers are
         | thrown under the bus for being poor general software engineers,
         | but in practice it's totally the opposite.
        
           | nerdponx wrote:
           | The problem is that ML engineers are not the people who wrote
           | GP's garbage code. Data scientists wrote it, and I know at
           | least a few of my very intelligent, high-functioning data
           | scientist colleagues who are alarmingly, astoundingly bad
           | programmers.
        
       | asimjalis wrote:
       | Who gets paid more? Maybe the question to ask is why market
       | forces are not naturally producing more of what is needed?
        
       | program-iscuous wrote:
       | That's interesting post, however there's large bias in my opinion
       | in how the analysis is done.
       | 
       | You have a stratified sample of companies in their early stages,
       | I think it's quite normal for most companies in their early
       | stages to prioritize data engineers rather than data scientists.
       | 
       | Data scientist comes after the data engineer, and if you have a
       | data scientist and not a data engineer then probably the data
       | scientist does both jobs. On the other hand, data engineer is not
       | dependent on a data scientist.
       | 
       | To conclude, I think that indeed there are more data engineer
       | positions because there are too many "data scientists", however
       | the true difference is not as large as in your analysis.
        
       | hankchinaski wrote:
       | the article is long overdue. so many times i have seen data
       | scientists glue together non production ready code and libraries
       | expecting someone else to finish the job and put the project to
       | production. in various places i worker at they have soon realised
       | that we mostly and exclusively needed data engineers - glorified
       | DevOps people that make production ready data pipelines - more
       | than we needed data scientists to do "modelling"
        
       | thecolorgreen wrote:
       | I'm a SWE and data engineering actually sounds super interesting
       | to me. Unfortunately, my day-to-day doesn't provide opportunities
       | to work with the massive amounts of data we generate. I've looked
       | into learning this stuff online but courses like DataCamp seem
       | too basic (I have experience with Python and data cleaning in a
       | research setting along with some academic ML experience) or
       | downright a bit scammy. Many of the articles I read online about
       | this also frame data engineering as a way to transition into the
       | tech industry, which isn't my blocker. Does anyone have advice to
       | help me transition away from pure software to a job as a data
       | engineer?
        
         | nojito wrote:
         | We just hired a SWE turned Data Engineer. You don't need to
         | handle massive data to make the transition. I believe all the
         | person did was build a small but robust pipeline that took some
         | API response data, cleaned it, populated some sqlite dbs,
         | replicated it for a small team to use and kept it updated every
         | few days automatically.
        
           | thecolorgreen wrote:
           | That's good news for me. For a minute, I genuinely thought
           | I'd have to waste time going through a bunch of stuff I
           | already knew to learn a tiny bit and earn a certificate to
           | prove my skills.
        
       | eVoLInTHRo wrote:
       | Agree with the general idea. Data engineers are like the
       | essential workers of the data world, people who today may not
       | receive the appropriate level of appreciation that they deserve.
       | 
       | I think the glut of data scientist occurs because we clump so
       | many different skills and disciplines under the single term "data
       | scientist". Data scientists today come from so many different
       | backgrounds that the definition means something different to
       | everyone. Because of this, the surface area of possible skills
       | that could be expected of a data scientist is vast, to the point
       | where it's pretty unlikely to be sufficiently competent in all of
       | them, let alone a majority.
       | 
       | I'd like to go back to a world where we had a little more
       | specificity about what kind of data scientist you are (e.g. I had
       | no problem with terms like statistician and data miner), which
       | could help ground expectations that others have of us, and it'd
       | also help clearly define the scope of various career paths for
       | the next generation.
       | 
       | Sadly the individual who coined the term shows no contrition for
       | the degree of confusion that the rest of us have been left to
       | deal with: https://observer.com/2019/11/data-scientist-inventor-
       | dj-pati...
        
       | Ansil849 wrote:
       | Genuine question: why is there so much pure teeming hatred for
       | data scientists in this comment thread? Almost every comment
       | comes off as full of snark and vitriol against data scientists.
        
         | danbrooks wrote:
         | I would guess that it's a reaction to job title hype.
         | 
         | There's a huge variety in DS responsibility and background
         | between companies.
        
         | bart_spoon wrote:
         | It tends to happen any time something becomes trendy. Data
         | science has/had a lot of hype over the last decade, and people
         | seem to have an inherent tend to want trendy things to fail.
         | Combined with that, you have a bunch of people hopping on the
         | data science bandwagon, so you get a lot of grifters, snake oil
         | salesmen, or simply individuals whose output is poor quality.
         | Seems to have created a feedback loop, where there is always a
         | new example of some AI solution failing, or a data science
         | initiative that didn't work out that everyone can point at and
         | say "See! I always knew this trend was dumb!".
         | 
         | Reality is that data science is here to stay. It's coming out
         | of the honeymoon period, and things may never be as hyped up as
         | it has been the last decade, but that's probably a good thing
         | for the field. Everyone will probably move on to hating the
         | next up and coming thing. I have a hunch it could be something
         | in data engineering because, while not exactly new, it is
         | absolutely the next "data science" in terms of demand, and with
         | products like Snowflake having so much hype behind them, it
         | seems the backlash will be inevitable.
        
         | andrewnc wrote:
         | I came to this thread interested in the discussion, but I feel
         | now like the homer simpson meme retreating into the hedge.
         | 
         | Maybe I'll come back in a few hours, but for now I'll stay
         | away.
        
         | lottin wrote:
         | Yes, the tide is turning now... who came up with the term "data
         | scientist" anyway? It's a made up profession. If you need
         | someone who understands statistics, get a statistician, or
         | maybe a mathematician. If you need someone that designs and
         | writes computer programs, get a computer programmer. But a
         | "data scientist"? No, thanks.
        
         | astrophysician wrote:
         | I'm guessing just venting personal frustrations due to their
         | own experiences, plus maybe poor hiring and guidance of data
         | scientists in their own teams? I have definitely seen DS people
         | in my experience that fit some of the descriptions here, but I
         | think it's a mistake to trivialize the DS position itself. A
         | good DS is a valuable asset, but depending on your
         | company/data, maybe not worth the cost. Plus there is no
         | "single" DS candidate or role, a lot of these roles (data
         | engineer, DS, analyst, swe) blend together at times, and it's
         | about finding the right balance of skills.
         | 
         | Sometimes I think a company (not having the DS experience
         | themselves) mistakenly over-hire DS roles in today's hype of
         | "AI" when their data is mostly run-of-the-mill and only
         | requires simple linear models that can be architected an
         | understood by a stats/math-savvy engineer. Even then, a good DS
         | is still useful (even linear models can be complex: e.g. what
         | priors do you want to use? Do you want a multi-task solution?
         | etc.), but maybe not worth the cost.
        
         | mywittyname wrote:
         | I think a lot of people feel like data scientists get all the
         | credit and the fun work, while we have to do all the heavy
         | lifting and boring stuff that they need. There's this idea that
         | data science is a special and unique skillset that couldn't
         | possibly be possessed by a simple software engineer. Despite
         | being largely a subdomain of CS.
         | 
         | I remember an era before "data scientist" was a job title. When
         | we (programmers) would analyze data to see if we had enough
         | information available to identify the problem, if not, fix
         | that, then come up with a strategy to solve it, test, and
         | finally deploy the model. The fun part was trying different
         | solutions and analyzing the data. It also felt awesome to
         | deploy a product that worked like "magic." Product owners
         | didn't know or care what a neural net was, they were just happy
         | it worked.
         | 
         | Now there are tons of data scientist out there who take the
         | easy, fun, rewarding work and try to skip over the nitty gritty
         | implementation details. Then management thinks engineering is
         | incapable of doing such work, and the only time we get the
         | opportunity to do something fun is to do so behind the scenes.
        
         | bartleby_ wrote:
         | Probably just people blowing off steam and the target de jour
         | are data scientists. If people had to sit down with their
         | company's data scientists to air their grievances face to face
         | I doubt they would be so condescending.
        
       | aaroncrayford wrote:
       | What scientist doesn't use data? Such terrible naming.
        
       | LawrenceHecht wrote:
       | Building or running data infrastructure is an important part of
       | 55% of 372 data engineers' jobs, according to the "2020 Kaggle
       | Machine Learning & Data Science Survey." ...Check out three
       | charts I created showing differences between data scientists,
       | data engineers and machine learning engineers:
       | https://thenewstack.io/software-engineers-use-spreadsheets-d...
        
       | nlbrown wrote:
       | What are the typical entry level Data Science/Engineering
       | positions like? Are they available to fresh college or bootcamp
       | graduates?
        
         | marcinzm wrote:
         | From everything I'm hearing Data Science is right now flooded
         | with entry level applicants to the point where they're applying
         | for even pure Data Analyst jobs. Data Engineering a lot less
         | so.
        
       | brd wrote:
       | As an industry we're letting history repeat itself and making all
       | the same mistakes.
       | 
       | There are different kinds of developers. At it's most base form,
       | you have systems focused developers and algorithmic focused
       | developers. Sure there is a grey area but I think those two
       | buckets are pretty defensible.
       | 
       | In the data science world you have an exact parallel. Those who
       | build the systems and those who optimize the thing the system
       | supports.
       | 
       | In the ML world you have another parallel. Those who build the
       | systems and those who optimize and pioneer the model
       | architectures and parameters.
       | 
       | We never reached consensus on the titles for different kinds of
       | developer/programmer/computer scientists. And we're failing now
       | to reach consensus on sane titles for ML and DS.
        
       | ziml77 wrote:
       | Having to deal with data scientists, I absolutely agree. The
       | thing that I've seen that lands in the "lab" vs production
       | distinction is that these people expect their data to be
       | pristine. They flip out when the world isn't as perfect as their
       | models want. Leads to me as just a normal software developer
       | having to do the data analysis and figure out how to clean it up.
       | 
       | I also end up having to be the one to talk to data vendors to
       | understand their data feeds and essentially translate that for
       | the data scientists. Having to sit in the middle is annoying for
       | me and suboptimal for the business.
        
         | baron_harkonnen wrote:
         | The data science field has been flooded with PhDs with nowhere
         | else to go that have _no_ background in engineering, and sadly
         | often have a very poor understanding of both machine learning
         | and statistics.
         | 
         | Companies were in a rush hire "data scientists" and boot camps
         | like Insight were more than happy to pump out very impressive
         | PhDs with just enough understanding to build a Keras model.
         | 
         | I've worked in industry awhile doing DS work and have been
         | astounded at the number of PhDs that both don't know how to
         | write Python that doesn't live in a notebook _and_ throw away
         | years of disciplined experimentation experiences to just throw
         | keras models at data until the needle moves.
         | 
         | There do exist excellent data scientists out there, who are
         | both very solid software engineers and really know their stuff
         | mathematically, but I've found most of these people can't
         | reliably find jobs because the people interviewing them know so
         | little that good data scientists will be penalized for answer a
         | stock question _correctly_.
         | 
         | The field has been so flooded with amateurs that have no idea
         | what they're doing, that potential mentors have been driven
         | out, and now it's just a mess. To get a job doing DS if you do
         | know what you are doing you have to play a weird game where you
         | guess the incorrect answer the interviewer has in mind.
        
           | fock wrote:
           | I do an introductory Python lab course at my university. It's
           | targeted at engineers who still create graphs from Excel and
           | then normally level up to MATLAB, if things get complicated
           | (think insets, ...). I guess about 30% of the people
           | previously did at least some of the YT/Udemy "courses" on
           | datascience. It's really horrifying for me (not being an
           | engineer myself, but imo having a relatively engineering-like
           | mindset) to see these people horrified at simple tasks like
           | writing a variadic function. "What do I need this for?".
           | Well, it's using the programming environment. And then let
           | them code up a simple version of Levenberg-Marquardt. The
           | level of "why do I need to do this" is astonishing again...
        
             | nitrogen wrote:
             | _why do I need to do this_
             | 
             | IMO this is _the_ number one problem of our modern culture
             | around education. Popular culture makes it popular to treat
             | education as pointless, and this even affects students who
             | are pursuing difficult degrees.  "Why do I need to study
             | humanities? Why should I learn to code if I think I am born
             | to be someone else's boss?"
             | 
             | On the other hand, many teachers in K12 and early
             | university have no ability to connect the "what" with the
             | "why." "The curriculum is the curriculum. The test is the
             | test."
             | 
             | If we can solve these problems, our societies will be much
             | better off.
        
               | wickedsickeune wrote:
               | If the educator cannot explain why the knowledge is
               | useful, then he is unfit to teach it.
        
           | theflyinghorse wrote:
           | I work at a place with a very high count of PhDs. Some of
           | them write code. All of them view writing code as something
           | menial and unimportant and its shows in the resulting work,
           | which from my experience is atrocious.
           | 
           | Of course I understand that YMV, but I will forever be
           | skeptical of anyone writing code with a PhD after working
           | here.
        
             | lordgrenville wrote:
             | Are they CS/EE PhDs?
        
           | nerdponx wrote:
           | Not to mention the dark pattern of giving data scientist
           | candidates an unsolved industry problem as their interview
           | take-home task, and then telling them to only spend 4 hours
           | on it. Data science hiring often feels like a competition
           | where the winner is the one who has the most free time and
           | willingness to do other people's work without compensation.
           | 
           | It's kind of a fucked up field right now.
        
           | boterham wrote:
           | Do you have any suggestions for where to start looking for
           | good places to apply that don't suffer from this?
        
         | glitchc wrote:
         | This implies a lack of rigorous training. In the physical
         | sciences, one wouldn't become an applied scientist without
         | conducting an experiment to test a phenomenon, and the teeth
         | gnashing that goes with making that experiment work.
         | 
         | Those who have been fed pristine data without having to undergo
         | the trials and tribulations of actually having to collect the
         | data have missed a crucial part of scientific training. Like
         | you, I find this lack of rigour is rather common among data
         | scientists. Not all, but quite a few.
        
         | astrophysician wrote:
         | Just want to say that while the data science profession
         | definitely includes a wide range of people and skillsets, a
         | _good_ data scientist should be practical and able to work with
         | the available data in whatever state it 's in.
         | 
         | No good data scientist should ever expect data to be pristine.
         | And a good data scientist, even if they don't have quite the
         | engineering chops necessary to build a production-quality ETL,
         | should know enough about the process to help guide it. If they
         | aren't a part of that process, they're not being a good DS.
         | They can't expect someone not involved with their problem to
         | know what tradeoffs to make, and if they don't know _exactly_
         | how their data went from raw form to the ETL-ed form, they 're
         | probably going to make bad assumptions, and those assumptions
         | may very well make their architected solution a complete pile
         | of garbage. Not to mention, how can a DS offer suggestions for
         | solutions if they aren't deeply familiar with the raw data
         | that's available?
         | 
         | To me, a good data scientist should, at bare minimum, have
         | several skills.
         | 
         | * They should first and foremost (but _not solely_ ) be an in
         | house expert in statistics and machine learning to know what
         | can be done with data, and what _can 't_ be done with data.
         | They should arrive with that knowledge. Engineers I think have
         | a tendency to trivialize this, but true expertise in this
         | domain comes only with years of experience.
         | 
         | * They should strive to find modeling solutions that are
         | _right_ for a _particular business problem_. If they seem to be
         | only applying the hottest research regardless of the tradeoffs
         | for the particular business problem, that 's a red flag.
         | 
         | * Their focus should be on integrating themselves with the
         | product/business as much as possible, and with the engineering
         | team as much as possible. If they're expecting to be handed
         | directives, that's a recipe for a ton of wasted time.
         | 
         | DS should never, ever be siloed into their own little DS world.
         | They will be useless without a deeply intimate knowledge of the
         | business goals, the needs of product, and the capabilities of
         | the engineering team.
         | 
         | As they progress, they should become more and more "full-
         | stack", otherwise they are stagnating.
        
           | tchalla wrote:
           | A good data scientist should also be good at science.
           | Otherwise, you can simply hire people with engineering skills
           | - you don't need scientists. If you hire scientists and then
           | are surprised they aren't good at engineering, the hiring
           | process needs a reality check.
        
             | jsinai wrote:
             | Statistics is a science as well. Unfortunately it's
             | overloaded in business terms and can mean anything from
             | "knows means and regressions" to "has a copy of _Meyn and
             | Tweedie_ on their shelf".
        
         | superbcarrot wrote:
         | I think that this is more of a problem with the specific people
         | that you have worked with and it isn't inherent to the role of
         | a data scientist.
        
           | ct0 wrote:
           | Doesn't sound like a modern Data Scientist, sounds more like
           | a statistician with 30+ years of experience.
        
           | sidlls wrote:
           | It's becoming more inherent, especially as the field is
           | populated with people who have no experience with the
           | "science" part. That is, with the very real and ubiquitous
           | problem of collecting and cleaning data to make it fit for
           | scientific study. Even theoretical physicists, for example,
           | participate in and rely on empirical data collection, and
           | understand deeply how messy and fraught with error it is.
           | 
           | I don't see the same appreciation or consideration in general
           | in the field of data "science."
        
             | Mauricebranagh wrote:
             | I remember working with some one who has PhD in Physics and
             | who worked at CERN - and one comment I loved "a key skill
             | is knowing how to place the legend, so it obscures that
             | annoying outlier data point"
        
             | superbcarrot wrote:
             | > with people who have no experience with the "science"
             | part
             | 
             | It's interesting that you put it that way because a lot of
             | the other complaints in this thread are that the people who
             | expect their data to be ready for use are exactly the
             | people with science experience but without the relevant
             | technical background.
        
         | nerdponx wrote:
         | Instead of sneering at "having to deal with" data scientists,
         | consider that the data scientists themselves would often much
         | rather have data engineers and dev ops people involved in the
         | process.
         | 
         | Data scientists like to quip that 80% of the job is data
         | cleaning, with the remaining 20% divided up arbitrarily among
         | other tasks as suited the joke. In some shops nowadays, it's
         | more like 45% data cleaning, 45% data
         | engineering/ops/programming just trying to make your results
         | available to the rest of your organization, and 10% research.
         | 
         | If I can spend less time learning/doing software engineering
         | and devops and more time doing actual data science, that's
         | great. At a previous job, my team was _clamoring_ for more data
         | engineer hiring, and part of the reason our projects were
         | slipping and starting to fail was lack of data engineering
         | support. Our tooling was shit, our processes were shit, our
         | code was shit, and access to (and trust of) our data sources
         | was especially wet and stinky shit.
         | 
         | It made the daily work of doing data science a miserable slog
         | of ad-hoc duct-tape solutions, and it contributed to us being
         | generally ineffective as a team.
         | 
         | All of this would have been fixed if we had _one_ competent
         | data engineer with some actual real-world data /ML engineering
         | experience and good communication/advocacy skills. Let alone
         | two or three!
        
           | ramraj07 wrote:
           | If the DE tooling was shit and you couldn't hire more fast
           | enough, why didn't your team members start addressing these
           | problems? Surely spending half the time cleaning up the pipes
           | would increase the value of what you do with the other half?
        
       | prionassembly wrote:
       | I'm late to the comment party, but: this is classic "commoditize
       | your complement".
       | 
       | This guy would have you believe that Pytorch has Solved the
       | entire, vast field of data analysis as inherited from Newton, de
       | Moivre, Laplace, Bayes, Fisher, Neyman, Pearson, Wald, Savage,
       | Jaynes, Breiman, Pearl.
       | 
       | This is a lot like saying that photography has Solved art, and
       | now we need people who can climb ladders and glue the posters on
       | them big billboards. It would be delusional if it didn't have a
       | self-interested angle.
       | 
       | What, we with math degrees are fully confident that the plumbing
       | problem is easier to commoditize than the problem of making sense
       | of data.
        
       | zeku wrote:
       | I'm a data engineer for most of my day right now, and a lot of it
       | is done with ruby/python/shell scripts into postgres DBs.
       | 
       | What learning path should I go down? I'm a solo actor at work
       | with a lot of agency to decide my workflows.
       | 
       | I see myself building small to medium size data collections over
       | the next year or two at my job.
       | 
       | Can someone point me to some learning?
       | 
       | I have a CS degree etc. and my title in software engineer etc.
       | etc.
       | 
       | End users of my data _usually_ like their data as a CSV that is
       | then read using R or Python. However there is also a use case
       | where I will build an app to view my data in a simple way.
       | 
       | All of this is completely doable with my current
       | knowledge/workflow but I can't help feel like I do a data
       | engineering job with very different tools than i see "data
       | engineers" speaking about online.
        
         | langitbiru wrote:
         | A couple of days ago, there was a thread about "How to become a
         | data engineer in 2021":
         | https://news.ycombinator.com/item?id=25728198
        
           | zeku wrote:
           | thanks!
        
         | ageofwant wrote:
         | Adopt a good Workflow tool, like apache airflow. Easiest is to
         | rent a service from AWS https://aws.amazon.com/managed-
         | workflows-for-apache-airflow/
        
           | mywittyname wrote:
           | Seconding this, Apache Airflow is awesome.
           | 
           | I can't believe how much time is saves during development.
        
           | zeku wrote:
           | thanks!
        
       | glitchc wrote:
       | Good data is hard. Anyone conducting research in the physical
       | sciences knows this firsthand. It takes painstaking effort to
       | conduct carefully controlled experiments and collect a batch of
       | good data that could then be used for analysis.
       | 
       | The promise of ML has always been to churn out good results from
       | not-so-good data. If I now need to sanitize my data carefully,
       | what's the advantage?
        
         | claytonjy wrote:
         | What makes you think that was the promise? I would say the
         | promise has been to replace code with data, and to build things
         | with data that would be practically impossible to build with
         | code.
         | 
         | "Garbage in, garbage out" has been the mantra for a long time.
         | Sure, there are tools and techniques to deal with not-so-good
         | data, but those are add-ons, not a core part of the value
         | proposition of ML.
        
       | lordnacho wrote:
       | My experience is in quant hedge funds, where sometimes you get
       | some guys who develop the strategy and some guys who put it into
       | production.
       | 
       | Yes, I do admit there can be some specialization in terms of time
       | spent on science vs engineering.
       | 
       | But you really need people who understand both. Particularly if
       | you have a strategist who thinks his job is just to dream up
       | profitable models, he ends up carving that role out in a way
       | that's detrimental to the rest of the team. You get people who
       | just don't appreciate that there's other work to do than finding
       | models, and that models depend on that other work to function.
       | 
       | You also get a huge prestige gap, because inevitably management
       | will think that there's a magician and a blacksmith. One guy
       | needs to be paid a lot, and the other guy needs to be paid
       | enough.
       | 
       | These effects feed each other. Magician will say "where's my
       | data" and expect blacksmith to make it, promptly. He won't do it
       | himself, because spending time on mundane stuff makes the magic
       | disappear. And not doing it yourself, or taking the time to
       | understand it, will eventually lead to problems with the magic.
        
         | hntrader wrote:
         | To add, quants that can't do the data engineering work are
         | always crappy quants. I haven't seen a counter-example to that.
         | Profitable models aren't going to be delivered on a silver
         | platter. They need to be able to process pretty low level data
         | effectively and build ad-hoc custom tools and data pipelines
         | around that to test out their ideas. Otherwise they're
         | constrained to the tools others have built and that massively
         | narrows the search space that they're capable of traversing.
         | 
         | The best quants are 1/3 statistician, 1/3 developer and 1/3
         | trader, in my view.
        
           | [deleted]
        
           | Karrot_Kream wrote:
           | > 1/3 statistician, 1/3 developer and 1/3 trader
           | 
           | How is being a trader different from being a statistician?
           | Curious as I've never worked in finance before.
        
             | hntrader wrote:
             | By trader, I mean domain knowledge about the markets.
             | Statistics is the toolbox that this domain expert uses to
             | test their hypotheses and turn them into a profitable
             | model. But if the person isn't a domain expert and only
             | knows statistics, their ideas about what to test won't be
             | good.
        
               | wpietri wrote:
               | And to knowledge, I'd add disposition. It's been years
               | since I've been in finance, but the best traders I worked
               | with were all very driven to succeed, to dominate, to
               | win. Markets were really interesting to me, but I never
               | cared much about that part.
        
               | hntrader wrote:
               | Yeah, it's a performance discipline like any other
               | (competitive gaming, athletics, etc) where only the top
               | few % can succeed. If someone isn't very driven then they
               | won't make it.
        
           | twic wrote:
           | I'm not sure about _crappy_ quants. Some people of the
           | "quantitatively inclined trader who has learned Python"
           | variety are never going to be good at the engineering side -
           | it takes years to learn to be a good software engineer, and
           | that's not a good use of time, for them, or for their
           | employer. But they can still do useful work.
           | 
           | The trick is to figure out how to work effectively with those
           | people. Build infrastructure that keeps them on the rails,
           | refactor their code, push them in the right direction, tell
           | them when they've fucked up, teach them little things with
           | high leverage. As long as that doesn't turn into being their
           | slave, that's fine.
        
             | [deleted]
        
             | proverbialbunny wrote:
             | If they're using a dynamically typed language to do
             | monetary calculations, it's not going to be ideal.
             | 
             | Researchers do not need to have deep programming
             | experience, but they have to be comfortable enough to use
             | an environment that can lend itself itself to the problem
             | at hand. On the quant side, unlike on the data science
             | side, the barrier of entry on the programming side is a bit
             | higher. To solve this problem many firms have their own
             | internal programming language.
        
               | greenshackle2 wrote:
               | > If they're using a dynamically typed language to do
               | monetary calculations, it's not going to be ideal.
               | 
               | And yet, Q.
        
               | chadash wrote:
               | > "To solve this problem many firms have their own
               | internal programming language."
               | 
               | Any examples other than Jane Street?
        
               | anonymousDan wrote:
               | Goldman Sachs (Slang)
        
               | singhrac wrote:
               | > If they're using a dynamically typed language to do
               | monetary calculations, it's not going to be ideal.
               | 
               | I think this is an inaccurate take. No one in finance is
               | doing accounting or model estimation using Python's
               | floats; they are using numpy's float32 (or float64) type
               | instead. I think a more accurate version of what you're
               | saying is that static type checking is useful when
               | modeling complicated contracts; this might be true, but I
               | think it's not that important, as those things aren't
               | that liquid anyway.
               | 
               | Jane Street's decision to use OCaml is almost as much
               | about hiring and history as it is about language
               | features.
        
               | twic wrote:
               | > No one in finance is doing accounting or model
               | estimation using Python's floats
               | 
               | We are. When your input data only has five significant
               | figures, and probably less than that of real information,
               | numerical accuracy is the least of your worries.
        
               | [deleted]
        
               | aldanor wrote:
               | Or, they're using _ints_ instead, at least for market
               | data.
        
               | proverbialbunny wrote:
               | Fixed precision types technically. Internally they are an
               | int under the hood, so yah basically that.
        
               | oivey wrote:
               | This is dogmatism swung too far in the other direction,
               | IMO. There are many, many successful production code
               | bases written in dynamic languages. In my own experience
               | as a vision scientist/engineer, there is tremendous value
               | in being able to quickly whip up a concept in Python and
               | then being able to easily visualize the results. Doing
               | this exploration in C++ is wasteful. Implementation takes
               | much longer, the correctness brought by static typing is
               | dubious since the code isn't in prod, and the canned
               | CV/visualization libraries are fewer and frequently suck
               | in at least some way. That said, there is also tremendous
               | value in understanding how to map your Python prototype
               | into production code, too. Someone strong in this field
               | can do both.
        
               | proverbialbunny wrote:
               | This was addressed in the previous comment
               | 
               | >On the quant side, *unlike on the data science side*,
               | 
               | Vision scientist is on the data science side. You're not
               | dealing with monetary values where floating point error
               | compounds on itself to the point your models become
               | garbage. Quant work is it's own unique field with its own
               | unique prerequisites.
        
               | oivey wrote:
               | Nothing precludes you from doing integer arithmetic in a
               | dynamic language.
               | 
               | I'm not a quant and this isn't my area of expertise, but,
               | for example, I'm pretty sure various differential
               | equation solving methods depend on variables taking on
               | continuous values, so floating point basically must be
               | used. Understanding the impact of that is definitely very
               | important. Analogously, I frequently run into numerical
               | precision issues in image processing. Understanding how
               | numbers are represented on a computer isn't unique to
               | being a quant. Understanding how the choice of
               | representation can impact prod is also not unique to
               | being a quant. The dynamicness of the language isn't
               | particularly relevant, either.
        
               | proverbialbunny wrote:
               | >Nothing precludes you from doing integer arithmetic in a
               | dynamic language.
               | 
               | You would be surprised. The second you use pandas with a
               | custom data type (let alone any other library you'd want
               | to use) it can randomly auto convert it to a float.
               | Furthermore identifying when it randomly converts the
               | type on you is a pain.
               | 
               | >so floating point basically must be used.
               | 
               | Quants tend to use fixed precision types. It is like a
               | float in every way, except base 10 instead of base 2 so
               | there is no floating point error.
        
               | hntrader wrote:
               | Quants don't care about floating point precision in
               | research. It's just applied stats
        
         | oivey wrote:
         | I think this insight exists across a lot of fields. Basically,
         | if you want to be a really excellent magician you also better
         | be a decent blacksmith. More concretely in this case, if you're
         | unable to do the data "engineering" yourself then it will close
         | a lot of doors for interesting and novel work on the "science"
         | side. Beyond that, if the scientist's job just involves gluing
         | sklearn models together I think that job is more on the
         | engineering side of things than the supposed scientist usually
         | wants to admit.
        
         | inthewoods wrote:
         | That's interesting - I just completed book on Jim
         | Simon/Renaissance (The Man Solved The Market). One of their
         | early advantages was having a person who was just focused on
         | acquiring and cleaning data. I expect that advantage has
         | largely gone away at this point due to wide availability of
         | market data but I thought it was interesting in the context of
         | this article.
        
           | curiousgal wrote:
           | Same for CFM too, they have an entire team working on
           | alternative data and they feed it to a modeling team.
        
         | 1vuio0pswjnm7 wrote:
         | What must be communicated to management: It is easy to find
         | other magicians. It is not easy to find another blacksmith.
         | Without the right blacksmith, there can be no magic.
         | 
         | Magicians will be magicians, always hustling (bullshitting),
         | but they will never have the value and job security of the
         | blacksmith. The blacksmith can see the fruits of her own
         | labour, whilst the magician must lie to herself and others in
         | order to claim the blacksmith's value as her own.
         | 
         | If the blacksmith is good enough, she will earn the trust of
         | management and management may consult the blacksmith in the
         | selection of magicians. Management may ask the blacksmith to
         | interview the magician and seek her advice on the final hiring
         | decision.
         | 
         | The blacksmith may not carry the "prestige" of the hustling,
         | bullshitting magician but she can command a high salary and
         | dictate her own working conditions. This is only if management
         | understands her value. What the magician thinks of the
         | blacksmith is irrelevant.
         | 
         | Reliable blacksmiths are hard to find. Magicians are a dime-a-
         | dozen.
        
           | legerdemain wrote:
           | > It is easy to find other magicians. It is not easy to find
           | another blacksmith. Without the right blacksmith, there can
           | be no magic.
           | 
           | What? That runs counter to my experience at every company
           | where I've either seen data engineers or worked as one. My
           | observations of how management treats the two groups is this:
           | 
           | Data engineers ("blacksmiths"): Blacksmiths are paid less.
           | People think of them as less highly educated. Their work is
           | less creative. When they are successful, their work is mostly
           | invisible. They are interchangeable. People think of what
           | blacksmiths do as more like scripting than writing code.
           | Blacksmiths mostly work on configuring systems they didn't
           | build. Blacksmiths do more troubleshooting than building.
           | Their roles are focused on support.
           | 
           | Data scientists ("magicians"): Magicians are paid more.
           | People think of them as more highly educated. By definition,
           | what they do is magic. They work on prominent projects. Their
           | successes are highly visible. They build large systems that
           | only they comprehend. They use support staff to clear away
           | mundane obstales so they can focus on unique, highly creative
           | aspects of work.
           | 
           | Saying that we need more data engineers than data scientists
           | is like saying that we need more janitors than CEOs. That's
           | true, but it's true because _we made it true_ by structuring
           | projects around one prominent, well-paid person supported by
           | a staff of invisible drudges.
        
         | notretarded wrote:
         | Please don't change the status quo. I love my cushy job.
        
         | lumost wrote:
         | This problem only grows as the company scales and the science
         | and engineering pieces are formally split along some role
         | guideline.
         | 
         | Inevitably if you treat a job role as a support role, you'll
         | attract weaker individuals into that role then you would get if
         | it wasn't considered a support role. The problem with Science
         | oriented teams is that all roles other than the science role
         | morph into science support roles over time. The same pattern
         | used to occur with Engineers and QA, or Engineers and ops.
        
         | chadash wrote:
         | I worked in investment banking (as an analyst, not an
         | engineer), so very different part of finance, but this was my
         | take as well. Companies might love to talk about how important
         | engineers are, but at the end of the day, if you can't directly
         | link someone to revenue, they get viewed as a cost center and
         | take on second tier status in the organization. Then the same
         | companies complain that they can't find enough (or retain)
         | engineering talent. Not many places get the balance right.
         | Silicon valley treats engineers well because for the most part,
         | the value they bring is more obvious (and also, they don't
         | threaten the existing hierarchy in the company). Curious to
         | hear if anyone has had the opposite experience.
        
           | isolli wrote:
           | Yes, quite a few developers left our investment bank and went
           | to work for our suppliers (of trading software), stating
           | they'd rather work somewhere where they're seen as value
           | creators rather than a cost center.
        
           | mushbino wrote:
           | Engineers get paid well in SV because they are in demand,
           | have lots of employment opportunities, and therefore are more
           | difficult to retain.
        
             | mywittyname wrote:
             | And because their contributions can be tied back to
             | revenue. You need both, demand for talent, as well as the
             | ability & justification to pay for it.
             | 
             | Engineers are in high demand all over the world. But most
             | companies do not profit enough from technology to justify
             | similar paying SV salaries.
        
               | pmiller2 wrote:
               | Not always. Frequently, the connection between code
               | that's written today and revenue tomorrow is tenuous and
               | difficult to package in a way that says "look at me! I'm
               | valuable!"
               | 
               | And, then there are those somewhat rare occasions where a
               | project is not intended to increase revenue, and may even
               | decrease it. At my last employer, we guesstimated that a
               | project I worked on for months could possibly have ended
               | up costing us $2M per year in revenue. That was both
               | accepted and expected, because we were doing it to gain
               | goodwill with users, but in such a way that it might end
               | up pissing off a small minority of our customers.
               | 
               | I really wish, just once, I could work on a project and
               | put underneath it on my resume "Increased revenue by X%,"
               | because I've never worked on anything that was so easy to
               | directly trace back to the top line.
               | 
               | Cost savings are another story, because engineers _can_
               | fairly easily quantify how much less money is being spent
               | by doing $THING a bit more efficiently....
        
           | AtomicOrbital wrote:
           | I worked for 15 years as a software engineer at Morgan
           | Stanley where they valued the process of taking a 3 martini
           | lunch idea into a production platform so value of engineers
           | was recognized and rewarded as such ... its somewhat easier
           | to whip up a new financial wrinkle its a whole other level of
           | magic to design and implement that idea when it takes 60
           | software developers 3 years to get that idea to market before
           | the rest of the street ... of course the IT department was/is
           | the largest budgeted portion at the entire bank and for a
           | good reason
        
         | marcinzm wrote:
         | As I see it you need people who have shallow knowledge of many
         | areas and deep knowledge of one area. That lets you have a
         | group of experts but ones that know enough about other areas of
         | expertise to work with those other experts.
        
         | wpietri wrote:
         | > Particularly if you have a strategist who thinks his job is
         | just to dream up profitable models, he ends up carving that
         | role out in a way that's detrimental to the rest of the team.
         | 
         | My god, this. These people make me bonkers. Especially because
         | I feel like I have a bit of this tendency myself, the desire
         | just to think big thoughts and do no actual work. Happily, I
         | long ago learned that ideas were approximately worthless
         | without labor, and that I anyway had much better ideas when
         | laboring because it forced me to engage with the details.
         | 
         | And yes, those people can poison a team. My best working
         | experiences have all been with people who a) all valued actual
         | work and b) believed that everybody could have good ideas.
        
           | jnwatson wrote:
           | "I'm the idea guy" out of someone's mouth is the stark red-
           | flag warning that their net contribution is 0.
        
             | mumblemumble wrote:
             | Or even negative. I've seen situations where the idea
             | person is so busy being Mr. Toad that everyone around them
             | is regularly scrambling to clean up messes and it ends up
             | being a constant distraction from actually pushing projects
             | through to completion.
        
         | LeifCarrotson wrote:
         | I really like your magician/blacksmith analogy.
         | 
         | I'm in industrial automation, but it's much the same. Projects
         | where someone developed a strategy but has never been involved
         | in the details of a machine are doomed to failure (or at best
         | to be unreliable and producing low quality parts). Projects
         | built by machine fabricators are over-engineered, frequently
         | late, and sometimes unprofitable, but damn if they don't work
         | well.
         | 
         | The main trouble, I think, is that when a shiny new contraption
         | is brought to the king, it's too often the magicians doing the
         | talking - whether they're speaking words of power or Common,
         | their job is to talk. Meanwhile, the blacksmith is probably
         | busy at in his workshop some ornate scroll work for the next
         | thing, or repairing the previous gizmo, because he'd rather be
         | hammering away at his anvil than talking.
         | 
         | The higher you go in an org chart, the fewer the number of
         | people who understand the work their company actually does, and
         | the more voices you have between the workers and the decision-
         | makers to take some of the credit for work as it passes up the
         | chain.
        
           | hef19898 wrote:
           | That seems to be true in every field I can think of. The
           | smaller the gap, or rather the more practical experience the
           | strategy people have, the better a given org seems to be.
           | 
           | One common issue I run into is that when the blacksmiths
           | start talking, nobody listens.
        
         | zzzeek wrote:
         | maybe hedge funds would be able to find more people if they
         | didn't only hire "guys".
        
         | noodlenotes wrote:
         | Also, a lot of data scientists find the science fun and the
         | engineering boring. But they have overlapping skill sets - if
         | you aren't good at one, you're probably not good at the other
         | either. Somebody who shows up to a team with the goal of only
         | modeling and pushing all the dirty engineering work to their
         | teammates is basically a worst case scenario because
         | 
         | 1) They probably aren't going to produce good models since
         | they're not sensitive to data nuances, but now they've taken
         | over ALL the modeling work.
         | 
         | 2) They bring down the job satisfaction of everyone else on the
         | team who would like to be doing at least some modeling.
         | 
         | 3) They're sucking up the prestige that should be distributed
         | over the entire team and management thinks they should be paid
         | more for work that it turns out everybody thinks is more fun
         | anyway.
         | 
         | My number one advice to entry level data scientists is to not
         | be this guy. Don't give your interviewers the impression that
         | you won't do your own engineering work because they won't want
         | someone who brings negative value to the team.
        
           | NikolaNovak wrote:
           | Here's the tricky thing:
           | 
           | I love your post; I agree with your post; but it takes a 90
           | degree turn at the end:
           | 
           | "My number one advice to entry level data scientists is to
           | not be this guy. "
           | 
           | Everything most people are saying here indicates it's GREAT
           | to be that guy. You're paid, you're respected, you get the
           | fun parts, you love your job and it's pretty safe. It just
           | happens to suck for everybody else including team and
           | business... but it feels that in a practical sense, gist of
           | everybody's actual unwitting message is "BE that guy, if you
           | can" :-<<<
        
             | proverbialbunny wrote:
             | It sucks being that guy because everyone else ends up
             | hating you.
             | 
             | Depending on the work environment it's not a stretch to see
             | software engineers complaining to management, sometimes
             | going as far to create rumors to get the jr data scientist
             | fired.
             | 
             | So, no the grass is not greener. It's best to not be that
             | person. This is why I go out of my way to prevent that
             | scenario when I lead a team.
        
               | urthor wrote:
               | Not really.
               | 
               | You just get seen as the product owner/project manager.
        
               | proverbialbunny wrote:
               | That's a really good point.
               | 
               | I tend to be seen as a product lead / owner /
               | stakeholder, so I feel like I'm being called out. lol
               | 
               | I think one difference is the software engineers see me
               | as someone who is helping them by making their life
               | easier. I'm not just throwing work at them blindly. I'm
               | working with them. Also, they like it when I include them
               | in the data science brainstorming sessions to solve
               | difficult problems. I guess it's seen as exotic or
               | something, but whatever the reason, they really love to
               | be apart of it.
        
               | rhizome wrote:
               | I think it's probably seen more as just being a decent
               | boss.
        
               | RobRivera wrote:
               | easy to ignore hate when youre pulling a 300k bonus at
               | comp season and can jet to st. barts to go deep sea
               | fishing and drink claws.
        
               | proverbialbunny wrote:
               | Data scientists do not pull that kind of bonus. Today
               | many of them get paid less than the data engineers do.
        
               | RobRivera wrote:
               | news to me, and welcome news to hear at that since I'm
               | more in the data plumbing and packaging business, not
               | algo publications.
               | 
               | my personal data points are from folks on buyside.
               | trading margins have been downward trending for years
        
               | proverbialbunny wrote:
               | Quant research work isn't data science work which is
               | probably where the mix up is.
               | 
               | On the quant side bonuses are distributed to the team.
        
             | ramraj07 wrote:
             | If you're that guy and you have a secure job it means you
             | write models no one ever sees in a company which doesn't
             | know or respect data, or you work in some data science
             | factory as a small cog of a fairly well oiled team. The
             | latter does happen from time to time, but it's often the
             | former.
             | 
             | In every other place, your job is on the line to be erased
             | because people will soon realize no one wants a wise-ass
             | who doesn't actually contribute much to the bottom-line in
             | the end.
        
             | urthor wrote:
             | The flipside is that there are 4x the job posts for data
             | engineering as there are for "that guy".
             | 
             | Companies understand that you can't hire five of that guy
             | and get things done. If you have 5-8 years of experience as
             | a technical product manager/data science combo then you are
             | very happy as the magician. But very few magicians are
             | being hired out of college, and a lot of "software
             | engineers in data"
        
               | mywittyname wrote:
               | Pretty soon companies are going to start realizing that
               | the 4x DEs can largely replace that 1 DS, and they will
               | be more than happy to do so.
               | 
               | I went into DE because I was kind of forced into the
               | space, but I'd strongly prefer doing full-stack DE.
               | Anymore, I still have the opportunity to build models,
               | they just aren't client-facing stuff, but instead are
               | kind of Data Plumber Bots that help me do my job better
               | so I can waste more time building other fun bots that I
               | can't otherwise be paid for.
               | 
               | Seems like a waste of resources, but my manager could
               | have another DS tomorrow, but my role would take months
               | to fill.
        
             | noodlenotes wrote:
             | Specifically, this is my advice to ENTRY level data
             | scientists who are trying to find a job and compete against
             | a flood of candidates hot off the bootcamps. I guess once
             | you get your foot in the door, you can be that guy if you
             | want. It seems to be a successful strategy at companies
             | without technical leadership.
        
           | ramraj07 wrote:
           | It's hard for most people entering this field because the
           | incentives are perverted - there's this perception that DS is
           | sexy and you actually don't need to know coding that much
           | (just enough to scikit learn). Thus people with pipe dreams
           | of tweaking model hyperparameters to spin gold come in and
           | get a rude awakening. Not a lot unlike people flocking to
           | become actors to LA.
        
           | proverbialbunny wrote:
           | Back in the day (3 years ago and earlier) at every company I
           | was at we used the term 'productionization' to describe
           | someone making a model aka a proof of concept, and then
           | someone else, a machine learning engineer or some kind of
           | engineer rewriting it to work on a server.
           | 
           | This process is horrible, and not just because it doubles the
           | work, but because it introduces bugs. When the version up in
           | the cloud does not work as intended, is it a bug in
           | productionizing or is it in the original model? Fixing bugs
           | in this space can take longer than the initial model
           | development and the initial productionization. Many companies
           | have failed over this.
           | 
           | So what's the solution? In recent years the industry has
           | turned to deployment over productionization. The idea is you
           | deploy the model to the cloud directly. Both engineers and
           | scientists work together on the process. The scientist
           | defines what cells in the notebook get called for the final
           | algorithm (as there are EDA / plotting cells and
           | documentation cells too). The engineer sets up the amazon IO
           | stuff, database login stuff, and monitoring services. The
           | scientist works with them to create tests and what to monitor
           | so they get notified if there is a problem with the service.
           | 
           | No more mystery bugs. The model gets directly deployed, the
           | work load is minimal, and it brings people together. The
           | downside is often the engineers and scientists are on
           | different teams, and sometimes companies will not let them
           | merge for a while, so it becomes a telephone game instead of
           | everyone feeling like they're on the same team working
           | together. imo moving the scientist to the engineering team
           | during this time can be helpful, or moving the engineer to
           | the data team.
           | 
           | Some companies have services where entire notebooks get put
           | up into the cloud and all of it gets called, so the scientist
           | has to write the notebook in a way that works for the cloud.
           | It's rarer, but how I prefer it is a wrapper py file is
           | created that calls just the relevant parts of the notebook,
           | kind of like a header file. This process works well for me,
           | but it as far as I know it is not standardized in the
           | industry yet.
           | 
           | In short, if you end up in this situation, there is a better
           | way. Import the notebook into a .py file or into the cloud,
           | don't rewrite it. This (hopefully) will remove this scenario
           | you're describing (comment this is replying to) so those
           | issues will become a historical footnote.
        
             | carabiner wrote:
             | How do you maintain notebooks in production? You use
             | papermill? What about versioning?
        
               | proverbialbunny wrote:
               | Most libraries load entire notebooks from top to bottom
               | when executing, and I believe papermill does too. (Please
               | correct me if I'm wrong, as I've not used papermill.)
               | 
               | This is great for making a dashboard, a report, or some
               | other kind of analytics, but when it comes to a service
               | the customer uses, you typically never want to load the
               | whole notebook. This is where the industry standard way
               | of loading the whole notebook tends to fall on its face.
               | 
               | What we do is the cells that will end up in prod are
               | written as functions inside of the notebook. This helps
               | reduce globals when writing the notebook, so it is good
               | form when prototyping, but also it allows just those
               | functions to be called from the notebook, instead of
               | running the entire notebook.
               | 
               | You will probably want to write your own library to do
               | this, but in the mean time there is one that works for
               | this purpose https://github.com/grst/nbimporter
               | (Ironically the author doesn't recognize this use case.)
               | 
               | Using nbimporter you can import a notebook without
               | loading it. You can then call functions within that
               | notebook and only those functions get loaded and called.
               | 
               | In my notebooks I have a process function which is like
               | main(), but for for feature engineering. On the prod side
               | the process function is called from the notebook. Process
               | calls all of the necessary cells/functions for me in the
               | correct order. This way the py wrapper only has to call
               | one function, then the ML predict function gets called,
               | so it's pretty small on the .py wrapper side. There are
               | tests written on the .py side, IO functions and what not
               | too.
               | 
               | Data engineers love their classes, so it's easy to write
               | a class that calls the notebook, and best of all calling
               | a single function this way does not load globals, so the
               | data engineers are happy. It's a nice library, because
               | otherwisw you'd have to write your own (which you may end
               | up wanting to do).
               | 
               | This way if the model doesn't work as intended in
               | production it's my fault. We log everything, so I can run
               | the instance prod caught on my local machine, figure out
               | what is going on, update the model, and then it can be
               | deployed instantly.
               | 
               | Version numbers on the engineering side I can't comment
               | on as they have their own method, but on my end the
               | second the model writes to a database then I strongly
               | push for having a version number column or a version
               | number metadata table in the database, so it's easy for
               | me to access for future analysis.
        
         | [deleted]
        
           | z3ncyberpunk wrote:
           | Then you end up with terrible engineers with zero practical
           | experience who make terrible designs due to said lack of
           | practical experience. Engineers out of college today have no
           | zero clue how to run machines or do anything but create
           | drawings that half work which is pathetic
        
       | [deleted]
        
       | yters wrote:
       | While we like to pooh pooh at the theoreticians, there are some
       | remarkable results proven just through thought experiments and
       | math, which are only confirmed and used many, many years later.
       | People would not even think to look into such things if
       | theoreticians did not come up with the original abstract proof.
        
       | afryer wrote:
       | I recommend the book: Agile Data Science by Russell Jurney[1].
       | The tech stack is circa 2017, but the chapters on the Agile Data
       | Science Process and Teams are timeless.
       | 
       | He clearly articulates team roles: from Biz Devs, marketers, PMs,
       | UX designers, UI designers, Web Developers, API Engineers, Data
       | scientists, Applied researchers, Platform/Data engineers, QA
       | engineers, DevOps Engineers.
       | 
       | Then he talks about different ways to increase agility by
       | combining these roles into generalists empowered to iteratively
       | explore the "pyramid of data value" until the right product-
       | market fit is found.
       | 
       | Building Data-science Intensive Web Applications is inherently
       | waterfall, not agile, and I find this book to be a fascinating
       | reference.
       | 
       | [1] https://www.oreilly.com/library/view/agile-data-
       | science/9781...
        
       | Wonnk13 wrote:
       | What's the career trajectory for data engineers?
       | 
       | I enjoy the pipeline building and business stakeholder
       | interfacing, but I'm not sure I want to be a SWE a decade from
       | now...
        
         | claytonjy wrote:
         | How is that different from a SWE? I see DE as a specialized
         | SWE; tons of overlap, but DEs focus on different tools and
         | concepts than other SWEs.
        
       | amelius wrote:
       | In other words: we need plumbers.
        
         | gregw2 wrote:
         | As a data engineer I've made the same joke.
         | 
         | But the statement is also a bit like saying you can use
         | plumbers to design and build a chemical refinement plant which
         | also just moves chemicals from point A to point B. Or you can
         | design a citywide sewer system with a bunch of plumbers.
         | 
         | There are many cleansing, refining, orchestration, dependency,
         | data quality, governance and optimization problems to be solved
         | and a wide variety of tools that have for whatever reason never
         | grown into higher-level open source frameworks and are thus
         | reimplemented in various forms in many places.
         | 
         | Data engineering (somewhat like software engineering actually)
         | doesn't require much if any of the math and physics I took in
         | engineering courses in college, but it does require rigorous
         | systems thinking about how to design and build structures that
         | withstand adverse conditions that are thought patterns common
         | to other engineering practices, so I don't think it's a totally
         | crazy title for the role.
        
         | digitalsushi wrote:
         | Or, if the data is food, we have chefs making incredible
         | plates, but we need wait staff to get it to the people who want
         | to eat it.
         | 
         | I would love to be on that wait staff; as an infradev I feel
         | like the process is very close to me but I am struggling to
         | break into it.
        
           | spacemanmatt wrote:
           | I would put data engineering on the supply side of the chef:
           | This would be ingredients, delivery scheduling, and pre-prep
           | functions. That sort of thing.
        
       ___________________________________________________________________
       (page generated 2021-01-14 23:00 UTC)