[HN Gopher] Tips for Shipping Data Products Fast
       ___________________________________________________________________
        
       Tips for Shipping Data Products Fast
        
       Author : oedmarap
       Score  : 139 points
       Date   : 2021-03-04 11:13 UTC (11 hours ago)
        
 (HTM) web link (shopify.engineering)
 (TXT) w3m dump (shopify.engineering)
        
       | pbronez wrote:
       | Nice write up, and a good example of the kinds of best practices
       | that data analysts should adopt from the world of software
       | engineering. The main recommendations are:
       | 
       | > Utilize design sprints to help focus your team's efforts and
       | remove the stress of the ticking clock
       | 
       | > Don't skip on prototyping, it's a great way to fail early
       | 
       | > Avoid machine learning (for first iterations) to avoid being
       | slowed down by unnecessary complexity
       | 
       | > Talk to your users so you can get a better sense of what
       | problem they're facing and what they need in a product
       | 
       | It's useful to consider these recommendations in the context of
       | the AI Hierarchy of Needs [0]. You need to sort out the base of
       | your pyramid as quickly as possible.
       | 
       | [0] https://hackernoon.com/the-ai-hierarchy-of-
       | needs-18f111fcc00...
        
       | nvilcins wrote:
       | > 3. Avoid Machine Learning (on First Iterations)
       | 
       | For many years I worked on building data products at a start-up
       | as the data guy (encompassing analytics, ML, data engineering).
       | We started off pretty much all-in with ML, even implementing
       | bleeding-edge models from scratch based on the newest research
       | papers, and building crazy infrastructures around them (we had
       | loads of fun tbh). In the end (after ~5 years), however, our
       | product was a UI displaying a handful of "simple" stats, which
       | was facilitated by a robust but relatively simple data ETL
       | pipeline in the background.
       | 
       | Essentially, as we gained more experience and learned more about
       | the domain and customers' needs, we found more value in "the
       | basics" rather than fancy ML models.
       | 
       | That is not to say ML isn't a powerful tool in the right context,
       | but I feel it is grossly over-hyped and over-used. And that seems
       | to be a controversial stance. Even within start-up circles I've
       | encountered push-back when suggesting going the simpler route and
       | saving ML for much much later.
       | 
       | Is that generally a hype perpetuated by the ML people not wanting
       | to lose on ML opportunities? (I guess ML sounds better, is more
       | fun, and probably pays more than the "basic stuff")
        
         | nvilcins wrote:
         | An open follow-up question:
         | 
         | (Assuming we are indeed living in a world overly enthusiastic
         | about ML)
         | 
         | How do you - a professional that doesn't just throw ML at
         | everything but focuses on the "boring stuff" - position
         | yourself in the job market?
         | 
         | Sounds like a bit of a hard sell, especially if you're also
         | charging more than the other ML-eager prospects. (i.e., you
         | know you are the best person for the gig, but the proposal is
         | likely to fall flat due to the expectations of the people
         | hiring you)
        
           | tixocloud wrote:
           | For me, I'd position myself as one who's focused on value and
           | focused on results. For new clients, you can lower your rates
           | but ultimately clients buy trust that you can deliver what
           | they need.
           | 
           | Soft skills are highly undervalued in the tech community yet
           | if you start speaking with management and other non-tech
           | stakeholders, you'll quickly find out how valuable you are.
        
         | importantbrian wrote:
         | I think you hit the nail on the head. In my experience, a lot
         | of the more BI/Analyst work I've done has been far more
         | valuable to the company than the ML work, and most of the value
         | of the ML work actually came from insights gained while doing
         | EDA that had direct business impact and not from the model
         | itself.
        
         | RHSman2 wrote:
         | You gotta start with a model that does something using averages
         | (and I say that in humour too) as otherwise you don't have a
         | base to compare against. Most ML is done because someone thinks
         | its what is required
        
           | jimbokun wrote:
           | Right, it's often surprising how well a stupidly simple model
           | does. And a complex ML model doesn't always beat that
           | baseline.
        
         | akg_67 wrote:
         | I believe companies are starting to figure this out. I recently
         | sent out resume (heavy on data engineering and analytics, light
         | on AI/ML/DL) for data scientist positions. I have 80% response
         | rate. Previously, with same resume response rate was less than
         | 5%.
        
         | boringds wrote:
         | I think you nailed it. Often companies and exec want ML but
         | don't have the basics: robust ETL pipeline, clean data, solid
         | analytics foundation (dashboards, automated reporting, etc.).
         | These appear boring to most but they will be the difference
         | between a useless ML department who can't ship anything to
         | production and a successful one that builds on top of the
         | aforementioned foundations.
         | 
         | In addition I believe it's time we drop the data science term.
         | It's an umbrella of different roles ranging from data engineer
         | to DL researcher. Companies need to identify what they REALLY
         | need and not go for the shiny PhD in ML.
         | 
         | The emergence of analytics engineering is the perfect example
         | of this shift towards creating robust data pipeline first and
         | enabling "data scientists" to do so.
         | 
         | I wrote a blog post about it yesterday, I don't want to post it
         | here and self-promote too much, so check it in my profile if
         | you want to.
        
           | nerdponx wrote:
           | I love the idea of "boring data science." I will steal that
           | term for my own use.
        
           | EricMausler wrote:
           | Who do you think has the best 3rd party solutions for data
           | cleaning?
        
             | klmadfejno wrote:
             | Data cleaning is domain specific. Hire someone to do it and
             | accumulate wisdom over time.
        
           | PebblesRox wrote:
           | Here's boringds's post, for anyone else who's curious:
           | https://boringdatascience.com/post/data-science-is-dead-
           | long...
        
         | datenhorst wrote:
         | AI is simultaneously over-hyped and a game-changer. To
         | paraphrase a famous Marketing quip, the problem is that you
         | need to be a domain expert as well as an ML expert in order to
         | judge which are the over-hyped parts.
         | 
         | As a consultant specializing in building Data Science teams,
         | I've been shouting this idea that you can only build AI
         | products on top of a robust data pipeline and, more
         | importantly, culture, from the roof-tops for years. Monica
         | Rogati's formulization of it, the "AI hierarchy of needs"[0]
         | helps a lot to instill a mental image in managers.
         | 
         | [0] https://hackernoon.com/the-ai-hierarchy-of-
         | needs-18f111fcc00...
        
         | scribu wrote:
         | Perhaps iterating on the ML model allowed you to learn about
         | the business domain faster or more systematically.
         | 
         | When you're learning something new, you don't even know what
         | questions to ask (unknown unknowns). And the domain experts
         | don't know what to tell you first (curse of knowledge).
         | 
         | If, on the other hand, you present a model to a domain expert,
         | they can start to poke it and tell you what it got wrong.
        
           | disgruntledphd2 wrote:
           | I dunno, ML/statistical models have really, really, really,
           | really slow iterations times. Matrix multiplication is O(N^3)
           | and you'll need to do that a lot.
           | 
           | EDA (exploratory data analysis) is the best way to learn
           | about the business domain and has the advantage of being
           | much, much, much faster than even the fastest ML approach.
        
         | qsort wrote:
         | I'm more on the development side of things, but I'm working on
         | similar products right now and I completely agree with your
         | sentiment.
         | 
         | What's more, there's usually a middle ground between spitting
         | data out of postgres and full-on ML models. For example, if
         | you're making forecasts, maybe a simple statistical model will
         | get you 90% of the way there and won't require you to reach out
         | for RNNs or something. In the end it's a tradeoff, and
         | personally I'm delighted when simple math that mostly works can
         | be used in place of complex models with little marginal value.
        
           | pbronez wrote:
           | This has been my experience as well. My intuition is that
           | complex models have several pitfalls that undermine their
           | practical value:
           | 
           | 1) complex models make more assumptions and may be less
           | robust to the world behaving in unexpected ways. It takes
           | more effort for practitioners to thoroughly review all these
           | assumptions.
           | 
           | 2) complex models have more parameters, and thus require more
           | data to train. This limits the scenarios where they can be
           | deployed
           | 
           | 3) complex models are harder to interpret, which makes them
           | less useful for convincing people to take action AND increase
           | the likelihood that a practitioner will make an
           | interpretation error
           | 
           | Of course, complex models have plenty of advantages too.
           | Under-specified models can be wrong just by failing to
           | account for obvious, well-documented variations. Human
           | judgement is required to select the appropriate model for a
           | given situation.
        
         | twelfthnight wrote:
         | > 3. Avoid Machine Learning
         | 
         | Agreed this usually makes sense. My only qualm is that I've
         | seen many systems developed by teams without any ML experience
         | develop tools that (1) throw away or don't maintain data
         | integrity and (2) cannot rank or handle uncertainty in the
         | underlying data. These systems often bake in heuristics that
         | cannot be validated and it's very difficult to change them
         | later. Users feel like they are losing something if you migrate
         | from a tool that automatically chooses the best option for them
         | to one where they need to choose among 5 alternatives, for
         | example, even if that "best option" was actually just noise.
         | 
         | I think the spirit of "avoid machine learning" is great, but I
         | do think having some forethought about how ML might integrate
         | into the system later on is pretty important.
        
           | srcreigh wrote:
           | What are some resources to learn more about this?
           | 
           | This is not "How to do ML". It's a very much more interesting
           | question "How to build products that can be extended with ML
           | later." I wouldn't even know how to Google for this!
           | 
           | Very interesting. Thank you so much.
        
             | twelfthnight wrote:
             | I'm not aware of any resources for that specific question.
             | The closest resource I can think of would be "Designing
             | Data Intensive Applications" which is great for designing
             | systems with data integrity that will lend themselves to ML
             | later on. My next recommendation would actually be to work
             | on a Kaggle project (esp the toy starter project
             | https://www.kaggle.com/c/titanic). Looking at Kaggle
             | notebooks, sklearn documentation, etc is really valuable
             | for understanding how to pose problems in a way solvable
             | with ML.
        
           | lumost wrote:
           | I think the problems you mention start off a bit of a rabbit
           | hole trap which most ML teams fall into.
           | 
           | Effectively, many problems do not have formal "correct"
           | solutions which can easily be applied. One can spend an
           | unbounded amount of time maintaining data-integrity or
           | improving uncertainty handling. As you add more people
           | familiar with this depth to the team, the "best/correct
           | approach" becomes harder to achieve. In the end, a naive
           | estimate of uncertainty is likely about as valuable to the
           | customer as a formal estimate in _most_ problem domains.
           | 
           | "Avoid ML" can easily be code for avoid ambiguous research
           | oriented tasks while building an industrial project, or
           | alternately "don't trust that a magic algorithm will solve
           | your customer problem".
        
         | molsongolden wrote:
         | What would your stack and first steps look like at a fresh
         | startup today?
        
         | ngc248 wrote:
         | People also thinking ML is some kinda panacea and trying to
         | shoehorn it into everything to make the product "sexy". I have
         | also wasted 1.5 years of time on an ML based log analysis
         | product, where they wanted everything to be learnt, it was
         | death by a thousand cuts. A little bit of coding and writing
         | parsers would have made the product successful.
        
         | darksaints wrote:
         | > That is not to say ML isn't a powerful tool in the right
         | context, but I feel it is grossly over-hyped and over-used. And
         | that seems to be a controversial stance. Even within start-up
         | circles I've encountered push-back when suggesting going the
         | simpler route and saving ML for much much later.
         | 
         | Exactly. Many times when people are looking for an intelligent
         | solution to a hard problem, Machine Learning is actually the
         | _wrong_ solution...but ML enthusiasts (as opposed to experts)
         | are so eager to use it that they end up prescribing it as a
         | solution to every problem, and that can really mess with non-
         | technical managers heads.
         | 
         | I recently advised a high level non-technical manager who was
         | looking for a solution to a very combinatorial optimization
         | problem, and I advised to him that he should be looking to hire
         | some expertise from the field of Operations Research. But he
         | also had a dozen software engineers that were chomping at the
         | bit to put some machine learning on their resumes, who were all
         | advising that he needed a machine learning solution. And when
         | the reqs went out, they were all for data scientists...and when
         | the first data scientist came on board, he quit after three
         | months and told the manager that he didn't need machine
         | learning, but that he needed a Gurobi license and someone who
         | could model Integer Programming problems...also known as an
         | Operations Researcher.
         | 
         | I know I know, /r/ThatHappened. But it's a real problem and ML
         | enthusiasts are actually feeding the next AI winter by
         | overprescribing it where it is the least applicable solution to
         | the problem, ultimately generating disappointment in ML.
         | Sometimes you don't need Machine Learning...sometimes you need
         | Linear Programming, sometimes Econometrics, sometimes
         | Statistics, sometimes Constraint Programming, and sometimes you
         | just need an Excel Pivot Table.
        
           | twelfthnight wrote:
           | > but ML enthusiasts (as opposed to experts) are so eager to
           | use it that they end up prescribing it as a solution to every
           | problem
           | 
           | Interesting you say "as opposed to experts". I think that's
           | right on. It's been my experience on data science teams that
           | the data scientists are often the ones pushing back on ML
           | solutions and it's leadership who want GPT-3 or whatever
           | latest ML model in the stack. I had one team where we were
           | told we needed to call our ML work "AI" because if we didn't,
           | leadership would think we weren't cutting edge and spin-up a
           | competing team who was doing "AI".
        
           | klmadfejno wrote:
           | When I was at MIT, the Op. Research Center was working on
           | globally optimal decision trees. That is, a simple, single
           | decision tree, that is devised by an optimization model such
           | that it is going to be the best possible tree you could make
           | with a given training set (vs. regular trees which are
           | trained in a dumb greedy way). That always felt like the holy
           | grail of actual useful business "ML" to me.
           | 
           | But either way, you're often correct. For anything but the
           | simplest decisions to be made, the bulk of the value will
           | come from better decision making. ML can help refine the
           | estimates of the value of each decision, but making the final
           | choose is an easily overlooked and super important step.
        
           | edot wrote:
           | I absolutely believe that happened.
           | 
           | OR is criminally underrated by businesses today, in favor of
           | ML and AI hype. For proof, search for an OR-titled job and
           | see how many come up, compared to ML-titled.
           | 
           | Lots of "ML" and "AI" is prettied up 50 year old OR
           | techniques.
        
           | EricMausler wrote:
           | I have a bachelor's in operations research (labeled
           | Information & Systems Engineering) and have been finding
           | myself looking up ML certifications because the jobs posted
           | in relation to the problems I am trained to solve are asking
           | for it.
           | 
           | I have been all-in on the data wave since picking my major in
           | 2010, and honestly most of the time all you need is SQL and a
           | pivot table.
           | 
           | I already learned python to smooth some edges on what people
           | think the job is about. To bite the bullet on ML makes me
           | worried I am going to spread myself too thin. I spent hours
           | researching in my own time and it is so vast. Like you said,
           | you need to be committed as a data scientist at that point.
           | Has operations research become a specialization of data
           | science now?
        
         | _the_inflator wrote:
         | >> Essentially, as we gained more experience and learned more
         | about the domain and customers' needs, we found more value in
         | "the basics" rather than fancy ML models
         | 
         | Absolutely agree. Our current model evolved from some
         | supermessy JavaScript scripts that from the outside looked like
         | fancy ML/AI stuff.
         | 
         | I am talking about Financial Services.
        
       | egr wrote:
       | I am looking for additional best practices regarding the
       | development of data and ML based products. Would be grateful for
       | any pointers in this direction.
        
       | [deleted]
        
       | artembugara wrote:
       | Nice, but I do not see how this is related to data products. You
       | can change data to pretty much any relevant word and this article
       | is still useful.
        
       | rodolphoarruda wrote:
       | > Typically, a sprint lasts up to five days.
       | 
       | I (am probably too old...) remember when sprints were no shorter
       | than 15 days.
        
         | jschmitz28 wrote:
         | The problem with 15 day sprints is that you tend to get tired
         | and slow down before the end of of the sprint, whereas with 5
         | day sprints it's easier to go full velocity because you're only
         | sprinting for 1/3rd of the time.
        
       | CapriciousCptl wrote:
       | I'll add one: you don't need much data to make important
       | insights. I'm taking N < 100. Focus on starting with a good
       | sample and building up a priori domain knowledge instead. Get the
       | data when it's time to build out complex models/ML IF that time
       | ever comes.
        
         | pbronez wrote:
         | > Focus on starting with a good sample and building up a priori
         | domain knowledge
         | 
         | Yes! Experimental design is king. Sampling strategies matter.
        
       ___________________________________________________________________
       (page generated 2021-03-04 23:01 UTC)