[HN Gopher] Tips for Shipping Data Products Fast
___________________________________________________________________
Tips for Shipping Data Products Fast
Author : oedmarap
Score : 139 points
Date : 2021-03-04 11:13 UTC (11 hours ago)
(HTM) web link (shopify.engineering)
(TXT) w3m dump (shopify.engineering)
| pbronez wrote:
| Nice write up, and a good example of the kinds of best practices
| that data analysts should adopt from the world of software
| engineering. The main recommendations are:
|
| > Utilize design sprints to help focus your team's efforts and
| remove the stress of the ticking clock
|
| > Don't skip on prototyping, it's a great way to fail early
|
| > Avoid machine learning (for first iterations) to avoid being
| slowed down by unnecessary complexity
|
| > Talk to your users so you can get a better sense of what
| problem they're facing and what they need in a product
|
| It's useful to consider these recommendations in the context of
| the AI Hierarchy of Needs [0]. You need to sort out the base of
| your pyramid as quickly as possible.
|
| [0] https://hackernoon.com/the-ai-hierarchy-of-
| needs-18f111fcc00...
| nvilcins wrote:
| > 3. Avoid Machine Learning (on First Iterations)
|
| For many years I worked on building data products at a start-up
| as the data guy (encompassing analytics, ML, data engineering).
| We started off pretty much all-in with ML, even implementing
| bleeding-edge models from scratch based on the newest research
| papers, and building crazy infrastructures around them (we had
| loads of fun tbh). In the end (after ~5 years), however, our
| product was a UI displaying a handful of "simple" stats, which
| was facilitated by a robust but relatively simple data ETL
| pipeline in the background.
|
| Essentially, as we gained more experience and learned more about
| the domain and customers' needs, we found more value in "the
| basics" rather than fancy ML models.
|
| That is not to say ML isn't a powerful tool in the right context,
| but I feel it is grossly over-hyped and over-used. And that seems
| to be a controversial stance. Even within start-up circles I've
| encountered push-back when suggesting going the simpler route and
| saving ML for much much later.
|
| Is that generally a hype perpetuated by the ML people not wanting
| to lose on ML opportunities? (I guess ML sounds better, is more
| fun, and probably pays more than the "basic stuff")
| nvilcins wrote:
| An open follow-up question:
|
| (Assuming we are indeed living in a world overly enthusiastic
| about ML)
|
| How do you - a professional that doesn't just throw ML at
| everything but focuses on the "boring stuff" - position
| yourself in the job market?
|
| Sounds like a bit of a hard sell, especially if you're also
| charging more than the other ML-eager prospects. (i.e., you
| know you are the best person for the gig, but the proposal is
| likely to fall flat due to the expectations of the people
| hiring you)
| tixocloud wrote:
| For me, I'd position myself as one who's focused on value and
| focused on results. For new clients, you can lower your rates
| but ultimately clients buy trust that you can deliver what
| they need.
|
| Soft skills are highly undervalued in the tech community yet
| if you start speaking with management and other non-tech
| stakeholders, you'll quickly find out how valuable you are.
| importantbrian wrote:
| I think you hit the nail on the head. In my experience, a lot
| of the more BI/Analyst work I've done has been far more
| valuable to the company than the ML work, and most of the value
| of the ML work actually came from insights gained while doing
| EDA that had direct business impact and not from the model
| itself.
| RHSman2 wrote:
| You gotta start with a model that does something using averages
| (and I say that in humour too) as otherwise you don't have a
| base to compare against. Most ML is done because someone thinks
| its what is required
| jimbokun wrote:
| Right, it's often surprising how well a stupidly simple model
| does. And a complex ML model doesn't always beat that
| baseline.
| akg_67 wrote:
| I believe companies are starting to figure this out. I recently
| sent out resume (heavy on data engineering and analytics, light
| on AI/ML/DL) for data scientist positions. I have 80% response
| rate. Previously, with same resume response rate was less than
| 5%.
| boringds wrote:
| I think you nailed it. Often companies and exec want ML but
| don't have the basics: robust ETL pipeline, clean data, solid
| analytics foundation (dashboards, automated reporting, etc.).
| These appear boring to most but they will be the difference
| between a useless ML department who can't ship anything to
| production and a successful one that builds on top of the
| aforementioned foundations.
|
| In addition I believe it's time we drop the data science term.
| It's an umbrella of different roles ranging from data engineer
| to DL researcher. Companies need to identify what they REALLY
| need and not go for the shiny PhD in ML.
|
| The emergence of analytics engineering is the perfect example
| of this shift towards creating robust data pipeline first and
| enabling "data scientists" to do so.
|
| I wrote a blog post about it yesterday, I don't want to post it
| here and self-promote too much, so check it in my profile if
| you want to.
| nerdponx wrote:
| I love the idea of "boring data science." I will steal that
| term for my own use.
| EricMausler wrote:
| Who do you think has the best 3rd party solutions for data
| cleaning?
| klmadfejno wrote:
| Data cleaning is domain specific. Hire someone to do it and
| accumulate wisdom over time.
| PebblesRox wrote:
| Here's boringds's post, for anyone else who's curious:
| https://boringdatascience.com/post/data-science-is-dead-
| long...
| datenhorst wrote:
| AI is simultaneously over-hyped and a game-changer. To
| paraphrase a famous Marketing quip, the problem is that you
| need to be a domain expert as well as an ML expert in order to
| judge which are the over-hyped parts.
|
| As a consultant specializing in building Data Science teams,
| I've been shouting this idea that you can only build AI
| products on top of a robust data pipeline and, more
| importantly, culture, from the roof-tops for years. Monica
| Rogati's formulization of it, the "AI hierarchy of needs"[0]
| helps a lot to instill a mental image in managers.
|
| [0] https://hackernoon.com/the-ai-hierarchy-of-
| needs-18f111fcc00...
| scribu wrote:
| Perhaps iterating on the ML model allowed you to learn about
| the business domain faster or more systematically.
|
| When you're learning something new, you don't even know what
| questions to ask (unknown unknowns). And the domain experts
| don't know what to tell you first (curse of knowledge).
|
| If, on the other hand, you present a model to a domain expert,
| they can start to poke it and tell you what it got wrong.
| disgruntledphd2 wrote:
| I dunno, ML/statistical models have really, really, really,
| really slow iterations times. Matrix multiplication is O(N^3)
| and you'll need to do that a lot.
|
| EDA (exploratory data analysis) is the best way to learn
| about the business domain and has the advantage of being
| much, much, much faster than even the fastest ML approach.
| qsort wrote:
| I'm more on the development side of things, but I'm working on
| similar products right now and I completely agree with your
| sentiment.
|
| What's more, there's usually a middle ground between spitting
| data out of postgres and full-on ML models. For example, if
| you're making forecasts, maybe a simple statistical model will
| get you 90% of the way there and won't require you to reach out
| for RNNs or something. In the end it's a tradeoff, and
| personally I'm delighted when simple math that mostly works can
| be used in place of complex models with little marginal value.
| pbronez wrote:
| This has been my experience as well. My intuition is that
| complex models have several pitfalls that undermine their
| practical value:
|
| 1) complex models make more assumptions and may be less
| robust to the world behaving in unexpected ways. It takes
| more effort for practitioners to thoroughly review all these
| assumptions.
|
| 2) complex models have more parameters, and thus require more
| data to train. This limits the scenarios where they can be
| deployed
|
| 3) complex models are harder to interpret, which makes them
| less useful for convincing people to take action AND increase
| the likelihood that a practitioner will make an
| interpretation error
|
| Of course, complex models have plenty of advantages too.
| Under-specified models can be wrong just by failing to
| account for obvious, well-documented variations. Human
| judgement is required to select the appropriate model for a
| given situation.
| twelfthnight wrote:
| > 3. Avoid Machine Learning
|
| Agreed this usually makes sense. My only qualm is that I've
| seen many systems developed by teams without any ML experience
| develop tools that (1) throw away or don't maintain data
| integrity and (2) cannot rank or handle uncertainty in the
| underlying data. These systems often bake in heuristics that
| cannot be validated and it's very difficult to change them
| later. Users feel like they are losing something if you migrate
| from a tool that automatically chooses the best option for them
| to one where they need to choose among 5 alternatives, for
| example, even if that "best option" was actually just noise.
|
| I think the spirit of "avoid machine learning" is great, but I
| do think having some forethought about how ML might integrate
| into the system later on is pretty important.
| srcreigh wrote:
| What are some resources to learn more about this?
|
| This is not "How to do ML". It's a very much more interesting
| question "How to build products that can be extended with ML
| later." I wouldn't even know how to Google for this!
|
| Very interesting. Thank you so much.
| twelfthnight wrote:
| I'm not aware of any resources for that specific question.
| The closest resource I can think of would be "Designing
| Data Intensive Applications" which is great for designing
| systems with data integrity that will lend themselves to ML
| later on. My next recommendation would actually be to work
| on a Kaggle project (esp the toy starter project
| https://www.kaggle.com/c/titanic). Looking at Kaggle
| notebooks, sklearn documentation, etc is really valuable
| for understanding how to pose problems in a way solvable
| with ML.
| lumost wrote:
| I think the problems you mention start off a bit of a rabbit
| hole trap which most ML teams fall into.
|
| Effectively, many problems do not have formal "correct"
| solutions which can easily be applied. One can spend an
| unbounded amount of time maintaining data-integrity or
| improving uncertainty handling. As you add more people
| familiar with this depth to the team, the "best/correct
| approach" becomes harder to achieve. In the end, a naive
| estimate of uncertainty is likely about as valuable to the
| customer as a formal estimate in _most_ problem domains.
|
| "Avoid ML" can easily be code for avoid ambiguous research
| oriented tasks while building an industrial project, or
| alternately "don't trust that a magic algorithm will solve
| your customer problem".
| molsongolden wrote:
| What would your stack and first steps look like at a fresh
| startup today?
| ngc248 wrote:
| People also thinking ML is some kinda panacea and trying to
| shoehorn it into everything to make the product "sexy". I have
| also wasted 1.5 years of time on an ML based log analysis
| product, where they wanted everything to be learnt, it was
| death by a thousand cuts. A little bit of coding and writing
| parsers would have made the product successful.
| darksaints wrote:
| > That is not to say ML isn't a powerful tool in the right
| context, but I feel it is grossly over-hyped and over-used. And
| that seems to be a controversial stance. Even within start-up
| circles I've encountered push-back when suggesting going the
| simpler route and saving ML for much much later.
|
| Exactly. Many times when people are looking for an intelligent
| solution to a hard problem, Machine Learning is actually the
| _wrong_ solution...but ML enthusiasts (as opposed to experts)
| are so eager to use it that they end up prescribing it as a
| solution to every problem, and that can really mess with non-
| technical managers heads.
|
| I recently advised a high level non-technical manager who was
| looking for a solution to a very combinatorial optimization
| problem, and I advised to him that he should be looking to hire
| some expertise from the field of Operations Research. But he
| also had a dozen software engineers that were chomping at the
| bit to put some machine learning on their resumes, who were all
| advising that he needed a machine learning solution. And when
| the reqs went out, they were all for data scientists...and when
| the first data scientist came on board, he quit after three
| months and told the manager that he didn't need machine
| learning, but that he needed a Gurobi license and someone who
| could model Integer Programming problems...also known as an
| Operations Researcher.
|
| I know I know, /r/ThatHappened. But it's a real problem and ML
| enthusiasts are actually feeding the next AI winter by
| overprescribing it where it is the least applicable solution to
| the problem, ultimately generating disappointment in ML.
| Sometimes you don't need Machine Learning...sometimes you need
| Linear Programming, sometimes Econometrics, sometimes
| Statistics, sometimes Constraint Programming, and sometimes you
| just need an Excel Pivot Table.
| twelfthnight wrote:
| > but ML enthusiasts (as opposed to experts) are so eager to
| use it that they end up prescribing it as a solution to every
| problem
|
| Interesting you say "as opposed to experts". I think that's
| right on. It's been my experience on data science teams that
| the data scientists are often the ones pushing back on ML
| solutions and it's leadership who want GPT-3 or whatever
| latest ML model in the stack. I had one team where we were
| told we needed to call our ML work "AI" because if we didn't,
| leadership would think we weren't cutting edge and spin-up a
| competing team who was doing "AI".
| klmadfejno wrote:
| When I was at MIT, the Op. Research Center was working on
| globally optimal decision trees. That is, a simple, single
| decision tree, that is devised by an optimization model such
| that it is going to be the best possible tree you could make
| with a given training set (vs. regular trees which are
| trained in a dumb greedy way). That always felt like the holy
| grail of actual useful business "ML" to me.
|
| But either way, you're often correct. For anything but the
| simplest decisions to be made, the bulk of the value will
| come from better decision making. ML can help refine the
| estimates of the value of each decision, but making the final
| choose is an easily overlooked and super important step.
| edot wrote:
| I absolutely believe that happened.
|
| OR is criminally underrated by businesses today, in favor of
| ML and AI hype. For proof, search for an OR-titled job and
| see how many come up, compared to ML-titled.
|
| Lots of "ML" and "AI" is prettied up 50 year old OR
| techniques.
| EricMausler wrote:
| I have a bachelor's in operations research (labeled
| Information & Systems Engineering) and have been finding
| myself looking up ML certifications because the jobs posted
| in relation to the problems I am trained to solve are asking
| for it.
|
| I have been all-in on the data wave since picking my major in
| 2010, and honestly most of the time all you need is SQL and a
| pivot table.
|
| I already learned python to smooth some edges on what people
| think the job is about. To bite the bullet on ML makes me
| worried I am going to spread myself too thin. I spent hours
| researching in my own time and it is so vast. Like you said,
| you need to be committed as a data scientist at that point.
| Has operations research become a specialization of data
| science now?
| _the_inflator wrote:
| >> Essentially, as we gained more experience and learned more
| about the domain and customers' needs, we found more value in
| "the basics" rather than fancy ML models
|
| Absolutely agree. Our current model evolved from some
| supermessy JavaScript scripts that from the outside looked like
| fancy ML/AI stuff.
|
| I am talking about Financial Services.
| egr wrote:
| I am looking for additional best practices regarding the
| development of data and ML based products. Would be grateful for
| any pointers in this direction.
| [deleted]
| artembugara wrote:
| Nice, but I do not see how this is related to data products. You
| can change data to pretty much any relevant word and this article
| is still useful.
| rodolphoarruda wrote:
| > Typically, a sprint lasts up to five days.
|
| I (am probably too old...) remember when sprints were no shorter
| than 15 days.
| jschmitz28 wrote:
| The problem with 15 day sprints is that you tend to get tired
| and slow down before the end of of the sprint, whereas with 5
| day sprints it's easier to go full velocity because you're only
| sprinting for 1/3rd of the time.
| CapriciousCptl wrote:
| I'll add one: you don't need much data to make important
| insights. I'm taking N < 100. Focus on starting with a good
| sample and building up a priori domain knowledge instead. Get the
| data when it's time to build out complex models/ML IF that time
| ever comes.
| pbronez wrote:
| > Focus on starting with a good sample and building up a priori
| domain knowledge
|
| Yes! Experimental design is king. Sampling strategies matter.
___________________________________________________________________
(page generated 2021-03-04 23:01 UTC)