[HN Gopher] Why is machine learning 'hard'? (2016)
___________________________________________________________________
Why is machine learning 'hard'? (2016)
Author : jxmorris12
Score : 58 points
Date : 2024-01-23 20:43 UTC (2 hours ago)
(HTM) web link (ai.stanford.edu)
(TXT) w3m dump (ai.stanford.edu)
| jruohonen wrote:
| "An aspect of this difficulty involves building an intuition for
| what tool should be leveraged to solve a problem."
|
| While I agree with the good point about debugging, like many
| others, I am rather worried that we're increasingly deploying
| AI/ML where we shouldn't be deploying it. Hence, the above quote.
| sjwhevvvvvsj wrote:
| I'm old enough to have learned that the secret to success is
| much less knowing the tool of the moment than picking the right
| tool for a job.
|
| The right tool may in fact be the new one, and LLM do open a
| lot of doors with zero shot capabilities, but oftentimes they
| can underperform a well tuned heuristic. It's the ability to
| pick the right tool that is key.
| dang wrote:
| Discussed at the time:
|
| _Why is machine learning 'hard'?_ -
| https://news.ycombinator.com/item?id=12936891 - Nov 2016 (88
| comments)
| epistasis wrote:
| Love that thread. The top comment is excellent:
|
| > Like picking hyperparamters - time and time again I've asked
| experts/trainers/colleagues: "How do I know what type of model
| to use? How many layers? How many nodes per layer? Dropout or
| not?" etc etc And the answer is always along the lines of "just
| try a load of stuff and pick the one that works best".
|
| > To me, that feels weird and worrying. It's like we don't yet
| understand ML properly yet to definitively say, for a given
| data set, what sort of model we'll need.
|
| This embodies the very fundamental difference between science
| and engineering. With science, you make a discovery, but rarely
| do we ask "what was the magical combination that let me find
| the needle in the haystack today?" We instead just pass on the
| needle and show everyone we found it.
|
| Should we work on finding out the magic behind hyperparameters?
| In bioinformatics, the brilliant mathematician Lior Pachter
| once attacked the problem of sequence alignment using the tools
| of tropical algebra: what parameters to the alignment
| algorithms resulted in which regimes of solutions? It was
| beautiful. It was great to understand. But I'm not sure if it
| even ever got published (though it likely did). Having
| reasonable parameters is more important than understanding how
| to pick them from first principles, because even if you know
| all the possible output regimes for different segments of the
| hyper parameter space, really the only thing we care about is
| getting a functionally trained model at the end.
|
| Sometimes deeper understanding provides deeper insights to the
| problems at hand. But often, they don't, even when the deeper
| understanding is beautiful. If the hammer works when you hold
| it a certain way, that's great, but understanding all possible
| ways to hold a hammer doesn't always help get the nail in
| better.
| amelius wrote:
| The hammer analogy doesn't make much sense because for a
| hammer we can actually use our scientific knowledge to
| compute the best possible way to hold the tool, and we can
| make instruments that are better than hammers, like pneumatic
| hammers, pile drivers, etc.
|
| With your argument, we would be stuck with the good old, but
| basic hammer for the rest of time.
| epistasis wrote:
| That seems like a different analogy; making better hammers
| is a different thing than understanding why holding a
| hammer a certain way works well. We did eventually invent
| enough physics to understand why we hold hammers where we
| do, but we got really far just experimenting without first
| principles. And even if we use first principles, we are
| going to discover a lot more by actually using the modified
| hand-held hammer and testing it, than necessarily hitting
| it out of the park with great physical modeling of the
| hammer and the biomechanics of the human body.
|
| And in any case, I'm not saying we shouldn't search for
| deep understanding of what hyperparameters work on a first
| try, I'm just saying there's a good chance that even if the
| principles are fully discovered, it may be that calculating
| using those principles is more expensive than a bunch of
| experimentation and won't matter in the end.
|
| That's the trick about science, it's more about finding the
| right question to answer than how to find answers, and
| often times the best questions only become apparent
| afterwards.
| sjwhevvvvvsj wrote:
| I do a lot of model tuning and I'm almost ashamed to say I
| tell GPT what performance I'm aiming for and have it generate
| the hyper parameters (as in just literally give me a code
| block). Then I see what works, tell GPT, and try again.
|
| I'm deeply uncomfortable with such a method...but my models
| perform quite well. Note I spend a TON of time generating the
| right training data, so it's not random.
| doctorM wrote:
| I'm a bit sceptical of the exponentially harder debugging claim.
|
| First it looks polynomially harder for the given example :p.
|
| Second other engineering domains arguably have additional
| dimensions which correspond to the machine learning ones
| mentioned in the article. The choice of which high level
| algorithm to implement is another dimension to traditional
| software engineering that seemingly exists and corresponds to the
| model dimension. This is often codified as 'design'.
|
| The data dimension often exists as well in standard learning
| software engineering. [Think of a system that is 'downstream' of
| other].
|
| It's probably a lot simpler to deal with these dimensions in
| standard software engineering - but then this is what makes
| machine learning harder, not that there are simply 'more
| dimensions'.
|
| The delayed debugging cycles point seems a lot more valid.
| pizzaknife wrote:
| i would subscribe to your newsletter if you offered one.
| PaulHoule wrote:
| The #1 thing that makes it 'hard' in real life is that nobody
| wants to make training and test sets. So we have 50,000 papers on
| the NIST digits but no insight into 'would this work for a
| different problem?' (Ironically the latter might have been
| exactly what academics would have needed to understand why these
| algorithms work!)
| ramesh31 wrote:
| Would there be enough of a financial incentive to do so? Seems
| like a prime startup opportunity.
| rzzzz wrote:
| I believe that Scale.ai was founded to do exactly this.
| whiplash451 wrote:
| You're not paying tribute to MNIST-1D and many other datasets
| (including the massive segmentation dataset released by Meta
| with SAM). Read the literature before lecturing the community.
| Cacti wrote:
| there are uncountable sets all over the place, and in practical
| terms, the repl loop may have a week long training lag after you
| hit enter.
|
| also, the data is almost always complete shit.
|
| lol. there's no mystery why it's hard.
| phkahler wrote:
| >> It becomes essential to build an intuition for where something
| went wrong based on the signals available.
|
| This has always been my approach. I learned programming way
| before I had access to debuggers and other methods to dig in, set
| breakpoints and step through code to see where it was going
| wrong. As a result, when I got in the real world I kind of looked
| down on people using those tools (mostly because I hate tools
| actually). But then I saw people get to the root of problems that
| I don't think I ever could have solved, and I started to
| appreciate those tools and the detail you could get to. My
| preference is still to have a great understanding of how
| algorithms work, how the code is written, and what the problem
| is, and noodle out what and where things may be going wrong. I
| only switch to detailed monitoring of the insides when "thinking
| about it" fails. Maybe I should have gone into this ML stuff ;-)
| error9348 wrote:
| one thing that has gotten better since 2016 is more standardized
| architectures with transformers across domains.
| Xcelerate wrote:
| With regard to model selection, one thing I learned a long time
| ago that provides powerful intuitive guidance on which model to
| use is the question: "How could this be compressed further?"
|
| There are some deep connections between data compression and
| generalized learning, both at the statistical level and even
| lower at the algorithmic level (see Solomonoff induction).
|
| For a specific example at the statistical level, suppose you fit
| a linear trendline to some data points using OLS. Now compute the
| standard deviation of the residual terms for each data point, and
| using the CDF of the normal distribution, for each residual, map
| its value into the interval [0, 1]. Sum together the logarithms
| of the trendline coefficients, the standard deviation, and the
| normalized residuals. This value is approximately proportional to
| "sizeof(data | model) + sizeof(model)". It represents how well
| you compressed the data using the OLS model.
|
| But now suppose you plot the distribution of the residuals and
| find out that they do not in fact resemble a Gaussian
| distribution. This is a problem because our model assumed the
| error terms were distributed normally, and since this is not the
| case, our compression is suboptimal.
|
| So you back out some function f that closely maps between the
| uniform distribution on [0, 1] and the distribution that the
| residuals form and use this f to define a new model: y = m*x + b
| + e, with e distributed according to f(x;Th), Th being a
| parameter vector. When you sum the logarithms again, you will
| find that the new total is smaller than the original total
| obtained using OLS. The new trend line coefficients will slightly
| mess up the residual distribution again, so iterate on this
| process until you've converged on stable values for m, b, and Th.
|
| At the algorithmic level, this approach even applies to LLMs, but
| it's a bit harder to use in practice because "sizeof(data |
| model) + sizeof(model)" isn't the entire story here. Suppose you
| had a "perfect" language model. In this case, you would achieve
| minimization of K(data | training data), where K is Kolmogorov
| complexity. In practice, what is being minimized upon the release
| of each new LLM is "sizeof(data | LLM model) + sizeof(training
| data | LLM model) + sizeof(LLM model)". You can assume
| "sizeof(LLM model)" is the smallest Turing machine equivalent to
| the LLM program.
___________________________________________________________________
(page generated 2024-01-23 23:00 UTC)