[HN Gopher] Why is machine learning 'hard'? (2016)
       ___________________________________________________________________
        
       Why is machine learning 'hard'? (2016)
        
       Author : jxmorris12
       Score  : 58 points
       Date   : 2024-01-23 20:43 UTC (2 hours ago)
        
 (HTM) web link (ai.stanford.edu)
 (TXT) w3m dump (ai.stanford.edu)
        
       | jruohonen wrote:
       | "An aspect of this difficulty involves building an intuition for
       | what tool should be leveraged to solve a problem."
       | 
       | While I agree with the good point about debugging, like many
       | others, I am rather worried that we're increasingly deploying
       | AI/ML where we shouldn't be deploying it. Hence, the above quote.
        
         | sjwhevvvvvsj wrote:
         | I'm old enough to have learned that the secret to success is
         | much less knowing the tool of the moment than picking the right
         | tool for a job.
         | 
         | The right tool may in fact be the new one, and LLM do open a
         | lot of doors with zero shot capabilities, but oftentimes they
         | can underperform a well tuned heuristic. It's the ability to
         | pick the right tool that is key.
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Why is machine learning 'hard'?_ -
       | https://news.ycombinator.com/item?id=12936891 - Nov 2016 (88
       | comments)
        
         | epistasis wrote:
         | Love that thread. The top comment is excellent:
         | 
         | > Like picking hyperparamters - time and time again I've asked
         | experts/trainers/colleagues: "How do I know what type of model
         | to use? How many layers? How many nodes per layer? Dropout or
         | not?" etc etc And the answer is always along the lines of "just
         | try a load of stuff and pick the one that works best".
         | 
         | > To me, that feels weird and worrying. It's like we don't yet
         | understand ML properly yet to definitively say, for a given
         | data set, what sort of model we'll need.
         | 
         | This embodies the very fundamental difference between science
         | and engineering. With science, you make a discovery, but rarely
         | do we ask "what was the magical combination that let me find
         | the needle in the haystack today?" We instead just pass on the
         | needle and show everyone we found it.
         | 
         | Should we work on finding out the magic behind hyperparameters?
         | In bioinformatics, the brilliant mathematician Lior Pachter
         | once attacked the problem of sequence alignment using the tools
         | of tropical algebra: what parameters to the alignment
         | algorithms resulted in which regimes of solutions? It was
         | beautiful. It was great to understand. But I'm not sure if it
         | even ever got published (though it likely did). Having
         | reasonable parameters is more important than understanding how
         | to pick them from first principles, because even if you know
         | all the possible output regimes for different segments of the
         | hyper parameter space, really the only thing we care about is
         | getting a functionally trained model at the end.
         | 
         | Sometimes deeper understanding provides deeper insights to the
         | problems at hand. But often, they don't, even when the deeper
         | understanding is beautiful. If the hammer works when you hold
         | it a certain way, that's great, but understanding all possible
         | ways to hold a hammer doesn't always help get the nail in
         | better.
        
           | amelius wrote:
           | The hammer analogy doesn't make much sense because for a
           | hammer we can actually use our scientific knowledge to
           | compute the best possible way to hold the tool, and we can
           | make instruments that are better than hammers, like pneumatic
           | hammers, pile drivers, etc.
           | 
           | With your argument, we would be stuck with the good old, but
           | basic hammer for the rest of time.
        
             | epistasis wrote:
             | That seems like a different analogy; making better hammers
             | is a different thing than understanding why holding a
             | hammer a certain way works well. We did eventually invent
             | enough physics to understand why we hold hammers where we
             | do, but we got really far just experimenting without first
             | principles. And even if we use first principles, we are
             | going to discover a lot more by actually using the modified
             | hand-held hammer and testing it, than necessarily hitting
             | it out of the park with great physical modeling of the
             | hammer and the biomechanics of the human body.
             | 
             | And in any case, I'm not saying we shouldn't search for
             | deep understanding of what hyperparameters work on a first
             | try, I'm just saying there's a good chance that even if the
             | principles are fully discovered, it may be that calculating
             | using those principles is more expensive than a bunch of
             | experimentation and won't matter in the end.
             | 
             | That's the trick about science, it's more about finding the
             | right question to answer than how to find answers, and
             | often times the best questions only become apparent
             | afterwards.
        
           | sjwhevvvvvsj wrote:
           | I do a lot of model tuning and I'm almost ashamed to say I
           | tell GPT what performance I'm aiming for and have it generate
           | the hyper parameters (as in just literally give me a code
           | block). Then I see what works, tell GPT, and try again.
           | 
           | I'm deeply uncomfortable with such a method...but my models
           | perform quite well. Note I spend a TON of time generating the
           | right training data, so it's not random.
        
       | doctorM wrote:
       | I'm a bit sceptical of the exponentially harder debugging claim.
       | 
       | First it looks polynomially harder for the given example :p.
       | 
       | Second other engineering domains arguably have additional
       | dimensions which correspond to the machine learning ones
       | mentioned in the article. The choice of which high level
       | algorithm to implement is another dimension to traditional
       | software engineering that seemingly exists and corresponds to the
       | model dimension. This is often codified as 'design'.
       | 
       | The data dimension often exists as well in standard learning
       | software engineering. [Think of a system that is 'downstream' of
       | other].
       | 
       | It's probably a lot simpler to deal with these dimensions in
       | standard software engineering - but then this is what makes
       | machine learning harder, not that there are simply 'more
       | dimensions'.
       | 
       | The delayed debugging cycles point seems a lot more valid.
        
         | pizzaknife wrote:
         | i would subscribe to your newsletter if you offered one.
        
       | PaulHoule wrote:
       | The #1 thing that makes it 'hard' in real life is that nobody
       | wants to make training and test sets. So we have 50,000 papers on
       | the NIST digits but no insight into 'would this work for a
       | different problem?' (Ironically the latter might have been
       | exactly what academics would have needed to understand why these
       | algorithms work!)
        
         | ramesh31 wrote:
         | Would there be enough of a financial incentive to do so? Seems
         | like a prime startup opportunity.
        
           | rzzzz wrote:
           | I believe that Scale.ai was founded to do exactly this.
        
         | whiplash451 wrote:
         | You're not paying tribute to MNIST-1D and many other datasets
         | (including the massive segmentation dataset released by Meta
         | with SAM). Read the literature before lecturing the community.
        
       | Cacti wrote:
       | there are uncountable sets all over the place, and in practical
       | terms, the repl loop may have a week long training lag after you
       | hit enter.
       | 
       | also, the data is almost always complete shit.
       | 
       | lol. there's no mystery why it's hard.
        
       | phkahler wrote:
       | >> It becomes essential to build an intuition for where something
       | went wrong based on the signals available.
       | 
       | This has always been my approach. I learned programming way
       | before I had access to debuggers and other methods to dig in, set
       | breakpoints and step through code to see where it was going
       | wrong. As a result, when I got in the real world I kind of looked
       | down on people using those tools (mostly because I hate tools
       | actually). But then I saw people get to the root of problems that
       | I don't think I ever could have solved, and I started to
       | appreciate those tools and the detail you could get to. My
       | preference is still to have a great understanding of how
       | algorithms work, how the code is written, and what the problem
       | is, and noodle out what and where things may be going wrong. I
       | only switch to detailed monitoring of the insides when "thinking
       | about it" fails. Maybe I should have gone into this ML stuff ;-)
        
       | error9348 wrote:
       | one thing that has gotten better since 2016 is more standardized
       | architectures with transformers across domains.
        
       | Xcelerate wrote:
       | With regard to model selection, one thing I learned a long time
       | ago that provides powerful intuitive guidance on which model to
       | use is the question: "How could this be compressed further?"
       | 
       | There are some deep connections between data compression and
       | generalized learning, both at the statistical level and even
       | lower at the algorithmic level (see Solomonoff induction).
       | 
       | For a specific example at the statistical level, suppose you fit
       | a linear trendline to some data points using OLS. Now compute the
       | standard deviation of the residual terms for each data point, and
       | using the CDF of the normal distribution, for each residual, map
       | its value into the interval [0, 1]. Sum together the logarithms
       | of the trendline coefficients, the standard deviation, and the
       | normalized residuals. This value is approximately proportional to
       | "sizeof(data | model) + sizeof(model)". It represents how well
       | you compressed the data using the OLS model.
       | 
       | But now suppose you plot the distribution of the residuals and
       | find out that they do not in fact resemble a Gaussian
       | distribution. This is a problem because our model assumed the
       | error terms were distributed normally, and since this is not the
       | case, our compression is suboptimal.
       | 
       | So you back out some function f that closely maps between the
       | uniform distribution on [0, 1] and the distribution that the
       | residuals form and use this f to define a new model: y = m*x + b
       | + e, with e distributed according to f(x;Th), Th being a
       | parameter vector. When you sum the logarithms again, you will
       | find that the new total is smaller than the original total
       | obtained using OLS. The new trend line coefficients will slightly
       | mess up the residual distribution again, so iterate on this
       | process until you've converged on stable values for m, b, and Th.
       | 
       | At the algorithmic level, this approach even applies to LLMs, but
       | it's a bit harder to use in practice because "sizeof(data |
       | model) + sizeof(model)" isn't the entire story here. Suppose you
       | had a "perfect" language model. In this case, you would achieve
       | minimization of K(data | training data), where K is Kolmogorov
       | complexity. In practice, what is being minimized upon the release
       | of each new LLM is "sizeof(data | LLM model) + sizeof(training
       | data | LLM model) + sizeof(LLM model)". You can assume
       | "sizeof(LLM model)" is the smallest Turing machine equivalent to
       | the LLM program.
        
       ___________________________________________________________________
       (page generated 2024-01-23 23:00 UTC)