[HN Gopher] Do simpler machine learning models exist and how can...
___________________________________________________________________
Do simpler machine learning models exist and how can we find them?
Author : luu
Score : 90 points
Date : 2022-12-22 18:56 UTC (4 hours ago)
(HTM) web link (statmodeling.stat.columbia.edu)
(TXT) w3m dump (statmodeling.stat.columbia.edu)
| cs702 wrote:
| _> I wonder whether it would make sense to separate the concepts
| of "simpler" and "interpretable."_
|
| Interesting. I was thinking the same, after coming across a
| preprint proposing a credit-assignment mechanism that seems to
| make it possible to build deep models in a way that enables
| interpretability: https://arxiv.org/abs/2211.11754 (please note:
| the results look interesting/significant to me, but I'm still
| making my way through the preprint and its accompanying code).
|
| Consider that our brains are incredibly complex organs, yet they
| are really good at answering questions in a way that other brains
| find interpretable. Meanwhile, large language models (LLMs) keep
| getting better and better at explaining their answers with
| natural language in a way that our brains find interpretable. If
| you ask ChatGPT to explain its answers, it will generate
| explanations that a human being can interpret -- even if the
| explanations are wrong!
|
| Could it be that "model simplicity" and "model interpretability"
| are actually _orthogonal_ to each other?
| joe_the_user wrote:
| I don't think anyone has come up with an unambiguous definition
| of "interpretable". I mean, often people assume that, for
| example, a statement like "it's a cat because it has fur,
| whiskers and pointy ears" is interpretable because it's a
| logical conjunction of conditions. But a logical conjunction of
| a thousand vague conditions could easily be completely opaque.
| It's a bit like the way SQL initially advanced, years ago, as
| "natural language interface" and simple SQL statements are a
| bit like natural language but large SQL statements tend to be
| more incomprehensible than even ordinary computer programs.
|
| _If you ask ChatGPT to explain its answers, it will generate
| explanations that a human being can interpret -- even if the
| explanations are wrong!_
|
| The funny thing is that yeah, LLMs often come up with correct
| method-description for wrong answers and wrong method-
| descriptions for right answers. Human language is quite
| slippery and humans do this too. Human beings tend to start
| loose but tighten things up over time - LLMs are kind of
| randomly tight and loose. Maybe this can be tuned but I think
| "lack of actual understanding" will make this difficult.
| kylevedder wrote:
| Humans give explanations that other humans find convincing, but
| they can be totally wrong and non-causal. I think human
| explanations are often mechanistically wrong / totally acausal.
|
| As a famous early example, this lady provided an unprompted
| explanation (using only the information available to her
| conscious part of her brain in her good eye) for some of her
| preferences despite the mechanism of action being subconscious
| observations out of her blind eye.
|
| https://www.nature.com/articles/336766a0
| not2b wrote:
| A key reason that we want models at least for some applications
| to be interpretable is to watch out for undesirable features.
| For example, suppose we want to train a model to figure out
| whether to grant or deny a loan, and we train it to match the
| decisions of human loan officers. Now, suppose it turns out
| that many loan officers have unconscious prejudices that cause
| them to deny loans more often to green people and grant loans
| more often to blue people (substitute whatever categories you
| like for blue/green). The model might wind up with an explicit
| weight that makes this implicit discrimination explicit. If the
| model is relatively small and interpretable this weight can be
| found and perhaps eliminated.
|
| But if that model could chat with us it would replicate the
| speech of the loan officers, many of whom sincerely believe
| that they treat green people and blue people fairly. So
| interpretability can't be about somehow asking the model to
| justify itself. We may need the equivalent of a debugger.
| visarga wrote:
| The fact that both humans and LMs can give interpretable
| justifications makes me think intelligence was actually in the
| language. It comes from language learning and problem solving
| with language, and gets saved back into language as we validate
| more of our ideas.
| bilsbie wrote:
| I think you're on to something. I wonder if there's anyone
| working on this idea. I'd be curious to research it more.
| [deleted]
| aputsiak wrote:
| Petar Velickovic et al has a concept of geometric deep learning,
| see this forthcoming book: https://geometricdeeplearning.com/
| There is also the Categories for AI, cats.for.ai, course which
| deals with the applying category theory into ML.
| satvikpendem wrote:
| I just submitted an article about a paper by Deepmind whose main
| conclusion is that "data, not size, is the currently active
| constraint on language modeling performance" [0]. This means that
| even if we have bigger models, with billions and trillions of
| parameters, they are unlikely to be better than our current ones,
| because our amount of data is the bottleneck.
|
| TFA though also reminds me of the phenomenon in mathematical
| proofs where some long winded proof eventually comes up but then
| over time it becomes simplified as more mathematicians try to
| optimize it, in much the same way as programmers with technical
| debt ("make it work, make it right, make it fast"), such as with
| the four color theorem that was until now computer assisted but
| it seems there is a non computer assisted proof out [1].
|
| I wonder if the problem in TFA could itself be solved by machine
| learning, where models would create, train, and test other
| models, changing them along the way, similar to genetic
| programming but with "artificial selection" and not "natural
| selection" so to speak.
|
| [0] https://news.ycombinator.com/item?id=34098087
|
| [1] https://news.ycombinator.com/item?id=34082022
| bilsbie wrote:
| Yet humans learn on way less language data.
| Enginerrrd wrote:
| I disagree.
|
| On pure # of words, sure.
|
| But for humans, language is actually just a compressed
| version of reality perceived through multiple senses /
| prediction + observation cycles / model paradigms / scales /
| contexts/ social cues, etc. and we get full access to the
| entire thing. So a single sentence is wrapped in orders of
| magnitude more data.
|
| We also get multiple modes of interconnected feedback. How to
| describe this? Let me use an analogy. In poker, different
| properties a player has statistically take different amounts
| of data to reach convergence: Some become evident in 10s of
| hands, some take 100s of hands, and some take 1000s, and some
| even take 10,000s before you get over 90% confidence. ....And
| yet, if you let a good player see your behavior on just one
| single hand that goes to showdown, a good human player will
| be able to estimate your playing style, skill, and where your
| stats will converge to with remarkable accuracy. They get to
| see how you acted pre-flop, on the flop, turn, and river,
| with the rich context of position, pot-size, and what the
| other players were doing during those times, along with the
| stakes and location you're playing at, what you're wearing,
| how you move and handle your chips, etc. etc.
| marcosdumay wrote:
| It's less data anyway. There's no way to add up the data a
| person senses and get into a volume any similar to what's
| on the internet.
|
| But it may be better data.
| numpad0 wrote:
| We also eat. It feels to me that better food with divergent
| micronutrients has positive performance implications. Maybe
| I'm just schizophrenic, but to me it just feels that way.
| fshbbdssbbgdd wrote:
| Try training the model with free-range, locally-sourced
| electrons.
| wincy wrote:
| Couldn't we start a website that just has humans tag stuff for
| machine learning, and make that tagged data set open? Does such
| a thing exist? I've heard the issues with Stable Diffusion and
| others is that the LAION-5B dataset is kind of terrible
| quality.
| hooande wrote:
| two thoughts
|
| People love to say "it's early" and "it will improve" about
| ChatGPT. but amount of training data IS the dominant factor in
| determining the quality of the output, usually in logarithmic
| terms. it's already trained on the entire internet. it's hard
| to see how they'll be able to significantly increase that.
|
| And having models build models is drastically overrated. again,
| the accuracy/quality improvements are largely driven by the
| scale and diversity of the dataset. that's like 90% of the
| solution to any ml problem. choosing the right model and
| parameters is often a minor relative improvement
| [deleted]
| tlarkworthy wrote:
| All the books. I think books might be better
| recuter wrote:
| There's only about 100m books (in English). About the same
| volume of text as the web all total.
| MonkeyClub wrote:
| Generally a book has deeper thinking than a webpage,
| though, I think that's the crux of the GP's
| clarification.
| theGnuMe wrote:
| > it's hard to see how they'll be able to significantly
| increase that.
|
| With feedback.
| not2b wrote:
| Perhaps some sort of adversarial network approach could work
| better; models that learn to generate text and other models
| that try to distinguish AIs from humans, competing against
| each other. Also, children learning language benefit from
| constant feedback from people who have their best interest at
| heart ... that last part is important because of episodes
| like Microsoft's Tay where 4chan folks thought it would be
| fun to turn the chatbot into a fascist.
| geysersam wrote:
| The data is the bottleneck _for the current generation of
| models_. Better models /training strategies could very well
| change that in the next couple of decades.
| donkeyboy wrote:
| Yes, they exist, and they are called Linear Regression and
| Decision Tree. Not everything needs to be a neural network.
|
| Anyway, residual connections in NNs as well as distillation being
| only a 1% hit to performance imply our models are way too big.
| PartiallyTyped wrote:
| > Anyway, residual connections in NNs as well as distillation
| being only a 1% hit to performance imply our models are way too
| big.
|
| I disagree with the conclusion.
|
| It indicates that our optimisers are just not good enough,
| likely because gradient descent is just weak.
|
| The argument for residual connections is that we can create a
| nested family of models which enables expressing more models,
| but also embedding the smaller ones into them.
|
| The smaller models may be retrieved if our model learns to
| produce the the identity function at later layers.
|
| The problem though is that that is very difficult, meaning that
| our optimisers are simply not good enough at constructing
| identify functions. With the residual layers, we can embed the
| identity function into the structure of the model, and we now
| need to learn to map to 0 (since a residual is f(x) = x+g(x)),
| we need only to learn g(x)=0).
|
| As for our optimisers being bad, the argument is that with an
| overparameterised network, there is always a descent direction,
| but we land on local minima that are very close to the global
| one. The descend direction may exist in the batch, but when
| considering all the batches, we are at a local minimum.
|
| We can find many such local minima via certain symmetries.
|
| The general problem however is that even with the full dataset,
| we can only make local improvements in the landscape.
|
| Thus, it's that the better models are embedded within the
| larger ones, and more parameters enable us to find them because
| of nested families, symmetries, and because of always having a
| descent direction.
| visarga wrote:
| > It indicates that our optimisers are just not good enough,
| likely because gradient descent is just weak.
|
| No, the networks are ok, what is wrong is the paradigm. If
| you want rule or code based exploration and learning it is
| possible. You need to train a model to generate code from
| text instructions, then fine-tune it with RL on problem
| solving. The code generated by the model is interpretable and
| generalises better than running computation in the network
| itself.
|
| Neural nets can also generate problems, tests and evaluations
| of the test outputs. They can make a data generation loop. As
| an analogy, AlphaGo generated its own training data by self
| play and had very strong skills.
| PartiallyTyped wrote:
| I did say that the networks are okay. In fact, I am arguing
| that the networks are even overcompensating for the
| weakness of optimisers. Neural nets are great even given
| that they are differentiable and we can propagate gradients
| through them without affecting the parameters.
|
| I don't think that this reply takes into consideration just
| _how_ inefficient RL and the likes are. In fact, RL is so
| inefficient that current SOTA in RL is ... causal
| transformers that perform in-context learning without
| gradient updates.
|
| Depending on the approach one takes with RL, be it policy
| gradients or value networks, it still relies on gradient
| descent (and backprop).
|
| Policy gradients are _just_ increasing the likelihood of
| useful actions given the current state. It's a likelihood
| model increasing probabilities based on observed random
| walks.
|
| Value networks are even worse because one needs to derive
| not only the quality of the behaviour but also select an
| action.
|
| Sure enough, alternative methods exist such as model based
| RL, etc, and for example ChatGPT use RL to train some value
| functions and learn how to rank options, but all of these
| rely on gradient descent.
|
| Gradient descent, especially stochastic, is just garbage
| compared to stuff that we have for fixed functions that are
| not very expensive to evaluate.
|
| With stochastic gradient descent, your loss landscape
| depends on the example or mini batch, so a way to think
| about it is that the landscape is a linear combination of
| all the training examples, but at any time you observe only
| some of them and cope that the gradient doesn't mess up too
| bad.
|
| But in general gradient descent shows linear convergence
| rate (cf Nocedal et al Numerical Opt, or Boyd and
| Vanderberghe's proof where they bound the improvement of
| the iterates), and that's a best case scenario (meaning non
| stochastic, non partial).
|
| Second order methods can get quadratic convergence rate but
| they are prohibitly expensive for large models, or require
| hessians (good luck lol).
|
| None of these though address limitations imposed by loss
| functions, eg needing exponentially higher values to
| increase a prediction optimised by cross entropy (see the
| logarithm). Nor do they address the bound on the
| information that we have about the minima.
|
| So needing exponentially more steps (assuming each update
| is fixed in length) while relying on linear convergence is
| ... problematic to say the list
| dr_dshiv wrote:
| Take the example of creating an accurate ontology. You could
| try to use a large language model to develop simpler, human-
| readable conceptual relations out of whatever mess of
| complexity currently constitutes an LLM concept. You could
| use ratings of the accuracy or reasonability of rules and
| cross-validated tests against the structure of human hand-
| crafted ontologies (ie, iteratively derive wikidata from LLMs
| trying to predict wikidata).
| heyitsguay wrote:
| I think this is one of those issues where it's easy to observe
| from the sidelines that models "should" be smaller (it'd make
| my life a whole lot easier), but it's not so clear how to
| actually create small models that work as well as these larger
| models, without having the larger models first (as in
| distillation).
|
| If you have any ideas to do better and aren't idly wealthy, I'd
| suggest pursuing them. Create a model that's within a
| percentage point or two of GPT3 on big NLP benchmarks, and fame
| and fortune will be yours.
|
| [Edit] this of course only applies for domains like NLP or
| computer vision where neural networks have proven very hard to
| beat. If you're working on a problem that doesn't need deep
| learning to achieve adequate performance, don't use them!
| z3c0 wrote:
| I've always thought it was abundantly clear how to make
| smaller models perform as well as large models: keep labeling
| data and build a human-in-the-loop support process to keep it
| on track.
|
| My perspective is more pessimistic. I think people opt for
| huge unsupervised models because they believe that tuning a
| few thousand more input features is easier than labeling
| copious amounts of data. Plus (in my experience) supervised
| models often require a more involved understanding of the
| math, whereas there's so many NN frameworks that ask very
| little of the users.
| janef0421 wrote:
| Supervised models would also require a lot more human
| labour, and the goal of most machine learning projects is
| to achieve cost-savings by eliminating human labour.
| z3c0 wrote:
| Up front, yes, but long term, I wholly disagree. A model
| that performs at 95% or higher will assuredly eliminate
| human work, no matter how many interns you enlist to
| label the data.
| heyitsguay wrote:
| People have tried (and continue to try) that human-in-the-
| loop data growth. Basically any applied AI company is doing
| something like that every day, if they're getting their own
| training data in the course of business. It helps but it
| won't turn your bag-of-words model into GPT3.
|
| Companies like Google have even spent huge amounts of time
| and money on enormous labeled datasets -- JFT-300M or
| something like that for computer vision tasks, as you might
| guess, ~300M labeled images. It creates value, but it
| creates more value for larger models with higher capacity.
| mrguyorama wrote:
| It's almost like we have no clue what we are doing with NN
| and are just tweaking knobs and hoping it works out in the
| end.
|
| And yet people still like to push this idea that we will
| magically and accidentally build a superintelligence on top
| of these systems. It's so frustrating how deep into their own
| koolaid the ML industry is. We don't even know how the brain
| learns, we don't understand intelligence, there's no valid
| reason to believe a NN "learns" the same way a human brain
| learns, and individual human neurons are infinitely more
| complex and "learning" than even a single layer of a NN.
| heyitsguay wrote:
| As someone in the ML industry, who knows many people in the
| ML industry, we all know this. It's non-technical
| fundraisers that spread the hype, and non-technical
| laypeople that buy into it. Meanwhile, the folks building
| things and solving problems plug right along, aware of
| where limitations are and aren't.
| hooande wrote:
| > It's almost like we have no clue what we are doing with
| NN and are just tweaking knobs and hoping it works out in
| the end.
|
| No, we understand very well how NNs work. Look at
| PartiallyTyped's comment in this thread. It's a great
| explanation of the basic concepts behind modern machine
| learning.
|
| You're quite correct that modern neural networks have
| nothing to do with how the brain learns or with any kind of
| superintelligence. And people know this. But these
| technologies have valuable practical applications. They're
| good at what they were made to do.
| scrumlord wrote:
| [dead]
| tbalsam wrote:
| I recently released a codebase in beta that modernizes a tiny
| model that gets really good performance on CIFAR-10 in about 18.1
| or so seconds on the right single GPU -- a number of years ago
| the world record was 10 minutes, down from several days a few
| years previously.
|
| While most of my work was porting and cleaning up certain parts
| of the code for a different purpose (just-clone-and-hack
| experimentation workbench), I've spent years optimizing neural
| networks at a very fine grained level, and many of the lessons
| learned here in debugging reflected that.
|
| I believe that there are fundamentally a few big NP-hard layers
| (at least two that I can define, and likely several other smaller
| ones) unfortunately but they are not hard blockers to progress.
| The model I mentioned above is extremely simple and has little
| "extra fat" where it is not needed. It also importantly seems to
| have good gradient and such flow throughout, something that's
| important for a model to be able to learn quickly. There are a
| few reasonable priors, like initializing and freezing the first
| convolution to whiten the inputs based upon some statistics from
| the training data. That does a shocking amount of work in
| stabilizing and speeding up training.
|
| Ultimately, the network is simple, and there are a number of
| other methods to help it reach near-SOTA, but they are as simple
| as can be. I think as this project evolves and we get nearer to
| the goal (<2 seconds in a year or two), we'll keep uncovering
| good puzzle pieces showing exactly what it is that's allowing
| such a tiny network to perform so well. There's a kind of
| exponential value to having ultra-short training times -- you can
| somewhat open-endedly barrage-test your algorithm, something
| that's already led to a few interesting discoveries that I'd like
| to refine before publishing to the repo.
|
| If you're interested, the code is here. The running code is a
| single .py with the upsides and downsides that come with that. If
| you're interested or have any questions, let me know! :D :))))
|
| https://github.com/tysam-code/hlb-CIFAR10
| danuker wrote:
| If interpretability is sufficiently important, you could
| straight-up search for mathematical formulae.
|
| My SymReg library pops to mind. I'm thinking of rewriting it in
| multithreaded Julia this holiday season.
|
| https://github.com/danuker/symreg
| UncleOxidant wrote:
| Would be interested to see this in Julia.
| moelf wrote:
| https://github.com/MilesCranmer/SymbolicRegression.jl
| danuker wrote:
| Wow! I should probably join forces with this project
| instead.
| heyitsguay wrote:
| How often are closed-form equations actually useful for real
| world problem domains? When i did my PhD in applied math, they
| mostly came up in abstracted toy problems. Then you get into
| the real world data or a need for realistic modeling and it's
| numerical methods everywhere.
| danuker wrote:
| I find them most useful when there are many variables, or
| when I can see there's a relationship but I don't feel like
| trying out equation forms manually.
|
| It is indeed of limited use, since often I can spot the
| relationship visually. And once I get the general equation I
| can easily transform the data to get a linear regression.
| chimeracoder wrote:
| > How often are closed-form equations actually useful for
| real world problem domains? When i did my PhD in applied
| math, they mostly came up in abstracted toy problems. Then
| you get into the real world data or a need for realistic
| modeling and it's numerical methods everywhere.
|
| And closed-form equations are themselves almost always
| simplified or abstracted models derived from real-world
| observations.
| nsxwolf wrote:
| "black box models have led to mistakes in bail and parole
| decisions in criminal justice"
|
| Lolwut? Does your average regular person know machine learning is
| used to make these decisions _at all_?
| derbOac wrote:
| Does anyone have recommendations on papers on current definitions
| of interpretability and explainability?
| WhitneyLand wrote:
| Instead of a rigorous CS oriented paper, it (the article
| referenced by Dr. Rudin) seems more like an editorial on the
| risks of using AI for consequential decisions. It proposes using
| simpler models and the benefits of explainable vs interpretable
| AI in these cases.
|
| However it seems to deal more with problems of perception in AI
| and how things might be better in the ideal rather than present
| any specific results.
|
| Maybe I'm missing something, not sure of the insight here? I
| agree it's an important issue and laudable goal.
| HWR_14 wrote:
| Isn't TikTok's recommendation engine famously a fairly simple
| machine learning model? Where simple means they really honed it
| down to the most important factors?
| fxtentacle wrote:
| We blow up model sizes to reduce the risk of overfitting and to
| speed up training. So yes, usually you can shrink the finished
| model by 99% with a bit of normalization, quantization and
| sparseness.
|
| Also, plenty of "deep learning" tasks work equally well with
| decision trees if you use the right feature extractors.
| jakearmitage wrote:
| What are feature extractors?
| danuker wrote:
| I suspect features created manually from the data (as opposed
| to solely using the raw data): https://en.wikipedia.org/wiki/
| Feature_(computer_vision)#Extr...
| londons_explore wrote:
| Are people 'interpretable'?
|
| If you ask an art expert 'how much will this painting sell for at
| auction', he might reply '$450k'. And when questioned, he'll
| probably have a long answer about the brush strokes being more
| detailed than this other painting by the same artist, but it
| being worth less due to surface damage...
|
| If our 'black box' ML models could give a similar long answer
| when asked 'why', would that solve the need? Because ChatGPT is
| getting close to being able to do just that...
| ketralnis wrote:
| If you tell that same art expert that it actually sold for
| $200k, they'll happily give you a post-hoc justification for
| that too. ChatGPT is equally good at that, you can ask it all
| sorts of "why" questions about falsehoods and it will
| confidently muse with the best armchair expert.
___________________________________________________________________
(page generated 2022-12-22 23:00 UTC)