[HN Gopher] Why do random forests work? They are self-regularizi...
___________________________________________________________________
Why do random forests work? They are self-regularizing adaptive
smoothers
Author : sebg
Score : 272 points
Date : 2024-10-17 21:24 UTC (3 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| PoignardAzur wrote:
| Reading the abstract alone, I have no idea whether it's talking
| about algorithmic trees or, like, the big brown things with small
| green bits.
| vatys wrote:
| I had the exact same reaction: biology or computers?
|
| The only hint I can see anywhere on the page is "Statistics >
| Machine Learning" above the abstract title.
|
| I really want it to be about actual biological trees being
| studied on the scale of forests growing with smooth edges over
| long periods of time, but I suspect that's not what it is
| about.
| bonoboTP wrote:
| There's also "Subjects: Machine Learning (stat.ML); Machine
| Learning (cs.LG)"
|
| Also, the very first sentence of the actual paper (after the
| abstract) is
|
| > Random forests (Breiman, 2001) have emerged as one of the
| most reliable off-the-shelf supervised learning algorithms
| [...]
|
| arxiv.org is overwhelmingly used for math and computer
| science papers, though not exclusively.
|
| The paper will also likely be submitted to a machine learning
| venue.
| ibgeek wrote:
| Biological trees don't make predictions. Second or third
| sentence contains the phrase "randomized tree ensembles not
| only make predictions."
| visarga wrote:
| Even single cells are able to sense and adapt to their
| environment. That is to recognize and react.
| mhuffman wrote:
| It's talking about these[0]
|
| [0]https://en.wikipedia.org/wiki/Random_forest
| defjm wrote:
| Jeremy Howard has a great video explaining Random Forests:
| https://youtu.be/blyXCk4sgEg .
| krystofee wrote:
| You have to know some machine learning fundamentals to figure
| that out - "Random Forest" is a specific machine learning
| algorithm, which does not need a further explanation. To take
| it a step further, they should really not describe "Machine
| learning", no, its not like the machine takes a book and
| learns, its a term.
| bigmadshoe wrote:
| The tree is an incredibly common data structure in computer
| science. Decision trees are well known. Random forests are
| ubiquitous in Machine Learning. Should the authors really have
| to dumb their paper down so people who don't work in this
| domain avoid confusing it with work in arborism?
| avazhi wrote:
| Pretty sure the guy you're replying to was half-joking, but
| adding the words 'machine learning' in the first sentence
| would have cleared this up pretty simply and wouldn't have
| resulted in dumbing down anything.
| visarga wrote:
| > avoid confusing it with work in arborism
|
| funny!
| youoy wrote:
| Nice article, although I find it's a bit overly complex if your
| are not familiar with ML and mathematics at the same time.
|
| I will leave here a geometrical intuition of why they work in
| case it can help someone:
|
| To simplify things, I will talk about regression, and say we want
| to predict some value y, that we know depends on x, and we have
| some noisy measurements of y.
|
| y = f(x) y_observed = f(x) + noise
|
| If you want some mental image, think about f(x)=sin(x).
|
| Now, a (over fitted) regression tree in this case is just a step
| function where the value at x is y_observed. If there is no
| noise, we now that by doing more measurements, we can approximate
| y with as much precision as we want. But if there is noise, the
| regression tree will over fit the noisy values, creating some
| artificial spikes.
|
| If you want to avoid this over fitting, you sample a lot of times
| the values of X, and for each sample you build a regression tree,
| and then average them. When you average them, every tree will
| contain its own particular artificial spikes, and if they are
| noise, they won't appear in the majority of the other trees. So
| when you average them, the spikes will attenuate, creating the
| smoother behaviour that the article talks about.
|
| I hope it helps!
| hammock wrote:
| Thanks that helps. The way I think about your example is it's
| like (not the same obv) taking a bunch of moving averages of
| different durations at different starting points, and throwing
| those into your regression model along with the actual data
| dbetteridge wrote:
| Some overlay lots of really squiggly lines, average (most
| points in common) is the actual function you're looking for?
| math_dandy wrote:
| This is good intuition for why ensembling overparametrized is a
| good idea. Doesn't speak to why ensembles of tree-structured
| estimators in particular perform so well compared to ensembles
| of other nonparametric estimators.
| youoy wrote:
| If you look at what makes it work well in the example, I
| would say it is being able to easily approximate a function
| with whatever degree of precision that you want, which
| translates to being able to isolate spikes in the
| approximation.
|
| For example, one could ask, what if instead of an
| approximation by step functions, we use a piecewise linear
| approximation (which is as good)? You can do that with a
| fully connected artificial neural network with ReLU
| nonlinearity, and if you check it experimentally, you will
| see that the results are equivalent.
|
| Why do people often use ensembles of tree structures? The
| ensembling part is included in the programming packages and
| that is not the case for ANN, so it is quicker to experiment
| with. Appart from that, if you have some features that behave
| like categorical variables, trees also behave better in
| training.
| 3abiton wrote:
| It would be also interesting to see the limitations of random
| forest and where they struggle to learn and produce good
| results.
| nurettin wrote:
| I get pretty abysmal results for scarce categories even
| though they have well defined preconditions.
| sillying wrote:
| So it seems that when you have different sources of errors the
| average of them cancel the noise. I think some property about
| the sources of error is necessary so in some sense the sources
| should be independent.
| levocardia wrote:
| Good to see more research exploring the connection between trees,
| ensembles, and smoothing. Way back in Trevor Hastie's ESL book
| there's a section on how gradient boosting using "stumps" (trees
| with only one split) is equivalent to an additive spline model
| (GAM, technically) with a step function as a spline basis and
| adaptive knot placement.
|
| I've always thought there should be a deep connection between
| ReLU neural nets and regularized adaptive smoothers as well,
| since the ReLU function is itself a spline basis (a so-called
| truncated linear spline) and happens to span the same functional
| basis as B-splines of the same degree.
| almostgotcaught wrote:
| One of my biggest pet peeves is flagrant overuse of "deep".
| Everything is so deep around around here these days...
|
| > since the ReLU function is itself a spline basis (a so-called
| truncated linear spline) and happens to span the same
| functional basis as B-splines of the same degree.
|
| ... you literally just spelled out the entire "depth" of it.
| abhgh wrote:
| I don't know if you have come across this: _A Spline Theory of
| Deep Networks_ [1]. Has been on my To-Read list forever :(
|
| [1]
| http://proceedings.mlr.press/v80/balestriero18b/balestriero1...
| tylerneylon wrote:
| Here's some context and a partial summary (youoy also has a nice
| summary) --
|
| Context:
|
| A random forest is an ML model that can be trained to predict an
| output value based on a list of input features: eg, predicting a
| house's value based on square footage, location, etc. This paper
| focuses on regression models, meaning the output value is a real
| number (or a vector thereof). Classical ML theory suggests that
| models with many learned parameters are more likely to overfit
| the training data, meaning that when you predict an output for a
| test (non-training) input, the predicted value is less likely to
| be correct because the model is not generalizing well (it does
| well on training data, but not on test data - aka, it has
| memorized, but not understood).
|
| Historically, a surprise is that random forests can have many
| parameters yet don't overfit. This paper explores the surprise.
|
| What the paper says:
|
| The perspective of the paper is to see random forests (and
| related models) as _smoothers_, which is a kind of model that
| essentially memorizes the training data and then makes
| predictions by combining training output values that are relevant
| to the prediction-time (new) input values. For example, k-nearest
| neighbors is a simple kind of smoother. A single decision tree
| counts as a smoother because each final/leaf node in the tree
| predicts a value based on combining training outputs that could
| possibly reach that node. The same can be said for forests.
|
| So the authors see a random forest as a way to use a subset of
| training data and a subset of (or set of weights on) training
| features, to provide an averaged output. While a single decision
| tree can overfit (become "spikey") because some leaf nodes can be
| based on single training examples, a forest gives a smoother
| prediction function since it is averaging across many trees, and
| often other trees won't be spikey for the same input (their leaf
| node may be based on many training points, not a single one).
|
| Finally, the authors refer to random forests as _adaptive
| smoothers_ to point out that random forests become even better at
| smoothing in locations in the input space that either have high
| variation (intuitively, that have a higher slope), or that are
| far from the training data. The word "adaptive" indicates that
| the predicted function changes behavior based on the nature of
| the data -- eg, with k-NN, an adaptive version might increase the
| value of k at some places in the input space.
|
| The way random forests act adaptively is that (a) the prediction
| function is naturally more dense (can change value more quickly)
| in _areas of high variability_ because those locations will have
| more leaf nodes, and (b) the prediction function is typically a
| combination of a wider variety of possible values _when the input
| is far from the training data_ because in that case the trees are
| likely to provide a variety of output values. These are both ways
| to avoid overfitting to training data and to generalize better to
| new inputs.
|
| Disclaimer: I did not carefully read the paper; this is my quick
| understanding.
| abetusk wrote:
| I think this is specifically coming to terms with an insight
| that's taught to statisticians about a bias-variance tradeoff.
|
| From my understanding, in a statistical setting, low
| variability in bias leads to high variability in variance
| whereas low variability in variance leads to high variability
| in bias. The example I saw was with K-means, where K = N leads
| to high variance (the predicted cluster is highly variable) but
| low bias (take an input point, you get that exact input point
| back), vs. K=1 low variance (there's only one cluster) but bad
| bias (input point is far away from the cluster
| center/representative point).
|
| I'm not sure I've characterized it well but there's a Twitter
| post from Alicia Curth that explains it [0] as well as a paper
| that goes into it [1].
|
| [0] https://x.com/AliciaCurth/status/1841817856142348529
|
| [1] https://arxiv.org/abs/2409.18842
| bonoboTP wrote:
| The same team wrote another interesting paper arguing that
| there's no "double descent" in linear regression, trees, and
| boosting, despite what many argued before (in this paper they
| don't tackle deep learning double descent, but remark that the
| case may be similar regarding the existence of different
| complexity measures being conflated).
| selimthegrim wrote:
| https://arxiv.org/abs/2310.18988
| StableAlkyne wrote:
| Random forests are incredible cool algorithms.
|
| The idea that you can take hundreds of bad models that over fit
| (the individual decision trees), add even more randomness by
| randomly picking training data and features*, and averaging them
| together - it's frankly amazing that this leads to consistently
| OK models. Often not the best but rarely the worst. There's a
| reason they're such a common baseline to compare against.
|
| *Unless you're using Sklearn, whose implementation of
| RandomForestRegressor is not a random forest. It's actually
| bagged trees because they don't randomly select features. Why
| they kept the misleading classname is beyond me.
| jncfhnb wrote:
| With a relatively small variant to make it gradient boosted
| trees it's pretty much as good as it gets for tabular data
| these days
| hanslovsky wrote:
| TIL sklearn doesn't actually implement random forest. Goos to
| know, thank you!
| StableAlkyne wrote:
| It does implement one, just the default settings do not
| alexfromapex wrote:
| Quantized random sampling regression
| foundry27 wrote:
| I like this article. Randomness in system design is one of the
| most practical ways to handle the messiness of real world inputs,
| and I think random forests nail this by using randomness to
| produce useful outputs to various inputs without overfitting and
| adapt to the unexpected. Yeah, you can always engineer a system
| that explicitly handles every possible situation, but the
| important question is "how long/costly will that process be?".
| Deterministic systems aren't good on that front, and when edge
| cases hit, sometimes those rigid models crack. Controlled
| randomness (load balancing, feature selection, etc.) makes
| systems more flexible and resilient. You don't get stuck in the
| same predictable ruts, and that's exactly why randomness works
| where pure determinism fails
| SomaticPirate wrote:
| Any suggestions for a 101 introduction to random forests? In
| university I encountered some ML courses but never random
| forests.
| jph00 wrote:
| Yeah I did a deep dive on them here -- doesn't require any
| particular math background beyond high school:
| https://www.youtube.com/watch?v=blyXCk4sgEg
| sunshinesnacks wrote:
| Statquest is really good. Maybe a little basic if you've taken
| ML courses, though?
|
| https://youtube.com/playlist?list=PLblh5JKOoLUIE96dI3U7oxHaC...
| RandomThoughts3 wrote:
| Grab any ML book and read the chapter on random forests. If you
| have the maths background (which is not particularly high for
| random forests) which you should if you took ML courses, it's
| all going to be pretty straightforward. I think someone already
| mentioned Hastie, _The Elements of Statistical Learning_ , in
| this thread which you can download for free and would be a good
| start.
| __mharrison__ wrote:
| My book, Effective XGBoost, covers tree theory from stumps to
| decision trees, to bagging (random forests) to boosting.
| xiaodai wrote:
| XGBoost LightGBM Catboost and JLBoost?
| Iwan-Zotow wrote:
| Cool, more random ess at work
| oli5679 wrote:
| My understanding of why bagging works well is because it's a
| variance reduction technique.
|
| If you have a particular algorithm, the bias will not increase if
| you train n versions in ensemble, but the variance will decrease
| as more anomalous observations won't persistently be identified
| in submodel random samples and so won't the persist in the
| bagging process.
|
| You can test this. The difference between train and test auc will
| not increase dramatically as you increase number of trees in
| sklearn random forest for same data and hyperparameters.
___________________________________________________________________
(page generated 2024-10-20 23:02 UTC)