[HN Gopher] Why do random forests work? They are self-regularizi...
       ___________________________________________________________________
        
       Why do random forests work? They are self-regularizing adaptive
       smoothers
        
       Author : sebg
       Score  : 272 points
       Date   : 2024-10-17 21:24 UTC (3 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | PoignardAzur wrote:
       | Reading the abstract alone, I have no idea whether it's talking
       | about algorithmic trees or, like, the big brown things with small
       | green bits.
        
         | vatys wrote:
         | I had the exact same reaction: biology or computers?
         | 
         | The only hint I can see anywhere on the page is "Statistics >
         | Machine Learning" above the abstract title.
         | 
         | I really want it to be about actual biological trees being
         | studied on the scale of forests growing with smooth edges over
         | long periods of time, but I suspect that's not what it is
         | about.
        
           | bonoboTP wrote:
           | There's also "Subjects: Machine Learning (stat.ML); Machine
           | Learning (cs.LG)"
           | 
           | Also, the very first sentence of the actual paper (after the
           | abstract) is
           | 
           | > Random forests (Breiman, 2001) have emerged as one of the
           | most reliable off-the-shelf supervised learning algorithms
           | [...]
           | 
           | arxiv.org is overwhelmingly used for math and computer
           | science papers, though not exclusively.
           | 
           | The paper will also likely be submitted to a machine learning
           | venue.
        
           | ibgeek wrote:
           | Biological trees don't make predictions. Second or third
           | sentence contains the phrase "randomized tree ensembles not
           | only make predictions."
        
             | visarga wrote:
             | Even single cells are able to sense and adapt to their
             | environment. That is to recognize and react.
        
         | mhuffman wrote:
         | It's talking about these[0]
         | 
         | [0]https://en.wikipedia.org/wiki/Random_forest
        
           | defjm wrote:
           | Jeremy Howard has a great video explaining Random Forests:
           | https://youtu.be/blyXCk4sgEg .
        
         | krystofee wrote:
         | You have to know some machine learning fundamentals to figure
         | that out - "Random Forest" is a specific machine learning
         | algorithm, which does not need a further explanation. To take
         | it a step further, they should really not describe "Machine
         | learning", no, its not like the machine takes a book and
         | learns, its a term.
        
         | bigmadshoe wrote:
         | The tree is an incredibly common data structure in computer
         | science. Decision trees are well known. Random forests are
         | ubiquitous in Machine Learning. Should the authors really have
         | to dumb their paper down so people who don't work in this
         | domain avoid confusing it with work in arborism?
        
           | avazhi wrote:
           | Pretty sure the guy you're replying to was half-joking, but
           | adding the words 'machine learning' in the first sentence
           | would have cleared this up pretty simply and wouldn't have
           | resulted in dumbing down anything.
        
           | visarga wrote:
           | > avoid confusing it with work in arborism
           | 
           | funny!
        
       | youoy wrote:
       | Nice article, although I find it's a bit overly complex if your
       | are not familiar with ML and mathematics at the same time.
       | 
       | I will leave here a geometrical intuition of why they work in
       | case it can help someone:
       | 
       | To simplify things, I will talk about regression, and say we want
       | to predict some value y, that we know depends on x, and we have
       | some noisy measurements of y.
       | 
       | y = f(x) y_observed = f(x) + noise
       | 
       | If you want some mental image, think about f(x)=sin(x).
       | 
       | Now, a (over fitted) regression tree in this case is just a step
       | function where the value at x is y_observed. If there is no
       | noise, we now that by doing more measurements, we can approximate
       | y with as much precision as we want. But if there is noise, the
       | regression tree will over fit the noisy values, creating some
       | artificial spikes.
       | 
       | If you want to avoid this over fitting, you sample a lot of times
       | the values of X, and for each sample you build a regression tree,
       | and then average them. When you average them, every tree will
       | contain its own particular artificial spikes, and if they are
       | noise, they won't appear in the majority of the other trees. So
       | when you average them, the spikes will attenuate, creating the
       | smoother behaviour that the article talks about.
       | 
       | I hope it helps!
        
         | hammock wrote:
         | Thanks that helps. The way I think about your example is it's
         | like (not the same obv) taking a bunch of moving averages of
         | different durations at different starting points, and throwing
         | those into your regression model along with the actual data
        
         | dbetteridge wrote:
         | Some overlay lots of really squiggly lines, average (most
         | points in common) is the actual function you're looking for?
        
         | math_dandy wrote:
         | This is good intuition for why ensembling overparametrized is a
         | good idea. Doesn't speak to why ensembles of tree-structured
         | estimators in particular perform so well compared to ensembles
         | of other nonparametric estimators.
        
           | youoy wrote:
           | If you look at what makes it work well in the example, I
           | would say it is being able to easily approximate a function
           | with whatever degree of precision that you want, which
           | translates to being able to isolate spikes in the
           | approximation.
           | 
           | For example, one could ask, what if instead of an
           | approximation by step functions, we use a piecewise linear
           | approximation (which is as good)? You can do that with a
           | fully connected artificial neural network with ReLU
           | nonlinearity, and if you check it experimentally, you will
           | see that the results are equivalent.
           | 
           | Why do people often use ensembles of tree structures? The
           | ensembling part is included in the programming packages and
           | that is not the case for ANN, so it is quicker to experiment
           | with. Appart from that, if you have some features that behave
           | like categorical variables, trees also behave better in
           | training.
        
         | 3abiton wrote:
         | It would be also interesting to see the limitations of random
         | forest and where they struggle to learn and produce good
         | results.
        
           | nurettin wrote:
           | I get pretty abysmal results for scarce categories even
           | though they have well defined preconditions.
        
         | sillying wrote:
         | So it seems that when you have different sources of errors the
         | average of them cancel the noise. I think some property about
         | the sources of error is necessary so in some sense the sources
         | should be independent.
        
       | levocardia wrote:
       | Good to see more research exploring the connection between trees,
       | ensembles, and smoothing. Way back in Trevor Hastie's ESL book
       | there's a section on how gradient boosting using "stumps" (trees
       | with only one split) is equivalent to an additive spline model
       | (GAM, technically) with a step function as a spline basis and
       | adaptive knot placement.
       | 
       | I've always thought there should be a deep connection between
       | ReLU neural nets and regularized adaptive smoothers as well,
       | since the ReLU function is itself a spline basis (a so-called
       | truncated linear spline) and happens to span the same functional
       | basis as B-splines of the same degree.
        
         | almostgotcaught wrote:
         | One of my biggest pet peeves is flagrant overuse of "deep".
         | Everything is so deep around around here these days...
         | 
         | > since the ReLU function is itself a spline basis (a so-called
         | truncated linear spline) and happens to span the same
         | functional basis as B-splines of the same degree.
         | 
         | ... you literally just spelled out the entire "depth" of it.
        
         | abhgh wrote:
         | I don't know if you have come across this: _A Spline Theory of
         | Deep Networks_ [1]. Has been on my To-Read list forever :(
         | 
         | [1]
         | http://proceedings.mlr.press/v80/balestriero18b/balestriero1...
        
       | tylerneylon wrote:
       | Here's some context and a partial summary (youoy also has a nice
       | summary) --
       | 
       | Context:
       | 
       | A random forest is an ML model that can be trained to predict an
       | output value based on a list of input features: eg, predicting a
       | house's value based on square footage, location, etc. This paper
       | focuses on regression models, meaning the output value is a real
       | number (or a vector thereof). Classical ML theory suggests that
       | models with many learned parameters are more likely to overfit
       | the training data, meaning that when you predict an output for a
       | test (non-training) input, the predicted value is less likely to
       | be correct because the model is not generalizing well (it does
       | well on training data, but not on test data - aka, it has
       | memorized, but not understood).
       | 
       | Historically, a surprise is that random forests can have many
       | parameters yet don't overfit. This paper explores the surprise.
       | 
       | What the paper says:
       | 
       | The perspective of the paper is to see random forests (and
       | related models) as _smoothers_, which is a kind of model that
       | essentially memorizes the training data and then makes
       | predictions by combining training output values that are relevant
       | to the prediction-time (new) input values. For example, k-nearest
       | neighbors is a simple kind of smoother. A single decision tree
       | counts as a smoother because each final/leaf node in the tree
       | predicts a value based on combining training outputs that could
       | possibly reach that node. The same can be said for forests.
       | 
       | So the authors see a random forest as a way to use a subset of
       | training data and a subset of (or set of weights on) training
       | features, to provide an averaged output. While a single decision
       | tree can overfit (become "spikey") because some leaf nodes can be
       | based on single training examples, a forest gives a smoother
       | prediction function since it is averaging across many trees, and
       | often other trees won't be spikey for the same input (their leaf
       | node may be based on many training points, not a single one).
       | 
       | Finally, the authors refer to random forests as _adaptive
       | smoothers_ to point out that random forests become even better at
       | smoothing in locations in the input space that either have high
       | variation (intuitively, that have a higher slope), or that are
       | far from the training data. The word "adaptive" indicates that
       | the predicted function changes behavior based on the nature of
       | the data -- eg, with k-NN, an adaptive version might increase the
       | value of k at some places in the input space.
       | 
       | The way random forests act adaptively is that (a) the prediction
       | function is naturally more dense (can change value more quickly)
       | in _areas of high variability_ because those locations will have
       | more leaf nodes, and (b) the prediction function is typically a
       | combination of a wider variety of possible values _when the input
       | is far from the training data_ because in that case the trees are
       | likely to provide a variety of output values. These are both ways
       | to avoid overfitting to training data and to generalize better to
       | new inputs.
       | 
       | Disclaimer: I did not carefully read the paper; this is my quick
       | understanding.
        
         | abetusk wrote:
         | I think this is specifically coming to terms with an insight
         | that's taught to statisticians about a bias-variance tradeoff.
         | 
         | From my understanding, in a statistical setting, low
         | variability in bias leads to high variability in variance
         | whereas low variability in variance leads to high variability
         | in bias. The example I saw was with K-means, where K = N leads
         | to high variance (the predicted cluster is highly variable) but
         | low bias (take an input point, you get that exact input point
         | back), vs. K=1 low variance (there's only one cluster) but bad
         | bias (input point is far away from the cluster
         | center/representative point).
         | 
         | I'm not sure I've characterized it well but there's a Twitter
         | post from Alicia Curth that explains it [0] as well as a paper
         | that goes into it [1].
         | 
         | [0] https://x.com/AliciaCurth/status/1841817856142348529
         | 
         | [1] https://arxiv.org/abs/2409.18842
        
       | bonoboTP wrote:
       | The same team wrote another interesting paper arguing that
       | there's no "double descent" in linear regression, trees, and
       | boosting, despite what many argued before (in this paper they
       | don't tackle deep learning double descent, but remark that the
       | case may be similar regarding the existence of different
       | complexity measures being conflated).
        
         | selimthegrim wrote:
         | https://arxiv.org/abs/2310.18988
        
       | StableAlkyne wrote:
       | Random forests are incredible cool algorithms.
       | 
       | The idea that you can take hundreds of bad models that over fit
       | (the individual decision trees), add even more randomness by
       | randomly picking training data and features*, and averaging them
       | together - it's frankly amazing that this leads to consistently
       | OK models. Often not the best but rarely the worst. There's a
       | reason they're such a common baseline to compare against.
       | 
       | *Unless you're using Sklearn, whose implementation of
       | RandomForestRegressor is not a random forest. It's actually
       | bagged trees because they don't randomly select features. Why
       | they kept the misleading classname is beyond me.
        
         | jncfhnb wrote:
         | With a relatively small variant to make it gradient boosted
         | trees it's pretty much as good as it gets for tabular data
         | these days
        
         | hanslovsky wrote:
         | TIL sklearn doesn't actually implement random forest. Goos to
         | know, thank you!
        
           | StableAlkyne wrote:
           | It does implement one, just the default settings do not
        
       | alexfromapex wrote:
       | Quantized random sampling regression
        
       | foundry27 wrote:
       | I like this article. Randomness in system design is one of the
       | most practical ways to handle the messiness of real world inputs,
       | and I think random forests nail this by using randomness to
       | produce useful outputs to various inputs without overfitting and
       | adapt to the unexpected. Yeah, you can always engineer a system
       | that explicitly handles every possible situation, but the
       | important question is "how long/costly will that process be?".
       | Deterministic systems aren't good on that front, and when edge
       | cases hit, sometimes those rigid models crack. Controlled
       | randomness (load balancing, feature selection, etc.) makes
       | systems more flexible and resilient. You don't get stuck in the
       | same predictable ruts, and that's exactly why randomness works
       | where pure determinism fails
        
       | SomaticPirate wrote:
       | Any suggestions for a 101 introduction to random forests? In
       | university I encountered some ML courses but never random
       | forests.
        
         | jph00 wrote:
         | Yeah I did a deep dive on them here -- doesn't require any
         | particular math background beyond high school:
         | https://www.youtube.com/watch?v=blyXCk4sgEg
        
         | sunshinesnacks wrote:
         | Statquest is really good. Maybe a little basic if you've taken
         | ML courses, though?
         | 
         | https://youtube.com/playlist?list=PLblh5JKOoLUIE96dI3U7oxHaC...
        
         | RandomThoughts3 wrote:
         | Grab any ML book and read the chapter on random forests. If you
         | have the maths background (which is not particularly high for
         | random forests) which you should if you took ML courses, it's
         | all going to be pretty straightforward. I think someone already
         | mentioned Hastie, _The Elements of Statistical Learning_ , in
         | this thread which you can download for free and would be a good
         | start.
        
         | __mharrison__ wrote:
         | My book, Effective XGBoost, covers tree theory from stumps to
         | decision trees, to bagging (random forests) to boosting.
        
       | xiaodai wrote:
       | XGBoost LightGBM Catboost and JLBoost?
        
       | Iwan-Zotow wrote:
       | Cool, more random ess at work
        
       | oli5679 wrote:
       | My understanding of why bagging works well is because it's a
       | variance reduction technique.
       | 
       | If you have a particular algorithm, the bias will not increase if
       | you train n versions in ensemble, but the variance will decrease
       | as more anomalous observations won't persistently be identified
       | in submodel random samples and so won't the persist in the
       | bagging process.
       | 
       | You can test this. The difference between train and test auc will
       | not increase dramatically as you increase number of trees in
       | sklearn random forest for same data and hyperparameters.
        
       ___________________________________________________________________
       (page generated 2024-10-20 23:02 UTC)