[HN Gopher] Why do random forests work? They are self-regularizi...
       ___________________________________________________________________
        
       Why do random forests work? They are self-regularizing adaptive
       smoothers
        
       Author : sebg
       Score  : 113 points
       Date   : 2024-10-17 21:24 UTC (2 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | PoignardAzur wrote:
       | Reading the abstract alone, I have no idea whether it's talking
       | about algorithmic trees or, like, the big brown things with small
       | green bits.
        
         | vatys wrote:
         | I had the exact same reaction: biology or computers?
         | 
         | The only hint I can see anywhere on the page is "Statistics >
         | Machine Learning" above the abstract title.
         | 
         | I really want it to be about actual biological trees being
         | studied on the scale of forests growing with smooth edges over
         | long periods of time, but I suspect that's not what it is
         | about.
        
           | bonoboTP wrote:
           | There's also "Subjects: Machine Learning (stat.ML); Machine
           | Learning (cs.LG)"
           | 
           | Also, the very first sentence of the actual paper (after the
           | abstract) is
           | 
           | > Random forests (Breiman, 2001) have emerged as one of the
           | most reliable off-the-shelf supervised learning algorithms
           | [...]
           | 
           | arxiv.org is overwhelmingly used for math and computer
           | science papers, though not exclusively.
           | 
           | The paper will also likely be submitted to a machine learning
           | venue.
        
           | ibgeek wrote:
           | Biological trees don't make predictions. Second or third
           | sentence contains the phrase "randomized tree ensembles not
           | only make predictions."
        
         | mhuffman wrote:
         | It's talking about these[0]
         | 
         | [0]https://en.wikipedia.org/wiki/Random_forest
        
           | defjm wrote:
           | Jeremy Howard has a great video explaining Random Forests:
           | https://youtu.be/blyXCk4sgEg .
        
         | krystofee wrote:
         | You have to know some machine learning fundamentals to figure
         | that out - "Random Forest" is a specific machine learning
         | algorithm, which does not need a further explanation. To take
         | it a step further, they should really not describe "Machine
         | learning", no, its not like the machine takes a book and
         | learns, its a term.
        
         | bigmadshoe wrote:
         | The tree is an incredibly common data structure in computer
         | science. Decision trees are well known. Random forests are
         | ubiquitous in Machine Learning. Should the authors really have
         | to dumb their paper down so people who don't work in this
         | domain avoid confusing it with work in arborism?
        
           | avazhi wrote:
           | Pretty sure the guy you're replying to was half-joking, but
           | adding the words 'machine learning' in the first sentence
           | would have cleared this up pretty simply and wouldn't have
           | resulted in dumbing down anything.
        
       | youoy wrote:
       | Nice article, although I find it's a bit overly complex if your
       | are not familiar with ML and mathematics at the same time.
       | 
       | I will leave here a geometrical intuition of why they work in
       | case it can help someone:
       | 
       | To simplify things, I will talk about regression, and say we want
       | to predict some value y, that we know depends on x, and we have
       | some noisy measurements of y.
       | 
       | y = f(x) y_observed = f(x) + noise
       | 
       | If you want some mental image, think about f(x)=sin(x).
       | 
       | Now, a (over fitted) regression tree in this case is just a step
       | function where the value at x is y_observed. If there is no
       | noise, we now that by doing more measurements, we can approximate
       | y with as much precision as we want. But if there is noise, the
       | regression tree will over fit the noisy values, creating some
       | artificial spikes.
       | 
       | If you want to avoid this over fitting, you sample a lot of times
       | the values of X, and for each sample you build a regression tree,
       | and then average them. When you average them, every tree will
       | contain its own particular artificial spikes, and if they are
       | noise, they won't appear in the majority of the other trees. So
       | when you average them, the spikes will attenuate, creating the
       | smoother behaviour that the article talks about.
       | 
       | I hope it helps!
        
         | hammock wrote:
         | Thanks that helps. The way I think about your example is it's
         | like (not the same obv) taking a bunch of moving averages of
         | different durations at different starting points, and throwing
         | those into your regression model along with the actual data
        
       | levocardia wrote:
       | Good to see more research exploring the connection between trees,
       | ensembles, and smoothing. Way back in Trevor Hastie's ESL book
       | there's a section on how gradient boosting using "stumps" (trees
       | with only one split) is equivalent to an additive spline model
       | (GAM, technically) with a step function as a spline basis and
       | adaptive knot placement.
       | 
       | I've always thought there should be a deep connection between
       | ReLU neural nets and regularized adaptive smoothers as well,
       | since the ReLU function is itself a spline basis (a so-called
       | truncated linear spline) and happens to span the same functional
       | basis as B-splines of the same degree.
        
       | tylerneylon wrote:
       | Here's some context and a partial summary (youoy also has a nice
       | summary) --
       | 
       | Context:
       | 
       | A random forest is an ML model that can be trained to predict an
       | output value based on a list of input features: eg, predicting a
       | house's value based on square footage, location, etc. This paper
       | focuses on regression models, meaning the output value is a real
       | number (or a vector thereof). Classical ML theory suggests that
       | models with many learned parameters are more likely to overfit
       | the training data, meaning that when you predict an output for a
       | test (non-training) input, the predicted value is less likely to
       | be correct because the model is not generalizing well (it does
       | well on training data, but not on test data - aka, it has
       | memorized, but not understood).
       | 
       | Historically, a surprise is that random forests can have many
       | parameters yet don't overfit. This paper explores the surprise.
       | 
       | What the paper says:
       | 
       | The perspective of the paper is to see random forests (and
       | related models) as _smoothers_, which is a kind of model that
       | essentially memorizes the training data and then makes
       | predictions by combining training output values that are relevant
       | to the prediction-time (new) input values. For example, k-nearest
       | neighbors is a simple kind of smoother. A single decision tree
       | counts as a smoother because each final/leaf node in the tree
       | predicts a value based on combining training outputs that could
       | possibly reach that node. The same can be said for forests.
       | 
       | So the authors see a random forest as a way to use a subset of
       | training data and a subset of (or set of weights on) training
       | features, to provide an averaged output. While a single decision
       | tree can overfit (become "spikey") because some leaf nodes can be
       | based on single training examples, a forest gives a smoother
       | prediction function since it is averaging across many trees, and
       | often other trees won't be spikey for the same input (their leaf
       | node may be based on many training points, not a single one).
       | 
       | Finally, the authors refer to random forests as _adaptive
       | smoothers_ to point out that random forests become even better at
       | smoothing in locations in the input space that either have high
       | variation (intuitively, that have a higher slope), or that are
       | far from the training data. The word "adaptive" indicates that
       | the predicted function changes behavior based on the nature of
       | the data -- eg, with k-NN, an adaptive version might increase the
       | value of k at some places in the input space.
       | 
       | The way random forests act adaptively is that (a) the prediction
       | function is naturally more dense (can change value more quickly)
       | in _areas of high variability_ because those locations will have
       | more leaf nodes, and (b) the prediction function is typically a
       | combination of a wider variety of possible values _when the input
       | is far from the training data_ because in that case the trees are
       | likely to provide a variety of output values. These are both ways
       | to avoid overfitting to training data and to generalize better to
       | new inputs.
       | 
       | Disclaimer: I did not carefully read the paper; this is my quick
       | understanding.
        
         | abetusk wrote:
         | I think this is specifically coming to terms with an insight
         | that's taught to statisticians about a bias-variance tradeoff.
         | 
         | From my understanding, in a statistical setting, low
         | variability in bias leads to high variability in variance
         | whereas low variability in variance leads to high variability
         | in bias. The example I saw was with K-means, where K = N leads
         | to high variance (the predicted cluster is highly variable) but
         | low bias (take an input point, you get that exact input point
         | back), vs. K=1 low variance (there's only one cluster) but bad
         | bias (input point is far away from the cluster
         | center/representative point).
         | 
         | I'm not sure I've characterized it well but there's a Twitter
         | post from Alicia Curth that explains it [0] as well as a paper
         | that goes into it [1].
         | 
         | [0] https://x.com/AliciaCurth/status/1841817856142348529
         | 
         | [1] https://arxiv.org/abs/2409.18842
        
       | bonoboTP wrote:
       | The same team wrote another interesting paper arguing that
       | there's no "double descent" in linear regression, trees, and
       | boosting, despite what many argued before (in this paper they
       | don't tackle deep learning double descent, but remark that the
       | case may be similar regarding the existence of different
       | complexity measures being conflated).
        
       | StableAlkyne wrote:
       | Random forests are incredible cool algorithms.
       | 
       | The idea that you can take hundreds of bad models that over fit
       | (the individual decision trees), add even more randomness by
       | randomly picking training data and features*, and averaging them
       | together - it's frankly amazing that this leads to consistently
       | OK models. Often not the best but rarely the worst. There's a
       | reason they're such a common baseline to compare against.
       | 
       | *Unless you're using Sklearn, whose implementation of
       | RandomForestRegressor is not a random forest. It's actually
       | bagged trees because they don't randomly select features. Why
       | they kept the misleading classname is beyond me.
        
       | alexfromapex wrote:
       | Quantized random sampling regression
        
       ___________________________________________________________________
       (page generated 2024-10-19 23:00 UTC)