[HN Gopher] Why do tree-based models still outperform deep learn...
___________________________________________________________________
Why do tree-based models still outperform deep learning on tabular
data?
Author : isolli
Score : 194 points
Date : 2022-08-03 16:08 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| BenoitEssiambre wrote:
| I like decision trees and this helps support my case for using
| them. I often go even further and don't use an algorithm to build
| the trees but build trees myself along intuitive causal lines and
| use the data to train its parameters. I sometimes build a few
| models manually and see which fits better with the data.
|
| Prior knowledge can prevent the pitfalls of automatically built
| models.
|
| Trees may be better than NNs because they overfit less but you
| can overfit even less with a bespoke model. For example, I've
| seen an automatically generated tree made to tune the efficiency
| of a factory end up using as a main feature "be after a specific
| date" because a machine was upgraded on that date and so the
| learning algorithm latched on to that unactionable piece of data
| as a main predictor for the model.
|
| This was an easy fix, to not feed timestamp data to the model but
| there are lots of more subtle cases like this and I've seen
| people spend much time cleaning and "tweaking" the input data to
| get the answers they want out of their ML models.
|
| If you have to try to make your ML model behave by manually
| selecting what data to feed it, you might as well go all the way
| and build a clean causal model yourself that reflect your priors
| and domain knowledge about the subject.
|
| I have an ML background but I often get more performance out of
| my models by doing something along the lines of what a Bayesian
| statistician would do.
|
| Of course with highly dimensional data like pixels in images you
| almost have no choice to use NNs. There's no way to hand-build
| these models.
| ersiees wrote:
| Our lab works on changing this. I think it might still take some
| years for a full solution, but so far we are successful with NNs
| for small datasets with meta-learning
| (https://arxiv.org/abs/2207.01848) and large datasets with
| regularisation (https://arxiv.org/abs/2106.11189). The first is
| very new, but the second is also cited in this paper, but they
| didn't run it as a baseline.
| civilized wrote:
| Why not run your own method on their benchmark? If your method
| is general-purpose, it should be very easy. And it would be
| very impressive to see your method succeed on a task that was
| not chosen by you to write your paper.
| ersiees wrote:
| Yes, good point! The paper I am responsible for is targeted
| at smaller datasets, but I will propose this to the authors
| of the second paper :)
| freemint wrote:
| Tree based algorithms have the advantage that global or robust
| optimization of them is possible. Do your approaches offer
| similar guarantees?
| ersiees wrote:
| The first work, I linked, arguably gets around this issue
| completely. There we train a neural network to approximate
| Bayesian prediction directly. It accepts the dataset as
| input. Thus, there is no optimisation procedure per dataset,
| just a forward pass. So far, this seems to be pretty stable.
| isolli wrote:
| Related, an article from 2019 [0] on how neural network finally
| beat statistical models (e.g. ARIMA) in time-series forecasting.
|
| [0] https://towardsdatascience.com/n-beats-beating-
| statistical-m...
| nomel wrote:
| I can only assume this was achieved years ago, as a black
| project in the financial market.
| jmmcd wrote:
| A great paper and an important result.
|
| However, it omits to cite the highly relevant SRBench paper from
| 2021, which also carefully curates a suitable set of regression
| benchmarks and shows that Genetic Programming approaches also
| tend to be better than deep learning.
|
| https://github.com/cavalab/srbench
|
| cc u/optimalsolver
| CapmCrackaWaka wrote:
| I have a theory - tree based models require minimal feature
| engineering. They are capable of handling categorical data in
| principled ways, they can handle the most
| skewed/multimodal/heteroskedastic continuous numeric data just as
| easily as a 0-1 scaled normal distribution, and they are easy to
| regularize compared to a DL model (which could have untold
| millions of possible parameter combinations, let alone getting
| the thing to train to a global optimum).
|
| I think if you spent months getting your data and model structure
| to a good place, you could certainly get a DL model to out-
| perform a gradient boosted tree. But why do that, when the GBT
| will be done today?
| [deleted]
| username_exists wrote:
| occam's razor
| oofbey wrote:
| I think you're on the right track that trees are good at
| feature engineering. But the key problem is that DL researchers
| are horrible at feature engineering, because they have never
| had to do it. These folks included.
|
| The feature engineering they do here is absolutely horrible!
| They use a QuantileTransform and that's it. They don't even
| tune the critical hyper parameter of the number of quantiles.
| Do they always use the scikitlearn default of 1,000 quantiles?
| No wonder uninformative features are hurting- they are getting
| expanded into 1000 even more uninformative features! Also with
| a single quantile transform like that, the relative values of
| the quantiles are completely lost! If the values 86 and 87 fall
| into different bins, the model has literally no information
| that the two bins are similar to each other, or even that they
| come from the same raw input.
|
| For a very large dataset a NN would learn its way around this
| kind of bone headed mistake. But for this size dataset, these
| researchers have absolutely crippled the nets with this
| thoughtless approach to feature engineering.
|
| There is plenty more to criticize about their experiments, but
| it's probably less important. E.g. Their HP ranges are too
| small to allow for the kind of nets that are known to work best
| in the modern era (after Double Descent theory has been worked
| out) - large heavily regularized nets. They don't let the nets
| get very big and they don't let the regularization get nearly
| big enough.
|
| So. Bad comparison. But it's also very true that XGB "just
| works" most of the time. NN's are finicky and complicated and
| very few people really understand them well enough to apply
| them to novel situations. Those who do are working on fancy AI
| problems, not writing poor comparison papers like this one.
| a-dub wrote:
| this is along the lines of my thinking. people organize and
| summarize data before throwing it into spreadsheets, where deep
| learning models do their thing by generating new
| representations from raw data.
|
| in a sense, most data in spreadsheets is compressed and deep
| learning models prefer to find their own compression that best
| suits the task at hand.
|
| or in human terms: "these spreadsheets are garbage. i can't
| work with this. can you bring me the raw data please?" :)
| ellisv wrote:
| I agree. The majority of DL layers are about feature
| engineering, not performing classification.
| wpietri wrote:
| Could you say more about this?
|
| One of the things that interests me about nominal AI
| applications is the extent to which they're sort of a
| Mechanical Turk or what I've heard called Artificial
| Artificial Intelligence. By which I mean it's sold as
| computer magic, but most of the magic is actually humans
| sneaking in a lot of human judgement. That can come through
| humans directly massaging the output or through human-driven
| selection of results. But I've also been wondering to what
| extent natural human intelligence is getting put in at the
| lower layers of the system, like feature engineering.
| drzoltar wrote:
| I think another aspect is that most modern GBT models prefer
| the _entire_ dataset to be in memory, thereby doing a full scan
| of the data for each iteration to calculate the optimal split
| point. That's hard to compete with if your batch size is small
| in a NN model.
| a-dub wrote:
| that's an interesting idea. but at the end of the paper they
| do an analysis of the effect of different hyperparameters for
| the nets with their dataset and find that the batch size
| doesn't seem to matter much. (although they're trying size
| ranges like [256, 512, 1024] as opposed to turning batching
| off entirely)
| alexcnwy wrote:
| The issue isn't batch size as a parameter but rather
| needing to load the entire dataset into memory
| a-dub wrote:
| > thereby doing a full scan of the data for each
| iteration to calculate the optimal split point
|
| > (although they're trying size ranges like [256, 512,
| 1024] as opposed to turning batching off entirely)
|
| > The issue isn't batch size as a parameter but rather
| needing to load the entire dataset into memory
|
| what's stored in memory is an implementation detail. the
| key idea is that the tree algorithms are choosing an
| optimal based on the entire dataset, where sgd is working
| on small randomly chosen batches. turning off batching
| means computing gradients on the entire dataset instead.
|
| although the typical bottleneck in gpu computing is
| moving data to and from the gpu's workarea (which is
| probably why you mention memory), there is nothing
| theoretical that says these computations could not be
| implemented in a streaming manner.
| moffkalast wrote:
| Well iirc a convenient trait of a random forest classifier is
| that it cannot overfit onto learning data. Something that's not
| exactly true for deep learning.
| omegalulw wrote:
| Any reference for this claim? In my opinion this is most
| certainly not the case - random forests are hard to overfit
| compared to gradient boosted trees but you can overfit with
| them too if you don't tune your parameters right.
|
| Overfitting is generally a function of the the size of your
| data and the complexity expressible in your model.
| moffkalast wrote:
| My reference would be my machine learning prof mentioning it
| at the lectures a few years ago, something about
| decorrelation as the other guy in the replies mentions. It's
| possible he meant hard to overfit in comparison to something,
| not completely impossible.
| sweezyjeezy wrote:
| If you keep adding more and more trees to a random forest,
| variance will go down (they stop being 'random' and become
| more like the average over all possible trees). In this
| respect they can't overfit, but in practice there are other
| tweaks you can make such as tree depth and number of
| features) that you often want to tweak to increase model
| performance. These can certainly cause orverfitting.
| taylorius wrote:
| A random forest is an ensemble model - i.e. lots of instances
| of a model, all separately fitted to the data, with some
| randomness so they're all different (e.g. bootstrapping,
| random subset of the data used to train etc). These models
| are then all used to classify, and their results averaged.
| This averaging will work to mitigate any overfitting that a
| single model might suffer from.
| ellisv wrote:
| RFs _can_ overfit but in practice usually won 't.
|
| I think Hastie & Tibshirani have an article on this.
| lupire wrote:
| taeric wrote:
| I'm curious on this claim. Feels like any model can over fit.
| bbstats wrote:
| Random Forests can overfit, but with sufficient data (and no
| data leakage) it is difficult to do so.
|
| The final output of a random forest regression for example is
| just the average of all the final prediction nodes of the
| individual trees, which thanks to bootstrapping are
| decorrelated. So - adding more trees does not tend to alter
| predictions very much.
| marcosdumay wrote:
| A model can only overfit if it has enough parameters. But
| indeed, any model with enough parameters can overfit.
|
| The thing about deep learning is that the number of
| parameters is astronomical.
| disgruntledphd2 wrote:
| Hence, the interesting question is why deep learning models
| don't overfit all the time.
| DougBTX wrote:
| This is from 2019 so I suspect there's a more recent
| paper now, but this is interesting:
| https://openai.com/blog/deep-double-descent/
| disgruntledphd2 wrote:
| Agreed, but it doesn't really explain it in any real
| sense.
| marcosdumay wrote:
| I really like how AI research is all about empirical
| analysis of mathematics.
| melony wrote:
| If you train your ensemble long enough, won't it overfit too?
| empyrrhicist wrote:
| No, it's not sequential. For a given number of bootstrap
| replicates you can over fit by using trees that are too
| complex, but you won't overfit in the number of bootstrap
| replicates like you would with Gradient Boosted Trees.
| melony wrote:
| Can you elaborate? I am unfamiliar with gradient boosted
| trees.
| leto_ii wrote:
| I think the idea is that gbt's are trained sequentially
| on the "residuals" left from the previous tree, while
| rf's are uncorrelated (can be trained in parallel, don't
| depend on one another). This means that in rf's there's
| no "compounding" effect that can lead to overfitting. The
| main way to overfit rf's would be to train many large
| trees.
| VoVAllen wrote:
| bilsbie wrote:
| I think K nearest neighbor is an underrated learning model.
|
| Assuming you can define a good distance measurement how cool is
| it to use every piece of data. And with no training.
|
| I wonder how they compare against tree based models on tabular
| data just in accuracy.
| teruakohatu wrote:
| I like using KNN as much as the next person but tabular data
| can be quite wide, and KNN suffers from the curse of
| dimensionality.
| marcodiego wrote:
| Wait until we see deep learning AI creating tree-based models. /s
| karmasimida wrote:
| triggering co-pilot to use xgboost might count as one today
| lol?
| PartiallyTyped wrote:
| Technically you can transform any finite / not runtime defined
| computation graph into a "tree" model.
| optimalsolver wrote:
| See symbolic regression and genetic programming.
| uoaei wrote:
| Not actually that complicated of an approach! Hard to implement
| correctly, which is why it's not more popular right now.
| _pastel wrote:
| It's baffling to me how little research attention there has been
| to improving tree-based methods, considering their effectiveness.
|
| For example, LightGBM and XGBoost allow some regularization
| terms, but the variance/bias is still mostly controlled by
| globally setting the max depth and max node count (and then
| parameter searching to find good settings).
|
| Surely there must be more powerful and sophisticated ways of
| deciding when to stop building each tree than counting the number
| of nodes? If this was neural nets there would be a hundred
| competing papers proposing different methods and arguing over
| their strengths and weaknesses.
|
| I'm not sure whether the problem is that neural nets are just
| fundamentally more sexy, or that in order to make SOTA
| improvements in GBMs you need to dive into some gnarly C++.
| Probably both.
| natalyarostova wrote:
| I think part of the problem is that the upper bound on neural
| nets, as far as we can tell, might very well be general
| intelligence, and things like self-driving cars, and other
| nearly magical use-cases that seem within reach. Whereas tree
| based models, for a series of reasons, many related to scaling,
| don't offer that feeling of limitless potential.
| zelphirkalt wrote:
| Maybe. But then again we often try to solve very specific
| problems, which are very far from requiring anything close to
| a general intelligence. Heck, a general intelligence might
| even be bad at things, just like humans are not as good as
| computers at certain things.
| mistrial9 wrote:
| > LightGBM not improving, meanwhile 8-figure budgets builds GPU
| and auto-logins..
|
| My take? management agenda to build plug-and-play researchers
| (humans on jobs), rather than domain specialists. DeepLearning
| fits that description with all-plumbing, all-the-time.. domain
| specialists want graduate school, weekends and health
| benefits..
| gwern wrote:
| Why do you think there has been little research attention? Time
| was, 'machine learning' was little _but_ tree-based methods
| (and that was how they distinguished themselves from 'AI'). Go
| look at Breiman's CV or random conference proceedings. Or as
| tree-based method proponents love to point out, pretty much
| everyone on Kaggle up until recently used trees for everything
| non-image-based; that's a ton of effort invested in tweaking
| trees. And there _were_ hardware efforts to accelerate them (I
| recall MS talking about how they were investing in FPGAs for MS
| Azure to run trees better), so 'GPUs' isn't an excuse.
| zelphirkalt wrote:
| I think people throwing algorithms at problems at Kaggle or
| combining them is not the kind of research, which the GP was
| referring to.
|
| Limiting a tree by its depth is a very general global
| parameter for a tree. One could try to use any kind of
| criteria for deciding when to stop making more child nodes in
| a tree, depending on what information is locally available
| and that depends on how the tree algorithm is actually
| implemented. So people doing Kaggle challenges would have to
| dig into the source code of the tree implementation, then
| change things there, to modify locally available knowledge,
| in order to allow for more fine grained decision making at
| each node.
|
| That is only the constructive side of things, when the tree
| is created. Even more powerful is the post processing /
| destructive / prunning side of things, because theoretically
| the whole tree structure can be taken into account, when
| deciding what branch to cut.
|
| I think the GP is referring to research in the area of what
| other useful things one can come up with here. As far as I
| know, these are not the usual things people do in Kaggle
| challenges. Correct me if I am wrong.
| _pastel wrote:
| Because anytime I search for literature on basic tweaks to
| the structure of decision trees, I find nothing.
|
| For example: modern GBM implementations all use binary trees.
| How would they perform with ternary trees? Or k-way trees for
| larger k plus some additional soft penalty that encourages
| minimizing the number branches, unless the information gain
| is really worth it?
|
| (You can simulate ternary trees with binary, but the
| splitting behavior is different because ternary can more
| easily identify good splitting regions in the middle range of
| the histogram values.)
|
| This seems like such a basic structural question, but the
| only relevant search result was this downvoted Stack Exchange
| question from 5 years ago:
| https://stats.stackexchange.com/questions/305685/ternary-
| dec...
|
| There are lots of papers on ternary trees in old-school
| contexts like Ternary Decision Diagrams etc., but nothing
| relevant to the context of modern tree ensemble performance.
| Or maybe I'm just bad at searching?
|
| (I implemented this and saw a small accuracy increase from
| ternary on large datasets, but much worse training speed
| because you get less mileage from the histogram subtraction
| trick. Maybe the accuracy would be even better with a more
| clever choice of soft penalty.)
| orasis wrote:
| We use XGBoost as the core learner for reinforcement learning at
| https://improve.ai despite the popularity of neural networks in
| academia.
|
| With tabular or nested data a human has already done a lot of
| work to organize that data in a machine friendly form - much of
| the feature engineering is performed by the data schema itself.
| ramraj07 wrote:
| Your app sounds amazing, as is the ethics model of it. Can't
| wait to test it out!
| wills_forward wrote:
| Does anyone see explainability as another good reason to trees on
| tabular data, for which I think users would expect more
| digestable outputs?
| oofbey wrote:
| The kinds of trees that come out of these algorithms are so
| huge they really aren't any more interpretable than a NN.
| [deleted]
| hatmatrix wrote:
| I didn't get to what extent they compared capability of these
| methods for extrapolation, which tree-based methods are not
| really suited for.
| jwilber wrote:
| If you're interested in how tree-based models work, I wrote an
| interactive explanation on decision trees here: https://mlu-
| explain.github.io/decision-tree/
|
| and random forests here: https://mlu-explain.github.io/random-
| forest/
|
| It's also worth noting that a recentish paper shows neural
| networks can perform well on tabular data if well-regularized:
| https://arxiv.org/abs/2106.11189v1?utm_source=jesper&utm_med...
| isabellat wrote:
| Really nice interactive explanations!
| lnenad wrote:
| That was super easy to digest, thank you!
| fdgsdfogijq wrote:
| Because high tabular data doesnt have enough complexity compared
| to language or images.
| Permit wrote:
| > Results show that tree-based models remain state-of-the-art on
| medium-sized data (~10K samples) even without accounting for
| their superior speed.
|
| Is that really "medium"? That seems very small to me. MNIST has
| 60,000 samples and ImageNet has millions.
|
| I think the title overstates the findings. I'd be interested to
| hear how these methods compare on much larger datasets. Is there
| a threshold at which deep learning outperforms tree-based models?
|
| Edit: They touch on this in the appendix:
|
| > A.2.2 Large-sized datasets
|
| > We extend our benchmark to large-scale datasets: in Figures 9,
| 10, 11 and 12, we compare the results of our models on the same
| set of datasets, in large-size (train set truncated to 50,000
| samples) and medium-size (train set truncated to 10,000 samples)
| settings.
|
| > We only keep datasets with more than 50,000 samples and
| restrict the train set size to 50,000 samples (vs 10,000 samples
| for the medium-sized benchmark). Unfortunately, this excludes a
| lot of datasets, which makes the comparison less clear. However,
| it seems that, in most cases, increasing the train set size
| reduces the gap between neural networks and tree-based models. We
| leave a rigorous study of this trend to future work.
| mochomocha wrote:
| I've put in production numerous models with millions of tabular
| data points and a 10^5-10^6 feature space where tree-based
| models (or FF nets) outperform more complex DL approaches.
| adamsmith143 wrote:
| What kind of freakish tabular data do you have with a million
| columns??
| ramraj07 wrote:
| A badly defined one probably? At least for one of the test
| arms.
| mochomocha wrote:
| Categorical variables in the datasets of large tech
| companies can take a lot of different values.
| whymauri wrote:
| I'll chime in with billions of data points and 100-300
| feature space with some smart feature engineering
| outperforming DL in runtime/compute (by orders of magnitude)
| and performance. But the domain was very specific and
| everything prior to the tree was doable with matrix
| operations, with the tree model summarizing a mixture of
| experts that chose optimal features.
| beckingz wrote:
| Many real world problems that result in data are decidedly
| medium: small enough to fit in excel, large enough to be too
| big to comfortable handle in excel.
| ramraj07 wrote:
| Then these real world problems don't actually warrant deep
| learning ?
|
| I thought the biggest leap in NN and deep learning in recent
| history was the realization that we need a ton of data to get
| maximal effectiveness from them; it now sounds
| counterproductive to forget this and cry they don't work well
| with 10,000 rows.
| riedel wrote:
| MNIST is not your typical real world tabular data. Many if not
| most data science problems out there are still in the range of
| a few k samples from my perspective (trying to "sell" ML to the
| average company) From a statistical point of view I would not
| call the datasets small (you can decently compare two means
| from subsets without needing a student's distribution).
| nonameiguess wrote:
| Assuming the categories are meant to apply to any data sets,
| anything amenable to machine learning at all is at least medium
| data. "Small" data would be something like a human trial with
| n=6 because the length and compliance of the protocol is so
| onerous. There are entirely different statistical techniques
| for finding significance in the face of extremely low power.
| spywaregorilla wrote:
| Tree based models seem like obvious approaches that should just
| work. It's more or less how a human would parse a problem. Throw
| some X and O's on a 2d plot. Draw lines to partition it into X
| sections and O sections. That's a tree based model.
| lupire wrote:
| melenaboija wrote:
| Now you can contact the authors and tell them to replace the
| publication by your 4 lines.
|
| And tree based models is not what you described, at most that
| could be a tree classifier.
|
| EDIT: First of all sorry for my comment, I think you did not
| like it. My point is that what you state as obvious on how tree
| based methods work is most probably not what makes them
| powerful but the fact that you have a bunch of them. To me if
| there is some intuition, statistics is more prevalent than
| separating spaces in hyperplanes.
|
| Here you can see the boundaries of a random forest compared to
| a single tree, the more dimensions you add the more blurry and
| unintuitive in terms of hyperplanes it gets:
|
| https://scikit-learn.org/stable/_images/sphx_glr_plot_classi...
| stonemetal12 wrote:
| To be fair "Obvious" doesn't mean correct or proven. Even if
| their results boil down to obvious method is better than more
| complicated method it is still worth a paper.
| spywaregorilla wrote:
| abduhl wrote:
| So it sounds like you'd need your 4 lines of pseudo-code
| from your first post and then another few lines of comments
| to describe how your 4 lines replace the authors'
| publication? Something like "All tree models are partitions
| of the space based on training data, and the extension to
| any particular tree model is an obvious exercise left to
| the reader"?
| spywaregorilla wrote:
| Why have 2 separate people asserted I have issue with the
| author's publication?
|
| My point is that tree based models are models that
| partition the space. This is not obvious. But the concept
| of partitioning the space is a very obvious thing to try
| because it just makes visual sense. Take a look at the
| example linked above: https://scikit-
| learn.org/stable/_images/sphx_glr_plot_classi...
|
| All models indirectly partition the space. Tree based
| models partition the space explicitly. Other model types
| do so implicitly.
| JoshCole wrote:
| Patience will be necessary; a bad reading of your meaning
| will be a bit sticky because it allows others to pick up
| computation where others left off. In the short term it
| looks like misunderstanding. Over time it means a great
| deal more thoughtfulness than any one person would have
| been able to give.
| JoshCole wrote:
| You are saying it sounds as if he his _explicit denial_
| is what he was saying? You 're like a kid that hears
| another kid getting upset at a mispronunciation of their
| name. "My name is John," he whines. "Sounds like you are
| Joonnna," you taunt. It is an exceedingly childish thing
| to do. Maybe others here aren't mature enough to realize
| you are being a bully, but I'm not stupid enough to think
| that "not X" said with emotional force means something
| "sounds like X". I'd also rather give this response
| rather than making spyware go through the humiliation of
| repeating himself. He got downvoted because he cussed,
| not because intentionally distorting what people say is
| funny.
| spywaregorilla wrote:
| I did not like the comment. It is needlessly aggressive and
| implies I have the arrogant view that my post is better than
| the author's paper. I do not.
|
| I also suspect your hypothesis is not correct. The top row of
| your link is a good example. The NN tries to infer a gradient
| but that's really tough to do with limited data. That is to
| say, the tree based models will locally fit their partitions
| to the exact training data and the NN will try to view it in
| the big picture filling in the gaps. Tree based model works
| better for most real world tabular data.
|
| The paper concludes something similar:
|
| > This superiority is explained by specific features of
| tabular data: irregular patterns in the target function,
| uninformative features, and non rotationally-invariant data
| where linear combinations of features misrepresent the
| information.
|
| I daresay my perspective is better aligned with the paper
| than yours. Are YOU trying to replace the publication with
| your 4 lines?
| melenaboija wrote:
| Totally agree and sorry again for my wording as I probably
| miss understood what you meant.
|
| And no, I am trying to replace what I consider is a wrong
| intuition (tree methods are strong models mostly because
| single trees separate data in hyperplanes) with my 4 lines.
| This is just my opinion.
___________________________________________________________________
(page generated 2022-08-03 23:00 UTC)