[HN Gopher] Regularization is all you need: simple neural nets c...
___________________________________________________________________
Regularization is all you need: simple neural nets can excel on
tabular data
Author : tracyhenry
Score : 105 points
Date : 2021-06-22 18:29 UTC (4 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| lern_too_spel wrote:
| A lot of these papers can be titled "Tuning hyperparameters on
| the evaluation dataset is all you need." I see a few cases in
| this paper.
| nightcracker wrote:
| Paper is missing the control: how good is this 'cocktail of
| regularization' when applied to traditional methods like XGBoost?
|
| At best you can claim the result here that neural networks with
| regularization methods can beat traditional methods without it,
| but to be apples to apples both methods must have access to the
| same 'cocktail of regularization'.
| audi0slave wrote:
| This paper compares xgboost vs nueural nets and an ensemble
| [Tabular Data: Deep Learning is Not All You Need]
| https://arxiv.org/abs/2106.03253
| ipsum2 wrote:
| From the paper:
|
| > This paper is the first to provide compelling evidence that
| wellregularized neural networks (even simple MLPs!) indeed
| surpass the current state-of-the-art models in tabular
| datasets, including recent neural network architectures and
| GBDT (Section 6).
|
| > Next, we analyze the empirical significance of our well-
| regularized MLPs against the GBDT implementations in Figure 2b.
| The results show that our MLPs outperform both GBDT variants
| (XGBoost and auto-sklearn) with a statistically significant
| margin.
|
| They test against XGBoost, GBDT Auto-sklearn, and others. Did
| you read the paper?
| nightcracker wrote:
| > They test against XGBoost, GBDT Auto-sklearn, and others.
| Did you read the paper?
|
| Yes. Did you read my comment?
|
| They compare NN + Cocktail vs. vanilla XGB. They don't
| compare NN + Cocktail vs. XGB + Cocktail.
|
| To make it crystal clear, if I wrote a paper "existing
| medicine A enhanced with novel method B is more effective
| than existing medicine C" and I did not include the control
| "C + B" (assuming if relevant, which is the case here),
| that'd be bad science. It's very much possible that novel
| method B is doing the heavy lifting and A isn't all that
| relevant. s/A/NN, s/B/Cocktail, s/C/XGBoost.
| ipsum2 wrote:
| How would you even apply layer normalization or SWA to XGB?
| These methods are neural net specific.
| nightcracker wrote:
| Batch normalization is nothing neural network specific to
| it if you use it on the input layer. I don't think it
| matters for a tree algorithm like XGBoost either way
| though.
|
| SWA is pretty NN specific. So leave it out for XGB.
| There's a bunch that _are_ relevant, and they could be
| very important.
| mandelken wrote:
| GBDT have their own set of hyperparameters such as
| learning rate, number of trees, min samples per bin, l0,
| l1, etc. So you could definitely also create an
| appropriate cocktail to optimize on, although GBDT are
| typically more robust wrt huperparameters.
| nightcracker wrote:
| The authors do claim to do a hyperparameter sweep but
| only for vanilla XGB hyperparams.
| blt wrote:
| I don't think every single regularization method in the
| cocktail can be applied to non-neural-network methods, but
| I'm pretty sure some of them can, like data augmentation.
| The authors could have figured out which methods can be
| applied to non-NN models or considered if
| equivalent/analogous methods exist. I agree it would make a
| fairer comparison.
| zozbot234 wrote:
| Ridiculous clickbait title. How the heck did this neural network
| "can" Excel wrt. this tabular dataset? What's the underlying
| objective and the evaluation benchmark? And what feature of Excel
| was this tested against in the first place?
|
| I suppose the follow up will be titled "One Weird Trick is All
| You Need To Destroy SOTA On This Dataset!"
| TFortunato wrote:
| Not Excel: the spreadsheet software, Excel: the verb.
|
| They aren't saying that they "canned" Excel (the software),
| they are saying that neural nets have the potential to perform
| well on tasks involving tabular data that are traditionally
| performed by other ML techniques.
| bno1 wrote:
| I, for one, very much appreciate the comedy of this post.
| joe_the_user wrote:
| From paper: _Tabular datasets are the last "unconquered castle"
| for deep learning, with traditional ML methods like Gradient-
| Boosted Decision Trees still performing strongly even against
| recent specialized neural architectures._
|
| My but this statement seems more than a little grandiose.
|
| Never mind that XGBoost still does well on a substantial portion
| of ML challenges (supposedly). The bigger problematic is that
| there's a confusion of maps and territories in this way of
| talking of machine learning. The field of ML has made a certain
| level of palpable progress by having created a number of
| challenges and benchmarks and then doing well on them. But
| success on a benchmark isn't _necessarily_ the same as a success
| the "task" broadly. An NLP test doesn't imply mastering real
| language, a driving benchmark doesn't imply master over the road
| driving. etc. Notably, success on a benchmark also "isn't
| nothing". In a situation like the game of go, the possibilities
| can be fully captured "in the lab" and success at tests indeed
| became success against humans. But with driving or language,
| things are much more complicated.
|
| What I would say is that benchmark success seems to produce at
| least a situation where the machine can achieve human-like
| performance for some neighborhood (or tile or etc) limited in
| time, space and subject. Of course, driving is the poster-child
| for the limitations of "works most of the time" but lot of
| "intelligent" activities require an ongoing thread of
| "reasonableness" aside from having an immediate logic.
|
| Anyway, it would be nice if our ML folks looked at this stuff
| more as a beginning than as a situation where they're poised on
| success.
| DSingularity wrote:
| Attention, regionalization, someone tell me what I really need.
| karxxm wrote:
| I strongly agree with this paper. Regularizing a NN is the key
| for better performance. But first, one needs to know, what
| exactly to regularize. I don't think trial-and-error
| hyperparameter bingo should be the way to go. We need better
| insight and understanding about these networks, analyze their
| structure and find out, what exactly is wrong with it. Then, a
| TARGETED regularization (layerwise, or maybe per neuron) has a
| huge potential to let very simple networks perform extremely
| well. I even suggest, that adaptive regularizations
| (on/off/strength) should be researched even more. It is not
| necessary, that a network is regularized all the time the same
| way.
| nabusman wrote:
| Any references/recommendations on the best practices to analyze
| a network, weights, etc.?
| toxik wrote:
| Would you please stop it with the "$whatever is all you need"?
| jasonzemos wrote:
| What if the paper which finally gives us The Singularity is
| titled "All you need is all you need?"
| melling wrote:
| Isn't it a reference to the "Attention is All You Need" machine
| learning paper?
|
| https://arxiv.org/abs/1706.03762
| toxik wrote:
| Yes, and this is the fourth or fifth paper I see copying that
| format. The attention paper was pretty damn significant. This
| won't be, because it just shows that hyperparameters are
| important. We know that.
| lux wrote:
| I often interpret that pattern as a reference to the lyric
| "love is all you need" by the Beatles (e.g., an attempt at
| being playful and relatable), but that may be my musical bias.
|
| Either way, totally agree. Overused, almost always incorrect,
| and easily misconstrued especially by people who don't speak
| English as their first language.
| prionassembly wrote:
| "All you need considered harmful"
| joe_the_user wrote:
| "The Unreasonable Effectiveness of All You Need Is Considered
| Harmful"
|
| -- All interpretations worth considering...
| ayyy wrote:
| "The Unreasonable Effectiveness of All You Need Is
| Considered Harmful for Fun and Profit"
| philipswood wrote:
| FANG doesn't want you to know how this one weird trick
| about the unreasonable effectiveness of all you need
| considered harmful for fun and profit and you will never
| guess what happened next.
| [deleted]
| The_rationalist wrote:
| Sometimes it feels like papers authors consider their audience
| as children.
| toxik wrote:
| It's such a cheap trick too, basically clickbait.
| bonoboTP wrote:
| But it works, for example it got to the top of HN.
| ausbah wrote:
| to be fair many of the tasks of machine learning are things
| children do like telling if something is a dog, coloring in a
| a picture, solving a maze, etc.
| jgalt212 wrote:
| I don't know about that. It's probably driven by the need to
| for a short attention* gathering title.
|
| * I see what I did there.
| bonoboTP wrote:
| The field is certainly moving in a direction of
| infantilization due to the social media attention economy.
| Your paper competes with cute flashy cat gifs etc. on
| Twitter. And Twitter is ridiculously important in AI/ML,
| except perhaps above 50-60 years of age.
| edmundsauto wrote:
| Communication is based on the lowest common denominator. The
| best communication can be understood by a wide range of
| audiences. Have you considered that you weren't the intended
| audience?
| amilios wrote:
| This is interesting, but the paper still notes that in most
| "real-life" applications people will likely still prefer
| gradient-boosted trees, just because you need to allocate
| significant computation to hyperparameter tuning even in the case
| of the MLP-with-regularization-cocktail. For just getting
| something off the ground quickly based on tabular data, GBDTs are
| still unbeatable.
| quantum_mcts wrote:
| An overlooked advantage of the MLP is that it is
| differentiable. Essentially, one trades the extra CPU for a
| classifier that one can pipe gradients through. That can be
| extremely useful in larger NN architectures.
| [deleted]
| not_jd_salinger wrote:
| > GBDTs are still unbeatable.
|
| You'd be surprised how many times I've replaced a GBDT with
| logistic regression and had negligible drop off in model
| performance with a dramatic improvement in both training time
| as well as debugging and fixing production models.
|
| I've had plenty of cases where a bit of reasonable feature
| transformation can get a logistic model to outperform a gbdt.
| Any non-linearity your picking up with a GBDT can often easily
| be captured with some very simple feature tweaking.
|
| My experience has been that GBDTs are only particularly useful
| in Kaggle contests, where minuscule improvements in an
| arbitrary metric are valuable and training time and model
| debugging are completely unimportant.
|
| There are absolutely cases where NNs can go places that
| logistic regression can't touch (CV and NLP), but I have yet to
| see a real world production pipeline where GBDT provides enough
| improvement over Logistic Regression, to throw out all of the
| performance and engineering benefits of linear models.
| flyers_research wrote:
| What are the size of the datasets? I have a hard time
| conceptualizing tabular business data large to be a problem.
| huac wrote:
| consider the problem of "online advertising"
| kimukasetsu wrote:
| I strongly agree with this. Not to mention parameter
| interpretability and, in the case of Bayesian models,
| uncertainty estimates and convergence diagnostics. Such
| things are very important when making decision under
| uncertainty. Kaggle competitions and empirical benchmarks are
| very biased samples of model performance in real life.
|
| I feel these two things often influence too much the course
| of Machine Learning research and communities, and this is not
| good. Most ML researchers and pratictioners are barely aware
| of the latest advances in parametric modelling, which is a
| shame. Multilevel models allow you to model response
| variables with explicit dependent structures. This is done
| through random (sometimes hierarchical) effects constrained
| by variance parameters. These parameters regularize the
| effects themselves and converge really well when fitting
| factors with high cardinality.
|
| Also, multilevel models are very interesting when it comes to
| the bias-variance tradeoff. Having more levels in a
| distribution of random effects actually DECREASES [1]
| overfitting, which is fascinating.
|
| [1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-
| mixe...
| whatshisface wrote:
| If you want to use NNs on tabular data look up the work that's
| being done on point clouds. They both share the same major
| symmetry over permutations.
| eachro wrote:
| Oh that's interesting. Can you say more? Haven't seen much
| relating the two topics before.
| whatshisface wrote:
| A row in a table of data is an point in R^n. I'm not sure how
| much there is to write about it other than to say, that's a
| point cloud.
___________________________________________________________________
(page generated 2021-06-22 23:00 UTC)