[HN Gopher] Regularization is all you need: simple neural nets c...
       ___________________________________________________________________
        
       Regularization is all you need: simple neural nets can excel on
       tabular data
        
       Author : tracyhenry
       Score  : 105 points
       Date   : 2021-06-22 18:29 UTC (4 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | lern_too_spel wrote:
       | A lot of these papers can be titled "Tuning hyperparameters on
       | the evaluation dataset is all you need." I see a few cases in
       | this paper.
        
       | nightcracker wrote:
       | Paper is missing the control: how good is this 'cocktail of
       | regularization' when applied to traditional methods like XGBoost?
       | 
       | At best you can claim the result here that neural networks with
       | regularization methods can beat traditional methods without it,
       | but to be apples to apples both methods must have access to the
       | same 'cocktail of regularization'.
        
         | audi0slave wrote:
         | This paper compares xgboost vs nueural nets and an ensemble
         | [Tabular Data: Deep Learning is Not All You Need]
         | https://arxiv.org/abs/2106.03253
        
         | ipsum2 wrote:
         | From the paper:
         | 
         | > This paper is the first to provide compelling evidence that
         | wellregularized neural networks (even simple MLPs!) indeed
         | surpass the current state-of-the-art models in tabular
         | datasets, including recent neural network architectures and
         | GBDT (Section 6).
         | 
         | > Next, we analyze the empirical significance of our well-
         | regularized MLPs against the GBDT implementations in Figure 2b.
         | The results show that our MLPs outperform both GBDT variants
         | (XGBoost and auto-sklearn) with a statistically significant
         | margin.
         | 
         | They test against XGBoost, GBDT Auto-sklearn, and others. Did
         | you read the paper?
        
           | nightcracker wrote:
           | > They test against XGBoost, GBDT Auto-sklearn, and others.
           | Did you read the paper?
           | 
           | Yes. Did you read my comment?
           | 
           | They compare NN + Cocktail vs. vanilla XGB. They don't
           | compare NN + Cocktail vs. XGB + Cocktail.
           | 
           | To make it crystal clear, if I wrote a paper "existing
           | medicine A enhanced with novel method B is more effective
           | than existing medicine C" and I did not include the control
           | "C + B" (assuming if relevant, which is the case here),
           | that'd be bad science. It's very much possible that novel
           | method B is doing the heavy lifting and A isn't all that
           | relevant. s/A/NN, s/B/Cocktail, s/C/XGBoost.
        
             | ipsum2 wrote:
             | How would you even apply layer normalization or SWA to XGB?
             | These methods are neural net specific.
        
               | nightcracker wrote:
               | Batch normalization is nothing neural network specific to
               | it if you use it on the input layer. I don't think it
               | matters for a tree algorithm like XGBoost either way
               | though.
               | 
               | SWA is pretty NN specific. So leave it out for XGB.
               | There's a bunch that _are_ relevant, and they could be
               | very important.
        
               | mandelken wrote:
               | GBDT have their own set of hyperparameters such as
               | learning rate, number of trees, min samples per bin, l0,
               | l1, etc. So you could definitely also create an
               | appropriate cocktail to optimize on, although GBDT are
               | typically more robust wrt huperparameters.
        
               | nightcracker wrote:
               | The authors do claim to do a hyperparameter sweep but
               | only for vanilla XGB hyperparams.
        
             | blt wrote:
             | I don't think every single regularization method in the
             | cocktail can be applied to non-neural-network methods, but
             | I'm pretty sure some of them can, like data augmentation.
             | The authors could have figured out which methods can be
             | applied to non-NN models or considered if
             | equivalent/analogous methods exist. I agree it would make a
             | fairer comparison.
        
       | zozbot234 wrote:
       | Ridiculous clickbait title. How the heck did this neural network
       | "can" Excel wrt. this tabular dataset? What's the underlying
       | objective and the evaluation benchmark? And what feature of Excel
       | was this tested against in the first place?
       | 
       | I suppose the follow up will be titled "One Weird Trick is All
       | You Need To Destroy SOTA On This Dataset!"
        
         | TFortunato wrote:
         | Not Excel: the spreadsheet software, Excel: the verb.
         | 
         | They aren't saying that they "canned" Excel (the software),
         | they are saying that neural nets have the potential to perform
         | well on tasks involving tabular data that are traditionally
         | performed by other ML techniques.
        
         | bno1 wrote:
         | I, for one, very much appreciate the comedy of this post.
        
       | joe_the_user wrote:
       | From paper: _Tabular datasets are the last "unconquered castle"
       | for deep learning, with traditional ML methods like Gradient-
       | Boosted Decision Trees still performing strongly even against
       | recent specialized neural architectures._
       | 
       | My but this statement seems more than a little grandiose.
       | 
       | Never mind that XGBoost still does well on a substantial portion
       | of ML challenges (supposedly). The bigger problematic is that
       | there's a confusion of maps and territories in this way of
       | talking of machine learning. The field of ML has made a certain
       | level of palpable progress by having created a number of
       | challenges and benchmarks and then doing well on them. But
       | success on a benchmark isn't _necessarily_ the same as a success
       | the  "task" broadly. An NLP test doesn't imply mastering real
       | language, a driving benchmark doesn't imply master over the road
       | driving. etc. Notably, success on a benchmark also "isn't
       | nothing". In a situation like the game of go, the possibilities
       | can be fully captured "in the lab" and success at tests indeed
       | became success against humans. But with driving or language,
       | things are much more complicated.
       | 
       | What I would say is that benchmark success seems to produce at
       | least a situation where the machine can achieve human-like
       | performance for some neighborhood (or tile or etc) limited in
       | time, space and subject. Of course, driving is the poster-child
       | for the limitations of "works most of the time" but lot of
       | "intelligent" activities require an ongoing thread of
       | "reasonableness" aside from having an immediate logic.
       | 
       | Anyway, it would be nice if our ML folks looked at this stuff
       | more as a beginning than as a situation where they're poised on
       | success.
        
       | DSingularity wrote:
       | Attention, regionalization, someone tell me what I really need.
        
       | karxxm wrote:
       | I strongly agree with this paper. Regularizing a NN is the key
       | for better performance. But first, one needs to know, what
       | exactly to regularize. I don't think trial-and-error
       | hyperparameter bingo should be the way to go. We need better
       | insight and understanding about these networks, analyze their
       | structure and find out, what exactly is wrong with it. Then, a
       | TARGETED regularization (layerwise, or maybe per neuron) has a
       | huge potential to let very simple networks perform extremely
       | well. I even suggest, that adaptive regularizations
       | (on/off/strength) should be researched even more. It is not
       | necessary, that a network is regularized all the time the same
       | way.
        
         | nabusman wrote:
         | Any references/recommendations on the best practices to analyze
         | a network, weights, etc.?
        
       | toxik wrote:
       | Would you please stop it with the "$whatever is all you need"?
        
         | jasonzemos wrote:
         | What if the paper which finally gives us The Singularity is
         | titled "All you need is all you need?"
        
         | melling wrote:
         | Isn't it a reference to the "Attention is All You Need" machine
         | learning paper?
         | 
         | https://arxiv.org/abs/1706.03762
        
           | toxik wrote:
           | Yes, and this is the fourth or fifth paper I see copying that
           | format. The attention paper was pretty damn significant. This
           | won't be, because it just shows that hyperparameters are
           | important. We know that.
        
         | lux wrote:
         | I often interpret that pattern as a reference to the lyric
         | "love is all you need" by the Beatles (e.g., an attempt at
         | being playful and relatable), but that may be my musical bias.
         | 
         | Either way, totally agree. Overused, almost always incorrect,
         | and easily misconstrued especially by people who don't speak
         | English as their first language.
        
         | prionassembly wrote:
         | "All you need considered harmful"
        
           | joe_the_user wrote:
           | "The Unreasonable Effectiveness of All You Need Is Considered
           | Harmful"
           | 
           | -- All interpretations worth considering...
        
             | ayyy wrote:
             | "The Unreasonable Effectiveness of All You Need Is
             | Considered Harmful for Fun and Profit"
        
               | philipswood wrote:
               | FANG doesn't want you to know how this one weird trick
               | about the unreasonable effectiveness of all you need
               | considered harmful for fun and profit and you will never
               | guess what happened next.
        
               | [deleted]
        
         | The_rationalist wrote:
         | Sometimes it feels like papers authors consider their audience
         | as children.
        
           | toxik wrote:
           | It's such a cheap trick too, basically clickbait.
        
             | bonoboTP wrote:
             | But it works, for example it got to the top of HN.
        
           | ausbah wrote:
           | to be fair many of the tasks of machine learning are things
           | children do like telling if something is a dog, coloring in a
           | a picture, solving a maze, etc.
        
           | jgalt212 wrote:
           | I don't know about that. It's probably driven by the need to
           | for a short attention* gathering title.
           | 
           | * I see what I did there.
        
           | bonoboTP wrote:
           | The field is certainly moving in a direction of
           | infantilization due to the social media attention economy.
           | Your paper competes with cute flashy cat gifs etc. on
           | Twitter. And Twitter is ridiculously important in AI/ML,
           | except perhaps above 50-60 years of age.
        
           | edmundsauto wrote:
           | Communication is based on the lowest common denominator. The
           | best communication can be understood by a wide range of
           | audiences. Have you considered that you weren't the intended
           | audience?
        
       | amilios wrote:
       | This is interesting, but the paper still notes that in most
       | "real-life" applications people will likely still prefer
       | gradient-boosted trees, just because you need to allocate
       | significant computation to hyperparameter tuning even in the case
       | of the MLP-with-regularization-cocktail. For just getting
       | something off the ground quickly based on tabular data, GBDTs are
       | still unbeatable.
        
         | quantum_mcts wrote:
         | An overlooked advantage of the MLP is that it is
         | differentiable. Essentially, one trades the extra CPU for a
         | classifier that one can pipe gradients through. That can be
         | extremely useful in larger NN architectures.
        
           | [deleted]
        
         | not_jd_salinger wrote:
         | > GBDTs are still unbeatable.
         | 
         | You'd be surprised how many times I've replaced a GBDT with
         | logistic regression and had negligible drop off in model
         | performance with a dramatic improvement in both training time
         | as well as debugging and fixing production models.
         | 
         | I've had plenty of cases where a bit of reasonable feature
         | transformation can get a logistic model to outperform a gbdt.
         | Any non-linearity your picking up with a GBDT can often easily
         | be captured with some very simple feature tweaking.
         | 
         | My experience has been that GBDTs are only particularly useful
         | in Kaggle contests, where minuscule improvements in an
         | arbitrary metric are valuable and training time and model
         | debugging are completely unimportant.
         | 
         | There are absolutely cases where NNs can go places that
         | logistic regression can't touch (CV and NLP), but I have yet to
         | see a real world production pipeline where GBDT provides enough
         | improvement over Logistic Regression, to throw out all of the
         | performance and engineering benefits of linear models.
        
           | flyers_research wrote:
           | What are the size of the datasets? I have a hard time
           | conceptualizing tabular business data large to be a problem.
        
             | huac wrote:
             | consider the problem of "online advertising"
        
           | kimukasetsu wrote:
           | I strongly agree with this. Not to mention parameter
           | interpretability and, in the case of Bayesian models,
           | uncertainty estimates and convergence diagnostics. Such
           | things are very important when making decision under
           | uncertainty. Kaggle competitions and empirical benchmarks are
           | very biased samples of model performance in real life.
           | 
           | I feel these two things often influence too much the course
           | of Machine Learning research and communities, and this is not
           | good. Most ML researchers and pratictioners are barely aware
           | of the latest advances in parametric modelling, which is a
           | shame. Multilevel models allow you to model response
           | variables with explicit dependent structures. This is done
           | through random (sometimes hierarchical) effects constrained
           | by variance parameters. These parameters regularize the
           | effects themselves and converge really well when fitting
           | factors with high cardinality.
           | 
           | Also, multilevel models are very interesting when it comes to
           | the bias-variance tradeoff. Having more levels in a
           | distribution of random effects actually DECREASES [1]
           | overfitting, which is fascinating.
           | 
           | [1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-
           | mixe...
        
       | whatshisface wrote:
       | If you want to use NNs on tabular data look up the work that's
       | being done on point clouds. They both share the same major
       | symmetry over permutations.
        
         | eachro wrote:
         | Oh that's interesting. Can you say more? Haven't seen much
         | relating the two topics before.
        
           | whatshisface wrote:
           | A row in a table of data is an point in R^n. I'm not sure how
           | much there is to write about it other than to say, that's a
           | point cloud.
        
       ___________________________________________________________________
       (page generated 2021-06-22 23:00 UTC)