[HN Gopher] Why do tree-based models still outperform deep learn...
       ___________________________________________________________________
        
       Why do tree-based models still outperform deep learning on tabular
       data? (2022)
        
       Author : tosh
       Score  : 174 points
       Date   : 2024-03-05 10:44 UTC (12 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | frgtpsswrdlame wrote:
       | I still don't get the impetus or desire to make NNs work better
       | for tabular data. Regression works pretty well and is easy to
       | interpret/diagnose/work with. GBMs work really well (given a few
       | considerations) and is trickier to work with but nothing crazy.
       | When I see all the fancy hijinks people get up to when applying
       | NNs to audio/text/pictures I think it's really cool but also not
       | something I'd want to have to do if I didn't absolutely need to
       | when working with data out of a relational db. And anyways, how
       | much of a benefit could it actually bring? GBMs are already
       | capable of fitting and dramatically overfitting most datasets.
        
         | mrfox321 wrote:
         | When you need the best possible model, full stop.
         | 
         | E.g. finance
         | 
         | In a sufficiently competitive space, good enough doesn't cut
         | it.
        
           | dchftcs wrote:
           | Do you know of any shop that is running deep learning
           | profitably?
        
             | mjhay wrote:
             | Plenty of places use DL models, even if it's just a
             | component of their stack. I would guess that that gradient-
             | boosted trees are more common in applications, though.
        
               | hackerlight wrote:
               | Do you know what kind of strategies it's seeing use in?
        
               | foobar20k wrote:
               | Real-time parsing of incoming news events and live
               | scanning of internet news sites - coupled with sentiment
               | analysis. Latency is an interesting challenge in that
               | space.
        
               | mjhay wrote:
               | Still mostly NLP and image stuff. Most actual data in the
               | wild is tabular - which GBTs are usually some combination
               | of better and easier. In some circumstances, NN can still
               | work well in tabular problems with the right feature
               | engineering or model stacking.
               | 
               | They are also more attractive for streaming data. Tree-
               | based models can't learn incrementally. They have to be
               | retrained from scratch each time.
        
               | dist-epoch wrote:
               | ML is very good at figuring out stuff like every day at
               | 22:00 this asset goes up if this another asset is not at
               | a daily maximum and the volatility of the market is low.
               | 
               | You might call this overfitting/noise/.... but if you do
               | it carefully it's profitable.
        
             | TimPC wrote:
             | Multiple parts of the iPhone stack run DL models locally on
             | your phone. They even added hardware acceleration to the
             | camera because most of the picture quality upgrades is
             | software rather than hardware.
        
           | Thrymr wrote:
           | There is no such thing as "best possible model, full stop".
           | Models are always context dependent, have implicit or
           | explicit assumptions about what is signal and what is noise,
           | have different performance characteristics in training or
           | execution. Choosing the "best" model for your task is a form
           | of hyperparameter optimization in itself.
        
             | naijaboiler wrote:
             | I can't upvote this enough. Whether in life, or with
             | models, some people really do believe in the myth of
             | absolutely meritocracy
        
           | asdff wrote:
           | These models usually have poorer fit though
        
         | dawnofdusk wrote:
         | The paper offers a reason why NNs working for tabular data
         | would be good:
         | 
         | >Creating tabular-specific deep learning architectures is a
         | very active area of research (see section 2) given that tree-
         | based models are not differentiable, and thus cannot be easily
         | composed and jointly trained with other deep learning blocks.
         | 
         | Here is a second reason, from the paper
         | 
         | >Impressed by the superiority of tree-based models on tabular
         | data, we strive to understand which inductive biases make them
         | well-suited for these data.
         | 
         | which is a great reason, because understanding the inductive
         | biases of different learning/regression techniques gets us
         | closer to a more general understanding of how to encode
         | inductive biases in a generic learning algorithm.
        
           | hackerlight wrote:
           | My hypothesis is decision trees are more robust to
           | nonstationary distributions. If the variance and means of the
           | features shift dramatically, the model isn't going to blow
           | up, because it's not additive.
           | 
           | In the domains where NNs work well (image processing and
           | language), you're dealing with a predictable and stable
           | distribution of values. Elephants might look a bit different
           | in the train and test set, but you're not randomly getting
           | 100x the variance of the input data. The decision tree just
           | isn't going to care as much, because splits around the mean
           | will lead to the same outcome.
           | 
           | Another hypothesis is that zooming into bivariable
           | relationships is more important in tabular data. Neural nets
           | are better at local and global context. But they struggle if
           | all that matters is the relationship between two columns of
           | data because of the additive nature. Large networks _can_
           | figure it out due to model capacity, but then you 'll run
           | into overfitting.
        
             | hansvm wrote:
             | In case anyone's sufficiently motivated (no promises, but I
             | might test it out eventually), a couple deep architectures
             | that might address those concerns are:
             | 
             | 1. Something like a deep support vector machine. Instead of
             | (linear) -> (any activation), you want to create a bunch of
             | features that look like testing the vector against a
             | splitting hyperplane. One option is (bias) -> (matmul) ->
             | (1-bit sigmoid). Applying a bias term _for each row_ let's
             | you choose the branch location, the matmul's result will be
             | positive or negative at each output feature depending on
             | which side of the hyperplane normal to the vector described
             | by the corresponding row you happen to fall on. Then just
             | bring that down to -1 or 1 so you can't sneak much
             | nonstationary drift variance into the output (perhaps train
             | with a normal sigmoid annealed to behave more like this
             | one, and a suitable regularizing term to keep the network
             | from sneaking in values near 0 to thwart your annealing).
             | 
             | 2. Use an attention-like mechanism, but across features
             | (this would likely require an additional tensor channel, so
             | that each "feature" carries information in a high enough
             | dimensional space for this to do something meaningful). You
             | apply the inductive bias that sparse feature interactions
             | are important and need to be discovered.
             | 
             | Those two ideas also compose easily.
        
               | hackerlight wrote:
               | > this would likely require an additional tensor channel,
               | so that each "feature" carries information in a high
               | enough dimensional space
               | 
               | Suppose input data is [batch_size, num_features]. Then
               | you do x.unsqueeze(1) giving you [batch_size,
               | num_features, 1]. Then what?
        
               | hansvm wrote:
               | You probably want something equivalent to (however you
               | make it fast in your chosen framework):
               | 
               | einsum('bf,fc->bfc', batched_inputs, channel_embedding)
               | 
               | Then carry that info through the network and project it
               | down at the end. It's roughly equivalent to the token
               | embedding step in an LLM.
        
         | melondonkey wrote:
         | At this point I wish every junior DS could read this paper and
         | not come in to every problem with the new bright idea that
         | they're going to beat XGBoost with their DL architecture. Free
         | promotion if they never say the words "latent subspace"
        
           | barrenko wrote:
           | One of those juniors is going to do it once!
        
         | bbstats wrote:
         | because smooth is better than jagged :)
        
       | mikkom wrote:
       | It's very important to note that this is from 2022. I'm not
       | saying it's not true today but neural models have gotten much
       | better in 2 years.
       | 
       | (I'm personally using NN models for predicting certain values for
       | tabularly structured data and at least for my case, the NN works
       | better than state-of-the art tree models.)
        
         | mjhay wrote:
         | Do you have any intuition you could share of why NNs work
         | better in this case?
        
           | queuebert wrote:
           | Not the parent, but NNs typically work better when you can't
           | linearize your data. For classification, that means a space
           | in which hyperplanes separate classes, and for regression a
           | space in which a linear approximation is good.
           | 
           | For example, take the circle dataset here:
           | https://playground.tensorflow.org
           | 
           | That doesn't look immediately linearly separable, but since
           | it is 2D we have the insight that parameterizing by radius
           | would do the trick. Now try doing that in 1000 dimensions.
           | Sometimes you can, sometimes you can't or don't want to
           | bother.
        
             | mjhay wrote:
             | That's an advantage over linear models, but GBTs handle non
             | linearly-separated data just fine. Each individual tree can
             | represent an arbitrary piecewise-constant function given
             | enough depth, and then each tree in turn tries to minimize
             | the loss on the residual of the previous trees. As such,
             | they're effectively like a neural network with two hidden
             | layers in terms of expressiveness.
        
             | melondonkey wrote:
             | This explanation doesn't make sense to me. What do you mean
             | by "linearize your data"--tree methods assume no linear
             | form and are not even monotonically constrained.
             | Classification is not done by plane-drawing but by
             | probability estimation + cost function
        
               | dist-epoch wrote:
               | A tree split can be considered plane-drawing.
        
             | CuriouslyC wrote:
             | Note that if linear separability is the only issue you can
             | just use kernel methods. In fact, gaussian processes are
             | equivalent to a single hidden layer neural network with
             | infinite hidden values.
             | 
             | The magic of deep neural networks comes from modeling
             | complicated conditional probability distributions, which
             | lets you do generative magic but isn't going to give you
             | significantly better results than ensemble kNN when you're
             | discriminating and the conditional distribution is low
             | variance. Ensemble methods are like a form of
             | regularization and they also act as a weak bootstrap to
             | better model population variance, so it's no surprise that
             | when they're capable of modeling the domain, they perform
             | better than unregularized, un-bootstrapped neural network
             | model. There are still tons of situations where ensemble
             | methods can't model the domain, and if you incorporated
             | regularization and bootstrapping into a discriminative NN
             | model it would probably perform equivalently to the
             | ensemble model.
        
           | mikkom wrote:
           | I assume it's because there are some very complex
           | relationships and patterns that cannot be captured by
           | decision trees. Tree models work better on simpler data at
           | least that is my gut feeling based on previous experiments
           | with similar data.
        
             | mjhay wrote:
             | Interesting. Usually I have better luck with xgboost for
             | tabular data, even when the relationships are complex
             | (which usually means deeper trees). It does fall flat a lot
             | of the time for very high dimensions, though. All data is
             | different, I guess.
        
           | lerchmo wrote:
           | There is some work with zero shot (decoder only) time series
           | predictions by google and an open source variant. Curious to
           | see how these approaches stack up as they are explored.
        
         | redox99 wrote:
         | In what way have models gotten better for tabular data? Can't
         | think of any new technique since 2022.
        
           | jeffreyrogers wrote:
           | There has been some work on training on lots of different
           | data sets and then specializing on the one you care about.
           | But I think people were trying that approach pre-2022 as
           | well.
        
             | frituur wrote:
             | Do you have some good scientific references for that? I'd
             | love to incorporate them in my phd thesis!
        
               | jeffreyrogers wrote:
               | Sorry, I don't have references off the top of my head. I
               | just recall coming across it while I was working on
               | something related to timeseries forecasting.
        
             | asdff wrote:
             | This has to be done with great care. Most datasets are of
             | poor quality.
        
           | _pastel wrote:
           | Tooling around embeddings has improved. Creating and fine-
           | tuning custom embeddings for your tabular data should be
           | easier and more powerful these days.
        
         | __mharrison__ wrote:
         | Pretty sure it is still true today. Catboost rules the roost!
        
       | dawnofdusk wrote:
       | Paper seems interesting but I don't like the question title. I
       | think the answer to the question would just be that tabular data
       | is not fully in the "big data" regime yet so there is no reason a
       | priori to expect deep NNs to do better. Factor in computational
       | simplicity of tree-based models and I think the deck is stacked
       | against deep learning from the start.
        
         | math_dandy wrote:
         | Do you know of _any_ (families of) examples of tabular datasets
         | of any size (you can choose what  "big" means) where deep
         | learning convincingly outperforms traditional methods? I would
         | love some quality examples of this nature to use in my
         | teaching.
        
           | Scene_Cast2 wrote:
           | Recommendation engines: search, feeds (tiktok / youtube
           | shorts / etc), ads, netflix suggestions, doordash
           | suggestions, etc etc. Also happens to be my specialty.
        
             | usgroup wrote:
             | I'm not sure that is true. I think inference speed is often
             | the bottleneck for the use cases stated, as is the need for
             | frequent re-training. As a result algorithms like catboost
             | are very popular in those domains. I think catboost was
             | actually invented by Yandex.
             | 
             | PS: Its weird that you are being down-voted. I think your
             | opinion is reasonable.
        
               | Scene_Cast2 wrote:
               | Inference speed: more sophisticated stacks use multiple
               | stages. Early stage might be a sublinear vector search,
               | and the heavy hitting neural nets only rerank the
               | remainder. Bytedance has a paper on their fairly fancy
               | sublinear approach.
               | 
               | Retraining - online training solves this for the most
               | part.
               | 
               | Frameworks - the only battle-tested batteries-included
               | one I've seen is Vespa. Noone else publishes any of
               | interesting bits. KDD is the most relevant conference if
               | you're interested in the field. IIRC Xiaohongshu has some
               | papers that can only really be done with NNs.
        
             | math_dandy wrote:
             | Wonderful! Any public datasets you could point me to?
        
               | Scene_Cast2 wrote:
               | Unfortunately, none that I know of. Maybe the Netflix
               | movie recommendations challenge from ages ago? I haven't
               | looked at it personally.
        
             | Jensson wrote:
             | I worked with search and ads model at Google, for most
             | things tree models were better. What evidence do you have
             | that neural nets are better there? I worked with large
             | parts of Google search ranking so I know what I'm talking
             | about, some parts you want a neural net but most of the
             | work is done by tree models and similar, they both perform
             | better and run faster.
        
         | Scene_Cast2 wrote:
         | I've worked on models trained on ultra-large tabular data. It
         | still took substantial effort to beat tree models (custom
         | architecture specifically for this particular domain, something
         | I haven't seen elsewhere out in the open).
         | 
         | When tabular data is mentioned, one of the unspoken
         | applications is finance. There, my guess is that one of the
         | issues is that data is not very IID and thus latent "events"
         | are fairly sparse. Combine that with the humongous amount of
         | raw data, and you get models that overfit.
        
           | TimPC wrote:
           | I think there are certain types of tabular data that lend
           | themselves naturally to tree models. But when you're talking
           | about tabular data for finance I guarantee you very few hedge
           | funds are running tree models for trading strategies. When
           | your scale of data is the past X quarters of all stock prices
           | and trade volumes you have enough data that you can fit an NN
           | and there are a number of techniques you can use to reduce
           | overfitting (large amount of data, good regularization,
           | dropout, etc.)
        
             | Jensson wrote:
             | > But when you're talking about tabular data for finance I
             | guarantee you very few hedge funds are running tree models
             | for trading strategies
             | 
             | What do you base this on? Having only neural nets on
             | tabular data is mostly done due to laziness of the creator
             | since neural nets are much easier to use, not because
             | neural nets perform better even with large amounts of data.
             | In general you want both since they are good at finding
             | different kinds of patterns.
        
         | Jensson wrote:
         | The tabular data I had at Google was exabytes, tree models
         | still performed the best so I guess exabytes is small data
         | then?
        
       | Scene_Cast2 wrote:
       | The team behind Yggdrasil tree library at Google was doing some
       | interesting research into tree differentiability (and thus
       | unlocking SGD & end-to-end learning for hybrid architectures).
        
       | doubtfuluser wrote:
       | Since this is from 2022, I'm wondering how "tabular foundation
       | models" could change this. The incredible success of DL we see at
       | the moment comes partially from foundation models learning on a
       | lot of "semi-related" data an "understanding" of the behavior.
       | Something similar has been explored in tabular data as well iirc.
       | 
       | So I would be curious to see latest DL results. On the other hand
       | it is also the case that in most cases where DL based on
       | foundation models is used, specific heavily tuned models
       | outperform the generalistic models. And for tabular data there is
       | a lot of experience how to make it great with tree based models.
        
         | dweinus wrote:
         | What would these tabular foundation models look like? LLMs work
         | as foundation models because the input is fixed in format (a
         | sequence of text). Would the model be for a specific fixed
         | tabular format?
        
           | scottyak wrote:
           | One promising approach is to encode each feature key and
           | feature value as embedding vectors, concatenate them into
           | "feature tokens", then feed them into a Transformer (without
           | positional encodings). This takes advantage of column-order
           | invariance. See:
           | 
           | https://arxiv.org/abs/2403.01841 (ICLR 2024 spotlight)
        
       | moelf wrote:
       | There seems to be differentiable tree models now that perfor
       | somewhat better than e.g. XGBoost
       | https://github.com/Evovest/MLBenchmarks.jl?tab=readme-ov-fil...
        
       | candiodari wrote:
       | TLDR: Because tree-based models don't just outperform deep
       | learning, they totally outclass deep learning on simple data.
       | 
       | But they don't scale with larger and more complex data. You
       | cannot (realistically) make an LLM with XGBoost.
       | 
       | Kind of surprised how well Resnet and FT Transformer do though.
        
         | nickpsecurity wrote:
         | I want to see more work combining them. Here's an example I saw
         | in one of the links in this thread:
         | 
         | https://arxiv.org/abs/1806.06988
         | 
         | It combines NN's with decision trees.
        
           | huac wrote:
           | the famous FB ads paper (from 10 years ago!) combines
           | decision trees with a logistic regression and shows a
           | significant improvement:
           | https://research.facebook.com/publications/practical-
           | lessons...
           | 
           | feel free to extend logistic regression to an MLP :)
        
       | MAXPOOL wrote:
       | > deep learning architectures have been crafted to create
       | inductive biases matching invariances and spatial dependencies of
       | the data. Finding corresponding invariances is hard in tabular
       | data, made of heterogeneous features, small sample sizes, extreme
       | values
       | 
       | Transformers with positional encoding have embeddings are
       | invariant to the input order. CNN's have translation invariance
       | and can have little rotational invariance.
       | 
       | It's harder to find similar invariances to tabular data. Maybe
       | applying methods from GNN's would help?
        
       | queuebert wrote:
       | This is tangential, but that paper has some amazingly good plots
       | for an ML paper.
        
       | usgroup wrote:
       | I'm not sure this is surprising. Say you were to glue together 10
       | datasets with the same 10 explanatory features and 1 response
       | feature, but distributed very differently to each other. This
       | would be no problem for tree based model because they'll
       | conditionalise indefinitely to get a good fit. If the number of
       | records is relatively small (say 10k) the dataset will be much
       | too scarce for an NN to learn these discontinuities -- its like
       | it has 1000 records per segment.
       | 
       | Similarly, tabular data is often of this nature. Its not i.i.d,
       | it tends to cluster.
        
         | 3abiton wrote:
         | I wonder if that would be the case for graph based models too
        
       | skadamat wrote:
       | When working with tabular data, there are very few situations
       | where absolute model performance is the only criteria that's
       | important. In practice, the following are equally as important:
       | 
       | - Explainability / debug-ability of models
       | 
       | - Effort to train, deploy, and manage NN models in production
       | 
       | - Capturing, collating, and organizing new & better datasets
       | 
       | - Local developer experience and human-model-iteration time
       | 
       | Building all of your software in C or Assembly will be faster and
       | higher performant. But at what cost and with what tradeoffs?
       | Building a website has a different set of tradeoffs than building
       | a program for the Mars rover.
        
         | derefr wrote:
         | It's funny; as a regular non-ML programmer, the optimum for
         | every one of those factors for "tabular data" would seem to
         | _me_ , to be to "throw the tabular data into a relational data
         | warehouse, and ask your questions in the form of SQL queries."
         | 
         | Or, if the "tabular data" is heavily relationship-based, then
         | possibly replace "relational data warehouse" with "graph
         | database", and "SQL queries" with whatever querying language
         | that graph DB is natively / most expressively queried in.
         | 
         | Of course, this is the most important implicit "equally
         | important" factor, one that an ML dev would think goes without
         | mentioning: the generality or "power" of the model in what
         | questions it can answer. You can only make these trade-offs in
         | the context of knowing what kinds of questions you want your
         | model to solve for! If all your questions are quantitative
         | ones, maybe the right "model" for you is an RDBMS!
         | 
         | ---
         | 
         | Though, that being said... why can't a deep-learning model
         | _emulate_ the thing that an RDBMS does,  "at runtime", as part
         | of its "mental toolkit" for approaching problems? That would be
         | the best of both worlds, no?
         | 
         | I know that LLMs in particular have been observed to have
         | "emergent numeracy" above a certain training-set size. There is
         | a step function in how they approach such problems, going from
         | their only being able to answer arithmetic questions on numbers
         | of bounded size, and sometimes getting the answers wrong
         | (probably this is due to a memorization-based approach); to
         | being able to answer arbitrary arithmetic questions on operands
         | of unbounded size, and always getting the answer correct.
         | 
         | I would _guess_ that that what 's happening, is that they are
         | developing a functional component of their network that works
         | akin to an Arithmetic Logic Unit, operating not on tokens, but
         | on tokens _transformed_ into a  "numeric register"
         | representation that is amenable to having math done to it with
         | stable, quantized, position-independent results. (Just like the
         | functional component that human brains develop after seeing
         | enough math problems... probably.)
         | 
         | Do you, as an ML dev, think it would ever be possible for any
         | of the model architectures we're familiar with today, to be
         | trained such that they would develop an analogous emergent
         | functional component for handling tabular-data questions, by
         | _transforming its internal working state into relational-DB
         | /graph-DB data structures_ -- e.g. page-heaps of binary-packed
         | row-tuples; B-tree indices; etc -- and then manipulating the
         | working state in that form, using learned algorithms applicable
         | to that type of data?
         | 
         | It seems to me (possibly just because I don't know any better)
         | that just as with numeracy, "being able to put the data into a
         | different and better internal representation" is what would be
         | needed for deep-learning models to become truly _good_ at
         | dealing with tabular-data problems.
         | 
         | But, unlike with numeracy, "thinking as if you were a
         | relational database" is _not_ something a single human would
         | ever intuit how to do without being taught. Relational algebra
         | -- and the data-structures and algorithms to make it practical
         | to have a Turing machine do said relational algebra -- wasn 't
         | even a single intuition, but a conscious effort, of _multiple_
         | humans, working together over years. I strongly doubt that
         | there 's any number of "tabular-data problems" that you could
         | show a human being, that would result in them developing an
         | _intuitional ability_ to do what a relational database does
         | with its memory to efficiently answer queries.
         | 
         | (I suppose we could _give_ an ML model an RDBMS, and hardwire
         | it to interact with it. I know there are hybrid ML + formal-
         | logic systems. Are there hybrid ML + data-warehouse systems?
         | Not where the model queries an external DB -- while that can be
         | done, it 'd be only in the same "stop and do this" way that
         | ChatGPT runs Python code, which wouldn't make it a _thinking
         | tool_ the way that the formal-logic proof engines are for
         | hybrid ML systems. Rather, I mean that some data-warehouse
         | execution engine could be embedded into the ML execution
         | framework itself, deployed as part of the GPU shader-program to
         | each tensor core, such that data-warehouse operations can be
         | done as a native part of the network 's per-node instruction-
         | set. Anyone ever tried this?)
        
           | dist-epoch wrote:
           | > ask your questions in the form of SQL queries
           | 
           | How do you know which questions to ask? This is what ML is
           | good at, finding the right questions which classify the data.
        
             | derefr wrote:
             | You already made a faulty assumption -- that we're
             | interested in "classifying the data" in the first place.
             | 
             | Maybe we already know everything about the dataset. For
             | example, if it's line-of-business customer data gradually
             | built up by a sales team, then the _brains of the
             | salespeople_ have likely already done all the  "implicit
             | classification" needed to generate good questions about the
             | dataset.
             | 
             | And this is, by far, the _usual_ scenario for Business
             | Intelligence questions: someone with  "business-domain
             | knowledge", e.g. an executive, has formed an _intuitional
             | hypothesis_ about the data based on their personal
             | experience; and so they ask someone with  "data-domain
             | knowledge", e.g. a business analyst or data scientist, to
             | test that hypothesis.
             | 
             | It's actually rare, in my experience, to have a tabular-
             | data dataset that someone is motivated to understand, that
             | doesn't also "come with" a set of people who can already
             | act as (good!) models trained on that dataset, to aid them
             | in that understanding. (Sometimes these people can't _find_
             | each-other -- but they do usually _exist_.)
             | 
             | AFAIK, having reams of _entirely opaque and ill-understood_
             | tabular data, such that you need classification /clustering
             | to get started on asking questions, only really happens in
             | the sciences: sensor-network climate data; longitudinal-
             | study medical-outcome data; census data; housing-market
             | data; etc. In other words, it's almost always _universities
             | and governments_ -- not businesses -- that care about
             | analyzing opaque tabular data.
             | 
             | And that's a key to understanding the constraints in play
             | for choosing models! Because business-driven analyses are
             | usually time-constrained in some way (potentially even
             | needing post-training question-answers to be generated in
             | soft-realtime); while institutional analyses usually
             | aren't. Big difference!
        
               | letsdothisagain wrote:
               | I'm really not clear on why you're arguing against this.
               | A proper data warehouse tackles the known unknowns, i.e.
               | supervised learning. But you can glean new insights using
               | unsupervised learning, like the textbook example of
               | Target knowing a woman is pregnant based on sales data.
               | 
               | https://www.forbes.com/sites/kashmirhill/2012/02/16/how-
               | targ...
        
               | lemmsjid wrote:
               | I might be misunderstanding your point, but there's use
               | cases that have repeatedly come up for me in multiple
               | businesses, below being some examples, without getting
               | too specific:
               | 
               | - identify latent features of customers via their
               | behavioral data, to be used for profiling customers or
               | recommending products to them
               | 
               | - within a large amount of customer behavioral data,
               | identify potentially fraudulent behavior
               | 
               | - identify causes of seasonality (e.g. temporal patterns)
               | in the data in order to improve forecasting (sales,
               | traffic, whatever)
               | 
               | In those cases part of the investigation is to initially
               | take a hands-off (unsupervised) approach, so that we can
               | compare our initial top-down hypotheses with actual
               | patterns in the data.
               | 
               | In both of those cases there's considerable (and
               | sometimes adversarial) noise in the data.
        
               | itsoktocry wrote:
               | > _You already made a faulty assumption -- that we 're
               | interested in "classifying the data" in the first place._
               | 
               | It's not clear what your point is. If you're not
               | interested in the predictions that tree-based models
               | provide, do not use tree-based models on your tabular
               | data. A predictive model and a SQL query are not the same
               | thing.
        
             | jonathankoren wrote:
             | What? No! That's not how it works. That's not how anything
             | -- _including unsupervised techniques_ work!
        
           | jonathankoren wrote:
           | > It's funny; as a regular non-ML programmer, the optimum for
           | every one of those factors for "tabular data" would seem to
           | me, to be to "throw the tabular data into a relational data
           | warehouse, and ask your questions in the form of SQL
           | queries."
           | 
           | It's doubly funny; as someone that comes from an ML
           | background, and has developed and maintained multiple ML
           | systems at multiple orgs, that I also think the answer very
           | often is, "throw the tabular data into a relational data
           | warehouse, and ask your questions in the form of SQL
           | queries."
        
             | dartos wrote:
             | Most problems don't need complex solutions.
        
           | closeparen wrote:
           | >"throw the tabular data into a relational data warehouse,
           | and ask your questions in the form of SQL queries."
           | 
           | You can ask SQL descriptive questions. Can you ask it for
           | predictions? How?
        
             | asdff wrote:
             | This is called extrapolation and can be done with simple
             | linear regression in some cases
        
               | itsoktocry wrote:
               | > _simple linear regression in some cases_
               | 
               | You're correct, but "in some cases" is doing a lot of
               | work here.
               | 
               | With the tooling where it's at, how much harder is it to
               | apply xGBoost vs a linear model?
        
             | taway_6PplYu5 wrote:
             | https://www.red-gate.com/simple-talk/blogs/statistics-sql-
             | si...
             | 
             | One of several examples of implementing linear regression
             | in SQL.
        
       | martindbp wrote:
       | Deep learning really shines when the input is raw and at a very
       | low abstraction level: pixels, byte pair encodings etc. Using
       | deeply learning for classification on tabular data is just
       | needless complexity, as the variables are often at a very high
       | abstraction level already. Also with tabular data there are
       | generally not many spatial or temporal relationship between the
       | variables, which CNNs and transformers excel at.
        
         | jobigoud wrote:
         | We have things that can describe and explain why an image they
         | have never seen is funny. That's pretty high level.
        
           | martindbp wrote:
           | Yes, what I meant was deep learning is great at deriving
           | those higher level abstractions from low level raw data.
           | Words can be seen as something in between, bag of words can
           | be fairly effective at simpler tasks, but LLMs embed words
           | into higher and higher abstractions.
        
         | andy99 wrote:
         | Also images and text have tons of recurring patterns that can
         | be exploited to train big models with lots of data. There is an
         | internet worth of each modality that at least generally can all
         | contribute helping a model build up a better overall
         | understanding.
         | 
         | There is no analog for tabular data, it's all different.
        
       | sgt101 wrote:
       | Isn't this just that trees are a natural compression of tables?
       | 
       | smthin smthin inductive bias?
        
       | hashta wrote:
       | I have a lot of experience working with both families of models.
       | If you use an ensemble of 10 NNs, they outperform well-optimized
       | tree-based models such as XGBoost & RFs.
        
         | padthai wrote:
         | Which kind of ensemble? Because it cannot be as easy as a
         | voting meta model of nn with same architecture/hyperparametres
         | right?
        
       ___________________________________________________________________
       (page generated 2024-03-05 23:00 UTC)