[HN Gopher] XLSTMTime: Long-Term Time Series Forecasting with xLSTM
       ___________________________________________________________________
        
       XLSTMTime: Long-Term Time Series Forecasting with xLSTM
        
       Author : beefman
       Score  : 217 points
       Date   : 2024-07-16 17:14 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | carbocation wrote:
       | > In recent years, transformer-based models have gained
       | prominence in multivariate long-term time series forecasting
       | 
       | Prominence, yes. But are they generally better than non-deep
       | learning models? My understanding was that this is not the case,
       | but I don't follow this field closely.
        
         | Pandabob wrote:
         | While I don't have firsthand experience with these models, I
         | recently discussed this topic with a friend who has used tree-
         | based models like XGBoost for time series analysis. They noted
         | that transformer-based architectures tend to yield decent
         | performance on time series tasks with relatively little effort
         | compared to tree models.
         | 
         | From what I understood, tree-based models can usually
         | outperform transformers when given sufficient parameter tuning.
         | However, models like TimeGPT offer decent performance without
         | extensive tuning, making them an attractive option for quicker
         | implementations.
        
         | techwizrd wrote:
         | In my aviation safety work, deep learning outperforms
         | traditional non-DL models for multivariate time-series
         | forecasting. Between deep learning models, I've had a wide
         | variance in performance between transformers, Bi-LSTMs, regular
         | MLPs, VAEs, and so on.
        
           | theLiminator wrote:
           | What's your go-to model that generally performs well with
           | little tuning?
        
             | techwizrd wrote:
             | If you have short time-series with low variance, noise and
             | outliers, strong prior knowledge, or limited resources to
             | train and maintain a model, I would stick with simpler
             | traditional models.
             | 
             | If DL is a good fit for your use-case, then I tend to like
             | transformers or combining CNNs with recurrent models (e.g.,
             | BiGRU, GRU, BiLSTM, LSTM) and optional attention.
        
           | montereynack wrote:
           | Seconding the other question, would be curious to know
        
           | ramon156 wrote:
           | Now take into account that it has to be lightweight and DL
           | falls shirt
        
           | nerdponx wrote:
           | What are you doing in aviation safety that requires time
           | series modeling? Weather?
        
             | all2 wrote:
             | My best guess would be accident occurrence prediction.
        
         | dongobread wrote:
         | From experience in payments/spending forecasting, I've found
         | that deep learning generally underperform gradient-boosted tree
         | models. Deep learning models tend to be good at learning
         | seasonality but do not handle complex trends or shocks very
         | well. Economic/financial data tends to have straightforward
         | seasonality with complex trends, so deep learning tends to do
         | quite poorly.
         | 
         | I do agree with this paper - all of the good deep learning time
         | series architectures I've tried are simple extensions of MLPs
         | or RNNs (e.g. DeepAR or N-BEATS). The transformer-based
         | architectures I've used have been absolutely awful, especially
         | the endless stream of transformer-based "foundational models"
         | that are coming out these days.
        
           | sigmoid10 wrote:
           | Transformers are just MLPs with extra steps. So in theory
           | they should be just as powerful. The problem with
           | transformers is simultaneously their big advantage: They
           | scale extremely well with larger networks and more training
           | data. Better so than any other architecture out there. So if
           | you had enormous datasets and unlimited compute budget, you
           | could probably do amazing things in this regard as well. But
           | if you're just a mortal data scientist without extra funding,
           | you will be better off with more traditional approaches.
        
             | dongobread wrote:
             | I think what you say is true when comparing transformers to
             | CNNs/RNNs, but not to MLPs.
             | 
             | Transformers, RNNs, and CNNs are all techniques to reduce
             | parameter count compared to a pure-MLP model. If you took a
             | transformer model and replaced each self-attention layer
             | with a linear layer+activation function, you'd have a pure
             | MLP model that can model every relationship the transformer
             | does, but can model more possible relationships as well
             | (but at the cost of tons more parameters). MLPs are more
             | powerful/scalable but transformers are more efficient.
             | 
             | Compared to MLPs, transformers save on parameter count by
             | skimping on the number of parameters devoted to modeling
             | the relationship between tokens. This works in language
             | modeling, where relationships between tokens isn't _that_
             | important - you can jumble up the words in this sentence
             | and it still mostly makes sense. This doesn 't work in time
             | series, where relationships between tokens (timesteps) is
             | the most important thing of all. The LTSF paper linked in
             | the OP paper also mentions this same problem:
             | https://arxiv.org/pdf/2205.13504 (see section 1)
        
               | immibis wrote:
               | Transformers reduce the number of relationships between
               | tokens that must be learned, too. An MLP has to
               | separately learn all possible relationships between token
               | 1 and 2, and 2 and 3, and 3 and 4. A transformer can
               | learn relationships between specific values regardless of
               | position.
        
               | newrotik wrote:
               | Though I agree with the idea that MLPs are theoretically
               | more "capable" than transformers, I think seeing them
               | just as a parameter reduction technique is also
               | excessively reductive.
               | 
               | Many have tried to build deep and large MLPs for a long
               | time, but at some point adding more parameters wouldn't
               | increase models' performance.
               | 
               | In contrast, transformers became so popular because their
               | modelling power just kept scaling with more and more data
               | and more and more parameters. It seems like the
               | 'restriction' imposed on transformaters (the attention
               | structure) is a verg good functional form for modelling
               | language (and, more and more, some tasks in vision and
               | audio).
               | 
               | They did not become popular because they were modest with
               | respect to the parameters used.
        
               | sigmoid10 wrote:
               | >Compared to MLPs, transformers save on parameter count
               | by skimping on the number of parameters
               | 
               | That is only correct if you look at models with equal
               | parameter count from a purely theoretical perspective. In
               | practice, it is possible to train transformers to orders
               | of magnitude bigger scales than MLPs because they are so
               | much more efficient. That's why I said a modern
               | transformer will easily beat these puny modern MLPs, but
               | only in cases where data and compute budgets allow it.
               | That is not even a question. If you look at recent time
               | series forecasting leaderboard entries, you'll almost
               | always see transformers playing along at the top of it:
               | https://github.com/thuml/Time-Series-Library
        
         | rjurney wrote:
         | They aren't so hot, but recent efforts at transfer learning
         | were promising.
        
         | svnt wrote:
         | The paper says this in the next paragraph. xLSTMTime is not
         | transformer-based either.
        
       | Dowwie wrote:
       | marketed as a forecasting tool, so is this not applicable to
       | event classification in time series?
        
         | RamblingCTO wrote:
         | I'd say that's kind of a different task. I'm not a pro in this,
         | but you could maybe treat it as a multi-variate forecast
         | problem where the targets are probabilities per event if n is
         | really small?
        
         | jimmySixDOF wrote:
         | Yes, I would be interested where this (and any Transformer/LLM
         | based approach) is improving anomaly detection for example.
        
           | spmurrayzzz wrote:
           | I can't speak for all use cases, but I've done a great deal
           | of work in the space of using deep learning approaches for
           | anomaly detection in network device telemetry. In particular
           | with high resolution univariate time series of latency
           | measurements, we saw success using convolutional autoencoders
           | and GANs. These methods lean on reconstruction loss rather
           | than forecasting, but still effective.
           | 
           | There is some prior art for this that we leaned on [1][2].
           | 
           | RE: transformers -- I did some early experimentation with
           | Temporal Fusion Transformers [3] which worked pretty well for
           | forecasting compared to other deep learning methods, but
           | rarely did I see it outperform standard baselines (like
           | ARIMA) in our datasets.
           | 
           | [1] https://www.mdpi.com/2076-3417/12/23/12472
           | 
           | [2] https://arxiv.org/abs/2009.07769
           | 
           | [3] https://arxiv.org/abs/1912.09363
        
       | greatpostman wrote:
       | The best deep learning time series models are closed source
       | inside hedge funds.
        
         | 3abiton wrote:
         | I think hedge funds, at least the advanced once, definitely
         | don't use time series modelling anymore. That's quit outdated
         | nowadays.
        
           | max_ wrote:
           | What do you suspect they are using?
        
             | meowkit wrote:
             | They pull data from all kinds of things now.
             | 
             | For example, satellite imagery of trucking activity
             | correlated to specific companies or industries.
             | 
             | Its all signal processing at some level, but directly
             | modeling the time series of price or other asset metrics
             | doesn't have the alpha it may have had decades ago.
        
               | greatpostman wrote:
               | Alternative data is passed into time series models. They
               | are features.
               | 
               | You don't know as much about this as you think
        
               | myhf wrote:
               | emoji hand pointing up
        
             | nextos wrote:
             | Some funds that tried to recruit me were really interested
             | in classical generative models (ARMA, GARCH, HMMs with
             | heavy-tailed emissions, etc.) extended with deep components
             | to make them more flexible. Pyro and Kevin Murphy's ProbML
             | vol II are a good starting point to learn more about these
             | topics.
             | 
             | The key is to understand that in some of these problems,
             | data is relatively scarce, and it is really important to
             | quantify uncertainty.
        
           | rjurney wrote:
           | There are many ways of approaching quantitative trading and
           | many people do employ time series analysis, especially for
           | high frequency trading.
        
         | fermisea wrote:
         | Most of the hard work is actually feature construction rather
         | than monolithic models. And afaik gradient boosting still rules
         | the world
        
         | energy123 wrote:
         | There is no such thing as a generally best model due to the no
         | free lunch theorem. What works in hedge funds will be bad in
         | other areas that need less or different inductive biases due to
         | having more or less data and different data.
        
       | thedudeabides5 wrote:
       | cant wait for someone to lose all their money trying to predict
       | stocks with this thing
        
       | nyanpasu64 wrote:
       | I misread this as XSLT :')
        
         | selimnairb wrote:
         | Same. I am old?
        
           | ThomasBHickey wrote:
           | Me too (and yes, I'm old)
        
         | mikepurvis wrote:
         | 100% clicked thinking I was getting into an article about XML
         | and wondering how interesting that was in 2024. Simultaneously
         | disappointed and pleased.
        
         | antod wrote:
         | Yup. And it's about transforms too.
        
       | optimalsolver wrote:
       | Reminder: If someone's time series forecasting method worked,
       | they wouldn't be publishing it.
        
         | dongobread wrote:
         | They definitely would and do, the vast majority of time series
         | work is not about asset prices or beating the stock market
        
         | musleh2 wrote:
         | The Transformer model, despite being one of the most successful
         | in AI history, was still being published.
        
           | logicchains wrote:
           | It's a sequence model, not a time-series model. All time
           | series are sequences but not all sequences are time series.
        
       | dlojudice wrote:
       | Is this somehow related to the Google weather prediction model
       | using AI [1]?
       | 
       | https://deepmind.google/discover/blog/graphcast-ai-model-for...
        
         | scellus wrote:
         | No, Graphcast is a graph transformer trained on ERA5 weather
         | reconstructions of the atmosphere, not a general time series
         | prediction model. It by the way outperforms all traditional
         | global point forecasts (non-ensembles), at least on predicting
         | large-scale global patterns (Z500 and such, on the lag of 3-10
         | days or so). ECMWF has AIFS that is a derivate of Graphcast,
         | they'll probably get it or something similar to production in a
         | couple of years.
        
           | wafngar wrote:
           | AIFS is transformer based (Graphcast is pure GNN) so
           | different architecture and is already running operationally,
           | see:
           | 
           | https://www.ecmwf.int/en/about/media-centre/aifs-
           | blog/2024/i...
        
       | brcmthrowaway wrote:
       | Wow, is there a way to apply this to financial trading?
        
         | musleh2 wrote:
         | If you have dataset in financial , I can try it for you
        
       | localfirst wrote:
       | time series forecasting works best with deterministic domains.
       | none of the published LLM/AI/Deep/Machine techniques do well in
       | the stock market. Absolutely none. we've tried them all.
        
       | dkga wrote:
       | A part of my work is literally building nowcasting and other
       | types of prediction models in economics (inflation, GDP etc) and
       | finance (market liquidity, etc). I haven't yet had a chance to
       | read the paper but overall the tone of "transformers are great
       | for what they do but LSTM-type of models are very valuable still"
       | completely resonates with me.
        
         | uoaei wrote:
         | Have you had the chance to apply Mamba to your work at all?
         | Thoughts?
        
       | _0ffh wrote:
       | Too bad the dataset link in the paper isn't working. I hope
       | that'll get amended.
        
       ___________________________________________________________________
       (page generated 2024-07-17 23:09 UTC)