[HN Gopher] Zero-Shot Forecasting: Our Search for a Time-Series ...
       ___________________________________________________________________
        
       Zero-Shot Forecasting: Our Search for a Time-Series Foundation
       Model
        
       Author : tiwarinitish86
       Score  : 64 points
       Date   : 2025-06-13 05:04 UTC (17 hours ago)
        
 (HTM) web link (www.parseable.com)
 (TXT) w3m dump (www.parseable.com)
        
       | nikhil4usinha wrote:
       | Interesting, what are the usecases youre using the models for?
       | Would like to know more on that, like anomaly detection
        
         | parmesant wrote:
         | That's actually one of the use-cases that we set out to explore
         | with these models. We'll release a head-to-head comparison
         | soon!
        
           | CubsFan1060 wrote:
           | That's the thing I'm most interested in out of these. Super
           | interested to see what you find out.
           | 
           | Did you or do you plan to publish any of your code or data
           | sets from this?
        
             | Debanitrkl wrote:
             | Author here, we're just getting started with these
             | experiments and plan to apply them to more features on our
             | roadmap. Future posts will be more detailed, based on the
             | feedback we received here. Once we finish implementing
             | these features, we'll be happy to share the code and
             | dataset.
        
       | dragon195346 wrote:
       | Great read! Really interesting to see how these foundation models
       | like Chronos and Toto are starting to perform well on real-world
       | observability data.
        
       | wenc wrote:
       | I wonder how this would perform on the M4 Makridakis competitions
       | (time series competitions)
       | 
       | https://github.com/Mcompetitions/M4-methods
       | 
       | https://en.wikipedia.org/wiki/Makridakis_Competitions
       | 
       | Makridakis' conclusion remained true for many years:
       | "statistically sophisticated and complex methods do not
       | necessarily provide more accurate forecasts than simpler ones."
       | 
       | Maybe things have changed?
       | 
       | (side: Nixtla showed a simple ensemble outperforming Chronos, and
       | the Chronos team responded, but there's some back and forth in
       | the comments: https://www.linkedin.com/pulse/extended-comparison-
       | chronos-a...)
        
         | parmesant wrote:
         | This looks like a great benchmark! We've been thinking of doing
         | a better and more detailed follow-up and this seems like the
         | perfect dataset to do that with. Thanks!
        
       | mvATM99 wrote:
       | Look i'm optimistic about time-series foundation models too, but
       | this post is hard to take seriously when the test is so flawed:
       | 
       | - Forward filling missing short periods of missing values. Why
       | keep this in when you explictly mention this is not normal?
       | Either remove it all or don't impute anything
       | 
       | - Claiming superiority over classic models and then not
       | mentioning any in the results table
       | 
       | - Or let's not forget, the cardinal sin of using MAPE as an
       | evaluation metric
        
         | parmesant wrote:
         | Author here, we're trying these out for the first time for our
         | use-cases so these are great points for us to improve upon!
        
           | mvATM99 wrote:
           | Good to see positive reception to feedback! Sorry if my
           | message came out as condescending, was not the intent. I
           | recommend reading this piece on metrics
           | https://openforecast.org/wp-
           | content/uploads/2024/07/Svetunko.... It's easy to grasp, yet
           | it contains great tips.
        
         | stevenae wrote:
         | To clarify, you'd prefer rmsle?
        
           | mvATM99 wrote:
           | Short answer: i use multiple metrics, never rely on just 1
           | metric.
           | 
           | Long answer: Is the metric for people with subject-matter
           | knowledge? Then (Weighted)RMSSE, or the MASE alternative for
           | a median forecast. WRMSSE is is very nice, it can deal with
           | zeroes, is scale-invariant and symmetrical in penalizing
           | under/over-forecasting.
           | 
           | The above metrics are completely uninterpretable to people
           | outside of the forecasting sphere though. For those cases i
           | tend to just stick with raw errors; if a percentage metric is
           | really necessary then a Weighted MAPE/RMSE, the weighing is
           | still graspable for most, and it doesn't explode with zeroes.
           | 
           | I've also been exploring FVA (Forecast Value Added), compared
           | against a second decent forecast. FVA is very intuitive, if
           | your base-measures are reliable at least. Aside from that i
           | always look at forecast plots. It's tedious but they often
           | tell you a lot that gets lost in the numbers.
           | 
           | RMSLE i havent used much. From what i read it looks
           | interesting, though more for very specific scenarios (many
           | outliers, high variance, nonlinear data?)
        
             | stevenae wrote:
             | Thanks for the reply! I am outside the forecasting sphere.
             | 
             | RMSLE gives proportional error (so, scale-invariant)
             | without MAPE's systematic under-prediction bias. It does
             | require all-positive values, for the logarithm step.
        
             | ted_dunning wrote:
             | MAPE can be a problem also if you have a problem where rare
             | excursions are what you want to predict and the cost of
             | missing an event is much higher than predicting a non-
             | event. A model that just predicts no change would have very
             | low MAPE because most of the time nothing happens. When the
             | event happens, however, the error of predicting status quo
             | ante is much worse than small baseline errors.
        
       | sheepscreek wrote:
       | > Our dataset consisted of Kubernetes pod metrics collected from
       | a production retail checkout application.
       | 
       | That sums it up and it's no surprise why Datadog's toto model
       | performed exceptionally well.
       | 
       | The results would have been much more useful had they opted for a
       | heterogenous mix of data sets. I am thinking of census data and
       | statistics, or financial forecasting (GDP, interest rates), or
       | clinical trial drop-out rates etc. So many interesting problems
       | out there.
        
         | bitshiftfaced wrote:
         | The GIFT Eval benchmark would be a good place to start:
         | https://huggingface.co/spaces/Salesforce/GIFT-Eval
        
       | fumeux_fume wrote:
       | I'm a bit confused by the results table. Were these models tested
       | against the same dataset? Also, a visualization of the test data
       | and forecasts would be helpful as well.
        
       | Fripplebubby wrote:
       | I think that the concept of a "foundation model" for time series
       | is actually a bit flawed as presented in this blog post. A
       | foundation model is interesting because it is capable of many
       | tasks _beyond the target tasks_ that it was trained to do,
       | whereas what the author is looking for is a time-series model
       | that can make out-of-distribution predictions without re-training
       | - which is, in my opinion, a problem that is pretty well solved
       | by existing ARIMA and (especially) Prophet models (Yes, you have
       | to re-fit the model on your distribution, but this is not at all
       | akin to the task of training or fine-tuning an LLM, it's
       | something you can do in seconds on a modern CPU, and yes, there
       | are certain hyperparameters that may need to be selected, but
       | they are actually fairly minimal).
       | 
       | But for a model to make out-of-distribution predictions does not
       | make it a foundation model for time series, really that's just
       | the basic task that all time series forecasting models do. A more
       | interesting question is, does an LLM architecture seem to improve
       | the task of univariate or multivariate time-series prediction? I
       | don't think the answer is yes, although, depending on your
       | domain, being able to use language inputs to your model may have
       | a positive impact, and the best way to incorporate language
       | inputs is certainly to use a transformer architecture, but that
       | isn't what is addressed in this post.
        
         | th0ma5 wrote:
         | A lot of people try to hedge this kind of sober insight along
         | with their personal economic goals to say all manner of
         | unfalsifiable statements of adequate application in some
         | context, but it is refreshing to try to deal with the issues
         | separately and I think a lot of people miss the insufficiency
         | compared to traditional methods in all cases that I've heard of
         | so far.
        
           | cyanydeez wrote:
           | Ai slop
        
       | spmurrayzzz wrote:
       | I'd be curious what the results would be with the automated
       | Autogluon fit/evals. I suspect given the results here, a weighted
       | average model would likely win out.
        
       ___________________________________________________________________
       (page generated 2025-06-13 23:01 UTC)