[HN Gopher] Forecasts need to have error bars
___________________________________________________________________
Forecasts need to have error bars
Author : apwheele
Score : 203 points
Date : 2023-12-04 16:28 UTC (6 hours ago)
(HTM) web link (andrewpwheeler.com)
(TXT) w3m dump (andrewpwheeler.com)
| datadrivenangel wrote:
| The interesting example in this article is nowcasting! The art of
| forecasting the present or past while you're waiting for data to
| come in.
|
| It's sloppy science / statistics to not haven error ranges.
| RandomLensman wrote:
| Not easy to always say what the benefit is: if you present in-
| model uncertainty from a stochastic model that might still say
| nothing about an estimation error vs the actual process. For
| forecasting to show actual uncertainty you need to be in a
| quite luxurious position to know the data generating process.
| You could try to fudge it with a lot of historical data where
| available - but still...
| doubled112 wrote:
| I really thought that this was going to be about the weather.
| nullindividual wrote:
| Same, but in a human context, are mundane atmospheric events so
| far off today that error bars would have any practical value
| and/or potentially introduce confusion?
| NegativeLatency wrote:
| For this reason I really enjoy reading the text products and
| area forecast discussion for interesting weather: https://for
| ecast.weather.gov/product.php?site=NWS&issuedby=p...
| doubled112 wrote:
| Anybody happen to know if there's anything more detailed
| from Environment Canada than their forecast pages?
|
| https://weather.gc.ca/city/pages/on-143_metric_e.html
|
| I really like that discussion type forecast.
| yakubin wrote:
| Absolutely. 15 years ago I could reasonably trust forecasts
| regarding whether it's going to rain in a given location 2
| days in advance. Today I can't trust forecasts about whether
| it's raining _currently_.
| Smoosh wrote:
| It seems unlikely that the modelling and forecasting has
| become worse, so I guess there is some sort of change
| happening to the climate making it more unstable and less
| predictable?
| lispisok wrote:
| >I guess there is some sort of change happening to the
| climate making it more unstable and less predictable?
|
| I've been seeing this question come up a lot lately. The
| answer is no, weather forecasting continues to improve.
| The rate is about 1 day improvement every 10 years so a 5
| day forecast today is as good as a 4 day forecast 10
| years ago.
| LexGray wrote:
| I think that is a change in definition. 15 years ago it was
| only rain if you were sure to get drenched. Now rain means
| 1mm of water hit the ground in your general vicinity. I
| blame an abundance of data combined people who refuse to
| get damp and need an umbrella if there is any chance at
| all.
| kqr wrote:
| Sure -- just a few day outs the forecast is not much better
| than the climatological average -- see e.g. https://charts.ec
| mwf.int/products/opencharts_meteogram?base_...
|
| Up until that point, error bars increase. At least to me,
| there's a big difference between "1 mm rain guaranteed" and
| "90 % chance of no rain but 10 % chance of 10 mm rain" but
| both have the same average.
| dguest wrote:
| Me too, and I was looking forward to the thread that talks
| about error bars in weather models, which is totally a thing!
|
| It turns out the ECMWF _does_ do an ensamble model where they
| run 51 concurrent models, presumably with slightly different
| initial conditions, or they vary the model parameters within
| some envelope. From these 51 models you can get a decent
| confidence interval.
|
| But this is a lower resolution model, run less frequently. I
| assume they don't do this with their "HRES" model (which has
| twice the spacial resolution) in an ensemble because, well,
| it's really expensive.
|
| [1]:
| https://en.wikipedia.org/wiki/Integrated_Forecast_System#Var...
| lispisok wrote:
| A lot of weather agencies across the world run ensembles
| including US, Canada, and the UK. Ensembles are the future of
| weather forecasting but weather models are so computationally
| heavy models have a resolution/forecast length tradeoff which
| is even bigger when trying to run 20-50 ensemble members. You
| can have a high resolution model that runs to 2 days or so or
| have a longer range model at much coarser resolution.
|
| ECMWF recently upgraded their ensemble to run at the same
| resolution as the HRES. The HRES is basically the ensemble
| control member at this point [1]
|
| [1] https://www.ecmwf.int/en/about/media-
| centre/news/2023/model-...
| iruoy wrote:
| I've been using meteoblue for a while now and they tell you how
| sure they are of their predictions. Right now I can see that
| they rate their predictability as medium for tomorrow, but high
| for the day after.
|
| https://content.meteoblue.com/en/research-education/specific...
| kqr wrote:
| I'll give you one better. The ECMWF publishes their
| probabilistic ensemble forecasts with boxplots for numeric
| probabilities: https://charts.ecmwf.int/products/opencharts_m
| eteogram?base_...
|
| They also have one for precipitation type distribution: https
| ://charts.ecmwf.int/products/opencharts_ptype_meteogram...
| amichal wrote:
| I have, in my life as a web developer, had multiple "academics"
| urgently demand that i remove error bands, bars, notes about
| outliers, confidence intervals etc from graphics at the last
| minute so people are not "confused"
|
| Its depressing
| Maxion wrote:
| The depressing part is that many people actually need them
| removed in order to not be confused.
| nonethewiser wrote:
| But aren't they still confused without the error bars? Or
| confidently incorrect? And who could blame them, when that's
| the information they're given?
|
| It seems like the options are:
|
| - no error bars which mislead everyone
|
| - error bars which confuse some people and accurately inform
| others
| alistairSH wrote:
| Yep.
|
| See also: Complaints about poll results in the last few
| rounds of elections in the US. "The polls said Hillary
| would win!!!" (no, they didn't).
|
| It's not just error margins, it's an absence of statistics
| of any sort in secondary school (for a large number of
| students).
| marcosdumay wrote:
| Yeah, when people remove that kind of information to not
| confuse people, they are aiming into making them
| confidently incorrect.
| ta8645 wrote:
| That is baldly justifying a feeling of superiority and
| authority over others. It's not your job to trick other
| people "for their own good". Present honest information, as
| accurately as possible, and let the chips fall where they
| may. Anything else is a road to disaster.
| echelon wrote:
| Some people won't understand error bars. Given that we evolved
| from apes and that there's a distribution of intelligences,
| skill sets, and interests across all walks of society, I don't
| place blame on anyone. We're just messy as a species. It'll be
| okay. Everything is mostly working out.
| ethbr1 wrote:
| > _We 're just messy as a species. It'll be okay. Everything
| is mostly working out._
|
| {Confidence interval we won't cook the planet}
| esafak wrote:
| Statistically illiterate people should not be making decisions.
| I'd take that as a signal to leave.
| sonicanatidae wrote:
| Statistically speaking, you're in the minority. ;)
| knicholes wrote:
| Maybe not in the minority for taking it as a signal to
| leave, but in the minority for actually acting on that
| signal.
| sonicanatidae wrote:
| That's fair. :)
| RandomLensman wrote:
| It really depends what it is for. If the assessment is that the
| data is solid enough for certain decisions you might indeed
| only show a narrow result in order not to waste time and
| attention. If it is for a scientific discussion then it is
| different, of course.
| strangattractor wrote:
| Sometimes they do this because the data doesn't entirely
| support their conclusions. Error bars, noting data outliers etc
| often make this glaringly apparent.
| cycomanic wrote:
| Can you be more specific (maybe point to a website)? I am
| trying to imagine the scenarios where a web developer would
| work with academics and does the data processing for the
| representation? Of the few scenarios that I could think about
| where an academic works directly with a web developer they
| would almost always provide the full figures.
| aftoprokrustes wrote:
| I obviously cannot assess the validity of the requests you got,
| but as a former researcher turned product developer, I had
| several times to take the decision _not_ to display confidence
| intervals in products, and to keep them as an internal feature
| for quality evaluation.
|
| Why, I hear you ask? Because, for the kind of system of models
| I use (detailed stochastic simulations of human behavior),
| there is no good definition of a confidence interval that can
| be computed in a reasonable amount of computing time. One can
| design confidence measures that can be computed without too
| much overhead, but they can be misleading if you do not have a
| very good understanding of what they represent and do not
| represent.
|
| To simplify, the error bars I was able to compute were mostly a
| measure of precision, but I had no way to assess accuracy,
| which is what most people assume error bars mean. So showing
| the error bars would have actually given a false sense of
| quality, which I did not feel confident to give. So not
| displaying those measures was actually done as a service to the
| user.
|
| Now, one might make the argument that if we had no way to
| assess accuracy, the type of models we used was just rubbish
| and not much more useful than a wild guess... Which is a much
| wider topic, and there are good arguments for and against this
| statement.
| mrguyorama wrote:
| If you are forecasting both "Crime" and "Economy", it's VERY
| likely you have domain expertise for neither.
| bo1024 wrote:
| Two things I think are interesting here, one discussed by the
| author and one not. (1) As mentioned at the bottom, forecasting
| usually should lead to decisionmaking, and when it gets
| disconnected, it can be unclear what the value is. It sounds like
| Rosenfield is trying to use forecasting to give added weight to
| his statistical conclusions about past data, which I agree sounds
| suspect.
|
| (2) it's not clear what the "error bars" should mean. One is a
| confidence interval[1] (e.g. model gives 95% chance the output
| will be within these bounds). Another is a standard deviation
| (i.e. you are pretty much predicting the squared difference
| between your own point forecast and the outcome).
|
| [1] acknowledged: not the correct term
| m-murphy wrote:
| That's not what a confidence interval is. A confidence interval
| is a random variable that covers the true value 95% of the time
| (assuming the model is correctly specified).
| bo1024 wrote:
| Ok, the 'reverse' of a confidence interval then -- I haven't
| seen a term for the object I described other than misuse of
| CI in the way I did. ("Double quantile"?)
| m-murphy wrote:
| You're probably thinking of a predictive interval
| borroka wrote:
| It is a very common misconception and one of my technical
| crusades. I keep fighting, but I think I have lost. Not
| knowing what the "uncertainty interval" represents (is
| it, loosely speaking, an expectation about a mean/true
| value or about the distribution of unobserved values?)
| could be even more dangerous, in theory, than using no
| uncertainty interval at all.
|
| I say in theory because, in my experience in the tech
| industry, with the usual exceptions, uncertainty
| intervals, for example on a graph, are interpreted by
| those making decisions as aesthetic components of the
| graph ("the gray bands look good here") and not as
| anything even marginally related to a prediction.
| m-murphy wrote:
| Agreed! I also think it's extremely important as
| practitioners to know what we're even trying to estimate.
| Expected value (i.e. least squares regression) is the
| usual first thing to go for, does that even matter? We're
| probably actually interested in something like an upper
| quantile for planning purposes. And then the whole model
| component of it, the interval that's being simultaneously
| estimated is model driven and if that's wrong, then the
| interval is meaningless. There's a lot of space for super
| interesting and impactful work in this area IMO, once you
| (the practitioner) think more critically about the
| objective. And then don't even get me started on
| interventions and causal inference...
| bo1024 wrote:
| > is it, loosely speaking, an expectation about a
| mean/true value or about the distribution of unobserved
| values
|
| If you don't mind typing it out, what do you mean
| formally here?
| bo1024 wrote:
| Yes, that term captures what I'm talking about.
| cubefox wrote:
| "Credible interval":
|
| https://en.wikipedia.org/wiki/Credible_interval
| bo1024 wrote:
| No, predictive interval is more precise, since we are
| dealing with predicting an observation rather than
| forming a belief about a parameter.
| ramblenode wrote:
| > Another is a standard deviation (i.e. you are pretty much
| predicting the squared difference between your own point
| forecast and the outcome).
|
| What you probably want is the standard error, because you are
| not interested in how much your data differ from each other but
| in how much your data differ from the true population.
| bo1024 wrote:
| I don't see how standard error applies here. You are only
| going to get one data point, e.g. "violent crime rate in
| 2023". What I mean is a prediction, not only of what you
| think the number is, but also of how wrong you think your
| prediction will be.
| nonameiguess wrote:
| Standard error is exactly what the statsmodels
| ARIMA.PredictionResults object actually gives you and the
| confidence interval in this chart is constructed from a
| formula that uses the standard error.
|
| ARIMA is based on a few assumptions. One, there exists some
| "true" mean value for the parameter you're trying to
| estimate, in this case violent crime rate. Two, the value
| you measure in any given period will be this true mean plus
| some random error term. Three, the value you measure in
| successive periods will regress back toward the mean. The
| "true mean" and error terms are both random variables, not
| a single value but a distribution of values, and when you
| add them up to get the predicted measurement for future
| periods, that is also a random variable with a distribution
| of values, and it has a standard error and confidence
| intervals and these are exactly what the article is saying
| should be included in any graphical report of the model
| output.
|
| This is a characteristic _of the model_. What you 're
| asking for, "how wrong do you think the model is," is a
| reasonable thing to ask for, but different and much harder
| to quantify.
| bo1024 wrote:
| Thanks for explaining how it works - I don't use R (I
| assume this is R). This does not seem like a good way to
| produce "error bars" around a forecast like the one in
| this case study. It seems more like a note about how much
| volatility there has been in the past.
| hgomersall wrote:
| Error bars in forecasts can only mean uncertainty your _model_
| has. Without error bars over models, you can say nothing about
| how good your model is. Even with them, your hypermodel may be
| inadequate.
| bo1024 wrote:
| To me, this comes back to the question of skin in the game.
| If you have skin in the game, then you produce the best
| uncertainty estimates you can (by any means). If you don't,
| you just sit back and say "well these are the error bars my
| model came up with".
| hgomersall wrote:
| It's worse than that. Oftentimes the skin in the game
| provides a motivation to mislead. C.f. most of the
| economics profession.
| nequo wrote:
| This is a pretty sweeping generalization, but if you have
| concrete examples to offer that support your claim, I'd
| be curious.
| PeterisP wrote:
| There are ways of scoring forecasts that reward accurate-
| and-certain forecasts in a manner where it's provably
| optimal to provide the most accurate estimates for your
| (un)certainty as you can.
| bo1024 wrote:
| Yes, of course. I don't see that as very related to my
| point. For example, consider how 538 or The Economist
| predict elections. They might claim they'll use squared
| error or log score, but when it comes down to a big
| mistake, they'll blame it on factors outside their
| models.
| pacbard wrote:
| As far as error bars are concerned, you could report some%
| credible intervals calculated from taking the some%tile out of
| your results. It's somewhat Bayesian thinking but it will work
| better than confidence intervals.
|
| The intuition would be that some% of your forecasts are between
| the bounds of the credible interval.
| mnky9800n wrote:
| Recently someone on hacker news described statistics as trying
| to measure how surprised you should be when you are wrong. Big
| fat error bars would give you the idea that you should expect
| to be wrong. Skinny ones would highlight that it might be
| somewhat upsetting to find out you are wrong. I don't think
| this is an exhaustive description of statistics but I do find
| it useful when thinking about forecasts.
| esafak wrote:
| Uncertainty quantification is a neglected aspect of data science
| and especially machine learning. Practitioners do not always have
| the statistical background, and the ML crowd generally has a
| "predict first and asks questions later" mindset that precludes
| such niceties.
|
| I always demand error bars.
| figassis wrote:
| So is it really science? These are concepts from stats 101. And
| the reasons and need, and the risks of not having them are very
| clear. But you have millions being put into models without
| these pre-requisites, and being sold to people as solutions,
| and waved away as "if people buy is it's bc it has value".
| People also pay fraudsters.
| nradov wrote:
| Mostly not. Very few data "scientists" working in industry
| actually follow the scientific method. Instead they just mess
| around with various statistical techniques (including AI/ML)
| until they get a result that management likes.
| marcinzm wrote:
| Most decent companies and especially tech do AB testing for
| everything including having people whose only job is to
| make sure those test results are statistically valid.
| borroka wrote:
| But even in academia, where supposedly "true science" is, if
| not done, at least pursued, uncertainty intervals are rarely,
| with respect to the times they would be needed, understood
| and used.
|
| When I used to publish stats- and math-heavy papers in the
| biological sciences, very rarely the reviewers--and I used to
| publish in intermediate and up journals--were paying any
| attention to the quality of the predictions, beyond a casual
| look at the R2 or R2-equivalents and mean absolute errors.
| macrolocal wrote:
| Also, error bars qua statistics can indicate problems with the
| underlying data and model, eg. if they're unrealistically
| narrow, symmetric etc.
| gh02t wrote:
| You can demand error bars but they aren't always possible or
| meaningful. You can more or less "fudge" some sort of normally
| distributed IID error estimate onto any method, but that
| doesn't necessarily mean anything. Generating error bars (or
| generally error distributions) that actually describe the
| common sense idea of uncertainty can be quite theoretically and
| computationally demanding for a general nonlinear model even in
| the ideal cases. There are some good practical methods backed
| by theory like Monte Carlo Dropout, but the error bars
| generated for that aren't necessarily always the error you want
| either (MC DO estimates the uncertainty due to model weights
| but not say, due to poor training data). I'm a huge advocate
| for methods that natively incorporate uncertainty, but there
| are lots of model types that empirically produce very useful
| results but where it's not obvious how to produce/interpret
| useful estimates of uncertainty in any sort of efficient
| manner.
|
| Another, separate, issue that is often neglected is the idea of
| calibrated model outputs, but that's its own rabbit hole.
| kqr wrote:
| I'm going to sound incredibly subjectivist now, but... the
| human running the model can just add error bars manually.
| They will probably be wide, but that's better than none at
| all.
|
| Sure, you'll ideally want a calibrated
| estimator/superforecaster to do it, but they exist and they
| aren't _that_ rare. Any decently sized organisation is bound
| to have at least one. They just need to care about finding
| them.
| rented_mule wrote:
| Yes, please! I was part of an org that ran thousands of online
| experiments over the course of several years. Having some sort of
| error bars when comparing the benefit of a new treatment gave a
| much better understanding.
|
| Some thought it clouded the issue. For example, when a new
| treatment caused a 1% "improvement", but the confidence interval
| extended from -10% to 10%, it was clear that the experiment
| didn't tell us how that metric was affected. This makes the
| decision feel more arbitrary. But that is exactly the point - the
| decision _is_ arbitrary in that case, and the confidence interval
| tells us that, allowing us to focus on other trade-offs involved.
| If the confidence interval is 0.9% to 1.1%, we know that we can
| be much more confident in the effect.
|
| A big problem with this is that meaningful error bars can be
| extremely difficult to come by in some cases. For example,
| imagine having something like that for every prediction made by
| an ML model. I would _love_ to have that, but I 'm not aware of
| any reasonable way to achieve it for most types of models. The
| same goes for online experiments where a complicated experiment
| design is required because there isn't a way to do random
| allocation that results in sufficiently independent cohorts.
|
| On a similar note, regularly look at histograms (i.e.,
| statistical distributions) for all important metrics. In one
| case, we were having speed issues in calls to a large web
| service. Many calls were completing in < 50 ms, but too many were
| tripping our 500 ms timeout. At the same time, we had noticed the
| emergence of two clear peaks in the speed histogram (i.e., it was
| a multimodal distribution). That caused us to dig a bit deeper
| and see that the two peaks represented logged-out and logged-in
| users. That knowledge allowed us to ignore wide swaths of code
| and spot the speed issues in some recently pushed personalization
| code that we might not have suspected otherwise.
| kqr wrote:
| > This makes the decision feel more arbitrary.
|
| This is something I've started noticing more and more with
| experience: people really hate arbitrary decisions.
|
| People go to surprising lengths to add legitimacy to arbitrary
| decisions. Sometimes it takes the shape of statistical models
| that produce noise that is then paraded as signal. Often it
| comes from pseudo-experts who don't really have the methods and
| feedback loops to know what they are doing but they have a
| socially cultivated air of expertise so they can lend decisions
| legitimacy. (They used to be called witch-doctors, priests or
| astrologers, now they are management consultants and
| macroeconomists.)
|
| Me? I prefer to be explicit about what's going on and literally
| toss a coin. That is not the strategy to get big piles of shiny
| rocks though.
| kqr wrote:
| > That caused us to dig a bit deeper and see that the two peaks
| represented logged-out and logged-in users.
|
| This is extremely common and one of the core ideas of
| statistical process control[1].
|
| Sometimes you have just the one process generating values that
| are sort of similarly distributed. That's a nice situation
| because it lets you use all sorts of statistical tools for
| planning, inferences, etc.
|
| Then frequently what you have is really two or more interleaved
| processes masquerading as one. These distributions generate
| values that within each are sort of similarly distributed, but
| any analysis you do on the aggregate is going to be confused.
| Knowing the major components of the pretend-single process
| you're looking at puts you ahead of your competition -- always.
|
| [1]: https://two-wrongs.com/statistical-process-control-a-
| practit...
| clircle wrote:
| Every estimate/prediction/forecast/interpolation/extrapolation
| should have a confidence/prediction/ or tolerance interval
| (application dependent) that incorporates the assumptions that
| the team is putting into the problem.
| mightybyte wrote:
| Completely agree with this idea. And I would add a
| corollary...date estimates (i.e. deadlines) should also have
| error bars. After all, a date is a forecast. If a stakeholder
| asks for a date, they should also specify what kind of error bars
| they're looking for. A raw date with no estimate of uncertainty
| is meaningless. And correspondingly, if an engineer is giving a
| date to some other stakeholder, they should include some kind of
| uncertainty estimate with it. There's a huge difference between
| saying that something will be done by X date with 90% confidence
| versus three nines confidence.
| niebeendend wrote:
| A deadline implies the upper limit of error bar cannot exceed
| it. That means you need to appropriately buffer to hit the
| deadline.
| kqr wrote:
| So much this. I've written about it before, but one of the big
| bonuses you get from doing it this way is that it enables you
| to learn from your mistakes.
|
| A date estimation with no error bars cannot be proven wrong.
| But! If you say "there's a 50 % chance it's done before this
| date" then you can look back at your 20 most recent such
| estimations and around 10 of them better have been on time.
| Otherwise your estimations are not calibrated. But at least
| then you know, right? Which you wouldn't without the error
| bars.
| Animats wrote:
| Looking at the graph, changes in this decade are noise. But what
| happened back in 1990?
| netsharc wrote:
| Probably no simple answer, but here's a long paper I just
| found:
| https://pubs.aeaweb.org/doi/pdf/10.1257/089533004773563485
|
| Another famous hypothesis is the phasing out of lead fuel:
| https://en.wikipedia.org/wiki/Lead%E2%80%93crime_hypothesis
| predict_addict wrote:
| Let me suggest a solution https://github.com/valeman/awesome-
| conformal-prediction
| xchip wrote:
| And also claims that say "x improves y", should include std and
| avg in the title.
| CalChris wrote:
| I'm reminded of Walter Lewin's analogous point about measurements
| from his 8.01 lectures: any measurement that you
| make without any knowledge of the uncertainty is
| meaningless
|
| https://youtu.be/6htJHmPq0Os
|
| You could say that forecasts are measurements you make about the
| future.
| lnwlebjel wrote:
| To that point, similarly:
|
| "Being able to quantify uncertainty, and incorporate it into
| models, is what makes science quantitative, rather than
| qualitative. " - Lawrence M. Krauss
|
| From https://www.edge.org/response-detail/10459
| _hyttioaoa_ wrote:
| Forecasts can also be useful without error bars. Sometimes all
| one needs is a point prediction to inform actions. But sometimes
| full knowledge of the predictive distribution is helpful or
| needed to make good decisions.
|
| "Point forecasts will always be wrong" - true that for continuous
| data but if you can predict that some stock will go to 2.01x it's
| value instead of 2x that's still helpful.
| lagrange77 wrote:
| This is a great advantage of Gaussian Process Regression aka.
| Kriging.
|
| https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_proc...
___________________________________________________________________
(page generated 2023-12-04 23:00 UTC)