[HN Gopher] Downsampling Time Series for Visual Representation [...
___________________________________________________________________
Downsampling Time Series for Visual Representation [pdf]
Author : todsacerdoti
Score : 53 points
Date : 2021-03-09 18:32 UTC (4 hours ago)
(HTM) web link (skemman.is)
(TXT) w3m dump (skemman.is)
| elsherbini wrote:
| This looks cool, I especially like the "intuitive" longest line
| algorithm.
|
| ~6 years ago I was trying to build a dashboard that displayed
| real-time measurements from a beehive. The sensors would take
| temperature, weight, humidity, etc. Back then I used simplify.js
| [0] which uses two simplification algorithms in two passes. The
| more compute intensive one is the Ramer Douglas Peucker algorithm
| [1]. One issue I had with streaming data is that new data points
| could change the past line, at least with my naive
| implementation. I'd love to see a real-time / streaming time
| series simplification algorithm where the past points don't
| appear to change wildly despite continuing to downsample.
|
| [0] https://github.com/mourner/simplify-js
|
| [1]
| https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93...
| tda wrote:
| Some (often desirable) properties of RDP for timeseries is that
| is preserves details at extreme events and does not shift
| peaks. Very useful for e.g. seismic data. I have compressed
| tons of data with 1% subsampling with hardly any information
| loss
| the8472 wrote:
| Why reduce it to a single line when you can paint hulls around
| your curve? For coarser intervals you can also use candle stick
| charts.
| mattiasfestin wrote:
| This is a form of compression that is made to preserve chart
| visual looks, if I understand it right. There is no analysis on
| how it bias the data with the down sample.
|
| My instinct is the algorithm looks too specialized for this one
| task (which is good sometimes, if used for the right task).
|
| As a thought could you not use a FFT to the frequency domain and
| remove the high frequency data. And then retransform it back to
| the time domain.
|
| Fourier transforms are used all over the place, and FFT libs are
| usually well optimized and can even be hardware accelerated.
|
| Without checking it, the cut off frequency would be determined by
| the Nyquist-Shannon sampling theorem. And should be dependent of
| the width of the graph in pixels.
|
| So if the graph is resized, then recompute the new down sampled
| timeseries from the new Nyquist-Shannon sampling limit of the new
| width.
|
| A sliding DFT can also be used for streaming data for realtime.
| skykooler wrote:
| Specifically, this is trying to downsample a chart while only
| using points from the original time series. Something like an
| FFT would generate a new series of points which would most
| likely not be in the original dataset, though they might
| visually represent them well.
| FredFS456 wrote:
| Why not just use a low-pass linear-phase FIR filter instead of
| FFT/iFFT? Same result but achieved in time-domain, somewhat
| more performant.
| MasterScrat wrote:
| Interesting to see this pop up here! I was looking for a solution
| to plot terabytes of time series during my MSc Thesis at CERN and
| ended up adding support for this to a Grafana fork. Fun times.
|
| Details here:
| https://masterscrat.github.io/documents/Florian_Laurent_CERN...
| jarenmf wrote:
| A demo for "Largest-Triangle-Three-Buckets" algorithm
| https://www.base.is/flot/
|
| And the plugin used in the demo: https://github.com/sveinn-
| steinarsson/flot-downsample
| dcl wrote:
| Neat. I've noticed that a few stock market apps/sites definitely
| use different methods to do this, and with dramatically different
| results.
| dabreegster wrote:
| Does anybody know an approach for downsampling time series when
| you need to compare two downsampled series? Let's say:
|
| 1) You measure millions of (time, count) events, covering 24hrs.
|
| 2) You downsample this somehow to save space.
|
| 3) You start measuring (time, count) events again, covering 7
| hours. You expect the data to match up with the previous
| measurements, and you'd like to plot differences.
|
| Is there a way to trim the downsampled 24hr set so that it
| matches the downsampled 7hr set?
|
| Motivation: https://github.com/a-b-street/abstreet/issues/85
| 1996 wrote:
| Do you have as many events in both samples?
|
| If not, even if you measure the same thing, the variance may
| differ - you can start applying the law of large numbers since
| you mentionned "millions" and just consider a normal
| distribution, but for rare events you need something different
| (Poisson)
|
| Are the series static? If not (ex: temperature) you will have
| to model on time. As the link mentions people crossing
| intersection, I assume more people will at 9am than at 2am (7
| hours ago). Likewise if one series is a weekday and the other a
| weekend, you may have surprises!
|
| You have to consider other things (ex: day of the week).
|
| I would recommend avoiding the problem altogether: fit
| something (KDE, ARIMA...) on your timeseries, and plot the fit
| with the 95% CI from X=0 to X=24
|
| Do the same for the other series, from X=0 to X=7 and see if
| they are at least in the same CI
|
| Then compute the correlation of the 2 fits (just to have a
| reference number) and use that as a metric.
| dabreegster wrote:
| To be more specific, I'm measuring throughput along road
| segments in a deterministic traffic simulation. Differences
| would occur when the user modifies the road network. The
| modifications often have no effect on most roads, but some
| wind up with more or less traffic than usual at a certain
| time.
|
| I'm interested in showing differences like "during morning
| rush hour, there was more traffic here, but it was about the
| same the rest of the day", so I'm not sure a single
| correlation as a metric would be best. Ideally I'd plot two
| line plots and call out differences over time.
| 1996 wrote:
| There is going to be complex behavior (adapting to traffic
| by routing around it) that a simulation will poorly predict
|
| Also, you will have correlation between adjacent segments
| of the network- and the idea of extracting "rush hours"
| brings its own set of problems.
|
| Consult with a statistician familiar with geographical
| models. I'm sorry I can't help much more than that. It's
| very different from finance.
| dabreegster wrote:
| Surprises from emergent behavior are some of the most
| satisfying things about working on this.
|
| I'm not trying to extract rush hour patterns or anything
| quantitatively, just show two timeseries plots to the
| user and have them figure out what's going on.
|
| Thanks for the ideas! Lots of new things to research.
| Stats isn't my background at all.
| TrackerFF wrote:
| Do you have some criteria for the error?
| dabreegster wrote:
| Not particularly. The data comes from a deterministic traffic
| simulation, and I just want to show people roughly how their
| changes to a road network affect throughput along road
| segments. I'm interested in showing differences like "during
| morning rush hour, there was more traffic here, but it was
| about the same the rest of the day". "About the same" is
| fuzzy.
| contravariant wrote:
| You're basically calculating a histogram anyway so the most
| sensible thing to do with would be to calculate both histograms
| with the same buckets (simplest is just the sum every 5 minutes
| or so).
|
| At any rate you want do to the same thing to both datasets, and
| if it's counts of some random event then the only thing you can
| really compare is the sum-total over the same period.
| dabreegster wrote:
| This is basically the approach I'm doing now, with
| granularity being 1 hour. The problem is that when you're in
| the middle of an hour, the bucket for the live sim only
| accounts for half of the hour, but the stored bucket from the
| full day's data represents the entire hour. I could assume
| the bucket accumulates its count linearly through the hour,
| and that'd be much better than my current approach, but it
| still feels a bit weak.
|
| 5 minute buckets would make this comparison problem easier,
| but it's too much data to store.
| dabreegster wrote:
| https://github.com/a-b-
| street/abstreet/issues/85#issuecommen...
|
| Here's a video of the problem with the current approach,
| which is storing a count per hour. No sliding windows;
| 7:59am and 8:01am are different buckets. Green is "less
| than usual", red is "more than usual." Every time the hour
| resets, lots of green appears, and slowly fades.
| benlivengood wrote:
| I've seen this called window alignment. You can either pick a
| default and stick to it everywhere, e.g. truncate to the
| nearest minute or Nth-minute, but this only works for divisors
| of an hour for example so that it's easy to align different
| sources at any starting point.
|
| Resampling is pretty effective; linear interpolation is usually
| good enough because the data is already downsampled, presumably
| with an averaging or aggregating step that can be interpolated
| over. The mean-value theorem is your friend if you're looking
| at derivatives of time series data.
|
| The type of metric also matters; gauges can be simply
| interpolated. Counters make more sense to re-bucket into the
| chosen sampling period and offset, and for your specific
| question what you can do is switch from downsampled data to
| "realtime" data exactly at the bucket boundary so that you
| avoid the half-full bucket problem.
|
| A good way to treat counter timeseries is as bundles of updates
| (deltas) that arrive at a certain cadence. That makes the
| transition from different cadences smooth by resampling to
| avoid overlapping updates, e.g. if you have a downsampled
| bucket counting events from 1:00 to 1:10 then resample other
| timeseries so that buckets begin exactly at 1:10 or end exactly
| at 1:00 instead of trying to let them overlap.
| dabreegster wrote:
| I haven't had nearly enough caffeine today to process this
| properly, but thank you -- it partly makes sense, and it's
| plenty to go off of later!
___________________________________________________________________
(page generated 2021-03-09 23:00 UTC)