[HN Gopher] Show HN: Gribstream.com - Historical Weather Forecas...
       ___________________________________________________________________
        
       Show HN: Gribstream.com - Historical Weather Forecast API
        
       Hello! I'd like share about my sideproject https://gribstream.com
       It is an API to extract weather forecasting data from the National
       Blend of Models (NBM) https://vlab.noaa.gov/web/mdl/nbm and the
       Global Forecast System (GFS)
       https://www.ncei.noaa.gov/products/weather-climate-models/gl... .
       The data is freely available from AWS S3 in grib2 format which can
       be great but also really hard (and resource intensive) to work
       with, especially if you want to extract timeseries over long
       periods of time based on a few coordinates. Being able to query and
       extract only what you want out of terabytes of data in just an http
       request is really nice.  What is cool about this dataset is that it
       has hourly data with full forecast history so you can use the
       dataset to train and forecast other parameters and have proper
       backtesting because you can see the weather "as of" points in time
       in the past. It has a free tier so you can play with it. There is a
       long list of upcoming features I intend to implement and I would
       very much appreciate both feedback on what is currently available
       and on what features you would be most interested in seeing.
       Like... I'm not sure if it would be better to support a few other
       datasets or focus on supporting aggregations.  Features include:  -
       A free tier to help you get started - Full history of weather
       forecasts - Extract timeseries for thousands of coordinates, for
       months at a time, at hourly resolution in a single http request
       taking only seconds. - Supports as-of/time-travel, indispensable
       for proper backtesting of derivative models - Automatic gap filling
       of any missing data with the next best (most recent) forecast.
       Please try it out and let me know what you think :)
        
       Author : ElPeque
       Score  : 62 points
       Date   : 2024-12-20 01:27 UTC (21 hours ago)
        
 (HTM) web link (gribstream.com)
 (TXT) w3m dump (gribstream.com)
        
       | antasvara wrote:
       | Just a heads up that there is a similar Python package for this
       | that is free called Herbie [1], though the syntax of this one
       | looks a little easier to use.
       | 
       | [1]: https://herbie.readthedocs.io/en/stable/
        
         | ElPeque wrote:
         | Interesting!
         | 
         | This is a little different though.
         | 
         | This dataset is is part of the AWS Open Data program and so it
         | is freely available from S3. By running the API within AWS then
         | you get a massive latency and bandwidth advantage.
         | 
         | So with GribStream you are pushing the computation closer to
         | where the big data is and only downloading the small data. And
         | GribStream uses a custom grib2 parser that allows it to extract
         | the data in a streaming fashion, using very little memory.
         | 
         | It makes a huge difference if you need to extract timeseries of
         | a handful of coordinates for months at a time.
         | 
         | Cheers!
        
           | cbarrick wrote:
           | > And GribStream uses a custom grib2 parser that allows it to
           | extract the data in a streaming fashion, using very little
           | memory.
           | 
           | How does this compare to using Xarray on a netCDF dataset?
        
             | ElPeque wrote:
             | Grib2 files are much more compact than netCDF, just less
             | convenient to use. But GribStream takes care of that and
             | just returns you the timeseries for the coordinates you
             | need.
             | 
             | Besides using the usual index files to only do http range
             | requests for weather parameters of interest, GribStream
             | also avoids creating big memory buffers to
             | decode/decompress the whole grid. It does the decoding in a
             | streaming fashion and only accumulates the values that are
             | being looked for so it can do so very efficiently. It
             | doesn't even finish downloading the partial grib file, it
             | early aborts. And it also skips ahead many headers and
             | parts of the grib2 format that are not really required or
             | that can be assumed for being constant in the whole
             | dataset. In other words, it cuts all possible corners and
             | the parse is (currently) specifically optimized for the NBM
             | and GFS datasets.
             | 
             | Although I intend to support several others, like the Rapid
             | Refresh (RAP) model.
             | 
             | And the fact that this process runs close to the data
             | (AWS), it can do so way faster than you can run it anywhere
             | else.
        
         | westurner wrote:
         | blaylockbk/Herbie: https://github.com/blaylockbk/Herbie :
         | 
         | > _Download numerical weather prediction datasets (HRRR, RAP,
         | GFS, IFS, etc.) from NOMADS, NODD partners (Amazon, Google,
         | Microsoft), ECMWF open data, and the Pando Archive System_
         | 
         | The Herbie docs mention a "GFS GraphCast" but not yet GenCast?
         | https://herbie.readthedocs.io/en/stable/gallery/noaa_models/...
         | 
         | "GenCast predicts weather and the risks of extreme conditions
         | with state-of-the-art accuracy" (2024)
         | https://deepmind.google/discover/blog/gencast-predicts-weath...
         | 
         | "Probabilistic weather forecasting with machine learning"
         | (2024) ; GenCast paper
         | https://www.nature.com/articles/s41586-024-08252-9
         | 
         | google-deepmind/graphcast: https://github.com/google-
         | deepmind/graphcast
         | 
         | Are there error and cost benchmarks for these predictive
         | models?
        
       | alexose wrote:
       | This is great! I've always wanted to backtest forecast models,
       | but I was stuck on where to find good historical data.
        
         | ElPeque wrote:
         | Glad you like it! I hope it will suit your work.
        
       | Upitor wrote:
       | In my experience, NWP data is big. Like, really big! Data over
       | HTTP calls seems to limit the use case a bit; have you considered
       | making it possible to mount storage directly (fsspec), and use fx
       | the zarr format? In this way, querying with xarray would be much
       | more flexible
        
         | ElPeque wrote:
         | The point of GribStream is that you don't need to download the
         | grided data (cause as you say it is huge! and it grows many Gb
         | every single hour for years).
         | 
         | This is an API that will do streaming extraction and only have
         | you download what you actually need.
         | 
         | When you make the http request to the API, the API will be
         | processing up to terabytes of data only to respond to you with
         | maybe a few Kb of csv.
        
         | westurner wrote:
         | Why is weather data stored in netcdf instead of tensors or
         | sparse tensors?
         | 
         | Also, SQLite supports _virtual tables_ that can be backed by
         | Content Range requests; https://www.sqlite.org/vtab.html
         | 
         | sqlite-wasm-http, sql.js-httpvfs; _HTTP VFS_ :
         | https://www.npmjs.com/package/sqlite-wasm-http
         | 
         | sqlite-parquet-vtable: https://github.com/cldellow/sqlite-
         | parquet-vtable
         | 
         | Could there be a sqlite-netcdf-vtable or a sqlite-gribs-vtable,
         | or is the dimensionality too much for SQLite?
         | 
         | From https://news.ycombinator.com/item?id=31824578 :
         | 
         | > _It looks like e.g. sqlite-parquet-vtable implements shadow
         | tables to memoize row group filters. How does JOIN performance
         | vary amongst sqlite virtual table implementations?_
         | 
         | https://news.ycombinator.com/item?id=42264274
         | 
         | SpatialLite does _geo_ vector search with SQLite.
         | 
         | datasette can JOIN across multiple SQLite databases.
         | 
         | Perhaps datasette and datasette-lite could support xarray and
         | thus NetCDF-style multidimensional arrays in WASM in the
         | browser with HTTP Content Range requests to fetch and cache
         | just the data requested
         | 
         | "The NetCDF header":
         | https://climateestimate.net/content/netcdfs-and-basic-coding...
         | :
         | 
         | > _The header can also be used to verify the order of
         | dimensions that a variable is saved in (which you will have to
         | know to use, unless you're using a tool like xarray that lets
         | you refer to dimensions by name) - for a 3-dimensional
         | variable, `lon,lat,time` is common, but some files will have
         | the `time` variable first._
         | 
         | "Loading a subset of a NetCDF file":
         | https://climateestimate.net/content/netcdfs-and-basic-coding...
         | 
         | From https://news.ycombinator.com/item?id=42260094 :
         | 
         | > _xeus-sqlite-kernel > "Loading SQLite databases from a remote
         | URL" https://github.com/jupyterlite/xeus-sqlite-
         | kernel/issues/6#i... _                 %FETCH <url> <filename>
        
           | ElPeque wrote:
           | In theory it could be done. It is sort of analogous to what
           | GribStream is doing already.
           | 
           | The grib2 files _are_ the storage. They are sorted by time in
           | the path and so that is used like a primary index. And then
           | grib2 is just a binary format to decode to extract what you
           | want.
           | 
           | I originally was going to write this as a plugin for
           | Clickhouse but in the end I made it a Golang API cause then
           | I'm less constrained to other things. Like, for example, I'd
           | like to create and endpoint to live encode the gribfiles into
           | MP4 so the data can be served as Video. And then with any
           | video player you would be able to playback, jump to times,
           | etc.
           | 
           | I might still write a clickhouse integration though because
           | it would be amazing to join and combine with other datasets
           | on the fly.
        
             | tomnicholas1 wrote:
             | > It is sort of analogous to what GribStream is doing
             | already.
             | 
             | The difference is presumably that you are doing some large
             | rechunking operation on your server to hide from the user
             | the fact that the data is actually in multiple files?
             | 
             | Cool project btw, would love to hear a little more about
             | how it works underneath :)
        
               | ElPeque wrote:
               | Yeah, exactly.
               | 
               | I basically scrape all the grib index files to know all
               | the offsets into all variables for all time. I store that
               | in clickhouse.
               | 
               | When the API gets a request for a time range, set of
               | coordinates and a set of weather parameters, first I pre-
               | compute the mapping of (lat,lon) into the 1 dimensional
               | index in the gridded data. That is a constant across the
               | whole dataset. Then I query the clickhouse table to find
               | out all the files+offset that need to be processed and
               | all of them are queued into a multi-processing pool. And
               | then processing each parameter implies parsing a grib
               | file. I wrote a grib2 parser from scratch in golang so as
               | to extract the data in a streaming fashion. As in... I
               | don't extract the whole grid only to lookup the
               | coordinates in it. I already pre-computed the index, so I
               | can just decode every value in the grid in order and when
               | I hit and index that I'm looking for, I copy it to a
               | fixed size buffer with the extracted data. When You have
               | all the pre-computed indexes then you don't even need to
               | finish downloading the file, I just drop the connection
               | immediately.
               | 
               | It is pretty cool. It is running in very humble hardware
               | so I'm hoping I'll get some traction so I can throw more
               | money into it. It should scale pretty linearly.
               | 
               | I've tested doing multi-year requests and the golang
               | program never goes over 80Mb of memory usage. The CPUs
               | get pegged so that is the limiting factor.
               | 
               | Grib2 complex packing (what the NBM dataset uses) implies
               | lots of bit-packing. So there is a ton more to optimize
               | using SIMD instructions. I've been toying with it a bit
               | but I don't want to mission creep into that yet
               | (fascinating though!).
               | 
               | I'm tempted to port this https://github.com/fast-
               | pack/simdcomp to native go ASM.
        
               | tomnicholas1 wrote:
               | That's pretty cool! Quite specific to this file
               | format/workload, but this is an important enough problem
               | that people might well be interested in a tailored
               | solution like this :)
        
           | tomnicholas1 wrote:
           | > Why is weather data stored in netcdf instead of tensors or
           | sparse tensors?
           | 
           | NetCDF is a "tensor", at least in the sense of being a self-
           | describing multi-dimensional array format. The bigger problem
           | is that it's not a Cloud-Optimized format, which is why Zarr
           | has become popular.
           | 
           | > Also, SQLite supports virtual tables that can be backed by
           | Content Range requests
           | 
           | The multi-dimensional equivalent of this is "virtual Zarr". I
           | made this library to create virtual Zarr stores pointing at
           | archival data (e.g. netCDF and GRIB)
           | 
           | https://github.com/zarr-developers/VirtualiZarr
           | 
           | > xarray and thus NetCDF-style multidimensional arrays in
           | WASM in the browser with HTTP Content Range requests to fetch
           | and cache just the data requested
           | 
           | Pretty sure you can do this today already using Xarray and
           | fsspec.
        
       | greggsy wrote:
       | There used to be a really nice website that allowed you to scroll
       | and scale years worth of historical weather for any location.
       | Also allowed you to prepare average annual graphs.
       | 
       | It fell into disrepair after Flash was killed off, and the
       | maintainer wasn't able to commit time to porting it over to a new
       | platform.
       | 
       | I figure that the commodification of weather data is the real
       | reason why it hasn't been replaced with a viable alternative.
        
         | CalRobert wrote:
         | Weatherspark! It was amazing.
        
           | gloflo wrote:
           | Was or is?
           | 
           | https://weatherspark.com/ is awesome.
        
             | ElPeque wrote:
             | Oh wow. Those are really cool visualizations. I can't
             | compete :P
        
             | victorbjorklund wrote:
             | Wow. This site is just amazing. Nice and useful (!) graphs
             | especially for comparing cities between each other. And the
             | small things like how they store all the state in the url
             | so I could just share a comparison with a friend by just
             | copying the url. Perfect.
        
             | CalRobert wrote:
             | Still is! But it used to have much more
        
         | ElPeque wrote:
         | Interesting!
         | 
         | Can you describe how it worked? Who knows, maybe I could make
         | it happen.
         | 
         | I've been thinking of creating a more involved "demo" to entice
         | traffic/links/SEO. Maybe something like this could be it.
        
       | TripleChecker wrote:
       | Small typo on homepage/demo - ice 'probabilty'
        
         | ElPeque wrote:
         | True!
         | 
         | Fixed!
         | 
         | Thank you!
        
           | TripleChecker wrote:
           | You got it!
        
       | drusenko wrote:
       | I was initially very excited because this data is not nearly as
       | it should be, especially historical forecasts. However, your
       | pricing model seems to seriously limit the potential uses.
       | 
       | I would imagine that most people who have a serious interest in
       | weather forecasting and would be target users of this service
       | don't think in terms of number of points but rather in lat/lon
       | bounds, resolution, and number of hours & days for the
       | predictions. I imagine they would also like to download a GRIB
       | and not a CSV.
       | 
       | Your pricing for any large enough area to be useful presumably
       | gets somewhat prohibitive, eg covering the North Pacific (useful
       | for West Coast modeling) at 0.25 deg resolution might be ~300k
       | data points per hour if I am doing my math right?
        
         | ElPeque wrote:
         | This is really great feedback. The truth is that the pricing
         | model is being figured out, so if you have a specific type of
         | use in mind maybe we could figure out what works best and it
         | might become the actual default pricing model.
         | 
         | I tried to tie the pricing with the amount of processing that
         | the API needs to do, which is closely related to the number of
         | grib2 files that the API needs to download and process in order
         | to create the response. And it doesn't change as much wether I
         | extract 1 point or 1000 points. But I thought I had to draw the
         | line somewhere or nobody would ever pay for anything because
         | the freetier is enough.
         | 
         | But I might make it same price for maybe chunks of 5000 or more
         | points.
         | 
         | From the line of business I come from the main usage is
         | actually to extract scattered coordinates (think weather where
         | specific assets are, like hotels or solar panels or wind farms)
         | and not whole boundaries at full resolution but it makes a lot
         | of sense that for other types of usage that is not the case.
         | 
         | It is definitely in the roadmap to be able to select based on
         | lat/lon bounds and even shapes. Also to return data not as
         | timeseries but the gridded data itself, either as grib2 or
         | netCDF or parquet or a plain matrix of floats or png or even
         | mp4 video.
        
         | ElPeque wrote:
         | Shoot me an email and I'll reach out when I can implement
         | selecting based on bounds.
         | 
         | Out of the top of my head, re-slicing into grib files for the
         | response is probably a big lift but some of the other formats
         | like maybe netCDF or geoTIFF or just compressed array of floats
         | might be a nice MVP.
         | 
         | info@gribstream.com
        
       ___________________________________________________________________
       (page generated 2024-12-20 23:02 UTC)