[HN Gopher] Loading a trillion rows of weather data into Timesca...
___________________________________________________________________
Loading a trillion rows of weather data into TimescaleDB
Author : PolarizedPoutin
Score : 292 points
Date : 2024-04-16 12:39 UTC (10 hours ago)
(HTM) web link (aliramadhan.me)
(TXT) w3m dump (aliramadhan.me)
| semiquaver wrote:
| Any idea why hypertable insert rates were slower? I though
| hypertables were supposed to _increase_ insert rates?
| leros wrote:
| Isn't data inserted into basically a normal Postgres table with
| hypertable extensions? I don't know the details of Timescale
| but that sounds like it would incur a cost of a normal Postgres
| insert, plus potentially extra work at insert time, plus extra
| work in the background to manage the hypertable.
| rahkiin wrote:
| Not entirely. A hypertable is a postgres table chunked over
| time. There is the assumption that most data and queries are
| time-relevant, but also that older data is less relevant than
| new data.
|
| Indexes are per chunk. So if the query analyzer understands
| you only touch 2023 it can omit looking at any chunk that is
| from other years and keep those out of memory. Same with the
| indexes.
| PolarizedPoutin wrote:
| Yeah I'm curious about this too. Been meaning to ask on the
| Timescale forums. My only guess is that there's some small
| extra overhead due to hypertable chunking.
|
| I know Timescale has a blog post from 2017 claiming a 20x
| higher insert rate but that's for inserting into a table with
| an index. The general wisdom for loading huge amounts of data
| seems to be that you should insert into a table with no indexes
| then build them later though. So with no index, inserting into
| a hypertable seems a bit slower.
|
| Timescale blog post: https://medium.com/timescale/timescaledb-
| vs-6a696248104e
| h4kor wrote:
| Timescale hypertables automatically have an index on the
| timestamp. To make this more comparable you could create the
| same index on the normal table and test the ingestion rate.
| perrygeo wrote:
| Hypertable insert rates are faster and more predictable _over
| time_. Each individual insert might incur a small bit of extra
| overhead, but they scale forever since each temporal chunk is
| indexed separately vs a regular table where the entire index
| needs to fit in memory. This is a case where you can 't make
| meaningful inferences from micro-benchmarks (they tested 20k
| rows, you probably need 200M to start seeing the diff)
| PolarizedPoutin wrote:
| Thanks for the insight! It is true that I started with a
| micro-benchmark of 20k rows for slower inserts, but I also
| did some longer benchmarks with ~772 million rows.
| hyperman1 wrote:
| Two remarks with postgres and lots of data:
|
| 1) I always wonder of there is a better way than COPY. I tend to
| quickly get 100% CPU without saturating I/O
|
| 2) The row overhead seems big. A row has 23 bytes overhead, this
| table has 48 bytes data per row, so even without page overhead,
| we lose ~1/3 of our storage. This is pure data storage, without
| any index.
| PolarizedPoutin wrote:
| 1. Yeah to me it seems very hard to saturate I/O with Postgres
| unless maybe you insert into an unlogged table. I guess there's
| quite a bit of overhead to get all the nice
| stability/consistency and crash-resistance.
|
| 2. That is a good point. I'm hoping TimescaleDB's compression
| helps here but yeah I don't think you'll ever get the database
| size below the data's original footprint.
| feike wrote:
| Timescaler here, if you configure the
| timescaledb.compress_segmentby well, and the data suits the
| compression, you can achieve 20x or more compression.
|
| (On some metrics data internally, I have 98% reduction in
| size of the data).
|
| One of the reasons this works is due to only having to pay
| the per-tuple overhead once per grouped row, which could be
| as much as a 1000 rows.
|
| The other is the compression algorithm, which can be
| TimescaleDB or plain PostgreSQL TOAST
|
| https://www.timescale.com/blog/time-series-compression-
| algor... https://www.postgresql.org/docs/current/storage-
| toast.html
| hyperman1 wrote:
| If I look into perf, it seems mostly parsing overhead. I can
| saturate a create newtable as select from oldtable.
| Unfortunately, CSV seems still the lingua franca for
| transport between DBs. Maybe some day a more binary oriented
| transport protocol will appear( e.g parquet?)
| feike wrote:
| Many libraries for python, Rust, golang support COPY
| BINARY.
|
| The times I've tested it, the improvement is very small as
| compared to plain copy, or copy with CSV, whereas it does
| require more work and thought upfront to ensure the binary
| actually works correctly.
|
| https://www.postgresql.org/docs/current/sql-copy.html
| lawn wrote:
| What an interesting post!
|
| > At a sustained ~462k inserts per second, we're waiting ~20 days
| for our ~754 billion rows which is not bad I guess It's less time
| than it took me to write this post.
|
| Hah, as I've been gravitating more to writing larger and more in
| depth blog posts I can relate to the surprising effort it can
| require.
| PolarizedPoutin wrote:
| Thank you! Yeah haha some of the benchmarks took several hours
| (and a few re-runs) and there was a lot of learning done along
| the way.
| ZeroCool2u wrote:
| I've done a good amount of geospatial analysis for work.
|
| One thing you quickly realize with geospatial data is that it's
| incredibly nuanced. You have to be quite careful about
| understanding which coordinate reference system (CRS) and for
| visualization which projection is being used. The CRS is somewhat
| paranoia inducing if you don't have great infrastructure setup
| with the right tools to carry that metadata with your geospatial
| data.
|
| I've tested everything AWS has to offer, Postgres/PostGIS,
| Spark/DB, Snowflake, Trini, and ARCGis.
|
| I'm convinced the best tool for large scale geospatial work is
| Google BigQuery and it's not even close. It took an expensive
| multi hour query running on PostGIS deployed on an enormous m6a
| EC2 instance to less than 5 seconds that ran in the BigQuery free
| tier. It does make sense if you think about it, Google was early
| with Maps, but it is just stunning how much better they are in
| this specific niche domain.
|
| This was using publicly available FEMA data that Snowflake and
| AWS services would just choke on, because the geometry column
| exceeded their maximum byte size. Spark doesn't even have
| geospatial data types and the open source packages that add
| support leave a lot to be desired.
|
| This guy is running on-prem, so maybe this made sense, but I just
| would never bother. The storage for BQ would probably be less
| than $100/months for 20 TB.
| Cthulhu_ wrote:
| I hear so much good things about BigTable / BigQuery, it's a
| shame I've had no opportunity to use it yet.
| ingenieroariel wrote:
| I went through a similar phase with a process that started with
| global OSM and Whosonfirst to process a pipeline. Google costs
| kept going up (7k a month with airflow + bigquery) and I was
| able to replace it with a one time $7k hardware purchase. We
| were able to do it since the process was using H3 indices early
| on and the resulting intermediate datasets all fit on ram.
|
| System is a Mac Studio with 128GB + Asahi Linux + mmapped
| parquet files and DuckDB, it also runs airflow for us and with
| Nix can be used to accelerate developer builds and run the
| airflow tasks for the data team.
|
| GCP is nice when it is free/cheap but they keep tabs on what
| you are doing and may surprise you at any point in time with
| ever higher bills without higher usage.
| jfim wrote:
| BigQuery is nice but it's definitely a major foot-gun in
| terms of cost. It's surprisingly easy to rack up high costs
| with say a misconfigured dashboard or a developer just
| testing stuff.
| dekhn wrote:
| Frankly I think this is just a sign that it's a power tool
| for power users.
| carlhjerpe wrote:
| Sadly my colleagues aren't always "power users"
| brailsafe wrote:
| Nobody starts as a power user
| ZeroCool2u wrote:
| That is a very cool setup!
|
| My org would never allow that as we're in a highly regulated
| and security conscious space.
|
| Totally agree about the BQ costs. The free tier is great and
| I think pretty generous, but if you're not very careful with
| enforcing table creation only with partitioning and
| clustering as much as possible, and don't enforce some
| training for devs on how to deal with columnar DB's if
| they're not familiar, the bills can get pretty crazy quickly.
| hawk_ wrote:
| > and may surprise you at any point in time with ever higher
| bills without higher usage.
|
| What? really? Do they change your pricing plan? How can they
| charge more for the same usage?
| PolarizedPoutin wrote:
| Thank you for the insights! Yeah I'm still not sure how
| Postgres/PostGIS will scale for me, but good to know that
| BigQuery does this nicely.
|
| This is not something I'm productionizing (at least not yet?)
| and I'm giving myself zero budget since it's a side project,
| but if/when I do I'll definitely look into BigQuery!
| ZeroCool2u wrote:
| Good luck! They have some really great tutorials on how to
| get started with BQ and geospatial data. One other nuance of
| BigQuery that doesn't seem to apply to many other tools in
| this space is that you can enable partitioning on your tables
| in addition to clustering on the geometry (Geography in BQ)
| column.
|
| https://cloud.google.com/bigquery/docs/geospatial-
| data#parti...
| xyzzy_plugh wrote:
| Do you mind linking the specific dataset? I agree very wide
| columns break a lot of tools but other columnar postgres forks
| should support this no problem. It sounds like you didn't use
| Redshift, which I find surprising as it directly competes with
| BQ. Redshift has "super" columns that can be very large, even
| larger than the maximum supported by BigQuery.
|
| I constantly see folks finding out the hard way that PostGIS is
| _really_ hard to beat. The fact that Trini /Presto and Spark
| have languished here is particularly telling.
| ZeroCool2u wrote:
| It's FEMA's NFHL. I can't recall the specific layer of the
| GDB file, but you could probably figure it out. Try loading
| up Iowa into redshift and if that works for you I'd be quite
| surprised.
|
| My org has a very large AWS spend and we got to have a chat
| with some of their SWE's that work on the geospatial
| processing features for Redshift and Athena. We described
| what we needed and they said our only option was to aggregate
| the data first or drop the offending rows. Obviously we're
| not interested in compromising our work just to use a
| specific tool, so we opted for better tools.
|
| The crux of the issue was that the large problem column was
| the geometry itself. Specifically, MultiPolygon. You need to
| use the geometry datatype for this[1]. However, our
| MultiPolygon column was 10's to 100's of MB's. Well outside
| the max size for the Super datatype from what I can tell as
| it looks like that's 16 MB.
|
| [1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryT
| ype-...
| beeboobaa3 wrote:
| Just split it out into multiple polygons, one per row. If
| you're using a relational database, do as relational
| databases do.
| winrid wrote:
| M6a is not even remotely enormous. Also were you using EBS?
| ZeroCool2u wrote:
| m6a.48xlarge: 192 vCPU & 768 GiB
|
| If that's not a large node for you, well you're living in a
| different world from me. Yes to using EBS.
| winrid wrote:
| Thanks for the info. The issue is using EBS. If you used an
| instance with NVME drives it would probably have been
| faster than BQ (and you aren't billed per-query...). I
| would suggest an R or I4 class instance for this, m6 is not
| good for the money here.
|
| You would just have to setup replication for backups, but
| this could just be rsync to EBS or some other replication
| solution depending on your database.
| sgarland wrote:
| Unless you're CPU-bound (unlikely), the r-family is usually
| a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to
| 4x for m-family.
|
| Then there's the x-family, which can go up to bonkers
| levels, like 4 TiB of RAM (and local NVMe).
|
| As a sibling comment mentioned, though, if you can fit the
| data into local storage, that's the way to do it. EBS
| latency, even with io2.blockexpress, simply cannot compete.
|
| That said, if I did the math correctly based on the 71
| GB/31 days footnote, you're looking at about 2.15 PB to
| load the entire thing, so, uh, good luck.
| ayewo wrote:
| In your experience, is the r-family better than the
| c-family for running an RDBMS?
| aPoCoMiLogin wrote:
| things are changing recently in aws, but few years ago
| R-family instances had one of the fastest uplink to EBS.
| for example only the larges M4 instance (m4.16xlarge) has
| 10gbps uplink, versus R5b where it starts from 10gbps
| (for the lowest tier) and ends on 60gbps @ 260k IOPS. you
| can very easily choke EBS with DB.
|
| EDIT: only newer C instances have comparable uplink to
| EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just
| compare the EBS bandwidth column in
| https://aws.amazon.com/ec2/instance-types/
| sgarland wrote:
| The other fun thing about AWS instances is that the
| network uplink speed isn't always what's advertised.
| There is of course the "Up to X Gbps" levels (30 minutes
| of rated speed guaranteed every 24 hours), but there are
| also other limits, like cross-region speed being capped
| at 50% of rated capacity.
|
| This rarely matters until it does, like if you're setting
| up an active-active DB across regions. Then it's a fun
| surprise.
| sgarland wrote:
| Yes. Most RDBMS are memory-bound (indexes, connection
| overhead, buffer pool...), so the more, the better.
|
| At my last job, I switched a large-ish (100K QPS at peak)
| m5 MySQL instance to an r6i that was cheaper, despite
| having the same amount of RAM and being generation newer.
| That, combined with careful tuning and testing, resulted
| in queries speeding up 20-40%, AND we then had plenty of
| room for vertical scaling again if necessary.
| detourdog wrote:
| I'm glad to hear this first hand experience. I'm pretty sure
| that want to build and refine my own geospatial data horde.
|
| I wonder if you think a longterm project is better rolling
| their own. The second priority is that I expect it all to be
| locally hosted.
|
| Thanks for any considerations.
| ZeroCool2u wrote:
| Frankly I think for long term hoarding BQ is hard to beat.
| The storage costs are pretty reasonable and you never pay for
| compute until you actually run a query, so if you're mostly
| just hoarding, well, you're probably going to save a lot of
| time, money, and effort in the long run.
| rtkwe wrote:
| Even more fun I bet would be if you're getting data with
| different reference spheroids.
| johnymontana wrote:
| > Spark doesn't even have geospatial data types and the open
| source packages that add support leave a lot to be desired.
|
| Could you say more about this? I'm curious if you've compared
| Apache Sedona [0] and what specifically you found lacking? I
| currently work at Wherobots [1], founded by the creators of
| Apache Sedona and would love to hear any feedback.
|
| [0] https://sedona.apache.org/latest/
|
| [1] https://wherobots.com/
| rkwasny wrote:
| Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of
| NOAA climate data on my desktop that I query when I want to go on
| holidays ;-)
| anentropic wrote:
| Do you say that because its quicker to insert large batches of
| rows into Clickhouse, or because it's better in other ways?
|
| (I'm currently inserting large batches of rows into MySQL and
| curious about Clickhouse...)
| rkwasny wrote:
| It's better in insert speed, query speed and used disk
| storage.
| PolarizedPoutin wrote:
| Haha very cool use! Yeah reading up on TimescaleDB vs.
| Clickhouse it seems like columnar storage and Clickhouse will
| be faster and better compress the time series data. For now I'm
| sticking to TimescaleDB to learn Postgres and PostGIS, but
| might make a TimescaleDB vs. Clickhouse comparison when I
| switch!
| rkwasny wrote:
| I can replicate your benchmark when I get a moment, the data,
| is it free to share if I wanted to make an open browser?
|
| I have a feeling general public has very limited access to
| weather data, and graphs in news that state "are we getting
| hotter on colder" are all sensational and hard to verify.
| cevian wrote:
| Please note that TimescaleDB also uses columnar storage for
| its compressed data.
|
| Disclosure: I am a TimescaleDB engineer.
| dangoodmanUT wrote:
| Agreed, clickhouse is faster and has better features for this
| RyanHamilton wrote:
| I agree. Clickhouse is awesomely powerful and fast. I
| maintain a list of benchmarks, if you know of any speed
| comparisons please let me know and I will add them to the
| lsit: https://www.timestored.com/data/time-series-database-
| benchma...
| lyapunova wrote:
| Thanks for maintaining benchmarks here. Is there a github
| repo that might accompany the benchmarks that I could take
| a look at / reproduce?
| gonzo41 wrote:
| So click house is a column db. Any thoughts on if the
| performance would be a wash if you just pivoted the timescale
| hypertable and indexed the time + column on timescale?
| mfreed wrote:
| Our experience is that Clickhouse and Timescale are designed
| for different workloads, and that Timescale is optimized for
| many of the time-series workloads people use in production:
|
| - https://www.timescale.com/blog/what-is-clickhouse-how-
| does-i...
|
| Sidenote: Timescale _does_ provide columnar storage. I don't
| believe that the blog author focused on this as part of insert
| benchmarks:
|
| - Timescale columnar storage:
| https://www.timescale.com/blog/building-columnar-compression...
|
| - Timescale query vectorization:
| https://www.timescale.com/blog/teaching-postgres-new-tricks-...
| rkwasny wrote:
| Well, as a Co-founder and CTO of Timescale, would you say
| TimescaleDB is a good fit for storing weather data as OP
| does?
| mfreed wrote:
| TimescaleDB primarily serves operational use cases:
| Developers building products on top of live data, where you
| are regularly streaming in fresh data, and you often know
| what many queries look like a priori, because those are
| powering your live APIs, dashboards, and product
| experience.
|
| That's different from a data warehouse or many traditional
| "OLAP" use cases, where you might dump a big dataset
| statically, and then people will occasionally do ad-hoc
| queries against it. This is the big weather dataset file
| sitting on your desktop that you occasionally query while
| on holidays.
|
| So it's less about "can you store weather data", but what
| does that use case look like? How are the queries shaped?
| Are you saving a single dataset for ad-hoc queries across
| the entire dataset, or continuously streaming in new data,
| and aging out or de-prioritizing old data?
|
| In most of the products we serve, customers are often
| interested in recent data in a very granular format
| ("shallow and wide"), or longer historical queries along a
| well defined axis ("deep and narrow").
|
| For example, this is where the benefits of TimescaleDB's
| segmented columnar compression emerges. It optimizes for
| those queries which are very common in your application,
| e.g., an IoT application that groups by or selected by
| deviceID, crypto/fintech analysis based on the ticker
| symbol, product analytics based on tenantID, etc.
|
| If you look at Clickbench, what most of the queries say
| are: Scan ALL the data in your database, and GROUP BY one
| of the 100 columns in the web analytics logs.
|
| - https://github.com/ClickHouse/ClickBench/blob/main/clickh
| ous...
|
| There are almost no time-predicates in the benchmark that
| Clickhouse created, but perhaps that is not surprising
| given it was designed for ad-hoc weblog analytics at
| Yandex.
|
| So yes, Timescale serves many products today that use
| weather data, but has made different choices than
| Clickhouse (or things like DuckDB, pg_analytics, etc) to
| serve those more operational use cases.
| carderne wrote:
| Hey OP (assuming you're the author), you might be interested in
| this similar experiment I did about four years ago, same dataset,
| same target, similar goal!
|
| https://rdrn.me/optimising-sql/
|
| Similar sequence of investigations, but using regular Postgres
| rather than Timescale. With my setup I got another ~3x speedup
| over COPY by copying binary data directly (assuming your data is
| already in memory).
| PolarizedPoutin wrote:
| Wish I saw this before I started haha! I left a footnote about
| why I didn't try binary copy (basically someone else found its
| performance disappointing) but it sounds like I should give it
| a try.
|
| footnote: https://aliramadhan.me/2024/03/31/trillion-
| rows.html#fn:copy...
| carderne wrote:
| Yeah I imagine it depends where the data is coming from and
| what exactly it looks like (num fields, dtypes...?). What I
| did was source data -> Numpy Structured Array [0] -> Postgres
| binary [1]. Bit of a pain getting it into the required shape,
| but if you follow the links the code should get you going
| (sorry no type hints!).
|
| [0] https://rdrn.me/optimising-sampling/#round-10-off-the-
| deep-e... [1] In the original blog I linked.
| anentropic wrote:
| I'd love to hear from anyone who's done the same in MySQL
| dunefox wrote:
| OT: Does anyone know if DuckDB would be of use here?
| wiredfool wrote:
| Not likely, unless it's a set of parquet files already.
| Clickhouse would be a better bet.
| RyanHamilton wrote:
| If you want to plot time-series charts or many other charts
| directly from sql queries, qStudio is a free SQL IDE and works
| with everything including TimescaleDB:
| https://www.timestored.com/qstudio/database/timescale Disclaimer,
| I am the author.
| ayewo wrote:
| What's the process for adding support for other databases to
| your tool qStudio?
|
| I'm thinking perhaps you could add support for Timeplus [1]?
| Timeplus is a streaming-first database built on ClickHouse. The
| core DB engine Timeplus Proton is open source [2].
|
| It seems that qStudio is open source [3] and written in Java
| and will need a JDBC driver to add support for a new RDBMS? If
| yes, Timeplus Proton has an open source JDBC driver [4] based
| on ClickHouse's driver but with modifications added for
| streaming use cases.
|
| 1: https://www.timeplus.com/
|
| 2: https://github.com/timeplus-io/proton
|
| 3: https://github.com/timeseries/qstudio
|
| 4: https://github.com/timeplus-io/proton-java-driver
| rjmunro wrote:
| This annoys me:
|
| > You can use parameterized queries to protect against SQL
| injection,
|
| No, that't not what parameterized queries are for. That's just a
| side benefit.
|
| Parameterized queries are so that the database doesn't have to
| parse the data from SQL to do the query. It's stupid to turn all
| the data to text, send the text over the wire, escaped and quoted
| as appropriate, then parse it back to whatever form it was in
| originally when you can just send it binary straight to the
| database.
|
| And if you are doing many similar queries with different results,
| e.g. inserts as here, or maybe querying the user table by user id
| every page load, the database doesn't have to parse any SQL each
| time and can even reuse the query plan.
|
| This may be why psycopg3 performed better than pandas df.to_sql()
| function in the single row insert case.
| PolarizedPoutin wrote:
| Thank you for reading through it thoroughly and pointing this
| out! I'm still new and learning Postgres so this is good to
| know. I will update the post.
| munk-a wrote:
| I would note that I think the above is a rather minority
| opinion - while parameterized queries are great for reuse
| simply using them to guard against SQL injection is still an
| excellent use of them. If your query is reusable then go for
| it, but most complex queries tend not to be reusable if they
| involve optional inputs.
| eddd-ddde wrote:
| Can it actually reuse the query plan? Couldn't that lead to a
| worse performing plan eventually? Say after inserting lots of
| data such that a different plan becomes a better option.
| munk-a wrote:
| Usually the TTLs you'll set on these plans[1] are pretty
| short. It's helpful if you need to do a burst of a thousand
| queries of the same form over a short interval.
|
| 1. At least effectively - this is usually manged via
| persisting handles or plans themselves.
|
| 2. Which is usually a bad code smell from ORM overuse. Every
| time you dispatch a query you're paying latency overhead - so
| in an ideal world your report delivery will involve a single
| query against the database per report request (whether this
| is a website and those requests are page loads or you're
| running some internal system for business reporting).
| rjmunro wrote:
| I think it's going to be very rare that a database changes
| enough for a plan to be significantly worse than the optimal
| plan during the life of a prepared query.
|
| A database engine could easily mitigate against it by re-
| planning queries every 1000 runs of the query or have the
| query plan expire after 60 seconds or something. It might be
| worth trying to re-prepare the query from time to time as
| part of this kind of bulk insertion benchmarking.
|
| The other thing that would be good to try is closing the
| transaction every 1000 inserts or some other number. Doing
| several inserts in a single transaction is certainly better
| than 1 insert per transaction, but there may come a point
| where the transaction being too big starts to slow things
| down.
| dboreham wrote:
| Quick note to observe that all of the above while true becomes
| less of a practical issue as CPUs become faster vs i/o, which
| they have done and probably will keep doing.
| forrestthewoods wrote:
| Wait what no stop that what are you talking about.
|
| Transforming data and then transforming it back will always
| be stupid and costly - in more ways than one.
| relaxing wrote:
| Not if you save transit time that can be more quickly made
| up in processing time.
| pphysch wrote:
| Feels like you're a couple decades behind. Single core speeds
| haven't gone up much in the last decade, especially for
| "server" CPU models.
| simiones wrote:
| Isn't the opposite mostly happening, with CPU's single-core
| performance mostly flat for the last decade, while I/O speeds
| have been improving every year (especially in networking, not
| as sure about disk I/O).
| davedx wrote:
| I disagree that the security is a side benefit. It's why most
| people choose to use parameterized queries.
| sigmonsays wrote:
| where can I download the weather data?
|
| Is it free or available if I sign up?
| roter wrote:
| If I read correctly, they are using the ERA5 dataset [0]. It is
| freely available and can be downloaded through a Python API
| called cdsapi.
|
| [0]
| https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
| jhoechtl wrote:
| For the German speaking among us: That should be 1 000 000 000
| 000 rows
|
| Million - million
|
| Milliarde - billion
|
| Billion - trillion
| aftbit wrote:
| Yes, this article is actually loading (order of magnitude)
| 10^12 rows.
|
| I was very surprised to learn about this the first time. In USA
| English:
|
| 10^3 -> thousand
|
| 10^6 -> million
|
| 10^9 -> billion
|
| 10^12 -> trillion
|
| 10^15 -> quadrillion
|
| 10^18 -> quintillion
|
| But there's another scale too! See this Wiki article for more
|
| https://en.wikipedia.org/wiki/Names_of_large_numbers
| samatman wrote:
| Unless you're reading historical documents, these words have
| only one meaning in the English language. It is in no sense
| American English specific.
|
| One advantage of the short scale is that its cadence is the
| same as that of SI units: million is mega, billion is giga,
| trillion is tera, and so on. The long scale of course has the
| -ard names, so this isn't exactly a problem, any more than
| French saying "four twenties and nineteen" for ninety-nine is
| a problem.
|
| The difference is one of the major reasons the SI scale names
| exist in the first place, in fact.
|
| It also matches the decimal separators used by everyone but
| Indians, which strikes me as more logical.
| aftbit wrote:
| Sorry, I wasn't intending to imply that they were American
| English specific, just that I only have experience with
| English as spoken in the USA.
|
| What's the advantage of the long scale? Just historical?
| hendiatris wrote:
| If you're going to work with weather data use a columnar
| database, like BigQuery. If you set things up right your
| performance will generally be a few seconds for aggregation
| queries. I setup a data platform like this at my previous company
| and we were able to vastly outperform our competitors and at a
| much lower cost.
|
| The great thing about this data is it is generally append only,
| unless errors are found in earlier data sets. But it's something
| that usually only happens once a year if at all.
| sammy2255 wrote:
| Clickhouse will eat this for breakfast. And has built-in
| compression even at the column level
| tmiku wrote:
| > I think it would be cool to have historical weather data from
| around the world to analyze for signals of climate change we've
| already had rather than think about potential future change.
|
| This is a very good instinct! A pretty major portion of modern
| climate science is paleoclimatology, with a goal of reaching far
| beyond reliable human measurements. A lot of earth's previous
| climate states were wildly different from the range of conditions
| we have experienced in the past 10,000 years, and a better
| climate record is essential to predicting the effects of massive
| carbon emission.
|
| Ice cores from Antarctica/Greenland are the most famous instance
| of this, but there's a lot of other cool ones - there are
| chemical records of climate change in cave stalactites, ocean
| floor sediments, etc.
| randrus wrote:
| Somewhat relevant (which I posted a while back):
|
| https://climate.metoffice.cloud/
| roter wrote:
| I too use the ERA5 reanalysis data and I too need quick time
| series. As the data comes in [lat, lon] grids, stacked by
| whatever period you've chosen, e.g. [month of hourly data, lat,
| lon], it becomes a massive matrix transpose problem if you want
| 20+ years.
|
| What I do is download each netCDF file, transpose, and insert
| into a massive 3D HDF file organized as [lat, lon, hour]. On my
| workstation it takes about 30 minutes to create one year for one
| variable (no parallel I/O or processes) but then takes
| milliseconds to pull a single (lat, lon) location. Initial pain
| for long-term gain. Simplistic, but I'm just a climatologist not
| a database guru.
| tonymet wrote:
| I encourage people to look into the ERA5 dataset provenance
| especially when you approach the observations made toward the
| "pre industrial date" of 1850 .
|
| Remember that modern global surface temperatures are collected by
| satellites, and the dataset is comingled with recordings observed
| visually & made by hand using buckets by sailors who were not
| primarily academic researchers. Segments of high resolution, low
| noise data (satellites) are mixed with low resolution, low
| coverage, high noise records (hand records on a boat made
| surrounding the united kingdom).
|
| My point is to be in awe of the technical aspects of this effort
| but also keep in mind that we are only making copies of low
| resolution, noisy manuscripts from sailors 170 years ago.
| relaxing wrote:
| Ok? And what's your point in pointing out that?
| jamesgresql wrote:
| This is super cool! I run DevlRel @ Timescale, and I love seeing
| our community create well written posts like this!
|
| My initial reaction is that I think one of the reasons you're
| seeing a hypertable being slower is almost certainly that it
| creates an index on the timestamp column by default. You don't
| have an index on your standard table which lets it go faster.
|
| You can use create_hypertable with create_default_indexes=>false
| to skip creating the index, or you can just drop the index before
| you ingest data. You'll eventually want that index - but it's
| best created after ingestion in a one-shot load like this.
|
| I'd also be interested in how the HDD you're reading data from is
| holding up in some of the highly parallel setups?
| PolarizedPoutin wrote:
| Thank you for reading and for your kind words!
|
| Ah I did not know about the `create_default_indexes=>false` and
| that a time index is created by default for hypertables. I'll
| add a note to explain this! Also curious to benchmark inserting
| without the time index then creating it manually.
|
| Even with 32 workers I think the HDD was fine. I did monitor
| disk usage through btop and the SSD that Postgres lived on
| seemed to be more of a bottleneck than the HDD. So my
| conclusion was that a faster SSD for Postgres would be a better
| investment than moving the data from HDD to SSD.
| smellybigbelly wrote:
| Can anyone give some advice on how they run TimeScale in
| Kubernetes? I'm seeing they dropped support for their Helm chart.
| jamesgresql wrote:
| Hi, we updated our docs with the best options.
|
| https://docs.timescale.com/self-hosted/latest/install/instal...
|
| I'd personally recommend StackGres, it's a great project.
| counters wrote:
| Why?
|
| Most weather and climate datasets - including ERA5 - are highly
| structured on regular latitude-longitude grids. Even if you were
| solely doing timeseries analyses for specific locations plucked
| from this grid, the strength of this sort of dataset is its
| intrinsic spatiotemporal structure and context, and it makes very
| little sense to completely destroy the dataset's structure unless
| you were solely and exclusively to extract point timeseries. And
| even then, you'd probably want to decimate the data pretty
| dramatically, since there is very little use case for, say, a
| point timeseries of surface temperature in the middle of the
| ocean!
|
| The vast majority of research and operational applications of
| datasets like ERA5 are probably better suited by leveraging
| cloud-optimized replicas of the original dataset, such as ARCO-
| ERA5 published on the Google Public Datasets program [1]. These
| versions of the dataset preserve the original structure, and
| chunk it in ways that are amenable to massively parallel access
| via cloud storage. In almost any case I've encountered in my
| career, a generically chunked Zarr-based archive of a dataset
| like this will be more than performant enough for the majority of
| use cases that one might care about.
|
| [1]: https://cloud.google.com/storage/docs/public-datasets/era5
| rabernat wrote:
| True, but in fact, the Google ERA5 public data suffers from the
| exact chunking problem described in the post: it's optimized
| for spatial queries, not timeseries queries. I just ran a
| benchmark, and it took me 20 minutes to pull a timeseries of a
| single variable at a single point!
|
| This highlights the needs for timeseries-optimized chunking if
| that is your anticipated usage pattern.
| orhmeh09 wrote:
| That might be nice if someone would do it and teach others to
| use it. Some labs have an RDBMS-based pipeline with published
| algorithms and data that nobody wants to try to reimplement
| (and which nobody would be paid to do). About the best
| improvement we could get was moving from an ancient version of
| MySQL to Postgres + PostGIS. I think Timescale would have
| helped. There were other reasons also to run locally due to
| privacy, cluster access, funds etc.
| PolarizedPoutin wrote:
| The main reason why was that it's a personal project and I
| wanted to do everything on my home server so that I wouldn't
| have to pay for cloud resources, and so that I could learn
| Postgres, TimescaleDB, and eventuallly PostGIS.
|
| But as rabernat pointed out in his comment, pulling out a long
| time series from the cloud replica is also slow. And I know I
| eventually want to perform complex spatio-temporal queries,
| e.g. computing the 99% percentile of summer temperatures in
| Chile from 1940-1980.
|
| I don't doubt that a cloud replica can be faster, but it's at
| odds with my budget of $0 haha.
| roter wrote:
| I too need to do percentiles. One option is loop through the
| grids but bin/histogram it. You'll get a really good 99% from
| a 1 Kelvin bin width.
|
| Also, I've found the diurnal profile from ERA5 analysis can
| be abysmal in some locations. ERA5-Land is much better, high
| resolution, though only available over... er... land.
|
| To your point about not relying on cloud. Noted in the Google
| option [1] link above:
|
| > Update Frequency: The ERA5 dataset is currently not
| refreshed in the Google Cloud Public Dataset Program. The
| program provides ERA5 data spanning from 1940 to May 2023.
|
| Another alternative, Amazon [2], also deprecated:
|
| > The provider of this dataset will no longer maintain this
| dataset. We are open to talking with anyone else who might be
| willing to provide this dataset to the community.
|
| [2] https://registry.opendata.aws/ecmwf-era5/
| rabernat wrote:
| Great post! Hi Ali!
|
| I think what's missing here is an analysis of what is gained by
| moving the weather data into a RDBMS. The motivation is to speed
| up queries. But what's the baseline?
|
| As someone very familiar with this tech landscape (maintainer of
| Xarray and Zarr, founder of https://earthmover.io/), I know that
| serverless solutions + object storage can deliver very low
| latency performance (sub second) for timeseries queries on
| weather data--much faster than the 30 minutes cited here--_if_
| the data are chunked appropriately in Zarr. Given the difficulty
| of data ingestion described in this post, it's worth seriously
| evaluating those solutions before going down the RDBMS path.
| ohmahjong wrote:
| This is a bit off-topic but I'm interested in the same space
| you are in.
|
| There seems to be an inherent pull between large chunks (great
| for visualising large extents and larger queries) vs smaller
| chunks for point-based or timeseries queries. It's possible but
| not very cost-effective to maintain separately-chunked versions
| of these large geospatial datasets. I have heard of "kerchunk"
| being used to try and get the best of both, but then I _think_
| you lose out on the option of compressing the data and it
| introduces quite a lot of complexity.
|
| What are your thoughts on how to strike that balance between
| use cases?
| rabernat wrote:
| > It's possible but not very cost-effective to maintain
| separately-chunked versions of these large geospatial
| datasets.
|
| Like all things in tech, it's about tradeoffs. S3 storage
| costs about $275 TB a year. Typical weather datasets are ~10
| TB. If you're running a business that uses weather data in
| operations to make money, you could easily afford to make 2-3
| copies that are optimized for different query patterns. We
| see many teams doing this today in production. That's still
| much cheaper (and more flexible) than putting the same volume
| of data in a RDBMS, given the relative cost of S3 vs.
| persistent disks.
|
| The real hidden costs of all of these solutions is the
| developer time operating the data pipelines for the
| transformation.
| PolarizedPoutin wrote:
| Hey Ryan and thank you for the feedback!
|
| I agree that storing the data is appropriately chunked Zarr
| files is almost surely going to be faster, simpler to set up,
| and take up less space. Could even put up an API in front of it
| to get "queries".
|
| I also agree that I haven't motivated the RDBMS approach much.
| This is mainly because I took this approach with Postgres +
| Timescale since I wanted to learn to work with them, and
| playing around with ERA5 data seemed like the most fun way.
| Maybe it's the allure of weather data being big enough to pose
| a challenge here.
|
| I don't have anything to back this up but I wonder if the RDBMS
| approach, with properly tuned and indexed TimescaleDB + PostGIS
| (non-trivial to set up), can speed up complex spatio-temporal
| queries, e.g. computing the 99th percentile of summer
| temperatures in Chile from 1940-1980, in case many different
| Zarr chunks have to be read to find this data. I like the idea
| of setting up different tables to cache these kinds of
| statistics, but it's not that hard to do with Zarr either.
|
| I'm benchmarking queries and indexes next so I might know more
| then!
| koliber wrote:
| Curious if he could squeeze more performance by using a different
| structure to store the same data. Some of these float4 cols could
| probably be stored as int2. Depending on how many decimal places
| are needed, can divide the int to get the resulting floating
| point value.
| postgresperf wrote:
| Contributor to the PG bulk loading docs you referenced here. Good
| survey of the techniques here. I've done a good bit of this
| trying to speed up loading the Open Street Map database.
| Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for
| my last public update. Since then the advance of hardware, GIS
| improvements in PG15, and osm2pgsql adopting their middle-way-
| node-index-id-shift technique (makes the largest but rarely used
| index 1/32 the size) have gotten my times to load the planet set
| below 4 hours.
|
| One suggestion aimed at the author here: some of your experiments
| are taking out WAL writing in a sort of indirect way, using
| pg_bulkload and COPY. There's one thing you could try that wasn't
| documented yet when my buddy Craig Ringer wrote the SO post you
| linked to: you can just turn off the WAL in the configuration.
| Yes, you will lose the tables in progress if there's a crash, and
| when things run for weeks those happen. With time scale data,
| it's not hard to structure the loading so you'll only lose the
| last chunk of work when that happens. WAL data isn't really
| necessary for bulk loading. Crash, clean up the right edge of the
| loaded data, start back up.
|
| Here's the full set of postgresql.conf settings I run to disable
| the WAL and other overhead:
|
| wal_level = minimal max_wal_senders = 0 synchronous_commit = off
| fsync = off full_page_writes = off autovacuum = off
| checkpoint_timeout = 60min
|
| Finally, when loading in big chunks, to keep the vacuum work down
| I'd normally turn off autovac as above then issue periodic VACUUM
| FREEZE commands running behind the currently loading date
| partition. (Talking normal PG here) That skips some work of the
| intermediate step the database normally frets about where new
| transactions are written but not visible to everyone yet.
___________________________________________________________________
(page generated 2024-04-16 23:00 UTC)