hngopher.com

       [HN Gopher] Loading a trillion rows of weather data into Timesca...
       ___________________________________________________________________
        
       Loading a trillion rows of weather data into TimescaleDB
        
       Author : PolarizedPoutin
       Score  : 292 points
       Date   : 2024-04-16 12:39 UTC (10 hours ago)
        
 (HTM) web link (aliramadhan.me)
 (TXT) w3m dump (aliramadhan.me)
        
       | semiquaver wrote:
       | Any idea why hypertable insert rates were slower? I though
       | hypertables were supposed to _increase_ insert rates?
        
         | leros wrote:
         | Isn't data inserted into basically a normal Postgres table with
         | hypertable extensions? I don't know the details of Timescale
         | but that sounds like it would incur a cost of a normal Postgres
         | insert, plus potentially extra work at insert time, plus extra
         | work in the background to manage the hypertable.
        
           | rahkiin wrote:
           | Not entirely. A hypertable is a postgres table chunked over
           | time. There is the assumption that most data and queries are
           | time-relevant, but also that older data is less relevant than
           | new data.
           | 
           | Indexes are per chunk. So if the query analyzer understands
           | you only touch 2023 it can omit looking at any chunk that is
           | from other years and keep those out of memory. Same with the
           | indexes.
        
         | PolarizedPoutin wrote:
         | Yeah I'm curious about this too. Been meaning to ask on the
         | Timescale forums. My only guess is that there's some small
         | extra overhead due to hypertable chunking.
         | 
         | I know Timescale has a blog post from 2017 claiming a 20x
         | higher insert rate but that's for inserting into a table with
         | an index. The general wisdom for loading huge amounts of data
         | seems to be that you should insert into a table with no indexes
         | then build them later though. So with no index, inserting into
         | a hypertable seems a bit slower.
         | 
         | Timescale blog post: https://medium.com/timescale/timescaledb-
         | vs-6a696248104e
        
           | h4kor wrote:
           | Timescale hypertables automatically have an index on the
           | timestamp. To make this more comparable you could create the
           | same index on the normal table and test the ingestion rate.
        
         | perrygeo wrote:
         | Hypertable insert rates are faster and more predictable _over
         | time_. Each individual insert might incur a small bit of extra
         | overhead, but they scale forever since each temporal chunk is
         | indexed separately vs a regular table where the entire index
         | needs to fit in memory. This is a case where you can 't make
         | meaningful inferences from micro-benchmarks (they tested 20k
         | rows, you probably need 200M to start seeing the diff)
        
           | PolarizedPoutin wrote:
           | Thanks for the insight! It is true that I started with a
           | micro-benchmark of 20k rows for slower inserts, but I also
           | did some longer benchmarks with ~772 million rows.
        
       | hyperman1 wrote:
       | Two remarks with postgres and lots of data:
       | 
       | 1) I always wonder of there is a better way than COPY. I tend to
       | quickly get 100% CPU without saturating I/O
       | 
       | 2) The row overhead seems big. A row has 23 bytes overhead, this
       | table has 48 bytes data per row, so even without page overhead,
       | we lose ~1/3 of our storage. This is pure data storage, without
       | any index.
        
         | PolarizedPoutin wrote:
         | 1. Yeah to me it seems very hard to saturate I/O with Postgres
         | unless maybe you insert into an unlogged table. I guess there's
         | quite a bit of overhead to get all the nice
         | stability/consistency and crash-resistance.
         | 
         | 2. That is a good point. I'm hoping TimescaleDB's compression
         | helps here but yeah I don't think you'll ever get the database
         | size below the data's original footprint.
        
           | feike wrote:
           | Timescaler here, if you configure the
           | timescaledb.compress_segmentby well, and the data suits the
           | compression, you can achieve 20x or more compression.
           | 
           | (On some metrics data internally, I have 98% reduction in
           | size of the data).
           | 
           | One of the reasons this works is due to only having to pay
           | the per-tuple overhead once per grouped row, which could be
           | as much as a 1000 rows.
           | 
           | The other is the compression algorithm, which can be
           | TimescaleDB or plain PostgreSQL TOAST
           | 
           | https://www.timescale.com/blog/time-series-compression-
           | algor... https://www.postgresql.org/docs/current/storage-
           | toast.html
        
           | hyperman1 wrote:
           | If I look into perf, it seems mostly parsing overhead. I can
           | saturate a create newtable as select from oldtable.
           | Unfortunately, CSV seems still the lingua franca for
           | transport between DBs. Maybe some day a more binary oriented
           | transport protocol will appear( e.g parquet?)
        
             | feike wrote:
             | Many libraries for python, Rust, golang support COPY
             | BINARY.
             | 
             | The times I've tested it, the improvement is very small as
             | compared to plain copy, or copy with CSV, whereas it does
             | require more work and thought upfront to ensure the binary
             | actually works correctly.
             | 
             | https://www.postgresql.org/docs/current/sql-copy.html
        
       | lawn wrote:
       | What an interesting post!
       | 
       | > At a sustained ~462k inserts per second, we're waiting ~20 days
       | for our ~754 billion rows which is not bad I guess It's less time
       | than it took me to write this post.
       | 
       | Hah, as I've been gravitating more to writing larger and more in
       | depth blog posts I can relate to the surprising effort it can
       | require.
        
         | PolarizedPoutin wrote:
         | Thank you! Yeah haha some of the benchmarks took several hours
         | (and a few re-runs) and there was a lot of learning done along
         | the way.
        
       | ZeroCool2u wrote:
       | I've done a good amount of geospatial analysis for work.
       | 
       | One thing you quickly realize with geospatial data is that it's
       | incredibly nuanced. You have to be quite careful about
       | understanding which coordinate reference system (CRS) and for
       | visualization which projection is being used. The CRS is somewhat
       | paranoia inducing if you don't have great infrastructure setup
       | with the right tools to carry that metadata with your geospatial
       | data.
       | 
       | I've tested everything AWS has to offer, Postgres/PostGIS,
       | Spark/DB, Snowflake, Trini, and ARCGis.
       | 
       | I'm convinced the best tool for large scale geospatial work is
       | Google BigQuery and it's not even close. It took an expensive
       | multi hour query running on PostGIS deployed on an enormous m6a
       | EC2 instance to less than 5 seconds that ran in the BigQuery free
       | tier. It does make sense if you think about it, Google was early
       | with Maps, but it is just stunning how much better they are in
       | this specific niche domain.
       | 
       | This was using publicly available FEMA data that Snowflake and
       | AWS services would just choke on, because the geometry column
       | exceeded their maximum byte size. Spark doesn't even have
       | geospatial data types and the open source packages that add
       | support leave a lot to be desired.
       | 
       | This guy is running on-prem, so maybe this made sense, but I just
       | would never bother. The storage for BQ would probably be less
       | than $100/months for 20 TB.
        
         | Cthulhu_ wrote:
         | I hear so much good things about BigTable / BigQuery, it's a
         | shame I've had no opportunity to use it yet.
        
         | ingenieroariel wrote:
         | I went through a similar phase with a process that started with
         | global OSM and Whosonfirst to process a pipeline. Google costs
         | kept going up (7k a month with airflow + bigquery) and I was
         | able to replace it with a one time $7k hardware purchase. We
         | were able to do it since the process was using H3 indices early
         | on and the resulting intermediate datasets all fit on ram.
         | 
         | System is a Mac Studio with 128GB + Asahi Linux + mmapped
         | parquet files and DuckDB, it also runs airflow for us and with
         | Nix can be used to accelerate developer builds and run the
         | airflow tasks for the data team.
         | 
         | GCP is nice when it is free/cheap but they keep tabs on what
         | you are doing and may surprise you at any point in time with
         | ever higher bills without higher usage.
        
           | jfim wrote:
           | BigQuery is nice but it's definitely a major foot-gun in
           | terms of cost. It's surprisingly easy to rack up high costs
           | with say a misconfigured dashboard or a developer just
           | testing stuff.
        
             | dekhn wrote:
             | Frankly I think this is just a sign that it's a power tool
             | for power users.
        
               | carlhjerpe wrote:
               | Sadly my colleagues aren't always "power users"
        
               | brailsafe wrote:
               | Nobody starts as a power user
        
           | ZeroCool2u wrote:
           | That is a very cool setup!
           | 
           | My org would never allow that as we're in a highly regulated
           | and security conscious space.
           | 
           | Totally agree about the BQ costs. The free tier is great and
           | I think pretty generous, but if you're not very careful with
           | enforcing table creation only with partitioning and
           | clustering as much as possible, and don't enforce some
           | training for devs on how to deal with columnar DB's if
           | they're not familiar, the bills can get pretty crazy quickly.
        
           | hawk_ wrote:
           | > and may surprise you at any point in time with ever higher
           | bills without higher usage.
           | 
           | What? really? Do they change your pricing plan? How can they
           | charge more for the same usage?
        
         | PolarizedPoutin wrote:
         | Thank you for the insights! Yeah I'm still not sure how
         | Postgres/PostGIS will scale for me, but good to know that
         | BigQuery does this nicely.
         | 
         | This is not something I'm productionizing (at least not yet?)
         | and I'm giving myself zero budget since it's a side project,
         | but if/when I do I'll definitely look into BigQuery!
        
           | ZeroCool2u wrote:
           | Good luck! They have some really great tutorials on how to
           | get started with BQ and geospatial data. One other nuance of
           | BigQuery that doesn't seem to apply to many other tools in
           | this space is that you can enable partitioning on your tables
           | in addition to clustering on the geometry (Geography in BQ)
           | column.
           | 
           | https://cloud.google.com/bigquery/docs/geospatial-
           | data#parti...
        
         | xyzzy_plugh wrote:
         | Do you mind linking the specific dataset? I agree very wide
         | columns break a lot of tools but other columnar postgres forks
         | should support this no problem. It sounds like you didn't use
         | Redshift, which I find surprising as it directly competes with
         | BQ. Redshift has "super" columns that can be very large, even
         | larger than the maximum supported by BigQuery.
         | 
         | I constantly see folks finding out the hard way that PostGIS is
         | _really_ hard to beat. The fact that Trini /Presto and Spark
         | have languished here is particularly telling.
        
           | ZeroCool2u wrote:
           | It's FEMA's NFHL. I can't recall the specific layer of the
           | GDB file, but you could probably figure it out. Try loading
           | up Iowa into redshift and if that works for you I'd be quite
           | surprised.
           | 
           | My org has a very large AWS spend and we got to have a chat
           | with some of their SWE's that work on the geospatial
           | processing features for Redshift and Athena. We described
           | what we needed and they said our only option was to aggregate
           | the data first or drop the offending rows. Obviously we're
           | not interested in compromising our work just to use a
           | specific tool, so we opted for better tools.
           | 
           | The crux of the issue was that the large problem column was
           | the geometry itself. Specifically, MultiPolygon. You need to
           | use the geometry datatype for this[1]. However, our
           | MultiPolygon column was 10's to 100's of MB's. Well outside
           | the max size for the Super datatype from what I can tell as
           | it looks like that's 16 MB.
           | 
           | [1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryT
           | ype-...
        
             | beeboobaa3 wrote:
             | Just split it out into multiple polygons, one per row. If
             | you're using a relational database, do as relational
             | databases do.
        
         | winrid wrote:
         | M6a is not even remotely enormous. Also were you using EBS?
        
           | ZeroCool2u wrote:
           | m6a.48xlarge: 192 vCPU & 768 GiB
           | 
           | If that's not a large node for you, well you're living in a
           | different world from me. Yes to using EBS.
        
             | winrid wrote:
             | Thanks for the info. The issue is using EBS. If you used an
             | instance with NVME drives it would probably have been
             | faster than BQ (and you aren't billed per-query...). I
             | would suggest an R or I4 class instance for this, m6 is not
             | good for the money here.
             | 
             | You would just have to setup replication for backups, but
             | this could just be rsync to EBS or some other replication
             | solution depending on your database.
        
             | sgarland wrote:
             | Unless you're CPU-bound (unlikely), the r-family is usually
             | a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to
             | 4x for m-family.
             | 
             | Then there's the x-family, which can go up to bonkers
             | levels, like 4 TiB of RAM (and local NVMe).
             | 
             | As a sibling comment mentioned, though, if you can fit the
             | data into local storage, that's the way to do it. EBS
             | latency, even with io2.blockexpress, simply cannot compete.
             | 
             | That said, if I did the math correctly based on the 71
             | GB/31 days footnote, you're looking at about 2.15 PB to
             | load the entire thing, so, uh, good luck.
        
               | ayewo wrote:
               | In your experience, is the r-family better than the
               | c-family for running an RDBMS?
        
               | aPoCoMiLogin wrote:
               | things are changing recently in aws, but few years ago
               | R-family instances had one of the fastest uplink to EBS.
               | for example only the larges M4 instance (m4.16xlarge) has
               | 10gbps uplink, versus R5b where it starts from 10gbps
               | (for the lowest tier) and ends on 60gbps @ 260k IOPS. you
               | can very easily choke EBS with DB.
               | 
               | EDIT: only newer C instances have comparable uplink to
               | EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just
               | compare the EBS bandwidth column in
               | https://aws.amazon.com/ec2/instance-types/
        
               | sgarland wrote:
               | The other fun thing about AWS instances is that the
               | network uplink speed isn't always what's advertised.
               | There is of course the "Up to X Gbps" levels (30 minutes
               | of rated speed guaranteed every 24 hours), but there are
               | also other limits, like cross-region speed being capped
               | at 50% of rated capacity.
               | 
               | This rarely matters until it does, like if you're setting
               | up an active-active DB across regions. Then it's a fun
               | surprise.
        
               | sgarland wrote:
               | Yes. Most RDBMS are memory-bound (indexes, connection
               | overhead, buffer pool...), so the more, the better.
               | 
               | At my last job, I switched a large-ish (100K QPS at peak)
               | m5 MySQL instance to an r6i that was cheaper, despite
               | having the same amount of RAM and being generation newer.
               | That, combined with careful tuning and testing, resulted
               | in queries speeding up 20-40%, AND we then had plenty of
               | room for vertical scaling again if necessary.
        
         | detourdog wrote:
         | I'm glad to hear this first hand experience. I'm pretty sure
         | that want to build and refine my own geospatial data horde.
         | 
         | I wonder if you think a longterm project is better rolling
         | their own. The second priority is that I expect it all to be
         | locally hosted.
         | 
         | Thanks for any considerations.
        
           | ZeroCool2u wrote:
           | Frankly I think for long term hoarding BQ is hard to beat.
           | The storage costs are pretty reasonable and you never pay for
           | compute until you actually run a query, so if you're mostly
           | just hoarding, well, you're probably going to save a lot of
           | time, money, and effort in the long run.
        
         | rtkwe wrote:
         | Even more fun I bet would be if you're getting data with
         | different reference spheroids.
        
         | johnymontana wrote:
         | > Spark doesn't even have geospatial data types and the open
         | source packages that add support leave a lot to be desired.
         | 
         | Could you say more about this? I'm curious if you've compared
         | Apache Sedona [0] and what specifically you found lacking? I
         | currently work at Wherobots [1], founded by the creators of
         | Apache Sedona and would love to hear any feedback.
         | 
         | [0] https://sedona.apache.org/latest/
         | 
         | [1] https://wherobots.com/
        
       | rkwasny wrote:
       | Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of
       | NOAA climate data on my desktop that I query when I want to go on
       | holidays ;-)
        
         | anentropic wrote:
         | Do you say that because its quicker to insert large batches of
         | rows into Clickhouse, or because it's better in other ways?
         | 
         | (I'm currently inserting large batches of rows into MySQL and
         | curious about Clickhouse...)
        
           | rkwasny wrote:
           | It's better in insert speed, query speed and used disk
           | storage.
        
         | PolarizedPoutin wrote:
         | Haha very cool use! Yeah reading up on TimescaleDB vs.
         | Clickhouse it seems like columnar storage and Clickhouse will
         | be faster and better compress the time series data. For now I'm
         | sticking to TimescaleDB to learn Postgres and PostGIS, but
         | might make a TimescaleDB vs. Clickhouse comparison when I
         | switch!
        
           | rkwasny wrote:
           | I can replicate your benchmark when I get a moment, the data,
           | is it free to share if I wanted to make an open browser?
           | 
           | I have a feeling general public has very limited access to
           | weather data, and graphs in news that state "are we getting
           | hotter on colder" are all sensational and hard to verify.
        
           | cevian wrote:
           | Please note that TimescaleDB also uses columnar storage for
           | its compressed data.
           | 
           | Disclosure: I am a TimescaleDB engineer.
        
         | dangoodmanUT wrote:
         | Agreed, clickhouse is faster and has better features for this
        
           | RyanHamilton wrote:
           | I agree. Clickhouse is awesomely powerful and fast. I
           | maintain a list of benchmarks, if you know of any speed
           | comparisons please let me know and I will add them to the
           | lsit: https://www.timestored.com/data/time-series-database-
           | benchma...
        
             | lyapunova wrote:
             | Thanks for maintaining benchmarks here. Is there a github
             | repo that might accompany the benchmarks that I could take
             | a look at / reproduce?
        
         | gonzo41 wrote:
         | So click house is a column db. Any thoughts on if the
         | performance would be a wash if you just pivoted the timescale
         | hypertable and indexed the time + column on timescale?
        
         | mfreed wrote:
         | Our experience is that Clickhouse and Timescale are designed
         | for different workloads, and that Timescale is optimized for
         | many of the time-series workloads people use in production:
         | 
         | - https://www.timescale.com/blog/what-is-clickhouse-how-
         | does-i...
         | 
         | Sidenote: Timescale _does_ provide columnar storage. I don't
         | believe that the blog author focused on this as part of insert
         | benchmarks:
         | 
         | - Timescale columnar storage:
         | https://www.timescale.com/blog/building-columnar-compression...
         | 
         | - Timescale query vectorization:
         | https://www.timescale.com/blog/teaching-postgres-new-tricks-...
        
           | rkwasny wrote:
           | Well, as a Co-founder and CTO of Timescale, would you say
           | TimescaleDB is a good fit for storing weather data as OP
           | does?
        
             | mfreed wrote:
             | TimescaleDB primarily serves operational use cases:
             | Developers building products on top of live data, where you
             | are regularly streaming in fresh data, and you often know
             | what many queries look like a priori, because those are
             | powering your live APIs, dashboards, and product
             | experience.
             | 
             | That's different from a data warehouse or many traditional
             | "OLAP" use cases, where you might dump a big dataset
             | statically, and then people will occasionally do ad-hoc
             | queries against it. This is the big weather dataset file
             | sitting on your desktop that you occasionally query while
             | on holidays.
             | 
             | So it's less about "can you store weather data", but what
             | does that use case look like? How are the queries shaped?
             | Are you saving a single dataset for ad-hoc queries across
             | the entire dataset, or continuously streaming in new data,
             | and aging out or de-prioritizing old data?
             | 
             | In most of the products we serve, customers are often
             | interested in recent data in a very granular format
             | ("shallow and wide"), or longer historical queries along a
             | well defined axis ("deep and narrow").
             | 
             | For example, this is where the benefits of TimescaleDB's
             | segmented columnar compression emerges. It optimizes for
             | those queries which are very common in your application,
             | e.g., an IoT application that groups by or selected by
             | deviceID, crypto/fintech analysis based on the ticker
             | symbol, product analytics based on tenantID, etc.
             | 
             | If you look at Clickbench, what most of the queries say
             | are: Scan ALL the data in your database, and GROUP BY one
             | of the 100 columns in the web analytics logs.
             | 
             | - https://github.com/ClickHouse/ClickBench/blob/main/clickh
             | ous...
             | 
             | There are almost no time-predicates in the benchmark that
             | Clickhouse created, but perhaps that is not surprising
             | given it was designed for ad-hoc weblog analytics at
             | Yandex.
             | 
             | So yes, Timescale serves many products today that use
             | weather data, but has made different choices than
             | Clickhouse (or things like DuckDB, pg_analytics, etc) to
             | serve those more operational use cases.
        
       | carderne wrote:
       | Hey OP (assuming you're the author), you might be interested in
       | this similar experiment I did about four years ago, same dataset,
       | same target, similar goal!
       | 
       | https://rdrn.me/optimising-sql/
       | 
       | Similar sequence of investigations, but using regular Postgres
       | rather than Timescale. With my setup I got another ~3x speedup
       | over COPY by copying binary data directly (assuming your data is
       | already in memory).
        
         | PolarizedPoutin wrote:
         | Wish I saw this before I started haha! I left a footnote about
         | why I didn't try binary copy (basically someone else found its
         | performance disappointing) but it sounds like I should give it
         | a try.
         | 
         | footnote: https://aliramadhan.me/2024/03/31/trillion-
         | rows.html#fn:copy...
        
           | carderne wrote:
           | Yeah I imagine it depends where the data is coming from and
           | what exactly it looks like (num fields, dtypes...?). What I
           | did was source data -> Numpy Structured Array [0] -> Postgres
           | binary [1]. Bit of a pain getting it into the required shape,
           | but if you follow the links the code should get you going
           | (sorry no type hints!).
           | 
           | [0] https://rdrn.me/optimising-sampling/#round-10-off-the-
           | deep-e... [1] In the original blog I linked.
        
             | anentropic wrote:
             | I'd love to hear from anyone who's done the same in MySQL
        
       | dunefox wrote:
       | OT: Does anyone know if DuckDB would be of use here?
        
         | wiredfool wrote:
         | Not likely, unless it's a set of parquet files already.
         | Clickhouse would be a better bet.
        
       | RyanHamilton wrote:
       | If you want to plot time-series charts or many other charts
       | directly from sql queries, qStudio is a free SQL IDE and works
       | with everything including TimescaleDB:
       | https://www.timestored.com/qstudio/database/timescale Disclaimer,
       | I am the author.
        
         | ayewo wrote:
         | What's the process for adding support for other databases to
         | your tool qStudio?
         | 
         | I'm thinking perhaps you could add support for Timeplus [1]?
         | Timeplus is a streaming-first database built on ClickHouse. The
         | core DB engine Timeplus Proton is open source [2].
         | 
         | It seems that qStudio is open source [3] and written in Java
         | and will need a JDBC driver to add support for a new RDBMS? If
         | yes, Timeplus Proton has an open source JDBC driver [4] based
         | on ClickHouse's driver but with modifications added for
         | streaming use cases.
         | 
         | 1: https://www.timeplus.com/
         | 
         | 2: https://github.com/timeplus-io/proton
         | 
         | 3: https://github.com/timeseries/qstudio
         | 
         | 4: https://github.com/timeplus-io/proton-java-driver
        
       | rjmunro wrote:
       | This annoys me:
       | 
       | > You can use parameterized queries to protect against SQL
       | injection,
       | 
       | No, that't not what parameterized queries are for. That's just a
       | side benefit.
       | 
       | Parameterized queries are so that the database doesn't have to
       | parse the data from SQL to do the query. It's stupid to turn all
       | the data to text, send the text over the wire, escaped and quoted
       | as appropriate, then parse it back to whatever form it was in
       | originally when you can just send it binary straight to the
       | database.
       | 
       | And if you are doing many similar queries with different results,
       | e.g. inserts as here, or maybe querying the user table by user id
       | every page load, the database doesn't have to parse any SQL each
       | time and can even reuse the query plan.
       | 
       | This may be why psycopg3 performed better than pandas df.to_sql()
       | function in the single row insert case.
        
         | PolarizedPoutin wrote:
         | Thank you for reading through it thoroughly and pointing this
         | out! I'm still new and learning Postgres so this is good to
         | know. I will update the post.
        
           | munk-a wrote:
           | I would note that I think the above is a rather minority
           | opinion - while parameterized queries are great for reuse
           | simply using them to guard against SQL injection is still an
           | excellent use of them. If your query is reusable then go for
           | it, but most complex queries tend not to be reusable if they
           | involve optional inputs.
        
         | eddd-ddde wrote:
         | Can it actually reuse the query plan? Couldn't that lead to a
         | worse performing plan eventually? Say after inserting lots of
         | data such that a different plan becomes a better option.
        
           | munk-a wrote:
           | Usually the TTLs you'll set on these plans[1] are pretty
           | short. It's helpful if you need to do a burst of a thousand
           | queries of the same form over a short interval.
           | 
           | 1. At least effectively - this is usually manged via
           | persisting handles or plans themselves.
           | 
           | 2. Which is usually a bad code smell from ORM overuse. Every
           | time you dispatch a query you're paying latency overhead - so
           | in an ideal world your report delivery will involve a single
           | query against the database per report request (whether this
           | is a website and those requests are page loads or you're
           | running some internal system for business reporting).
        
           | rjmunro wrote:
           | I think it's going to be very rare that a database changes
           | enough for a plan to be significantly worse than the optimal
           | plan during the life of a prepared query.
           | 
           | A database engine could easily mitigate against it by re-
           | planning queries every 1000 runs of the query or have the
           | query plan expire after 60 seconds or something. It might be
           | worth trying to re-prepare the query from time to time as
           | part of this kind of bulk insertion benchmarking.
           | 
           | The other thing that would be good to try is closing the
           | transaction every 1000 inserts or some other number. Doing
           | several inserts in a single transaction is certainly better
           | than 1 insert per transaction, but there may come a point
           | where the transaction being too big starts to slow things
           | down.
        
         | dboreham wrote:
         | Quick note to observe that all of the above while true becomes
         | less of a practical issue as CPUs become faster vs i/o, which
         | they have done and probably will keep doing.
        
           | forrestthewoods wrote:
           | Wait what no stop that what are you talking about.
           | 
           | Transforming data and then transforming it back will always
           | be stupid and costly - in more ways than one.
        
             | relaxing wrote:
             | Not if you save transit time that can be more quickly made
             | up in processing time.
        
           | pphysch wrote:
           | Feels like you're a couple decades behind. Single core speeds
           | haven't gone up much in the last decade, especially for
           | "server" CPU models.
        
           | simiones wrote:
           | Isn't the opposite mostly happening, with CPU's single-core
           | performance mostly flat for the last decade, while I/O speeds
           | have been improving every year (especially in networking, not
           | as sure about disk I/O).
        
         | davedx wrote:
         | I disagree that the security is a side benefit. It's why most
         | people choose to use parameterized queries.
        
       | sigmonsays wrote:
       | where can I download the weather data?
       | 
       | Is it free or available if I sign up?
        
         | roter wrote:
         | If I read correctly, they are using the ERA5 dataset [0]. It is
         | freely available and can be downloaded through a Python API
         | called cdsapi.
         | 
         | [0]
         | https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysi...
        
       | jhoechtl wrote:
       | For the German speaking among us: That should be 1 000 000 000
       | 000 rows
       | 
       | Million - million
       | 
       | Milliarde - billion
       | 
       | Billion - trillion
        
         | aftbit wrote:
         | Yes, this article is actually loading (order of magnitude)
         | 10^12 rows.
         | 
         | I was very surprised to learn about this the first time. In USA
         | English:
         | 
         | 10^3 -> thousand
         | 
         | 10^6 -> million
         | 
         | 10^9 -> billion
         | 
         | 10^12 -> trillion
         | 
         | 10^15 -> quadrillion
         | 
         | 10^18 -> quintillion
         | 
         | But there's another scale too! See this Wiki article for more
         | 
         | https://en.wikipedia.org/wiki/Names_of_large_numbers
        
           | samatman wrote:
           | Unless you're reading historical documents, these words have
           | only one meaning in the English language. It is in no sense
           | American English specific.
           | 
           | One advantage of the short scale is that its cadence is the
           | same as that of SI units: million is mega, billion is giga,
           | trillion is tera, and so on. The long scale of course has the
           | -ard names, so this isn't exactly a problem, any more than
           | French saying "four twenties and nineteen" for ninety-nine is
           | a problem.
           | 
           | The difference is one of the major reasons the SI scale names
           | exist in the first place, in fact.
           | 
           | It also matches the decimal separators used by everyone but
           | Indians, which strikes me as more logical.
        
             | aftbit wrote:
             | Sorry, I wasn't intending to imply that they were American
             | English specific, just that I only have experience with
             | English as spoken in the USA.
             | 
             | What's the advantage of the long scale? Just historical?
        
       | hendiatris wrote:
       | If you're going to work with weather data use a columnar
       | database, like BigQuery. If you set things up right your
       | performance will generally be a few seconds for aggregation
       | queries. I setup a data platform like this at my previous company
       | and we were able to vastly outperform our competitors and at a
       | much lower cost.
       | 
       | The great thing about this data is it is generally append only,
       | unless errors are found in earlier data sets. But it's something
       | that usually only happens once a year if at all.
        
       | sammy2255 wrote:
       | Clickhouse will eat this for breakfast. And has built-in
       | compression even at the column level
        
       | tmiku wrote:
       | > I think it would be cool to have historical weather data from
       | around the world to analyze for signals of climate change we've
       | already had rather than think about potential future change.
       | 
       | This is a very good instinct! A pretty major portion of modern
       | climate science is paleoclimatology, with a goal of reaching far
       | beyond reliable human measurements. A lot of earth's previous
       | climate states were wildly different from the range of conditions
       | we have experienced in the past 10,000 years, and a better
       | climate record is essential to predicting the effects of massive
       | carbon emission.
       | 
       | Ice cores from Antarctica/Greenland are the most famous instance
       | of this, but there's a lot of other cool ones - there are
       | chemical records of climate change in cave stalactites, ocean
       | floor sediments, etc.
        
         | randrus wrote:
         | Somewhat relevant (which I posted a while back):
         | 
         | https://climate.metoffice.cloud/
        
       | roter wrote:
       | I too use the ERA5 reanalysis data and I too need quick time
       | series. As the data comes in [lat, lon] grids, stacked by
       | whatever period you've chosen, e.g. [month of hourly data, lat,
       | lon], it becomes a massive matrix transpose problem if you want
       | 20+ years.
       | 
       | What I do is download each netCDF file, transpose, and insert
       | into a massive 3D HDF file organized as [lat, lon, hour]. On my
       | workstation it takes about 30 minutes to create one year for one
       | variable (no parallel I/O or processes) but then takes
       | milliseconds to pull a single (lat, lon) location. Initial pain
       | for long-term gain. Simplistic, but I'm just a climatologist not
       | a database guru.
        
       | tonymet wrote:
       | I encourage people to look into the ERA5 dataset provenance
       | especially when you approach the observations made toward the
       | "pre industrial date" of 1850 .
       | 
       | Remember that modern global surface temperatures are collected by
       | satellites, and the dataset is comingled with recordings observed
       | visually & made by hand using buckets by sailors who were not
       | primarily academic researchers. Segments of high resolution, low
       | noise data (satellites) are mixed with low resolution, low
       | coverage, high noise records (hand records on a boat made
       | surrounding the united kingdom).
       | 
       | My point is to be in awe of the technical aspects of this effort
       | but also keep in mind that we are only making copies of low
       | resolution, noisy manuscripts from sailors 170 years ago.
        
         | relaxing wrote:
         | Ok? And what's your point in pointing out that?
        
       | jamesgresql wrote:
       | This is super cool! I run DevlRel @ Timescale, and I love seeing
       | our community create well written posts like this!
       | 
       | My initial reaction is that I think one of the reasons you're
       | seeing a hypertable being slower is almost certainly that it
       | creates an index on the timestamp column by default. You don't
       | have an index on your standard table which lets it go faster.
       | 
       | You can use create_hypertable with create_default_indexes=>false
       | to skip creating the index, or you can just drop the index before
       | you ingest data. You'll eventually want that index - but it's
       | best created after ingestion in a one-shot load like this.
       | 
       | I'd also be interested in how the HDD you're reading data from is
       | holding up in some of the highly parallel setups?
        
         | PolarizedPoutin wrote:
         | Thank you for reading and for your kind words!
         | 
         | Ah I did not know about the `create_default_indexes=>false` and
         | that a time index is created by default for hypertables. I'll
         | add a note to explain this! Also curious to benchmark inserting
         | without the time index then creating it manually.
         | 
         | Even with 32 workers I think the HDD was fine. I did monitor
         | disk usage through btop and the SSD that Postgres lived on
         | seemed to be more of a bottleneck than the HDD. So my
         | conclusion was that a faster SSD for Postgres would be a better
         | investment than moving the data from HDD to SSD.
        
       | smellybigbelly wrote:
       | Can anyone give some advice on how they run TimeScale in
       | Kubernetes? I'm seeing they dropped support for their Helm chart.
        
         | jamesgresql wrote:
         | Hi, we updated our docs with the best options.
         | 
         | https://docs.timescale.com/self-hosted/latest/install/instal...
         | 
         | I'd personally recommend StackGres, it's a great project.
        
       | counters wrote:
       | Why?
       | 
       | Most weather and climate datasets - including ERA5 - are highly
       | structured on regular latitude-longitude grids. Even if you were
       | solely doing timeseries analyses for specific locations plucked
       | from this grid, the strength of this sort of dataset is its
       | intrinsic spatiotemporal structure and context, and it makes very
       | little sense to completely destroy the dataset's structure unless
       | you were solely and exclusively to extract point timeseries. And
       | even then, you'd probably want to decimate the data pretty
       | dramatically, since there is very little use case for, say, a
       | point timeseries of surface temperature in the middle of the
       | ocean!
       | 
       | The vast majority of research and operational applications of
       | datasets like ERA5 are probably better suited by leveraging
       | cloud-optimized replicas of the original dataset, such as ARCO-
       | ERA5 published on the Google Public Datasets program [1]. These
       | versions of the dataset preserve the original structure, and
       | chunk it in ways that are amenable to massively parallel access
       | via cloud storage. In almost any case I've encountered in my
       | career, a generically chunked Zarr-based archive of a dataset
       | like this will be more than performant enough for the majority of
       | use cases that one might care about.
       | 
       | [1]: https://cloud.google.com/storage/docs/public-datasets/era5
        
         | rabernat wrote:
         | True, but in fact, the Google ERA5 public data suffers from the
         | exact chunking problem described in the post: it's optimized
         | for spatial queries, not timeseries queries. I just ran a
         | benchmark, and it took me 20 minutes to pull a timeseries of a
         | single variable at a single point!
         | 
         | This highlights the needs for timeseries-optimized chunking if
         | that is your anticipated usage pattern.
        
         | orhmeh09 wrote:
         | That might be nice if someone would do it and teach others to
         | use it. Some labs have an RDBMS-based pipeline with published
         | algorithms and data that nobody wants to try to reimplement
         | (and which nobody would be paid to do). About the best
         | improvement we could get was moving from an ancient version of
         | MySQL to Postgres + PostGIS. I think Timescale would have
         | helped. There were other reasons also to run locally due to
         | privacy, cluster access, funds etc.
        
         | PolarizedPoutin wrote:
         | The main reason why was that it's a personal project and I
         | wanted to do everything on my home server so that I wouldn't
         | have to pay for cloud resources, and so that I could learn
         | Postgres, TimescaleDB, and eventuallly PostGIS.
         | 
         | But as rabernat pointed out in his comment, pulling out a long
         | time series from the cloud replica is also slow. And I know I
         | eventually want to perform complex spatio-temporal queries,
         | e.g. computing the 99% percentile of summer temperatures in
         | Chile from 1940-1980.
         | 
         | I don't doubt that a cloud replica can be faster, but it's at
         | odds with my budget of $0 haha.
        
           | roter wrote:
           | I too need to do percentiles. One option is loop through the
           | grids but bin/histogram it. You'll get a really good 99% from
           | a 1 Kelvin bin width.
           | 
           | Also, I've found the diurnal profile from ERA5 analysis can
           | be abysmal in some locations. ERA5-Land is much better, high
           | resolution, though only available over... er... land.
           | 
           | To your point about not relying on cloud. Noted in the Google
           | option [1] link above:
           | 
           | > Update Frequency: The ERA5 dataset is currently not
           | refreshed in the Google Cloud Public Dataset Program. The
           | program provides ERA5 data spanning from 1940 to May 2023.
           | 
           | Another alternative, Amazon [2], also deprecated:
           | 
           | > The provider of this dataset will no longer maintain this
           | dataset. We are open to talking with anyone else who might be
           | willing to provide this dataset to the community.
           | 
           | [2] https://registry.opendata.aws/ecmwf-era5/
        
       | rabernat wrote:
       | Great post! Hi Ali!
       | 
       | I think what's missing here is an analysis of what is gained by
       | moving the weather data into a RDBMS. The motivation is to speed
       | up queries. But what's the baseline?
       | 
       | As someone very familiar with this tech landscape (maintainer of
       | Xarray and Zarr, founder of https://earthmover.io/), I know that
       | serverless solutions + object storage can deliver very low
       | latency performance (sub second) for timeseries queries on
       | weather data--much faster than the 30 minutes cited here--_if_
       | the data are chunked appropriately in Zarr. Given the difficulty
       | of data ingestion described in this post, it's worth seriously
       | evaluating those solutions before going down the RDBMS path.
        
         | ohmahjong wrote:
         | This is a bit off-topic but I'm interested in the same space
         | you are in.
         | 
         | There seems to be an inherent pull between large chunks (great
         | for visualising large extents and larger queries) vs smaller
         | chunks for point-based or timeseries queries. It's possible but
         | not very cost-effective to maintain separately-chunked versions
         | of these large geospatial datasets. I have heard of "kerchunk"
         | being used to try and get the best of both, but then I _think_
         | you lose out on the option of compressing the data and it
         | introduces quite a lot of complexity.
         | 
         | What are your thoughts on how to strike that balance between
         | use cases?
        
           | rabernat wrote:
           | > It's possible but not very cost-effective to maintain
           | separately-chunked versions of these large geospatial
           | datasets.
           | 
           | Like all things in tech, it's about tradeoffs. S3 storage
           | costs about $275 TB a year. Typical weather datasets are ~10
           | TB. If you're running a business that uses weather data in
           | operations to make money, you could easily afford to make 2-3
           | copies that are optimized for different query patterns. We
           | see many teams doing this today in production. That's still
           | much cheaper (and more flexible) than putting the same volume
           | of data in a RDBMS, given the relative cost of S3 vs.
           | persistent disks.
           | 
           | The real hidden costs of all of these solutions is the
           | developer time operating the data pipelines for the
           | transformation.
        
         | PolarizedPoutin wrote:
         | Hey Ryan and thank you for the feedback!
         | 
         | I agree that storing the data is appropriately chunked Zarr
         | files is almost surely going to be faster, simpler to set up,
         | and take up less space. Could even put up an API in front of it
         | to get "queries".
         | 
         | I also agree that I haven't motivated the RDBMS approach much.
         | This is mainly because I took this approach with Postgres +
         | Timescale since I wanted to learn to work with them, and
         | playing around with ERA5 data seemed like the most fun way.
         | Maybe it's the allure of weather data being big enough to pose
         | a challenge here.
         | 
         | I don't have anything to back this up but I wonder if the RDBMS
         | approach, with properly tuned and indexed TimescaleDB + PostGIS
         | (non-trivial to set up), can speed up complex spatio-temporal
         | queries, e.g. computing the 99th percentile of summer
         | temperatures in Chile from 1940-1980, in case many different
         | Zarr chunks have to be read to find this data. I like the idea
         | of setting up different tables to cache these kinds of
         | statistics, but it's not that hard to do with Zarr either.
         | 
         | I'm benchmarking queries and indexes next so I might know more
         | then!
        
       | koliber wrote:
       | Curious if he could squeeze more performance by using a different
       | structure to store the same data. Some of these float4 cols could
       | probably be stored as int2. Depending on how many decimal places
       | are needed, can divide the int to get the resulting floating
       | point value.
        
       | postgresperf wrote:
       | Contributor to the PG bulk loading docs you referenced here. Good
       | survey of the techniques here. I've done a good bit of this
       | trying to speed up loading the Open Street Map database.
       | Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for
       | my last public update. Since then the advance of hardware, GIS
       | improvements in PG15, and osm2pgsql adopting their middle-way-
       | node-index-id-shift technique (makes the largest but rarely used
       | index 1/32 the size) have gotten my times to load the planet set
       | below 4 hours.
       | 
       | One suggestion aimed at the author here: some of your experiments
       | are taking out WAL writing in a sort of indirect way, using
       | pg_bulkload and COPY. There's one thing you could try that wasn't
       | documented yet when my buddy Craig Ringer wrote the SO post you
       | linked to: you can just turn off the WAL in the configuration.
       | Yes, you will lose the tables in progress if there's a crash, and
       | when things run for weeks those happen. With time scale data,
       | it's not hard to structure the loading so you'll only lose the
       | last chunk of work when that happens. WAL data isn't really
       | necessary for bulk loading. Crash, clean up the right edge of the
       | loaded data, start back up.
       | 
       | Here's the full set of postgresql.conf settings I run to disable
       | the WAL and other overhead:
       | 
       | wal_level = minimal max_wal_senders = 0 synchronous_commit = off
       | fsync = off full_page_writes = off autovacuum = off
       | checkpoint_timeout = 60min
       | 
       | Finally, when loading in big chunks, to keep the vacuum work down
       | I'd normally turn off autovac as above then issue periodic VACUUM
       | FREEZE commands running behind the currently loading date
       | partition. (Talking normal PG here) That skips some work of the
       | intermediate step the database normally frets about where new
       | transactions are written but not visible to everyone yet.
        
       ___________________________________________________________________
       (page generated 2024-04-16 23:00 UTC)