[HN Gopher] Citus 10 brings columnar compression to Postgres
___________________________________________________________________
Citus 10 brings columnar compression to Postgres
Author : whitepoplar
Score : 133 points
Date : 2021-03-06 18:16 UTC (4 hours ago)
(HTM) web link (www.citusdata.com)
(TXT) w3m dump (www.citusdata.com)
| nwatson wrote:
| Sensage/Addamark was too early to the columnar storage game in
| 2001-2003 ... https://en.wikipedia.org/wiki/Sensage.
| atombender wrote:
| Sybase/IQ (now SAP IQ, originally acquired from Expressway) was
| launched in 1995.
|
| There are other old column databases, such as kdb, but I
| believe Sybase/Expessway was the first commercial product.
| jacques_chester wrote:
| I think the claim here is that Citus added it to PostgreSQL.
|
| Depending on how one defines that claim, Greenplum may have
| been first.
|
| Disclosure: I work for VMware, which sponsors Greenplum
| development.
| clairegiordano wrote:
| Definitely not intending to take credit away from any other
| teams. There are so many good Postgres extensions and forks
| in this ecosystem.
|
| What we've done is add a Columnar storage feature into the
| Citus open source extension to Postgres, as part of Citus 10.
|
| One way to think of it: Citus 10 now gives Postgres users
| (those who are also using the Citus extension) a columnar
| compression option, for use on a single node and/or a
| distributed Citus cluster.
| jacques_chester wrote:
| Thanks. Regardless of who did what, columnar stores are
| non-trivial engineering and anyone who produces a
| production-ready one is worthy of admiration.
| asah wrote:
| Greenplum was founded in 2003, A/S was 2001.
| jacques_chester wrote:
| Again, my qualifier is "to PostgreSQL". I couldn't see such
| a connection on a brief skim of the A/S wikipedia entry.
| gigatexal wrote:
| Great start! Keep in mind the limitations:
|
| What are the Limitations?
|
| These limitations are not set in stone, and we look forward to
| working on them in the future:
|
| No UPDATE or DELETE support No index support No logical
| replication or logical decoding support See more limitations in
| the columnar README
| tpetry wrote:
| cstore_fdw had the same limitations, so even if they say these
| limitations may not persist forever, i am not very hopeful they
| really want to solve this problem.
|
| But they could solve it with just a little bit of work: *
| Create a hidden bitmap information to store whether the ,,row"
| of the columnar table is still valid * When updating/deleting
| values only set the bitmap information to zero to indicate the
| value is no longer valid * Every updated ,,row" is added into a
| special chunk at the end of the table in uncompressed format *
| When vacuuming the table the old rows with bitmap=0 are deleted
| and the values from the special chunk are merged
|
| So you would have same performance if update/delete is never
| done as the complete bitmap index is filled with 1. And every
| update/delete will make it just a little bit slower, as most
| often only old date is stored in compressed format not much
| uncomprossed values will be stored. And a vacuum full would
| optimize these old tables again.
| [deleted]
| jeff-davis wrote:
| Thank you for the suggestions! We are interested in
| UPDATE/DELETE.
|
| Can you describe you use case for columnar update/delete in a
| little more detail? Is it a few random updates, or bulk
| updates, or something else?
| BenoitP wrote:
| So are they using Parquet, ORC or Arrow under the hood; or do
| they have a custom format?
| mslot wrote:
| It is a custom format that was originally derived from ORC, but
| is very different at this point. For instance, all the metadata
| is kept in PostgreSQL catalog tables to make changes
| transactional.
| jfbaro wrote:
| I am glad to see Postgresql ecosystem growing. It gets better by
| the day. Thanks to all people involved!
| znpy wrote:
| since we're talking postgresql:
|
| i recently started diving into postgresql and it seems to me that
| there is a patchwork of HA solutions (with patroni being the most
| feature-full) but no real multi-master solution for postgresql
| released under an open source license.
|
| There's BDR (bi-directional replication) but apparently
| 2ndquadrant pulled it back under a proprietary license, am i
| right?
|
| what's the current status of postgresql multi-master HA?
| merb wrote:
| > i recently started diving into postgresql and it seems to me
| that there is a patchwork of HA solutions (with patroni being
| the most feature-full) but no real multi-master solution for
| postgresql released under an open source license.
|
| true multi master is barely needed. but there is citus which
| uses the gpl, which can be run in multi master. and yes patroni
| is really really awesome.
| znpy wrote:
| Thanks, I'll look into that!
| whitepoplar wrote:
| Not multi-master, but pg_auto_failover looks to be a very nice
| solution for HA, and one that is operationally simple to
| manage: https://github.com/citusdata/pg_auto_failover
| teej wrote:
| One of the gotchas of columnar storage (coming from Redshift) is
| that you lose all of the compression benefits if you have just
| one column that's fat or hard to compress.
|
| In Redshift columns are stored in blocks. You want to fit roughly
| the same number of column values per block across all your
| columns. But if you have one column where a small number of
| values can fit in a block, the rest of the columns end up leaving
| most of the block space unused. The result is wasted disk space
| and poor query performance.
|
| This Postgres extension has similar-sounding storage ideas with
| stripes, but it's not clear to me if it suffers from the same
| issue.
|
| My first test to vet this would be a table with 50 columns of
| ints and one column of md5 hashes stored as varchar.
| mrits wrote:
| Several different compression algorithms would do very well for
| this. E.g dict compression for low cardinality wide columns.
| I'd recommend familiarizing yourself with a handful of
| different encodings for columnar data stores before you give up
| on the idea.
| georgewfraser wrote:
| Do you have a source for this, or a code sample that can
| demonstrate it? This would be an extremely naive implementation
| of columnar storage. There are some truly hard cases around
| long variable-length strings, but any halfway decent columnar
| storage engine should be able to handle columns with different
| widths.
| zten wrote:
| https://github.com/awslabs/amazon-redshift-
| utils/blob/master...
|
| The compression actually works as advertised, but that
| document outlines why you don't want to have the sortkey skew
| from other columns. The disk I/O balloons as it goes and
| fetches a larger number of blocks from the other columns that
| match what you're filtering in the sortkey.
| teej wrote:
| They call this issue out in the docs specifically when the
| fat column is a sort/distkey. rsbb in the Intermix slack has
| studied this in the most detail I know of outside AWS but
| they haven't published their book yet.
| jeff-davis wrote:
| In citus columnar, stripes/chunks are variable in size. If you
| have one large column that doesn't bloat the other columns.
| teej wrote:
| Cool! So if I select two columns with a predicate in the
| first, does it scan all stripes for the second column?
| jeff-davis wrote:
| Stripes aren't really "scanned". They are more of a logical
| concept that tracks where the physical data for each column
| is, and only fetches what it needs.
|
| If I understand what you are asking, let me restate: "can
| you apply a predicate first on column A before reading
| column B, so that you can avoid reading column B if the
| predicate on A doesn't match?".
|
| The answer is: "sometimes". Predicates match some rows and
| not others, so matching rows may be mixed in with non-
| matching rows, so it's not always possible to avoid the IO
| to read column B. _However_ , if the matching and non-
| matching rows happen to be separated (e.g. if your table is
| naturally in time order and you have a time-based
| predicate), then it's able to do this optimization. Please
| see the section on "Chunk Group Filtering".
| teej wrote:
| Thanks for the in depth explanation! I look forward to
| exploring it more.
| brightball wrote:
| Awesome news! After seeing the benefits with the cstore
| extension, I assumed this was only a matter of time.
| georgewfraser wrote:
| Beware that simply adding a column-oriented storage engine to a
| row store like Postgres is not going to get you anywhere near the
| performance of a ground-up columnar system like Redshift or
| Snowflake. This paper explains why [1]. Short version: most of
| the benefits are in the column-oriented execution engine, which
| differs in every aspect of its implementation from a row-oriented
| execution engine.
|
| [1]
| https://stratos.seas.harvard.edu/files/stratos/files/columns...
| mslot wrote:
| Correct, though it depends whether you are CPU-bound or
| I/O-bound. We see the latter a lot more often for large data
| sets.
|
| Columnar storage for PostgreSQL is especially relevant in cloud
| environments. Most database servers in the cloud use managed,
| network-attached disks because of durability, availability, and
| encryption-at-rest requirements. However, those do come with a
| performance penalty compared to local SSDs. The VMs also have
| IOPS and bandwidth limits, partly to manage capacity within the
| IaaS platform.
|
| If you can reduce the data size by 10x, then you are
| effectively increasing your disk bandwidth by that much as
| well. Moreover, you can keep more data in memory, so you will
| read much less data from disk, plus you'll only read the
| columns used by the query. Hence, you're likely to see speed
| ups of more than 10x for some queries, even without column-
| oriented execution.
| glogla wrote:
| That's exactly what we've seen. I don't know how Citus
| exactly stores data, but we found difference of 30x between
| gzipped Parquet and "generic Oracle table".
|
| There's a huge difference whether your analytical query is
| scanning full 30 GB or 1 GB (or maybe even half that or less
| if you need to scan just some columns).
| dang wrote:
| Some past threads on that paper:
|
| _The design and implementation of modern column-oriented
| database systems_ -
| https://news.ycombinator.com/item?id=18076547 - Sept 2018 (42
| comments)
|
| _Design and Implementation of Modern Column-Oriented Databases
| (2012) [pdf]_ - https://news.ycombinator.com/item?id=11803299 -
| May 2016 (9 comments)
| GordonS wrote:
| I agree,l that it's never going to be as effective as something
| built from the ground up as a column store. But the performance
| and storage improvements can still be immense over standard
| Postgres. And if you're already using Postgres, it means one
| less service to configure and maintain, and means you can
| easily combine both standard relational data with columnar
| data. For a lot of folks, that a huge win.
| wyck wrote:
| Amazon really missed the boat on Citus, a few more great
| acquisitions like this and Azure is going to look great (yes I
| know it can be still self installed on AWS).
| biggerfisch wrote:
| I'm curious to see how this compares in real life to TimescaleDB
| hypertables with compression - which to me, reads as much the
| same thing. I'm wondering if Citus is bringing a lower level
| implementation of idea possibly?
| LogicX wrote:
| Came here to say this - I was looking to see how compression
| compared to timescale's stated 91% compression.
|
| https://docs.timescale.com/latest/using-timescaledb/compress...
| atanasovskib wrote:
| There are a lot of differences that need to be taken into
| account before making a comparison.
|
| 1. TimescaleDB implements the compression on a hifher level,
| the underlying table storage/access method remains the same
|
| 2. TimescaleDB doesn't compress latest data, allowing you to
| keep fast writes and edits for recent data, but also allows
| you to benefit from compression on row based data
|
| 3. Although not currently available, it is possible to have a
| TimescaleDB hypertable with a column based access method
|
| 4. Comparing would have to take into account the data model,
| access methods (types of queries), ingestion vs query
| comparison (batch vs real time), backfilling and editing, etc
|
| I agree that this (Columnar) would be closer to Parquet.
| mslot wrote:
| It always depends on the data, but we've seen 92.5% and more:
| https://twitter.com/JeffMealo/status/1368030569557286915
| akulkarni wrote:
| (TimescaleDB person)
|
| TimescaleDB users have seen 98% (ie over 50x) compression
| rates in some real-world cases (e.g., for some IT
| monitoring datasets), but compression ratio will definitely
| vary by dataset. (For example, a dataset of just 0s will
| compress even better! But that's probably not a realistic
| dataset :-) )
|
| The reality is that Citus and TimescaleDB [0][1] take very
| different approaches to columnar compression, which result
| in different usability and performance trade-offs. In
| reality one should choose the right tool for their
| workload.
|
| (As an aside, if you have time-series data, no one has
| spent more time developing an awesome time-series
| experience on Postgres than the TimescaleDB team has :-) )
|
| Kudos to the Citus team for this launch! I love seeing how
| different members of the Postgres community keep pushing
| the state-of-the art.
|
| [0] Building columnar compression in a row-oriented
| database (https://blog.timescale.com/blog/building-
| columnar-compressio...)
|
| [1] Time-series compression algorithms, explained
| (https://blog.timescale.com/blog/time-series-compression-
| algo...)
| mslot wrote:
| The access method approach followed in Citus is indeed lower
| level and more generic, which means it can be used on both time
| series data and other types of data.
|
| For time series data, you can use built-in partitioning in
| PostgreSQL. It's not as easy to use as TimescaleDB, but
| pg_partman goes a long way, see:
| https://docs.citusdata.com/en/latest/use_cases/timeseries.ht...
|
| You can then use the columnar access method to compress old
| partitions (see the end of the doc), and use distributed tables
| to shard and parallelize queries and DML.
| bradleyjg wrote:
| This reads to me more like parquet.
| simonw wrote:
| Reassuring to see big new features like this coming out after the
| Microsoft acquisition, a healthy sign that this open source
| product continues to see serious ongoing investment.
| ablekh wrote:
| Technically, Citus _product_ is open core, not open source.
| Though open sourcing shard rebalancer is a recent welcome step.
| merb wrote:
| well because they are the only cloud vendor with a HA master-
| master postgresql database (citusdb managed) it probably is
| also a selling point.
| pm90 wrote:
| Wait... they have 2 read/write masters? And you can
| read/write from either master at the same time?
| merb wrote:
| yes but its more like vitess. which is basically more like
| sharded masters, but most of the time this works way better
| than something like galera or bdr.
___________________________________________________________________
(page generated 2021-03-06 23:00 UTC)