[HN Gopher] Using PostgreSQL as a Data Warehouse
___________________________________________________________________
Using PostgreSQL as a Data Warehouse
Author : cedricd
Score : 148 points
Date : 2021-05-10 19:39 UTC (3 hours ago)
(HTM) web link (www.narrator.ai)
(TXT) w3m dump (www.narrator.ai)
| u678u wrote:
| What would be great is some way to codify all this advice. Eg run
| PG in a Data Warehouse mode. Databases are too big and too
| configurable and without specialist DBAs any more most people are
| just guessing what works best.
| cedricd wrote:
| That's a great point.
|
| This isn't really what you're saying, but citus [1] ships a
| distributed Postgres. A lot of the things they improve would
| help massively with analytical workloads actually.
|
| 1: https://www.citusdata.com/
| riku_iki wrote:
| You misspelled cit<r>us, the same is in the blogpost.
| cedricd wrote:
| Ahh! So sorry. Fixed it.
| cedricd wrote:
| Also, what's your take? Do people use citus for analytical
| workloads as well as production at scale?
|
| I'd assume yes, but I haven't personally used you guys. I'm
| just aware of you and broadly how you scale Postgres.
| riku_iki wrote:
| Sorry, I am not familiar with citus.
| [deleted]
| richwater wrote:
| > vacuum analyze after bulk insertion
|
| Ugh, this is nightmare. I wish they would come up with a better
| system than forcing this on users.
| cedricd wrote:
| Luckily Postgres' autovacuum works really well in normal
| workloads. If there's an even mix of inserts spread throughout
| time then it's probably best to just rely on it.
|
| For data warehouses inserts can happen in bulk on a regular
| cadence. In that case it can help to vacuum right after. I'm
| not sure if it has a huge impact in practice.
| efxhoy wrote:
| Dr. Martin Loetzsch did a great video, ETL Patterns with
| Postgres. He covers some really good topics:
|
| - Instead of updating tables build their replacements under a
| different name then rename them. This makes updating heavy-to-
| compute table instant. Works even for schemas: rebuild a schema
| as schemaname_next rename the current to schemaname_old then
| rename schemaname_next to schemaname.
|
| - Keep all the source data raw and disable WAL, you don't need it
| for ETL.
|
| - Set memory limitis high.
|
| And lots of other good tips for doing ETL/DW in postgres. It's
| here: https://www.youtube.com/watch?v=whwNi21jAm4
|
| I really appreciate having data in postgres. It's often easy to
| think that a specialised DW tool will solve all your problems,
| but that often fails to consider things like:
|
| - Developer experience. Postgres runs very easily on a local
| machine, more specialized solutions often don't or are tricky to
| setup.
|
| - Learning another tool costs time. A developer can learn
| postgres really well in the time it takes them to figure out how
| to use several more specialised tools. And many devs already know
| postgres because it's pretty much the default DB nowadays.
|
| - Analytics queries often don't need to run at warp speed.
| Bigquery might give you the answer in a second but if postgres
| does it in a minute and it's a weekly report, who cares?
|
| - Postgres is boring and has been around for many years now, it
| will probably still be here in 10 years so time spent learning it
| is time well spent. More niche systems will probably be
| superseded by fancier, faster replacements.
|
| I would go so far as to say don't necessarily need to split out
| your DW from your prod DB in every case. As soon as you start
| splitting out a DW to a separate server you need some way to keep
| it in sync, so you'll probably end up duplicating some business
| logic for a report, maintaining some ingestion app, shuffling
| data around S3 or whatever. Keeping your analytics in your prod
| DB (or just a snapshot of yesterdays DB) is often good enough and
| means you will be more likely to avoid gnarly business-rules
| going out of sync between your app and your DW.
| dgudkov wrote:
| From a quick look I didn't notice any mention of columnar
| storage. I would be very skeptical about any claim that a DB
| without columnar storage "can work extremely well as a data
| warehouse".
| dreyfan wrote:
| Take a look at Kimball/Star-Schema. It's worked extremely well
| as a data warehouse technique for decades. That said, I think
| modern offerings (e.g. Clickhouse) are superior in most use
| cases, but it's definitely not impossible on a traditional row-
| oriented RDBMS.
| kthejoker2 wrote:
| No mention of Greenplum? Literally a columnar DW built on top of
| pg.
|
| Where are my DuckDB people? (Think SQLite for OLAP workloads.)
|
| https://duckdb.org/
| mattashii wrote:
| > Avoid Common Table Expressions (or upgrade to 12)
|
| Upgrading to 12 (or 13) seems like the better option here,
| whenever you're able to do so. The improvements are very much
| worth it.
| truculent wrote:
| Yes, it is strange that "Avoid CTEs" made it into the TL;DR
| instead.
| cedricd wrote:
| I updated the blog post :)
| cedricd wrote:
| Yes, that's a really great point. I should emphasize that more
| clearly in the blog :).
| mattashii wrote:
| A few more items I've encountered:
|
| - Citus data, not Citrus data.
|
| - In tables, column order matters. Order const-width non-null
| columns before all other columns (best so that there's no
| unnecessary alignment padding). Then maybe some low null
| fraction fixed-width columns, then ordered by query access
| frequency. A lot of time can be spent extracting tuple
| values, and you can save a lot of time using the offset
| caches by correctly tetris-ing your columns. Note that
| dropped columns are set to null, so a table rewrite (by
| redefining your table) may be in order for 100% performance.
| galkk wrote:
| Isn't #2 something that the database engine should handle
| by itself: const-width, non-null is written in DDL and the
| engine should be be able to handle it.
| arcticfox wrote:
| Something painful I just learned today is that even Postgres >=
| 12 isn't that great with planning CTEs. Querying the same CTE
| twice seems to force it into materializing the CTE just like
| Postgres <12 used to do.
|
| Fortunately there's a workaround - using the `with .. as not
| materialized ()` hint sped up my query 100x.
| wwweston wrote:
| What situations would make people unable to upgrade to 12+?
| aargh_aargh wrote:
| One case I decided against it was when I needed to access the
| new postgres server (which I initially planned to be v13)
| from an old machine (with a legacy distro with v9.6). v13
| introduces a new SCRAM-SHA-256 password hashing method by
| default and only libpq10 and newer supports this method. For
| some reason I couldn't or didn't want to neither rehash the
| passwords nor upgrade the client, so I remained on a lower
| version. Certainly not unsolvable, but I didn't have the time
| to spend on a proper fix.
| goatinaboat wrote:
| _What situations would make people unable to upgrade to 12+?_
|
| In my personal experience: the organisation doesn't invest in
| a storage solution that offers snapshots/COW (e.g. ZFS or SAN
| or whatever). Then they wait to upgrade until their disks
| reach a usage that the upgrade has to be done in-place. Then
| they become like rabbits in the headlights and never upgrade.
| cedricd wrote:
| Maybe it doesn't get prioritized until it's important. I know
| PG upgrades are pretty straightforward, but sometimes people
| don't want to touch something running well.
|
| That said, given the performance implications, if someone
| wants to use PG as a warehouse upgrading to 12 is a no-
| brainer.
| cedricd wrote:
| We support multiple data warehouses on our platform. We recently
| had to do a bit of work to get Postgres running, so we wrote a
| high-level post about things to consider when running analytical
| workloads on PG instead of normal production workloads.
| rad_gruchalski wrote:
| Is anybody aware of any serious research Yugabyte vs Postgres?
|
| Yugabyte appears to an application as essentially Postgres 11.2
| with all psql features (even the row / column level security
| mechanisms) but, apparently, handles replication and sharding
| automagically (is it DHT, similar to Cassandra)?
|
| Are there any serious technical comparisons with some
| conclusions, without marketing bs?
| zinclozenge wrote:
| yugabyte is multi-raft i believe
| oncethere wrote:
| It's interesting to see how PG can be configured, but why not
| just use a real warehouse like Snowflake?
|
| Also, do you have any numbers on how PG performs once it's
| configured?
| cedricd wrote:
| Yeah, I think using Snowflake or BigQuery or something is
| ultimately the better move. But sometimes folks use what they
| know (what they're comfortable managing, tuning, deploying,
| whatever).
|
| In my own testing PG performed very similarly to a 'real'
| warehouse. It's hard to measure because I didn't have the same
| datasets across several warehouses. Maybe in the future I'll
| try running something against a few to see.
| fiddlerwoaroof wrote:
| I really wanted to migrate an analytics project I was working
| on from Elasticsearch to Postgres: however, when we sat down
| and ran production-scale proofs of concepts for the change,
| ClickHouse handily outclassed all the Postgres-based
| solutions we tried. (A Real DBA might have been able to solve
| this for us: I did some tuning, but I'm not an expert).
| ClickHouse, however, worked near-optimally out of the box.
| willvarfar wrote:
| Snowflake and big query are cloud solutions. Some companies
| have a need for self hosted databases etc.
|
| And often having a homogenous database stack is a plus. If
| your production systems are all MySQL, then trying to get
| away with using MySQL for analytics too is a smart move etc.
|
| I've seen so many tiddly data warehouses. Most companies
| don't need web scale, and they overbuild and over complicate
| when they could be running on a simpler homogenous stack etc.
| acidbaseextract wrote:
| I'm glad I'm not in the situation of needing to make the
| judgement call, but Postgres' ecosystem might be part of the
| answer. For example, Snowflake has basic geospatial support,
| but PG has insane and useful stuff in PostGIS like
| ST_ClusterDBSCAN:
| https://postgis.net/docs/ST_ClusterDBSCAN.html
|
| Foreign data wrappers are another thing that might be
| compelling -- dunno if Snowflake has an equivalent.
|
| I don't have any numbers but PG has served me fine for basic
| pseudo-warehousing. Relative to real solutions, it's pretty bad
| at truly columnar workloads: scans across a small number of
| columns in wide tables. The "Use Fewer Columns" advice FTA is
| solid. This hasn't been a deal breaker though. Analytical query
| time has been fine if not great up to low tens of GB table
| size, beyond that it gets rough.
| simonw wrote:
| If your project has less than a few million rows, chances are
| PostgreSQL will run any analytical query you like in just a few
| seconds. So you can keep things simpler by avoiding adding any
| extra tools.
| adwww wrote:
| > why not just use a real warehouse like Snowflake
|
| Customers who demand / are legally obliged to ensure their data
| does not leave their territory, is one big reason.
|
| Snowflake give you some options, but if you get a big contract
| with a customer in another region your entire analytics
| platform is unavailable to that customer.
| peterthehacker wrote:
| > why not just use a real warehouse like Snowflake?
|
| Cost is the most common reason I've seen. An RDS instance is
| about 1/2 the cost of Redshift per CPU and then Snowflake is
| slightly more expensive than Redshift (often worth the extra
| $$).
|
| Also, if you're dealing with less than 10GB of data the
| difference in performance will be barely noticeable, so at
| modest scale the most cost effective solution is RDS.
| hn2017 wrote:
| No columnstore option like SQL server?
| mattashii wrote:
| None baked in (yet). Maybe in 15 or 16; Zedstore (a project for
| a PG columnstore table access method) is slowly getting more
| and more feature-complete, and might be committed to the main
| branch in one of the next few major releases.
| arcticfox wrote:
| I'm curious what people think of Swarm64 and/or TimescaleDB on
| that front
| BenoitP wrote:
| Citus, cited at the end, is a column store (single node or
| distributed)
|
| https://github.com/citusdata/citus
| CRConrad wrote:
| Slightly peculiar juxtaposition in subheading, "Differences
| Between Data Warehouses and Relational Databases". Most data
| warehouses _are_ relational databases (like, e.g, PostgreSQL). I
| think you might want to use something like "Differences Between
| Data Warehouses and Operational Databases" in stead?
|
| Also, under "Reasons not to use indexes", #3 says: "Indexes add
| additional cost on every insert / update". Yes, but then data
| warehouses usually aren't continually updated during the workday.
| It's a one-time hit sometime during the night, during your ETL
| run. (Not a Pg DBA, but presumably you can shut of index updating
| during data load, and then run it separately afterwards for
| higher performance?)
| SigmundA wrote:
| Usually you drop indexes before load, then recreate after.
| gervwyk wrote:
| Probably going to get downvoted for this, but I feel like MongoDB
| should get more love in these subs.
|
| We use it all the time and their aggreations can get really
| advanced and perform well to the level where we run most
| analytics on demand. Sure we're not pushing to "big data" levels,
| max a few 100k records, but in reality I believe that's the
| average size of the majority of business datasets (Just an
| estimate, I have nothing to back this up)
|
| Been building with it for 5 years now and it's been a breeze.
| Especially with Atlas. I think our team has not spent more that 3
| days in total on DB dev ops. And with Atlas Lucene text search
| and data lakes, querying data from S3. What's not to love.
| jmchuster wrote:
| Yes, if you're only going to have 100k records, then basically
| any type of store will work, and you should choose purely on
| ergonomics. At that size, you can get away with not even having
| indexes on a traditional sql database.
| ahmedelsama wrote:
| Yeah, for this blog, it was optimizing a Postgres table with
| 100m records. so 1000x more and thus all these issues came to
| be.
| neximo64 wrote:
| I have used Mongo and it is a total nightmare, a bit of the
| reasons are in your text. Its supposedly only good with Atlas
| which is a managed service.
|
| I haven't tried Atlas myself (and why would I? - I try and
| avoid lock in) but since Postgres supports json column types
| which has been my go to instead of Mongo & it has been an
| absolute breeze. Especially since it can be indexed and scanned
| with postgres sql.
| adwww wrote:
| Mongo is nice to work with for applications, but for analytics
| it's a bit of a nightmare - mostly around non compatibility
| with external tools.
| arcticfox wrote:
| > Sure we're not pushing to "big data" levels, max a few 100k
| records, but in reality I believe that's the average size of
| the majority of business datasets (Just an estimate, I have
| nothing to back this up)
|
| Excel on a laptop is also a viable option at that scale
| paulrbr wrote:
| Combine these tips with those: https://tech.fretlink.com/build-
| your-own-data-lake-for-repor... (I'm the writer of that linked
| article) and you get a really powerful, fully open-source,
| extensive data-warehouse solution for many real-life use cases
| (when your data doesn't exceed the 10^9 order of magnitude for
| number of rows).
|
| Thanks Cedric for sharing your experience with using PG for data-
| warehousing <3
| willvarfar wrote:
| A new Postgres-based darling is TimescaleDB. It's a drop-in for
| Postgres.
|
| It is a hybrid row column store with excellent compression and
| performance.
|
| It would be interesting to see how it compares if narrator would
| try it out. Benchmarks would be cool.
|
| One very neat feature I am enamored by is "continuous
| aggregates". These are materialized views that auto-update as you
| change the fact table.
|
| Continuous aggregates are a great idea. InfluxDB had "continuous
| queries" (but the implementation of influx generally is not so
| neat), and firebolt has "aggregate indexes" which are much the
| same thing.
|
| I think all olap dbs will eventually have them as staple, and
| that they will trickle down into oltp too.
| esturk wrote:
| This recommendation reads backward. You made a suggestion of a
| substitute that's "comparable" yet without any data and then go
| to nudge the author to try it. How do we know you didn't just
| pull up something just to have the author test it for you?
| cedricd wrote:
| Would TimescaleDB be much faster for analytical queries that
| aren't necessarily segmented or filtered by time?
|
| My uninformed assumption is if I do a group by over all rows in
| a table that they may not perform better.
|
| I'll look into their continuous aggregates -- that could be one
| way to get around the cost of aggregating everything if it's
| done incrementally.
| willvarfar wrote:
| I'd guess any roughly sequentially keyed table ought get good
| insert performance. Think how common auto-increment is.
|
| And being HTAP, timescale ought do better than a classic pure
| column store on upserts and non-appending inserts too.
|
| Of course if your table is big and the keys are unordered you
| still might get excellent performance if your access pattern
| is read heavy.
|
| Of course you can still mix in classic Postgres row-based
| tables etc. Timescale just gives you a column store choice
| for each table.
| gshulegaard wrote:
| I would have to dig more into specifics of your use case but
| my gut reaction is yes, it would be better.
|
| I do not have specific experience with TimescaleDB, but I
| have some experience scaling PostgreSQL directly and with
| Citus (which is similar, but not the same). But depending on
| the nuances of your use case, I can envision a number of
| scaling strategies in vanilla Postgres to handle your use
| case. A lot of what Timescale and Citus does is abstract some
| of those strategies and extend them. Which is just a vague
| way of me saying: I think I could probably come up with a
| scheme in vanilla Postgres to support your use case, and
| since Timescale/Citus makes those strategies even
| easier/better I am fairly confident they would also handle
| that use case.
|
| As an example I currently have a table in my current Citus
| schema that is sharded by column hash (e.g. "type"
| enumerator) and further partitioned by time. The first part
| (hash based sharding) seems possibly sufficient for your use
| case.
|
| Beyond the most simple applications in that domain though,
| there are more exotic options available to both Timescale and
| Citus that could come into play. For example, I know Citus
| recently incorporated some of their work on the cstore_fdw
| into Citus "natively" to allow columnar storage tables
| directly:
|
| https://www.citusdata.com/blog/2021/03/05/citus-10-release-o.
| ..
| gnfargbl wrote:
| I love postgres and my business relies on it. However, at the
| scale the author is talking about (~100m rows), throwing all the
| data into BigQuery is very likely to be a better option for many
| people.
|
| If rows are around 1kB, then full-dataset queries over 100m rows
| will cost less than $0.5 each on BQ -- less if only a few columns
| have to be scanned. Storage costs will be negligible and, unlike
| pg, setup time will be pretty much nil.
| cedricd wrote:
| Yep. Fully agree. The point of the post wasn't to say that you
| _should_ use PG as a data warehouse. Just that if it 's what
| you've got available (for various reasons) that you _can_.
| predictmktegirl wrote:
| $0.5 queries?
| Keyframe wrote:
| Full scan. Still absurd to what we've come to to think this
| is tolerable.
| phpnode wrote:
| Reading 100Gb from disk costs $0.50? this is absolutely
| incredible to me, how has this become acceptable to the
| industry?
| wodenokoto wrote:
| You don't pay for having a server running. You only pay for
| storage and data read during queries.
|
| No server fees and no fees when autoscaling up for heavy
| computations.
| phibz wrote:
| Maybe I missed it but there's no mention of denormalizing
| strategies or data cubes, fact tables and dimension tables.
| Structuring your data closer to how its going to be analyzed us
| vital for performance. It also gives you the opportunity to
| cleanse, standardize, and conform your data.
|
| I ran a pg data warehouse in the 8.x and 9.x days with about 20TB
| of data and it performed great.
| cedricd wrote:
| I think you're right but it's a bit out of scope. Hard to give
| generalizable advice around this I think.
|
| What we personally do in practice is put everything into a
| single time-series table with 11 columns. [1]
|
| 1: https://www.activityschema.com/
___________________________________________________________________
(page generated 2021-05-10 23:00 UTC)