[HN Gopher] Postgres Unlogged Tables
___________________________________________________________________
Postgres Unlogged Tables
Author : plaur782
Score : 61 points
Date : 2023-02-18 12:21 UTC (10 hours ago)
(HTM) web link (www.crunchydata.com)
(TXT) w3m dump (www.crunchydata.com)
| lazyant wrote:
| can we have "best of both worlds" for inserting a lot of data
| quickly, to have an unlogged table for performance and a trigger
| to copy data to another regular table for permanence?
| BeefySwain wrote:
| Very well written and detailed article, with the caveat that they
| never mention a use case they consider legitimate. Does anyone
| here have any uses? I could imagine some sort of ETL type tasks
| which are transient could make sense. Thoughts?
| jpalomaki wrote:
| I've been using these for some larger analytical queries that
| run in batch fashion (typically something that's done like once
| or couple of times per day).
|
| I find it hard to get good performance from large queries with
| CTEs. Often, it's much easier (and uglier) to just split them
| to multiple steps, create intermediate tables for each step and
| add necessary indexes.
|
| Temporary tables are then of course even better, since usually
| you don't want to have these left around. But temporary tables
| are also unlogged.
| avianlyric wrote:
| Which versions of Postgres have you been running your large
| CTE based queries with?
|
| Up until Postgres 12, all CTE were materialised before their
| results were used, effectively making every CTE a temporary
| table. In Postgres 12 onwards, for side-effect free CTE that
| are only referenced once, Postgres will take constrains from
| the parent query, and push them into the CTE, reducing the
| amount of data they need process, and allowing better use of
| indexes.
|
| OOI have you tried converting your CTEs into sub-queries as
| an experiment to see if they're faster? Even in Postgres 12
| onwards, sub-queries and CTEs are treated the same by the
| query planner, and you can get some surprising differences in
| performance.
| RandomBK wrote:
| I can't speak to Parent's situation, but temp tables are
| common practice in OLAP workloads across many database
| engines (personal experience with Postgres, Presto, Spark,
| Netezza).
|
| At a certain level of query complexity, no query planner is
| able to accurately predict the characteristics of sub-
| queries or CTE, resulting in query plans that are ill-
| suited to the problem. More often than not, materializing
| CTEs inside a large query into 2-3 temporary tables results
| in order-magnitude performance as the database now knows
| exactly how many rows its dealing with, stats on # of
| nulls, etc.
| candiddevmike wrote:
| I use them for caching in lieu of redis, they work well and by
| using triggers I can ensure they're never stale
| anarazel wrote:
| Session state is a very common one, where the journaling
| overhead also can be particularly pronounced (much easier to
| hide in batch workloads).
| hyperman1 wrote:
| I use them for importing from files to tables. Run some cleanup
| and validation, before moving the data to its final
| destination. If pg crashes halfway, I still have the file
| imported and restart from zero.
|
| A second use is importing httpd combined logs and doing some
| analysis when finding out which chains of http calls caused
| some kinds of behaviour. When done, the tables get deleted. It
| allows easy ad hock queries, correlating with monitoring
| tables, indexing... I used to write some python scripts, but pg
| did better than expected, and this way of working stuck
| somehow.
| chuckledog wrote:
| Seems like bulk loading might be a good use case here.
| avianlyric wrote:
| Probably not, unless you don't want to keep the data long term.
|
| The initial load might be faster, but you'll end up repaying
| that time saving when you turn on logging back on, and Postgres
| is then forced to read the contents of the table and push it
| into the WAL anyway. Thereby throwing away any saving you
| originally made by avoiding the WAL with an unlogged table.
| ndriscoll wrote:
| It can be useful to bulk load data into a temporary table, do
| validations (which may involve joins to your permanent tables
| so it's convenient to be able to do in the database vs the
| application) there, and then once everything is validated,
| copy into your real tables from the temporary table. If you
| have a process like that, it could make sense to have the
| temporary table be unlogged (assuming the copy to real tables
| is still safe if you do that).
| antupis wrote:
| We did use it ETL processing.
| tremon wrote:
| _should someone literally pull the plug on your Postgres database
| server, your data is 100% safe, except for unlogged tables, in
| which your data is 100% gone - by design!_
|
| This omits a crucial fact: your data is not gone until Postgresql
| has gone through its startup recovery phase. If you really need
| to recover unlogged table data (to which every proper database
| administrator will rightly say "then WhyTH did you make it
| unlogged"), you should capture the table files _before_ starting
| the database server again. And then pay through the nose for a
| data recovery specialist.
|
| _However, a crash will truncate it._
|
| So this isn't exactly true. A crash _recovery_ will truncate it.
| _a_a_a_ wrote:
| 'Safe' here obviously means including transactional integrity,
| so I feel you're criticising the wrong thing.
|
| > A crash _recovery_ will truncate it.
|
| That's so nitpicky it has lesser nitpicks living upon it.
| Dylan16807 wrote:
| > 'Safe' here obviously means including transactional
| integrity, so I feel you're criticising the wrong thing.
|
| Good news, they're not criticizing the word "safe", they're
| criticizing the word "gone".
|
| > That's so nitpicky it has lesser nitpicks living upon it.
|
| It says "100% gone" when you pull the plug.
|
| Pointing out that is merely _queued_ for deletion is a big
| deal, not a nitpick.
| _a_a_a_ wrote:
| > Pointing out that is merely queued for deletion...
|
| ok, so before deletion occurs, 1) how are you going to get
| it back and 2) how do you know how much you have/haven't
| lost (that transactional integrity I mentioned)?
| apnew wrote:
| Interesting, I thought unlogged tables meant they dont get get
| WAL-ed and stored on FS at all.
|
| Any further reading you can suggest?
| anarazel wrote:
| They do have backing files - after all, unlogged tables can
| end up many times the size of memory. We need to reserve
| space for them, so there's a very high likelihood we could
| evict the buffers / shut down. But they're just written out
| due to buffer pressure or shutdown checkpoints (+things like
| slowing the tablespace).
| anarazel wrote:
| s/slowing/altering/
|
| Damn you autocorrect.
| gavinray wrote:
| At a high level, there are some things that coordinate when
| transactions happen in terms of WAL logging:
|
| 1. The Buffer Pool (memory for pages in the database)
|
| 2. The Log Manager
|
| 3. The Transaction Manager
|
| 4. The Storage Manager (read/write to disk)
|
| 5. Accessor Methods (interpret page bytes as e.g. Heap or
| BTree)
|
| This is abstract, not particular to Postgres, which doesn't
| have exact such names for all above things.
|
| Normally, when the Transaction Manager creates transactions
| that modify records/tuples (using the Accessor Methods) these
| actions need to be persisted via the Log Manager to the WAL.
|
| The pages of memory backing these records come from the
| Buffer Pool, and the Buffer Pool must also log certain
| actions.
|
| Before the Buffer Pool can flush any modified page to disk,
| the changes up to that page must have been persisted by the
| WAL via the Storage Manager as well.
|
| When you create unlogged tables, none of this happens, and
| when you modify records there's no trail.
|
| There's an attempt at tl;dr'ing WAL
|
| I am not an expert (Anarazel is)
| _a_a_a_ wrote:
| I thought the article was pretty clear.
| tremon wrote:
| A quick scan of the postgresql manual turns up nothing. I
| would have expected the steps performed during startup
| recovery to be documented there. Neither is the manual
| explicit about data file usage for an unlogged table, there's
| just these two snippets:
|
| from https://www.postgresql.org/docs/15/sql-
| createtable.html#SQL-...:
|
| > Data written to unlogged tables is not written to the
| write-ahead log (see Chapter 30), which makes them
| considerably faster than ordinary tables
|
| and from
| https://www.postgresql.org/docs/15/glossary.html#GLOSSARY-
| UN...:
|
| > The property of certain relations that the changes to them
| are not reflected in the WAL. This disables replication and
| crash recovery for these relations.
|
| Which both say nothing about the normal data files underlying
| unlogged tables, so none of what I wrote can be found in the
| official docs (or maybe it can, just not by me ;)
|
| However, there is also this page from the postgres developer
| wiki: https://wiki.postgresql.org/wiki/Future_of_storage
| which does say that pure in-memory tables are not supported
| by Postgres:
|
| > it would be nice to have in-memory tables if they would
| perform faster. Somebody could object that PostgreSQL is on-
| disk database which shouldn't utilize in-memory storage. But
| users would be interested in such storage engine if it would
| give serious performance advantages.
| anarazel wrote:
| Unlogged table data commonly won't even have been written to
| disk. And what is on disk is completely inconsistent after a
| crash. So really, the data is gone once postgres crashed
| (itself, due to hw issues, kill -9, ...)
| jonstewart wrote:
| Does COPY FROM sidestep the WAL? My (perhaps incorrect)
| understanding is that pg writes such data to a table file then
| uses a small transaction to make it live.
| lfittl wrote:
| It does not, except if you're COPYing into an unlogged table -
| for regular tables data that gets COPYed in gets WAL logged and
| is crash safe.
|
| In my understanding, as compared to a multi-row insert, a COPY
| statement is (1) a lot more efficient to parse/process, (2)
| uses a special bulk write strategy that avoids churn and large
| amount of dirty pages in shared buffers.
|
| As compared to multiple single row inserts another benefit is
| that you're writing a single transaction ID, instead of a lot
| of individual transaction IDs (causing less freezing work
| having to be done by autovacuum, etc.)
| anarazel wrote:
| We also batch the insertion into the table and the WAL
| logging, if possible. There's cases where that optimization
| has to be disabled though (presence of some types of
| triggers, for example).
___________________________________________________________________
(page generated 2023-02-18 23:01 UTC)