[HN Gopher] How Postgres stores data on disk - this one's a page...
___________________________________________________________________
How Postgres stores data on disk - this one's a page turner
Author : drewsberry
Score : 394 points
Date : 2024-08-05 08:28 UTC (14 hours ago)
(HTM) web link (drew.silcock.dev)
(TXT) w3m dump (drew.silcock.dev)
| HPsquared wrote:
| Title missing the leading "How"
| theblazehen wrote:
| I believe HN auto-strips such "filler" words, needs to be added
| again by a mod
| andersa wrote:
| Perhaps it should not? Why would such a silly feature be
| explained over and over rather than _removed_?
| codr7 wrote:
| Yeah, especially when it changes the meaning in silly ways.
|
| My brain spent quite some time trying to figure out what
| the headline was trying to say.
| codetrotter wrote:
| Because there are a lot of articles in the news that add
| this word for no reason.
| nandhinianand wrote:
| Ahhh.. wouldn't that be better by using a percentage of
| filler words/total words in title threshold?? I don't
| know if a feature simply strips out the filler words in
| title, it's not always useful and rather harmful is what
| I would argue.
| shakna wrote:
| Humans work better. HN is small scale enough that a
| moderator can come along, collapse the off topic comments
| and fix the title, and it's not an issue.
| candiddevmike wrote:
| HN does a lot of title editorializing. Like the one a few
| days ago that turned the acronym RATs into Rats and
| completely changed the meaning.
| drewsberry wrote:
| I did add it into the title originally, must've been
| stripped. Thanks for the heads up, didn't know HN had this
| "useful feature".
| eatonphil wrote:
| The submitter can edit the title after it has been auto-
| modified. It doesn't take a mod to do it unless some amount
| of time has passed.
| drewsberry wrote:
| Can't see any way of modifying the title, unless I'm
| missing something - maybe I missed the edit window.
| eatonphil wrote:
| Yeah, after an hour, you've probably missed it. (Source:
| I frequently edit titles of posts I submit to HN because
| HN changed it to nonsense. :) )
| drewsberry wrote:
| Thanks for letting me know, I'll make sure to edit it
| quickly next time :-)
| eitland wrote:
| IIRC, if the original submitter edits the title once it has
| been posted, the edited version sticks, i.e. the filter only
| works the first time and you can override it if you notice it.
| nraynaud wrote:
| A bit of curiosity: how did Postgres choose 8k pages? shouldn't
| it be the FS page size to help with atomicity?
| silvestrov wrote:
| Depends very much on how the SSDs are designed internally. I
| think these days we have to settle for "can never be sure" of
| the real page size for atomicity. Pages can also become corrupt
| in other ways.
|
| It is weird that "--data-checksums" isn't the default for new
| databases, even when it cost a bit in performance. Integrity
| should be more important than performance.
| isosphere wrote:
| Was thinking the same thing when I saw those zeros in the
| checksum field. Perhaps the consequences are significant.
|
| Here's a benchmarking exercise I found: https://www-
| staging.commandprompt.com/uploads/images/Command...
|
| With a tidy summary:
|
| > Any application with a high shared buffers hit ratio:
| little difference. > Any application with a high ratio of
| reads/writes: little difference. > Data logging application
| with a low ratio of reads/inserts, and few updates and
| deletes: little difference. > Application with an equal ratio
| of reads/inserts, or many updates or deletes, and a low
| shared buffers hit ratio (for example, an ETL workload),
| especially where the rows are scattered among disk pages:
| expect double or greater CPU and disk I/O use. > Run pg_dump
| on a database where all rows have already been previously
| selected by applications: little difference. > Run pg_dump on
| a database with large quantities of rows inserted to insert-
| only tables: expect roughly double CPU and disk I/O use.
| silvestrov wrote:
| On my M1 mac "dd ... | cksum" takes 3 seconds while "dd |
| shasum" (sha1) takes 2 seconds. So cksum might not be the
| best tool for performance checking.
|
| There is CPU specific code in the PG source in
| src/include/storage/checksum_impl.h
|
| It is written as a plain nested loop in C. So performance
| is fully dependent on the compiler being able to
| parallelize or vectorize the code.
|
| I would not be surprised if manually written SIMD code
| would be faster.
| Joe_Cool wrote:
| Interesting. I guess M1 doesn't have the 'crc32'
| "acceleration" that is included in SSE4.2.
| loeg wrote:
| M1 has CRC32 acceleration intrinsics.
|
| https://dougallj.wordpress.com/2022/05/22/faster-
| crc32-on-th...
|
| https://github.com/corsix/fast-crc32?tab=readme-ov-
| file#appl...
| silvestrov wrote:
| So when using these intrinsics an Intel Core i7 can do 30
| GB/s but the performance check linked above (by isosphere
| ) says only 300 MB/s, i.e. 1%
|
| Something is amiss here.
|
| If a CPU can do 30 GB/s then a CRC check should not have
| any real performance impact.
| anarazel wrote:
| The bottleneck isn't at all the checksum computation
| itself. It's that to keep checksums valid we need to
| protect against the potential of torn pages even in cases
| where it doesn't matter without checksums (i.e. were just
| individual bits are flipped). That in turn means we need
| to WAL log changes we don't need to without checksums -
| which can be painful.
| speed_spread wrote:
| If physical integrity is already provided by the backing
| filesystem, such as ZFS, wouldn't --data-checksums be
| redundant?
| magicalhippo wrote:
| If data is served from ARC, it's primary cache, ZFS does
| not perform a checksum check before handing it to you, as
| the data was checked when it got read into the ARC.
|
| If you use ECC you're quite safe, but ECC can't detect
| multi-bit errors, just single and double bit errors.
|
| So if you care much about your integrity, you might want
| Postgres to do its checksum check as well.
| ikiris wrote:
| If you somehow have 3+ bit errors coming out of ram on an
| ECC board, you have _much_ bigger problems than trying to
| verify your postgres data via checksum.
| magicalhippo wrote:
| AFAIK RowHammer attacks can cause multiple bits to flip
| in a row[1], no?
|
| But sure, it's not for the vast majority of folks.
|
| [1]: https://www.vusec.net/projects/eccploit/
| ikiris wrote:
| Rowhammer would qualify as a bigger problem, yes.
| tremon wrote:
| I'm not sure how much bearing internal storage organization
| should have on Postgres' page size. Since pg explicitly
| chooses not to implement their own storage organization
| layer, there's always a filesystem between a pg database and
| the underlying storage.
| berkes wrote:
| > Integrity should be more important than performance.
|
| Most often it is. But not always. There certainly are cases
| where speed is far more important than integrity in
| databases. I cannot think of a case where this would be true
| for a RDBMS or even a Document DB (Though MongoDB had
| different opinions on this...).
|
| But e.g. redis as caching server, or memcached, or even these
| non-normalized data that I have in a PG that can be
| reproduced from other sources easily in case of corruption or
| stale-ness: it's fine to trade in integrity for speed there.
| loeg wrote:
| The 8k size long predates SSDs or 4k sectors.
| da_chicken wrote:
| 8k is a very common page size, but 4k isn't unheard of.
| Oracle's default is 4k.
|
| The issue is that page size caps row size (for on-row storage).
| Also, if you have a smart clustering index, larger pages can be
| more efficient use of index addressing. So it's a trade-off.
| hans_castorp wrote:
| Oracle defaults to 8k as well:
|
| https://docs.oracle.com/en/database/oracle/oracle-
| database/1...
|
| > Default value 8192
| anarazel wrote:
| Historically there was no atomicity at 4k boundaries, just at
| 512 byte boundaries (sectors). That'd have been too limiting.
| Lowering the limit now would prove problematic due to the
| smaller row sizes/ lower number of columns.
| dfox wrote:
| > This process of retrieving the expected database state from the
| WAL is called logical decoding and Postgres stores files related
| to this process in here.
|
| While logical decoding is about WAL, it is not related to the
| recovery process. Logical decoding is a mechanism to convert the
| WAL entries back into the high-level operations that caused the
| WAL entries, for example for replication or audit.
| drewsberry wrote:
| Very good point, I've rephrased this.
| dfox wrote:
| You still refer to logical decoding as part of recovery in
| the last section. The main point is that WAL by itself is
| designed only for data recovery and only contains information
| about what is going to be written where in terms of
| essentially raw disk accesses (notably, the writes do not
| have to be page aligned).
|
| Logical decoding (which needs wal_level=logical which extends
| the WAL format with additional metadata) is about parsing the
| WAL for other purposes than performing the recovery (or
| physical replication, which is essentially the same thing as
| recovery, but performed on another instance of the same
| cluster). The name "logical decoding" is certainly intended
| to emphasize that there are other uses for that than logical
| replication, but these are not that different from logical
| replication on this level (get a stream of changed tuples in
| tables).
| hinkley wrote:
| But in a power loss the WAL is also read to restore and
| transactions that were in flight, right?
| llimllib wrote:
| @drewsberry: I wish you had an RSS feed! I tried to subscribe to
| your blog but if there is one it's not linked.
|
| (Enjoyed the post)
| anotherevan wrote:
| Found it: https://drew.silcock.dev/rss.xml
| drewsberry wrote:
| Thanks for the feedback, I really appreciate it :-) I've added
| the RSS feed to my home page now, as the other poster noted the
| URL is https://drew.silcock.dev/rss.xml.
| mharig wrote:
| > Can't we just store some data on disk and read / write from it
| when we need to? (Spoiler: no.)
|
| I disagree. SQLite does a good job in uniting the 2 worlds:
| complex SQL queries with excellent data consistency and simple
| file(s). Although SQLite is for sure not the one size fits all
| solution.
| berkes wrote:
| > SQLite is for sure not the one size fits all solution
|
| Nor is Postgres. PG is surprisingly versatile. E.g. with some
| extensions can be used as key-value storage (hashtable),
| document database, time-series db and so on. And it works quite
| well. Beyond "good enough" for many use cases. Added benefit,
| aside from having to run only one db-server, is that you can
| mix it: part relational, part document, etc.
|
| But the PG versions nearly ever get as good as focused,
| dedicated solutions get. Which makes sense if you think about
| it: a team developing a dedicated key-value storage that does
| that and only that, for years, will always produce a better
| key-value storage then one bolted onto a generic RDBMS.
|
| A practical example was where we used ltree extension to store
| ever growing hierarchies. We needed access control over
| subtrees (so that the X report for John only includes the
| entities of Johns devision and lower). While it worked in PG,
| it turned out that "simply replacing" it with OpenLDAP, which
| had all this built in, made it faster, easier and above all
| easier to maintain.
| ozim wrote:
| But SQLite doesn't do concurrency on writing you lock the file.
| While other db engines deal with row/table locks concurrent
| connections etc.
| indoordin0saur wrote:
| This URL is blocked by my company's network because of a certain
| substring in the URL lol
| crngefest wrote:
| What why ? Would people browse bigcocks.net unless it's
| explicitly blocked? What about cox?
|
| Is ,,tube" on a blocklist as well?
| davidmurdoch wrote:
| Because a sales person at some "security" company convinced a
| CTO that it was as good idea.
| actionfromafar wrote:
| The security company just follows "best practices" :)
| indoordin0saur wrote:
| It is idiotic, yes. This feature is certainly outsourced to
| our mediocre IT contractor
| MattJ100 wrote:
| A classic case of the Scunthorpe problem:
| https://en.wikipedia.org/wiki/Scunthorpe_problem
|
| In this case the substring is part of the author's name. Such
| names are not at all uncommon.
| asciii wrote:
| I just saw something similar with a user on here dangsux...
|
| Apparently might be Dang's UX and not against the Mod.
| -\\_(tsu)_/-
| hinkley wrote:
| Alistair Cockburn is one of the signatories of the Agile
| Manifesto.
|
| You may have heard of him more recently with the Hexagonal
| Architecture approach.
| drewsberry wrote:
| There are dozens of us!
| fullstop wrote:
| I was on a mailing list once where messages were blocked
| because of m _sexchange_ being in the headers.
| Izkata wrote:
| Any relation to ExpertsExchange?
| fullstop wrote:
| Ah, this was before ExpertsExchange existed. I think that
| it was Corel that had the mail filter, and it was related
| to Corel Linux.
|
| https://en.wikipedia.org/wiki/Corel_Linux
| hinkley wrote:
| You mean ExpertSexChange? If you're gonna capitalize it, do
| it right.
| forinti wrote:
| I once worked for a company that blocked Cuban sites because of
| .cu (which is the Portuguese word for the end of your digestive
| system), but did not block porn sites (or so I was told ;-).
| hinkley wrote:
| That seems like the sort of thing you check on your last day
| as you're going out the door.
|
| "The rumors are true!"
|
| (Although less amusing, you could also just ask the IT guys
| and gals)
| CodesInChaos wrote:
| Are you sure it wasn't because of the US embargo on Cuba?
| Companies outside the US often participate(d) as well,
| because they want to do business with the US and US
| companies.
| forinti wrote:
| The proxy replied with a message stating that the site was
| blocked because it was pornographic.
| LVB wrote:
| I remember us once giving a supplier access to our internal bug
| tracker for a collaborative project. They were unable to get to
| the ".../openissue" endpoint.
| twic wrote:
| My company also blocks it, but as phishing.
| drewsberry wrote:
| Mine blocked it when I first created it, you can usually flag
| an incorrectly classified site and they'll correct it (I
| assure that you I'm not doing any phishing).
| drewsberry wrote:
| Well if it isn't my arch-nemesis - my legally designated name.
| Maybe I should've gone for something with just my first name
| like drewsexpertblog.dev
| Brian_K_White wrote:
| Shut off like a petcock.
| shreddit wrote:
| Why does closing the table of contents open the nav menu on
| mobile?
| ZoomerCretin wrote:
| On mobile, it appears to open regardless of where you tap. This
| appears to be the culprit:
|
| ```const n = document.getElementById('nav-header');
| document.addEventListener( 'click', s
| => { u.hidden || s.target ===
| null || n === null ||
| n.contains(s.target) || r() }```
|
| Above, in the same function, there exists the function
| `e.addEventListener('click', r);`, which is likely closer to
| what the author intended. This fires the 'click' event any time
| the page is clicked, which opens the nav menu when it
| shouldn't.
| sgarland wrote:
| If anyone is interested in contrasting this with InnoDB (MySQL's
| default engine), Jeremy Cole has an outstanding blog series [0]
| going into incredible detail.
|
| [0]: https://blog.jcole.us/innodb/
| westurner wrote:
| Apache Arrow Columnar Format:
| https://arrow.apache.org/docs/format/Columnar.html :
|
| > _The Arrow columnar format includes a language-agnostic in-
| memory data structure specification, metadata serialization,
| and a protocol for serialization and generic data transport._
| _This document is intended to provide adequate detail to create
| a new implementation of the columnar format without the aid of
| an existing implementation. We utilize Google's Flatbuffers
| project for metadata serialization, so it will be necessary to
| refer to the project's Flatbuffers protocol definition files
| while reading this document._ _The columnar format has some key
| features:_
|
| > _Data adjacency for sequential access (scans)_
|
| > _O(1) (constant-time) random access_
|
| > _SIMD and vectorization-friendly_
|
| > _Relocatable without "pointer swizzling", allowing for true
| zero-copy access in shared memory_
|
| Are the major SQL file formats already SIMD optimized and zero-
| copy across TCP/IP?
|
| Arrow doesn't do full or partial indexes.
|
| Apache Arrow supports Feather and Parquet on-disk file formats.
| Feather is on-disk Arrow IPC, now with default LZ4 compression
| or optionally ZSTD.
|
| Some databases support Parquet as the database flat file format
| (that a DBMS process like PostgreSQL or MySQL provides a
| logged, permissioned, and cached query interface with query
| planning to).
|
| IIUC with Parquet it's possible both to use normal tools to
| offline query data tables as files on disk and also to online
| query tables with a persistent process with tunable parameters
| and optionally also centrally enforce schema and referential
| integrity.
|
| From https://stackoverflow.com/questions/48083405/what-are-the-
| di... :
|
| > _Parquet format is designed for long-term storage, where
| Arrow is more intended for short term or ephemeral storage_
|
| > _Parquet is more expensive to write than Feather as it
| features more layers of encoding and compression. Feather is
| unmodified raw columnar Arrow memory. We will probably add
| simple compression to Feather in the future._
|
| > _Due to dictionary encoding, RLE encoding, and data page
| compression, Parquet files will often be much smaller than
| Feather files_
|
| > _Parquet is a standard storage format for analytics that 's
| supported by many different systems:_ Spark, Hive, Impala,
| various AWS services, in future by BigQuery, etc. _So if you
| are doing analytics, Parquet is a good option as a reference
| storage format for query by multiple systems_
|
| Those systems index Parquet. Can they also index Feather IPC,
| which an application might already have to journal and/or log,
| and checkpoint?
|
| Edit: What are some of the DLT solutions for indexing given a
| consensus-controlled message spec designed for synchronization?
|
| - cosmos/iavl: a Merkleized AVL+ tree (a balanced search tree
| with Merkle hashes and snapshots to prevent tampering and
| enable synchronization)
| https://github.com/cosmos/iavl/blob/master/docs/overview.md
|
| - Google/trillion has Merkle hashed edges between rows in order
| in the table but is centralized
|
| - "EVM Query Language: SQL-Like Language for Ethereum" (2024)
| https://news.ycombinator.com/item?id=41124567 : [...]
| dfox wrote:
| > Then, there's a random byte like 0x25 or 0x07 followed by the
| column data - the rest of the columns are string types so they're
| all stored in UTF-8. If you know what these inter-column bytes
| mean, leave a comment below! I can't figure it out.
|
| Next paragraph mentions TOAST and this byte is related to that.
| The low order bits (on little endian platforms) determine whether
| the value is stored inline (00, first 4 bytes are total length),
| is stored in TOAST table (11) or is shorter than 127 bytes (01
| for even length, 10 for odd length, the total length is first
| byte >> 1). So for 0x25 you get 01, so length is 0x25 >> 1 = 18,
| which is that byte followed by "Equatorial Guinea".
|
| Edit: the reason why endianness matters is that the same
| representation is also used in memory and the whole first word is
| interpreted as one length value. The toast tag bits have to be in
| first byte, which is most easily done as two highest order bits
| of that word on big endian. That means that it is placed in the
| two highest bits of the byte.
| jayde2767 wrote:
| Just curious if anyone else encountered this same error from the
| initial "docker run" command:
|
| docker: Error response from daemon: create ./pg-data: "./pg-data"
| includes invalid characters for a local volume name, only
| "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass
| a host directory, use absolute path.
| topherjaynes wrote:
| Honestly, I couldn't click this link quick enough and it
| delivered. Looking forward to the "oversized values" follow-up!
| nemothekid wrote:
| When I started my dev career, NoSQL was the rage and I remember
| reading about BigTable, Cassandra, Dynamo, and most importantly
| LSMs. They made a big deal about how the data on stored on disk
| was sorted. I never knew why this was a big deal but always kept
| it in mind, but I never bothered to understand how it was done
| previously.
|
| > _Something really important about tables which isn't obvious at
| first is that, even though they might have sequential primary
| keys, tables are not ordered._
|
| This was very surprising to read.
| Izkata wrote:
| I'mma pop up again with this, since it's not mentioned -
| there's a CLUSTER command that lets you reorder the table data
| to match an index. It's a one-off so you'll need to run it
| regularly from a crontab or something, but it's important to be
| aware of because postgres keeps a "correlation" statistic
| between the indexes and the order on disk. It affects the query
| planner, biasing it against random disk access and towards
| sequential disk access. It's one of the possible reasons
| postgres might not use an index that otherwise makes sense -
| the random disk access penalty is too high.
___________________________________________________________________
(page generated 2024-08-05 23:00 UTC)