[HN Gopher] How Postgres stores data on disk - this one's a page...
       ___________________________________________________________________
        
       How Postgres stores data on disk - this one's a page turner
        
       Author : drewsberry
       Score  : 394 points
       Date   : 2024-08-05 08:28 UTC (14 hours ago)
        
 (HTM) web link (drew.silcock.dev)
 (TXT) w3m dump (drew.silcock.dev)
        
       | HPsquared wrote:
       | Title missing the leading "How"
        
         | theblazehen wrote:
         | I believe HN auto-strips such "filler" words, needs to be added
         | again by a mod
        
           | andersa wrote:
           | Perhaps it should not? Why would such a silly feature be
           | explained over and over rather than _removed_?
        
             | codr7 wrote:
             | Yeah, especially when it changes the meaning in silly ways.
             | 
             | My brain spent quite some time trying to figure out what
             | the headline was trying to say.
        
             | codetrotter wrote:
             | Because there are a lot of articles in the news that add
             | this word for no reason.
        
               | nandhinianand wrote:
               | Ahhh.. wouldn't that be better by using a percentage of
               | filler words/total words in title threshold?? I don't
               | know if a feature simply strips out the filler words in
               | title, it's not always useful and rather harmful is what
               | I would argue.
        
               | shakna wrote:
               | Humans work better. HN is small scale enough that a
               | moderator can come along, collapse the off topic comments
               | and fix the title, and it's not an issue.
        
             | candiddevmike wrote:
             | HN does a lot of title editorializing. Like the one a few
             | days ago that turned the acronym RATs into Rats and
             | completely changed the meaning.
        
           | drewsberry wrote:
           | I did add it into the title originally, must've been
           | stripped. Thanks for the heads up, didn't know HN had this
           | "useful feature".
        
           | eatonphil wrote:
           | The submitter can edit the title after it has been auto-
           | modified. It doesn't take a mod to do it unless some amount
           | of time has passed.
        
             | drewsberry wrote:
             | Can't see any way of modifying the title, unless I'm
             | missing something - maybe I missed the edit window.
        
               | eatonphil wrote:
               | Yeah, after an hour, you've probably missed it. (Source:
               | I frequently edit titles of posts I submit to HN because
               | HN changed it to nonsense. :) )
        
               | drewsberry wrote:
               | Thanks for letting me know, I'll make sure to edit it
               | quickly next time :-)
        
         | eitland wrote:
         | IIRC, if the original submitter edits the title once it has
         | been posted, the edited version sticks, i.e. the filter only
         | works the first time and you can override it if you notice it.
        
       | nraynaud wrote:
       | A bit of curiosity: how did Postgres choose 8k pages? shouldn't
       | it be the FS page size to help with atomicity?
        
         | silvestrov wrote:
         | Depends very much on how the SSDs are designed internally. I
         | think these days we have to settle for "can never be sure" of
         | the real page size for atomicity. Pages can also become corrupt
         | in other ways.
         | 
         | It is weird that "--data-checksums" isn't the default for new
         | databases, even when it cost a bit in performance. Integrity
         | should be more important than performance.
        
           | isosphere wrote:
           | Was thinking the same thing when I saw those zeros in the
           | checksum field. Perhaps the consequences are significant.
           | 
           | Here's a benchmarking exercise I found: https://www-
           | staging.commandprompt.com/uploads/images/Command...
           | 
           | With a tidy summary:
           | 
           | > Any application with a high shared buffers hit ratio:
           | little difference. > Any application with a high ratio of
           | reads/writes: little difference. > Data logging application
           | with a low ratio of reads/inserts, and few updates and
           | deletes: little difference. > Application with an equal ratio
           | of reads/inserts, or many updates or deletes, and a low
           | shared buffers hit ratio (for example, an ETL workload),
           | especially where the rows are scattered among disk pages:
           | expect double or greater CPU and disk I/O use. > Run pg_dump
           | on a database where all rows have already been previously
           | selected by applications: little difference. > Run pg_dump on
           | a database with large quantities of rows inserted to insert-
           | only tables: expect roughly double CPU and disk I/O use.
        
             | silvestrov wrote:
             | On my M1 mac "dd ... | cksum" takes 3 seconds while "dd |
             | shasum" (sha1) takes 2 seconds. So cksum might not be the
             | best tool for performance checking.
             | 
             | There is CPU specific code in the PG source in
             | src/include/storage/checksum_impl.h
             | 
             | It is written as a plain nested loop in C. So performance
             | is fully dependent on the compiler being able to
             | parallelize or vectorize the code.
             | 
             | I would not be surprised if manually written SIMD code
             | would be faster.
        
               | Joe_Cool wrote:
               | Interesting. I guess M1 doesn't have the 'crc32'
               | "acceleration" that is included in SSE4.2.
        
               | loeg wrote:
               | M1 has CRC32 acceleration intrinsics.
               | 
               | https://dougallj.wordpress.com/2022/05/22/faster-
               | crc32-on-th...
               | 
               | https://github.com/corsix/fast-crc32?tab=readme-ov-
               | file#appl...
        
               | silvestrov wrote:
               | So when using these intrinsics an Intel Core i7 can do 30
               | GB/s but the performance check linked above (by isosphere
               | ) says only 300 MB/s, i.e. 1%
               | 
               | Something is amiss here.
               | 
               | If a CPU can do 30 GB/s then a CRC check should not have
               | any real performance impact.
        
               | anarazel wrote:
               | The bottleneck isn't at all the checksum computation
               | itself. It's that to keep checksums valid we need to
               | protect against the potential of torn pages even in cases
               | where it doesn't matter without checksums (i.e. were just
               | individual bits are flipped). That in turn means we need
               | to WAL log changes we don't need to without checksums -
               | which can be painful.
        
           | speed_spread wrote:
           | If physical integrity is already provided by the backing
           | filesystem, such as ZFS, wouldn't --data-checksums be
           | redundant?
        
             | magicalhippo wrote:
             | If data is served from ARC, it's primary cache, ZFS does
             | not perform a checksum check before handing it to you, as
             | the data was checked when it got read into the ARC.
             | 
             | If you use ECC you're quite safe, but ECC can't detect
             | multi-bit errors, just single and double bit errors.
             | 
             | So if you care much about your integrity, you might want
             | Postgres to do its checksum check as well.
        
               | ikiris wrote:
               | If you somehow have 3+ bit errors coming out of ram on an
               | ECC board, you have _much_ bigger problems than trying to
               | verify your postgres data via checksum.
        
               | magicalhippo wrote:
               | AFAIK RowHammer attacks can cause multiple bits to flip
               | in a row[1], no?
               | 
               | But sure, it's not for the vast majority of folks.
               | 
               | [1]: https://www.vusec.net/projects/eccploit/
        
               | ikiris wrote:
               | Rowhammer would qualify as a bigger problem, yes.
        
           | tremon wrote:
           | I'm not sure how much bearing internal storage organization
           | should have on Postgres' page size. Since pg explicitly
           | chooses not to implement their own storage organization
           | layer, there's always a filesystem between a pg database and
           | the underlying storage.
        
           | berkes wrote:
           | > Integrity should be more important than performance.
           | 
           | Most often it is. But not always. There certainly are cases
           | where speed is far more important than integrity in
           | databases. I cannot think of a case where this would be true
           | for a RDBMS or even a Document DB (Though MongoDB had
           | different opinions on this...).
           | 
           | But e.g. redis as caching server, or memcached, or even these
           | non-normalized data that I have in a PG that can be
           | reproduced from other sources easily in case of corruption or
           | stale-ness: it's fine to trade in integrity for speed there.
        
           | loeg wrote:
           | The 8k size long predates SSDs or 4k sectors.
        
         | da_chicken wrote:
         | 8k is a very common page size, but 4k isn't unheard of.
         | Oracle's default is 4k.
         | 
         | The issue is that page size caps row size (for on-row storage).
         | Also, if you have a smart clustering index, larger pages can be
         | more efficient use of index addressing. So it's a trade-off.
        
           | hans_castorp wrote:
           | Oracle defaults to 8k as well:
           | 
           | https://docs.oracle.com/en/database/oracle/oracle-
           | database/1...
           | 
           | > Default value 8192
        
         | anarazel wrote:
         | Historically there was no atomicity at 4k boundaries, just at
         | 512 byte boundaries (sectors). That'd have been too limiting.
         | Lowering the limit now would prove problematic due to the
         | smaller row sizes/ lower number of columns.
        
       | dfox wrote:
       | > This process of retrieving the expected database state from the
       | WAL is called logical decoding and Postgres stores files related
       | to this process in here.
       | 
       | While logical decoding is about WAL, it is not related to the
       | recovery process. Logical decoding is a mechanism to convert the
       | WAL entries back into the high-level operations that caused the
       | WAL entries, for example for replication or audit.
        
         | drewsberry wrote:
         | Very good point, I've rephrased this.
        
           | dfox wrote:
           | You still refer to logical decoding as part of recovery in
           | the last section. The main point is that WAL by itself is
           | designed only for data recovery and only contains information
           | about what is going to be written where in terms of
           | essentially raw disk accesses (notably, the writes do not
           | have to be page aligned).
           | 
           | Logical decoding (which needs wal_level=logical which extends
           | the WAL format with additional metadata) is about parsing the
           | WAL for other purposes than performing the recovery (or
           | physical replication, which is essentially the same thing as
           | recovery, but performed on another instance of the same
           | cluster). The name "logical decoding" is certainly intended
           | to emphasize that there are other uses for that than logical
           | replication, but these are not that different from logical
           | replication on this level (get a stream of changed tuples in
           | tables).
        
         | hinkley wrote:
         | But in a power loss the WAL is also read to restore and
         | transactions that were in flight, right?
        
       | llimllib wrote:
       | @drewsberry: I wish you had an RSS feed! I tried to subscribe to
       | your blog but if there is one it's not linked.
       | 
       | (Enjoyed the post)
        
         | anotherevan wrote:
         | Found it: https://drew.silcock.dev/rss.xml
        
         | drewsberry wrote:
         | Thanks for the feedback, I really appreciate it :-) I've added
         | the RSS feed to my home page now, as the other poster noted the
         | URL is https://drew.silcock.dev/rss.xml.
        
       | mharig wrote:
       | > Can't we just store some data on disk and read / write from it
       | when we need to? (Spoiler: no.)
       | 
       | I disagree. SQLite does a good job in uniting the 2 worlds:
       | complex SQL queries with excellent data consistency and simple
       | file(s). Although SQLite is for sure not the one size fits all
       | solution.
        
         | berkes wrote:
         | > SQLite is for sure not the one size fits all solution
         | 
         | Nor is Postgres. PG is surprisingly versatile. E.g. with some
         | extensions can be used as key-value storage (hashtable),
         | document database, time-series db and so on. And it works quite
         | well. Beyond "good enough" for many use cases. Added benefit,
         | aside from having to run only one db-server, is that you can
         | mix it: part relational, part document, etc.
         | 
         | But the PG versions nearly ever get as good as focused,
         | dedicated solutions get. Which makes sense if you think about
         | it: a team developing a dedicated key-value storage that does
         | that and only that, for years, will always produce a better
         | key-value storage then one bolted onto a generic RDBMS.
         | 
         | A practical example was where we used ltree extension to store
         | ever growing hierarchies. We needed access control over
         | subtrees (so that the X report for John only includes the
         | entities of Johns devision and lower). While it worked in PG,
         | it turned out that "simply replacing" it with OpenLDAP, which
         | had all this built in, made it faster, easier and above all
         | easier to maintain.
        
         | ozim wrote:
         | But SQLite doesn't do concurrency on writing you lock the file.
         | While other db engines deal with row/table locks concurrent
         | connections etc.
        
       | indoordin0saur wrote:
       | This URL is blocked by my company's network because of a certain
       | substring in the URL lol
        
         | crngefest wrote:
         | What why ? Would people browse bigcocks.net unless it's
         | explicitly blocked? What about cox?
         | 
         | Is ,,tube" on a blocklist as well?
        
           | davidmurdoch wrote:
           | Because a sales person at some "security" company convinced a
           | CTO that it was as good idea.
        
             | actionfromafar wrote:
             | The security company just follows "best practices" :)
        
           | indoordin0saur wrote:
           | It is idiotic, yes. This feature is certainly outsourced to
           | our mediocre IT contractor
        
         | MattJ100 wrote:
         | A classic case of the Scunthorpe problem:
         | https://en.wikipedia.org/wiki/Scunthorpe_problem
         | 
         | In this case the substring is part of the author's name. Such
         | names are not at all uncommon.
        
           | asciii wrote:
           | I just saw something similar with a user on here dangsux...
           | 
           | Apparently might be Dang's UX and not against the Mod.
           | -\\_(tsu)_/-
        
           | hinkley wrote:
           | Alistair Cockburn is one of the signatories of the Agile
           | Manifesto.
           | 
           | You may have heard of him more recently with the Hexagonal
           | Architecture approach.
        
           | drewsberry wrote:
           | There are dozens of us!
        
         | fullstop wrote:
         | I was on a mailing list once where messages were blocked
         | because of m _sexchange_ being in the headers.
        
           | Izkata wrote:
           | Any relation to ExpertsExchange?
        
             | fullstop wrote:
             | Ah, this was before ExpertsExchange existed. I think that
             | it was Corel that had the mail filter, and it was related
             | to Corel Linux.
             | 
             | https://en.wikipedia.org/wiki/Corel_Linux
        
             | hinkley wrote:
             | You mean ExpertSexChange? If you're gonna capitalize it, do
             | it right.
        
         | forinti wrote:
         | I once worked for a company that blocked Cuban sites because of
         | .cu (which is the Portuguese word for the end of your digestive
         | system), but did not block porn sites (or so I was told ;-).
        
           | hinkley wrote:
           | That seems like the sort of thing you check on your last day
           | as you're going out the door.
           | 
           | "The rumors are true!"
           | 
           | (Although less amusing, you could also just ask the IT guys
           | and gals)
        
           | CodesInChaos wrote:
           | Are you sure it wasn't because of the US embargo on Cuba?
           | Companies outside the US often participate(d) as well,
           | because they want to do business with the US and US
           | companies.
        
             | forinti wrote:
             | The proxy replied with a message stating that the site was
             | blocked because it was pornographic.
        
         | LVB wrote:
         | I remember us once giving a supplier access to our internal bug
         | tracker for a collaborative project. They were unable to get to
         | the ".../openissue" endpoint.
        
         | twic wrote:
         | My company also blocks it, but as phishing.
        
           | drewsberry wrote:
           | Mine blocked it when I first created it, you can usually flag
           | an incorrectly classified site and they'll correct it (I
           | assure that you I'm not doing any phishing).
        
         | drewsberry wrote:
         | Well if it isn't my arch-nemesis - my legally designated name.
         | Maybe I should've gone for something with just my first name
         | like drewsexpertblog.dev
        
         | Brian_K_White wrote:
         | Shut off like a petcock.
        
       | shreddit wrote:
       | Why does closing the table of contents open the nav menu on
       | mobile?
        
         | ZoomerCretin wrote:
         | On mobile, it appears to open regardless of where you tap. This
         | appears to be the culprit:
         | 
         | ```const n = document.getElementById('nav-header');
         | document.addEventListener(              'click',              s
         | => {                u.hidden ||                s.target ===
         | null ||                n === null ||
         | n.contains(s.target) ||                r()              }```
         | 
         | Above, in the same function, there exists the function
         | `e.addEventListener('click', r);`, which is likely closer to
         | what the author intended. This fires the 'click' event any time
         | the page is clicked, which opens the nav menu when it
         | shouldn't.
        
       | sgarland wrote:
       | If anyone is interested in contrasting this with InnoDB (MySQL's
       | default engine), Jeremy Cole has an outstanding blog series [0]
       | going into incredible detail.
       | 
       | [0]: https://blog.jcole.us/innodb/
        
         | westurner wrote:
         | Apache Arrow Columnar Format:
         | https://arrow.apache.org/docs/format/Columnar.html :
         | 
         | > _The Arrow columnar format includes a language-agnostic in-
         | memory data structure specification, metadata serialization,
         | and a protocol for serialization and generic data transport._
         | _This document is intended to provide adequate detail to create
         | a new implementation of the columnar format without the aid of
         | an existing implementation. We utilize Google's Flatbuffers
         | project for metadata serialization, so it will be necessary to
         | refer to the project's Flatbuffers protocol definition files
         | while reading this document._ _The columnar format has some key
         | features:_
         | 
         | > _Data adjacency for sequential access (scans)_
         | 
         | > _O(1) (constant-time) random access_
         | 
         | > _SIMD and vectorization-friendly_
         | 
         | > _Relocatable without "pointer swizzling", allowing for true
         | zero-copy access in shared memory_
         | 
         | Are the major SQL file formats already SIMD optimized and zero-
         | copy across TCP/IP?
         | 
         | Arrow doesn't do full or partial indexes.
         | 
         | Apache Arrow supports Feather and Parquet on-disk file formats.
         | Feather is on-disk Arrow IPC, now with default LZ4 compression
         | or optionally ZSTD.
         | 
         | Some databases support Parquet as the database flat file format
         | (that a DBMS process like PostgreSQL or MySQL provides a
         | logged, permissioned, and cached query interface with query
         | planning to).
         | 
         | IIUC with Parquet it's possible both to use normal tools to
         | offline query data tables as files on disk and also to online
         | query tables with a persistent process with tunable parameters
         | and optionally also centrally enforce schema and referential
         | integrity.
         | 
         | From https://stackoverflow.com/questions/48083405/what-are-the-
         | di... :
         | 
         | > _Parquet format is designed for long-term storage, where
         | Arrow is more intended for short term or ephemeral storage_
         | 
         | > _Parquet is more expensive to write than Feather as it
         | features more layers of encoding and compression. Feather is
         | unmodified raw columnar Arrow memory. We will probably add
         | simple compression to Feather in the future._
         | 
         | > _Due to dictionary encoding, RLE encoding, and data page
         | compression, Parquet files will often be much smaller than
         | Feather files_
         | 
         | > _Parquet is a standard storage format for analytics that 's
         | supported by many different systems:_ Spark, Hive, Impala,
         | various AWS services, in future by BigQuery, etc. _So if you
         | are doing analytics, Parquet is a good option as a reference
         | storage format for query by multiple systems_
         | 
         | Those systems index Parquet. Can they also index Feather IPC,
         | which an application might already have to journal and/or log,
         | and checkpoint?
         | 
         | Edit: What are some of the DLT solutions for indexing given a
         | consensus-controlled message spec designed for synchronization?
         | 
         | - cosmos/iavl: a Merkleized AVL+ tree (a balanced search tree
         | with Merkle hashes and snapshots to prevent tampering and
         | enable synchronization)
         | https://github.com/cosmos/iavl/blob/master/docs/overview.md
         | 
         | - Google/trillion has Merkle hashed edges between rows in order
         | in the table but is centralized
         | 
         | - "EVM Query Language: SQL-Like Language for Ethereum" (2024)
         | https://news.ycombinator.com/item?id=41124567 : [...]
        
       | dfox wrote:
       | > Then, there's a random byte like 0x25 or 0x07 followed by the
       | column data - the rest of the columns are string types so they're
       | all stored in UTF-8. If you know what these inter-column bytes
       | mean, leave a comment below! I can't figure it out.
       | 
       | Next paragraph mentions TOAST and this byte is related to that.
       | The low order bits (on little endian platforms) determine whether
       | the value is stored inline (00, first 4 bytes are total length),
       | is stored in TOAST table (11) or is shorter than 127 bytes (01
       | for even length, 10 for odd length, the total length is first
       | byte >> 1). So for 0x25 you get 01, so length is 0x25 >> 1 = 18,
       | which is that byte followed by "Equatorial Guinea".
       | 
       | Edit: the reason why endianness matters is that the same
       | representation is also used in memory and the whole first word is
       | interpreted as one length value. The toast tag bits have to be in
       | first byte, which is most easily done as two highest order bits
       | of that word on big endian. That means that it is placed in the
       | two highest bits of the byte.
        
       | jayde2767 wrote:
       | Just curious if anyone else encountered this same error from the
       | initial "docker run" command:
       | 
       | docker: Error response from daemon: create ./pg-data: "./pg-data"
       | includes invalid characters for a local volume name, only
       | "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass
       | a host directory, use absolute path.
        
       | topherjaynes wrote:
       | Honestly, I couldn't click this link quick enough and it
       | delivered. Looking forward to the "oversized values" follow-up!
        
       | nemothekid wrote:
       | When I started my dev career, NoSQL was the rage and I remember
       | reading about BigTable, Cassandra, Dynamo, and most importantly
       | LSMs. They made a big deal about how the data on stored on disk
       | was sorted. I never knew why this was a big deal but always kept
       | it in mind, but I never bothered to understand how it was done
       | previously.
       | 
       | > _Something really important about tables which isn't obvious at
       | first is that, even though they might have sequential primary
       | keys, tables are not ordered._
       | 
       | This was very surprising to read.
        
         | Izkata wrote:
         | I'mma pop up again with this, since it's not mentioned -
         | there's a CLUSTER command that lets you reorder the table data
         | to match an index. It's a one-off so you'll need to run it
         | regularly from a crontab or something, but it's important to be
         | aware of because postgres keeps a "correlation" statistic
         | between the indexes and the order on disk. It affects the query
         | planner, biasing it against random disk access and towards
         | sequential disk access. It's one of the possible reasons
         | postgres might not use an index that otherwise makes sense -
         | the random disk access penalty is too high.
        
       ___________________________________________________________________
       (page generated 2024-08-05 23:00 UTC)