hngopher.com

       [HN Gopher] Optimizing Postgres table layout for maximum efficiency
       ___________________________________________________________________
        
       Optimizing Postgres table layout for maximum efficiency
        
       Author : napsterbr
       Score  : 191 points
       Date   : 2024-10-06 15:50 UTC (2 days ago)
        
 (HTM) web link (r.ena.to)
 (TXT) w3m dump (r.ena.to)
        
       | remus wrote:
       | Very interesting, but I think the author overstates the
       | importance of alignment a little. Unless your data/indexes are
       | already of a challenging size (or you expect them to be
       | imminently) for your hardware then fiddling with byte alignment
       | details feels like a fairly premature optimisation.
       | 
       | Disk is cheap, memory is plentiful, your time is expensive etc.
        
         | koolba wrote:
         | > Disk is cheap, memory is plentiful, your time is expensive
         | etc.
         | 
         | Taking the time to know the in memory sizing for your data
         | types is well worth it. Taking the time to think about the
         | types to use and sorting them by size is also minimal and well
         | worth it.
         | 
         | It may make sense for the system to do it automatically for
         | newly created tables. But maybe not as it's possible you'd want
         | the data layout to match some existing structure.
        
         | egnehots wrote:
         | As mentioned in the article, it's a good idea to consider this
         | when creating a new table, since it's essentially a free
         | optimization. However, it's probably not worth the hassle of
         | reordering a production table for that.
         | 
         | > Disk is cheap, memory is plentiful, but your time is
         | expensive, etc.
         | 
         | One thing to keep in mind, though, is that while you often have
         | plenty of disk space, RAM is still relatively expensive. It's
         | also divided into many smaller buffers, such as working memory
         | and shared buffers, which are not that large. These
         | optimizations help to fit more data into cache.
         | 
         | However, what the article said about alignment being important
         | for indexes is somewhat misleading. Reordering an index field
         | is not the same as reordering columns in a table. Beside having
         | to rewrite queries, it also changes the access pattern and the
         | time required to access the data, which is often much more
         | significant than the space saved. Indexes are, by nature, a
         | tradeoff where you give up space to gain time, so this mindset
         | doesn't really apply there.
        
           | napsterbr wrote:
           | Hey, author here.
           | 
           | > Indexes are, by nature, a tradeoff where you give up space
           | to gain time, so this mindset doesn't really apply there.
           | 
           | I agree that (re)aligning indexes are a different beast
           | entirely, but (as mentioned in my recommendation) ideally the
           | developer should keep this in mind when creating the index
           | initially.
           | 
           | Factors like cardinality and even readability should take
           | precedence over perfect alignment, but all else being equal,
           | aligning your indexes from the very moment they are
           | introduced in the codebase is the ideal scenario IMO.
        
         | sgarland wrote:
         | > Disk is cheap, memory is plentiful, your time is expensive
         | etc.
         | 
         | Spend 30 minutes one day playing around with Postgres, trying
         | different column combinations out. Boom, you now know how best
         | to order columns. This doesn't seem like a big ask.
         | 
         | The flip side is that changing data at scale is HARD, so if you
         | put things like this off, when you do finally need to squeeze
         | bytes, it's painful.
         | 
         | Also, memory is absolutely not plentiful. That's generally the
         | biggest bottleneck (or rather, the lack of it then makes IO the
         | bottleneck) for an RDBMS, assuming you have connection pooling
         | and aren't saturating the CPU with overhead.
        
         | epgui wrote:
         | > Disk is cheap, memory is plentiful, your time is expensive
         | etc.
         | 
         | Index size is not solely a storage concern. I also don't really
         | care about how much disk space I pay for, but sometimes I care
         | a lot about how long it takes to vacuum a table.
        
       | koolba wrote:
       | The content itself is fine but the unnecessary image above the
       | label " _You have the right to remain aligned_ " with the usual
       | AI-generated garbage text cheapens the entire article.
        
         | loloquwowndueo wrote:
         | Are you kidding, that image is hilarious.
         | 
         | (Articles that intersperse annoying memes every 3 paragraphs
         | with obnoxious gif animations are much much worse).
        
       | wccrawford wrote:
       | Interesting. But if this is important, why doesn't Postgres do it
       | invisibly, automatically? Surely there's a most-correct sequence,
       | with the longest values first, and the shortest values last, and
       | then (probably) the variable-length values?
       | 
       | It could then happily report back the data in the order that the
       | DB Admin/developer expects.
        
         | j45 wrote:
         | Updating column rows in MySQL is pretty trivial, and I've
         | wondered why it's not the same in Postgres, even at the GUI
         | level.
         | 
         | Ordering columns for performance might be a different order of
         | reccomendation.
        
         | jeltz wrote:
         | Because nobody has implemented it yet. A patch would be very
         | welcome by many but nobody has written one.
        
       | johnthuss wrote:
       | This is a super useful thing to know and I'm thankful for this
       | article highlighting this aspect of Postgres.
       | 
       | But I would disagree with the takeaway to focus on optimizing
       | your indexes and not your tables. The reason is that the order of
       | columns in a multi-column index is highly meaningful and
       | intentional in order to support match on a range of values for
       | the last column in the index. The way databases work you can only
       | utilize a multi-column index on (customer_id int4, date
       | timestamp) if have an equality match on customer_id, like "WHERE
       | customer_id = 1 AND BETWEEN '2024-01-01' and '2025-01-01'". If
       | you reorder these columns in the index to put the larger date
       | column first, then, sure, you save space in the index, but you
       | also make it worthless - it will never be used by the query
       | above. As such, optimizing a multi-column index is only useful
       | when all the columns are queried for equality rather than a
       | range.
       | 
       | In contrast, when you are creating a NEW table you might not
       | think hard about the order of the columns in the table, and
       | especially not about the data-sizes of each column and their
       | alignment. But doing so at the time you create the table can be
       | tremendously beneficial if it is going to be very large. It is
       | important to note that you not only save space on-disk, but in
       | precious RAM when the tuples are loaded.
        
         | sgarland wrote:
         | This (index ordering resulting in useless indices) is not true,
         | at least not in newer versions of Postgres (I just tried with
         | 15).
         | 
         | While the query will take much longer (for me, it was about 47
         | msec vs 0.6 msec for 2,000,000 rows), it still uses the index.
         | 
         | Similarly, while normally you wouldn't expect a query using
         | predicate Y to use an index defined on columns (X, Y, Z) if
         | predicate X is also not referenced, Postgres may choose to do
         | so, depending on table stats, and the relative difference
         | between random_page_cost and seq_page_cost. I've seen it happen
         | before.
        
           | ghusbands wrote:
           | > This (index ordering resulting in useless indices) is not
           | true, at least not in newer versions of Postgres (I just
           | tried with 15).
           | 
           | > While the query will take much longer (for me, it was about
           | 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the
           | index.
           | 
           | I'd argue that something being 78x slower can make it pretty
           | useless, but it is indeed at least used, in some cases.
        
             | sgarland wrote:
             | Certainly I wouldn't encourage this, no, but it _is_
             | possible, and is still usually faster than a sequential
             | scan.
        
         | jeltz wrote:
         | This is a limitation which is currently being worked on. The
         | order will still matter of course but it will allow PostgreSQL
         | to make some use of indexes even when the order of coulums does
         | not match.
         | 
         | https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZd...
        
           | johnthuss wrote:
           | >>> This is feasible in cases where the total number of
           | distinct values in the column 'a' is reasonably small (think
           | tens or hundreds, perhaps even thousands for very large
           | composite indexes).
           | 
           | It's great this is improving, but this is a fairly narrow
           | improvement. Personally, the multi-column indexes I use would
           | not be improved by this change since column 'a' does not
           | store a "reasonably small" distribution of values.
        
         | rand_r wrote:
         | This is sort of an aside, but a very interesting thing about
         | Postgres is that it can efficiently combine independent column
         | indexes together, so there is much less of a need, compared to
         | older databases, to even create multi-column indexes. It's a
         | feature from 8.1 called "Bitmap Scan". Basically, if you create
         | an index on column X and an index on column Y, it can use them
         | to do queries involving either or both columns pretty
         | efficiently (for any number of columns).
         | 
         | It's not as fast as a multi-column index, but the savings of
         | not having to worry about all the combinations of columns that
         | can be queried together could well be worth it.
         | 
         | - https://www.postgresql.org/docs/release/8.1.0/
         | 
         | - https://www.postgresql.org/docs/current/indexes-bitmap-
         | scans...
        
           | Pamar wrote:
           | Bitmap Scan sounds a lot like Rushmore technology in
           | Foxpro[1]. Are they the same?
           | 
           | 1) https://en.wikipedia.org/wiki/FoxPro
           | 
           | It is difficult to find a complete explanation for Rushmore
           | nowadays, from what I remember, it would create a bitmap
           | where each bit represented the nth record of the table you
           | wanted to search, then with a single, fast sequential scan it
           | would set the nth bit to 1 if the record satisfied all
           | clauses of your search, 0 otherwise.
           | 
           | Try to see if this makes any sense to you:
           | http://www.foxpert.com/docs/howfoxproworks.en.htm
        
           | williamdclt wrote:
           | It's very cool, but at high throughput you really see the
           | difference. This bitmap scan can take huge amounts of cpu,
           | reduced to nothing (and much faster) when setting up a proper
           | multi column index.
           | 
           | On small/medium tables and lowish throughout though, yeah
           | it's often good enough and avoids having many indexes for
           | specific use cases (which is a cost in itself, in
           | memory/cpu/storage)
        
             | jamespo wrote:
             | Do you have any benchmarks? Be interesting to compare.
        
           | qazxcvbnm wrote:
           | Is there any potential that indexes could be created over
           | foreign key joins in the future? I know that as of today, no
           | multi-table indices or statistics exist for Postgres, which
           | has had led me to do some further denormalisations.
        
       | rtuin wrote:
       | It's really something you don't know how this applies to your
       | Postgres DB, until you run into a situation where you do.
       | 
       | The author explains this very well, it's a good read! I've
       | learned about this padding little over a year ago, while I was
       | designing a data intensive application with a colleague. I was
       | skeptical about the advantage at first, but for our specific
       | design, where we have 100 to 480+ columns in one table it makes a
       | huge difference on table store size. Not so much on the indexes,
       | though.
        
         | sgarland wrote:
         | > where we have 100 to 480+ columns in one table
         | 
         | I'm sorry, what? Why?
        
           | rtuin wrote:
           | Glad you asked!
           | 
           | This system contains measurements and state of physical
           | devices (time series). It's designed for both heavy write and
           | read, with slight emphasis on write. Each table is one type
           | of device and contains 1 to 5 different measurements/states.
           | But here's the trick: because data is queried with minimum
           | bucket size of 15minutes I figured we could just create a
           | column for each measurement + quarter of the day (i.e.
           | measure0000, measure0015), so that's 100 columns for each
           | measurement (96 quarter + 4 for DST), include the date in the
           | key, et voila: excellent write performance (because it's
           | mainly UPDATE queries) and good read performance.
           | 
           | Okay, the queries to make sense of the data aren't pretty,
           | but can be generated.
           | 
           | I find it really cool how effective this is for time-series
           | data without Postgres extensions (we're on RDS).
        
             | napsterbr wrote:
             | One interesting thing your team may want to look into (if
             | you haven't already) is compression.
             | 
             | Of course there are a multitude of variables we don't have
             | access from the outside, but Postgres only compresses data
             | that is TOASTed, and based on your description of the
             | table, the data is not being TOASTed (and therefore not
             | being compressed).
             | 
             | Instead, if you could somehow pack your timeseries entries
             | into an array, you would get the benefits of compression
             | automatically.
             | 
             | Given your write performance requirements, using an array
             | may be out-of-question (and you may get too much overhead
             | from dead tuples) -- but who knows? Always a good idea to
             | benchmark.
             | 
             | I actually considered mentioning this at the post but
             | figured it was too long already and could be the material
             | for a future one :)
        
               | rtuin wrote:
               | This is interesting! Turning the values into arrays is
               | not suitable for our base-tables, but might be for the
               | several materialization tables we run.
        
             | aidos wrote:
             | The updates wouldn't be in place anyway because of the
             | Postgres mvcc approach, right?
             | 
             | I can see the advantage in terms of just needing a single
             | tuple for a reads. So a timestamp + value model would
             | likely take twice as much heap space than your approach?
             | 
             | Given that you're probably always just inserting new data
             | you could use a brin index to get fast reads on the date
             | ranges. Would be interesting to see it in action and play
             | around to see the tradeoffs. The model you've settled on
             | sounds like it would be a pain to query.
        
               | sgarland wrote:
               | > The updates wouldn't be in place anyway because of the
               | Postgres mvcc approach, right?
               | 
               | They might be if the columns being updated weren't
               | indexed [0], but since it sounds like at least one is,
               | no, not in-place.
               | 
               | Though interestingly, your comment on BRIN indexes is
               | quite relevant, as that's the one type of index that HOT
               | can still work with.
               | 
               | [0]: https://www.postgresql.org/docs/current/storage-
               | hot.html
        
               | rtuin wrote:
               | Good observation. The updates are as much HOT-updates as
               | possible. I wasn't familiar with BRIN-indexes before, so
               | I have to look into that.
               | 
               | At first glance our solution follows a similar approach,
               | let me elaborate:
               | 
               | - no index columns are updated ever, only inserted
               | 
               | - all tables are partitioned based on date (partition
               | range is 1 month)
               | 
               | - for some tables there is another layer of partitioning
               | (3 sub-partitions, based on one specific column)
               | 
               | - finding an appropriate fillfactor is important to
               | improve the speed of UPDATE statements (HOT-updates)
               | 
               | - standard vacuum / auto vacuum settings work great for
               | us so far.
               | 
               | - to improve ANALYZE performance, set column statistics
               | of value-only columns to 0.
        
             | sgarland wrote:
             | I assume you looked at the various Postgres functions, like
             | width_bucket()?
             | 
             | I've never ran a PG-based TSDB so I'm reluctant to pass
             | judgment one way or the other, but as I hope you can
             | understand, "we have hundreds of columns in a table" is
             | alarming at first glance.
        
               | rtuin wrote:
               | The schema is organized by how we need to query it which
               | is mostly by fetching all measurements for an entire day
               | or month. Querying statistics on those metrics is not
               | often needed, maybe a few times per week or day, so we do
               | that in Snowflake if necessary. Or create a specialized
               | streaming data flow when it the stats are queried more
               | often.
        
             | sevensor wrote:
             | Fascinating. How reliable is the measurement apparatus? Or
             | to put it another way, how tolerant of missed measurements
             | is this scheme? I've been in a similar situation in a past
             | life and we took a different approach. I was never
             | satisfied with how we ended up doing it though.
        
               | rtuin wrote:
               | I see your point. Some of our measurements are counters
               | and interpolate missing data via certain rules. We store
               | these interpolations in a separate table with exactly the
               | same format and join the two tables directly when
               | querying the data. Since the real value always takes
               | precedence over the interpolated value this is just a
               | COALESCE(realmeasure0000, interpolation0000) statement.
               | 
               | This is super fast when taking advantage of postgres'
               | partition-wise joins.
        
               | sevensor wrote:
               | That's really neat, thanks for explaining!
        
           | epgui wrote:
           | I don't know what they're doing, but highly-denormalized
           | tables are very common, and sometimes even the norm,
           | depending on how things are set up, especially in OLAP
           | contexts.
        
             | sgarland wrote:
             | Highly denormalized tables are often the norm simply
             | because the tables weren't properly normalized to begin
             | with, and the data model wasn't properly done, such that
             | reasonable joins are overly difficult.
             | 
             | OLAP is of course its own problem, and most of the best
             | practices for OLTP do not apply.
        
           | magicalhippo wrote:
           | Not OP but in similar situation. Main reason is that the
           | primary tables represent various official declarations, and
           | they have a _lot_ of fields. In addition, our customers are
           | required by law to keep the data readily available for 5-10
           | years. Also the law states our customers are responsible for
           | the data they send through our systems, so they want to see
           | what they 're sending.
           | 
           | So just to give an example of how field count "blows up",
           | there can be up to 8 different parties related to a single
           | declaration, and for each we want to have for display/print
           | purposes our internal id, id in customers system,
           | organization number, name, address, postal code, location,
           | country code, contact person, contact person phone and email.
           | 
           | So we're talking about almost 100 fields just to store
           | parties.
           | 
           | We want to store all that separate from say our "companies"
           | table as a company might change name, location etc, and we're
           | required to show/print the details from today 5+ years down
           | the line.
           | 
           | We could put that in a separate table, but that means 8 joins
           | to fetch data for a single record just there. And that's just
           | one piece of the declarations, and at least 3-4 of these
           | parties our customers want to see in overview grids and be
           | able to filter on there.
           | 
           | Now tack on 20+ years of doing this, with various official
           | systems coming and going with different technical
           | requirements, and it adds up.
        
             | mkesper wrote:
             | Sounds like it's rife to convert to JSON fields maybe?
        
               | magicalhippo wrote:
               | If we could limit us to only PostgreSQL we could perhaps
               | do that for new modules. Sadly we have some customers
               | which want to use their own MSSQL database.
               | 
               | Though we'll see what the future holds. PostgreSQL is
               | interesting due to cost and features, and many of the
               | large customers are shifting their POV on how they want
               | to run things, almost doing 180's in some cases.
        
               | sgarland wrote:
               | You do not want to do this, I assure you. The TOAST / DE-
               | TOAST overhead alone would wreck performance, plus OP
               | said it was update-heavy; Postgres can't update a JSON
               | object on disk, it just rewrites the entire blob.
        
               | syncsynchalt wrote:
               | > Postgres can't update a JSON object on disk, it just
               | rewrites the entire blob.
               | 
               | Postgres does this for _all_ row updates; it must, for
               | transaction isolation to work.
        
               | hashhar wrote:
               | I'm not sure, there's also HOT-updates (heap-only
               | tuples). It's an optimization where data in modified in
               | place if none of the modified columns are part of an
               | index and maybe other conditions but I don't remember it
               | all too well.
        
               | pphysch wrote:
               | I strongly disagree. This is an excellent application for
               | JSONB. The user outlined that the records are for
               | historical compliance purposes and therefore aren't
               | frequently updated after the initial phase.
               | 
               | A schema of `party1 JSONB, party2 JSONB, ...`, or even
               | `parties JSONB` would likely be far more maintainable and
               | sensible than the "table-as-spreadsheet" approach.
        
       | OliverJones wrote:
       | Good material!
       | 
       | It has to be said, the order of columns in correctly designed
       | multicolumn BTREE indexes is governed by the shape of the queries
       | the indexes support.
       | 
       | So don't arbitarily reorder columns in your indexes to handle
       | alignment.
        
       | delduca wrote:
       | Could the Django ORM deal with this?
        
         | sgarland wrote:
         | Yes, but it doesn't that I'm aware of. Certainly able to
         | override the classes yourself, though. Make a lookup table of
         | column type / length and then reorder based on that.
        
         | jbkkd wrote:
         | It can't right now, but this is a doable extension
        
       | SoftTalker wrote:
       | This is the sort of thing that good DBAs used to know about and
       | manage, but nowadays that isn't a fashionable job.
        
         | sgarland wrote:
         | It's coming back in the form of DBRE (hi, it's me), but
         | unfortunately actual core RDBMS knowledge is often lacking.
         | RDBMS administration is legitimately difficult, as they have a
         | million knobs to turn, and a million gotchas that can tank
         | performance if you aren't careful. I should mention here that
         | this difficulty really only occurs at scale. If your DB has a
         | few million rows, as long as your queries are indexed it's
         | unlikely that you'll experience problems. You probably won't
         | have optimal performance, but it's likely to be good enough.
         | 
         | Personally, I love it, but also find it frustrating. The amount
         | of times I've presented "here's your problem, here's how to fix
         | it, sorry it's gonna suck" only to be told "nah, we'll just
         | upsize the instance" is far too many.
        
       | didgetmaster wrote:
       | It's been awhile since I dug into the bowels of PG; but the
       | overall row size with respect to page size used to be important.
       | If you had a table where every row took up 4097 bytes and the
       | page size was 8K; the your disk footprint for that table was
       | double. Only one row fit in a page and the other (nearly) half of
       | each page was wasted.
       | 
       | If that is still true, then alignment issues could push you over
       | the edge if you are close to it.
        
       | gnfargbl wrote:
       | The ever-useful postgres_dba tool
       | (https://github.com/NikolayS/postgres_dba) can help with this.
       | Install it, start psql, run :dba and choose option p1 to see how
       | much table space you would save by reordering your columns.
        
         | samokhvalov wrote:
         | Thanks for mentioning!
        
           | gnfargbl wrote:
           | Thanks for the tool! It has helped me out of the mire more
           | than once.
        
           | fforflo wrote:
           | Looked nice so I automatically did git clone && cd
           | postgres_dba && make all install installcheck... and failed
           | :D
           | 
           | Creatures of habit... any plans to make this an extension?
        
       | tiffanyh wrote:
       | How much of this article is rooted in hard drives with physically
       | mechanical spinning disk latency assumptions, vs ssd/nvme?
        
         | pornel wrote:
         | None at all. The alignment is for cheap and easy access of data
         | once it is in memory. It's probably rooted in PostgreSQL being
         | written in C where aligned access is trivial, and dereference
         | of an unaligned pointer is Undefined Behavior and requires more
         | clunky code instead.
        
         | nzach wrote:
         | As the sibling said: "not at all".
         | 
         | But I see why this could be confusing. In the article he wrote:
         | 
         | > Postgres will happily add padding to the underlying data in
         | order to make sure it is properly aligned at the physical
         | layer. Having the data aligned ensures faster access time when
         | retrieving pages from disk.
         | 
         | And this is correct. The problem is that "physical layer"
         | refers to the physical memory layout and how things are loaded
         | into the CPU. And not how they are stored in the disk(mostly).
         | 
         | I'm not expert in this subject, but as far I understand the
         | main factor for this kind of behavior is the way a CPU reads
         | data from its cache, i.e. 1 line(64bytes) at a time. And this
         | is why we always pad to factors of 64(2, 4, 8, 16, 32).
         | 
         | This is the first time I read about this in the context of PG,
         | but I've already encoutered the same issue in C and Go. So for
         | me this is just a new manifestation of the same underlying
         | problem.
         | 
         | https://mecha-mind.medium.com/demystifying-cpu-caches-with-e...
        
       | branko_d wrote:
       | From the article:                 Having the data aligned ensures
       | faster access time when retrieving pages from disk.
       | 
       | Byte-level alignment cannot possibly have anything to do with
       | retrieving pages from disk, simply because the unit of retrieval
       | is the whole page. From the hardware/OS perspective, a page is
       | just an opaque blob of bytes (comprised from one or more blocks
       | on the physical drive).
       | 
       | Only after these bytes have reached RAM does the byte-level
       | alignment play a role, because CPU works slower on misaligned
       | data.
       | 
       | The article itself then goes on to illustrates the above (and
       | seemingly contradict itself):                 SQLite does not pad
       | or align columns within a row. Everything is tightly packed
       | together using minimal space. Two consequences of this design:
       | SQLite has to work harder (use more CPU cycles) to access data
       | within a row once it has that row in memory.       SQLite uses
       | fewer bytes on disk, less memory, and spends less time moving
       | content around because there are fewer bytes to move.
        
         | winternewt wrote:
         | I think the idea is that padding results in lower storage
         | efficiency, which means fewer rows per page and hence lower I/O
         | throughput. By changing the column order you can reduce the
         | amount of padding required.
        
           | branko_d wrote:
           | Sure, having less padding increases I/O efficiency. I was
           | just commenting on the author's apparent confusion as to why
           | the padding is there in the first place.
           | 
           | Here is the full(er) quote:                 Postgres will
           | happily add padding to the underlying data in order to make
           | sure it is properly aligned at the physical layer. Having the
           | data aligned ensures faster access time when retrieving pages
           | from disk.
           | 
           | This might be misunderstood as "Postgres adds padding to
           | speed-up disk I/O", which is the opposite of what actually
           | happens. Padding slows-down I/O but speeds-up the CPU
           | processing afterwards.
           | 
           | SQLite made the opposite tradeoff.
        
             | napsterbr wrote:
             | You are absolutely correct, the current wording causes
             | confusion as to where the speed-up happens. Over the
             | weekend I'll add a note and link to this thread, thanks for
             | pointing that out.
        
           | dspillett wrote:
           | The wording implies to me that Postgres is doing the padding
           | for alignment to reduce IO costs which, as branko_d suggests,
           | would do the opposite. You are reading it as the intervention
           | of the DBA reordering columns to remove the padding will
           | improve IO efficiency by fitting more rows into each page,
           | which you are right would be beneficial in that way.
           | 
           | Postgres will be performing padding for alignment to improve
           | processing speed once data is in local memory - CPUs are
           | usually much faster at reading & writing aligned data0. This
           | is trading off memory use and IO efficiency for CPU gain,
           | which is the right optimisation _if_ you assume that your
           | core working set fits nicely into RAM and that your CPU(s)
           | have large enough cache that you don 't create the same
           | problem there1. Other DBs don't do this padding at all either
           | because they didn't think of it or, more likely in the case
           | of the big ones, because they are optimising more for IO than
           | being concerned about CPU bottlenecks, or perhaps they
           | natively rearrange the fields where it makes a difference
           | instead of being beholden to the column ordering given by the
           | user2.
           | 
           | ----
           | 
           | [0] in fact some architectures don't directly support
           | unaligned access at all, though probably not any
           | architectures Postgres supports
           | 
           | [1] causing extra cache evictions if cache segment width
           | aligns badly with the padding such that less data fits in the
           | available cache
           | 
           | [2] if the user needs to care about physical ordering like
           | this, you have a leaky abstraction
        
       | silvestrov wrote:
       | Wild guess: this is a left-over from the old days of Sun SPARC
       | processors which could only access doubles on 8-byte aligned
       | memory addresses.
       | 
       | For Intel processors you just paid a performance penality for
       | unaligned accesses but SPARC processors would generate a TRAP for
       | unaligned accesses.
       | 
       | I'd really like to see some performance testing of alignment for
       | modern Intel/ARM CPUs. My guess is that tightly packing would be
       | better as more data would fit into caches and you would have less
       | i/o.
       | 
       | Modern CPUs are very different from back in the 90's when
       | PostgreSQL was designed. Posgresql has traditionally been very
       | slow to stop supporting old OSes that are no longer used.
        
       ___________________________________________________________________
       (page generated 2024-10-08 23:02 UTC)