[HN Gopher] Optimizing Postgres table layout for maximum efficiency
___________________________________________________________________
Optimizing Postgres table layout for maximum efficiency
Author : napsterbr
Score : 191 points
Date : 2024-10-06 15:50 UTC (2 days ago)
(HTM) web link (r.ena.to)
(TXT) w3m dump (r.ena.to)
| remus wrote:
| Very interesting, but I think the author overstates the
| importance of alignment a little. Unless your data/indexes are
| already of a challenging size (or you expect them to be
| imminently) for your hardware then fiddling with byte alignment
| details feels like a fairly premature optimisation.
|
| Disk is cheap, memory is plentiful, your time is expensive etc.
| koolba wrote:
| > Disk is cheap, memory is plentiful, your time is expensive
| etc.
|
| Taking the time to know the in memory sizing for your data
| types is well worth it. Taking the time to think about the
| types to use and sorting them by size is also minimal and well
| worth it.
|
| It may make sense for the system to do it automatically for
| newly created tables. But maybe not as it's possible you'd want
| the data layout to match some existing structure.
| egnehots wrote:
| As mentioned in the article, it's a good idea to consider this
| when creating a new table, since it's essentially a free
| optimization. However, it's probably not worth the hassle of
| reordering a production table for that.
|
| > Disk is cheap, memory is plentiful, but your time is
| expensive, etc.
|
| One thing to keep in mind, though, is that while you often have
| plenty of disk space, RAM is still relatively expensive. It's
| also divided into many smaller buffers, such as working memory
| and shared buffers, which are not that large. These
| optimizations help to fit more data into cache.
|
| However, what the article said about alignment being important
| for indexes is somewhat misleading. Reordering an index field
| is not the same as reordering columns in a table. Beside having
| to rewrite queries, it also changes the access pattern and the
| time required to access the data, which is often much more
| significant than the space saved. Indexes are, by nature, a
| tradeoff where you give up space to gain time, so this mindset
| doesn't really apply there.
| napsterbr wrote:
| Hey, author here.
|
| > Indexes are, by nature, a tradeoff where you give up space
| to gain time, so this mindset doesn't really apply there.
|
| I agree that (re)aligning indexes are a different beast
| entirely, but (as mentioned in my recommendation) ideally the
| developer should keep this in mind when creating the index
| initially.
|
| Factors like cardinality and even readability should take
| precedence over perfect alignment, but all else being equal,
| aligning your indexes from the very moment they are
| introduced in the codebase is the ideal scenario IMO.
| sgarland wrote:
| > Disk is cheap, memory is plentiful, your time is expensive
| etc.
|
| Spend 30 minutes one day playing around with Postgres, trying
| different column combinations out. Boom, you now know how best
| to order columns. This doesn't seem like a big ask.
|
| The flip side is that changing data at scale is HARD, so if you
| put things like this off, when you do finally need to squeeze
| bytes, it's painful.
|
| Also, memory is absolutely not plentiful. That's generally the
| biggest bottleneck (or rather, the lack of it then makes IO the
| bottleneck) for an RDBMS, assuming you have connection pooling
| and aren't saturating the CPU with overhead.
| epgui wrote:
| > Disk is cheap, memory is plentiful, your time is expensive
| etc.
|
| Index size is not solely a storage concern. I also don't really
| care about how much disk space I pay for, but sometimes I care
| a lot about how long it takes to vacuum a table.
| koolba wrote:
| The content itself is fine but the unnecessary image above the
| label " _You have the right to remain aligned_ " with the usual
| AI-generated garbage text cheapens the entire article.
| loloquwowndueo wrote:
| Are you kidding, that image is hilarious.
|
| (Articles that intersperse annoying memes every 3 paragraphs
| with obnoxious gif animations are much much worse).
| wccrawford wrote:
| Interesting. But if this is important, why doesn't Postgres do it
| invisibly, automatically? Surely there's a most-correct sequence,
| with the longest values first, and the shortest values last, and
| then (probably) the variable-length values?
|
| It could then happily report back the data in the order that the
| DB Admin/developer expects.
| j45 wrote:
| Updating column rows in MySQL is pretty trivial, and I've
| wondered why it's not the same in Postgres, even at the GUI
| level.
|
| Ordering columns for performance might be a different order of
| reccomendation.
| jeltz wrote:
| Because nobody has implemented it yet. A patch would be very
| welcome by many but nobody has written one.
| johnthuss wrote:
| This is a super useful thing to know and I'm thankful for this
| article highlighting this aspect of Postgres.
|
| But I would disagree with the takeaway to focus on optimizing
| your indexes and not your tables. The reason is that the order of
| columns in a multi-column index is highly meaningful and
| intentional in order to support match on a range of values for
| the last column in the index. The way databases work you can only
| utilize a multi-column index on (customer_id int4, date
| timestamp) if have an equality match on customer_id, like "WHERE
| customer_id = 1 AND BETWEEN '2024-01-01' and '2025-01-01'". If
| you reorder these columns in the index to put the larger date
| column first, then, sure, you save space in the index, but you
| also make it worthless - it will never be used by the query
| above. As such, optimizing a multi-column index is only useful
| when all the columns are queried for equality rather than a
| range.
|
| In contrast, when you are creating a NEW table you might not
| think hard about the order of the columns in the table, and
| especially not about the data-sizes of each column and their
| alignment. But doing so at the time you create the table can be
| tremendously beneficial if it is going to be very large. It is
| important to note that you not only save space on-disk, but in
| precious RAM when the tuples are loaded.
| sgarland wrote:
| This (index ordering resulting in useless indices) is not true,
| at least not in newer versions of Postgres (I just tried with
| 15).
|
| While the query will take much longer (for me, it was about 47
| msec vs 0.6 msec for 2,000,000 rows), it still uses the index.
|
| Similarly, while normally you wouldn't expect a query using
| predicate Y to use an index defined on columns (X, Y, Z) if
| predicate X is also not referenced, Postgres may choose to do
| so, depending on table stats, and the relative difference
| between random_page_cost and seq_page_cost. I've seen it happen
| before.
| ghusbands wrote:
| > This (index ordering resulting in useless indices) is not
| true, at least not in newer versions of Postgres (I just
| tried with 15).
|
| > While the query will take much longer (for me, it was about
| 47 msec vs 0.6 msec for 2,000,000 rows), it still uses the
| index.
|
| I'd argue that something being 78x slower can make it pretty
| useless, but it is indeed at least used, in some cases.
| sgarland wrote:
| Certainly I wouldn't encourage this, no, but it _is_
| possible, and is still usually faster than a sequential
| scan.
| jeltz wrote:
| This is a limitation which is currently being worked on. The
| order will still matter of course but it will allow PostgreSQL
| to make some use of indexes even when the order of coulums does
| not match.
|
| https://www.postgresql.org/message-id/CAH2-Wzmn1YsLzOGgjAQZd...
| johnthuss wrote:
| >>> This is feasible in cases where the total number of
| distinct values in the column 'a' is reasonably small (think
| tens or hundreds, perhaps even thousands for very large
| composite indexes).
|
| It's great this is improving, but this is a fairly narrow
| improvement. Personally, the multi-column indexes I use would
| not be improved by this change since column 'a' does not
| store a "reasonably small" distribution of values.
| rand_r wrote:
| This is sort of an aside, but a very interesting thing about
| Postgres is that it can efficiently combine independent column
| indexes together, so there is much less of a need, compared to
| older databases, to even create multi-column indexes. It's a
| feature from 8.1 called "Bitmap Scan". Basically, if you create
| an index on column X and an index on column Y, it can use them
| to do queries involving either or both columns pretty
| efficiently (for any number of columns).
|
| It's not as fast as a multi-column index, but the savings of
| not having to worry about all the combinations of columns that
| can be queried together could well be worth it.
|
| - https://www.postgresql.org/docs/release/8.1.0/
|
| - https://www.postgresql.org/docs/current/indexes-bitmap-
| scans...
| Pamar wrote:
| Bitmap Scan sounds a lot like Rushmore technology in
| Foxpro[1]. Are they the same?
|
| 1) https://en.wikipedia.org/wiki/FoxPro
|
| It is difficult to find a complete explanation for Rushmore
| nowadays, from what I remember, it would create a bitmap
| where each bit represented the nth record of the table you
| wanted to search, then with a single, fast sequential scan it
| would set the nth bit to 1 if the record satisfied all
| clauses of your search, 0 otherwise.
|
| Try to see if this makes any sense to you:
| http://www.foxpert.com/docs/howfoxproworks.en.htm
| williamdclt wrote:
| It's very cool, but at high throughput you really see the
| difference. This bitmap scan can take huge amounts of cpu,
| reduced to nothing (and much faster) when setting up a proper
| multi column index.
|
| On small/medium tables and lowish throughout though, yeah
| it's often good enough and avoids having many indexes for
| specific use cases (which is a cost in itself, in
| memory/cpu/storage)
| jamespo wrote:
| Do you have any benchmarks? Be interesting to compare.
| qazxcvbnm wrote:
| Is there any potential that indexes could be created over
| foreign key joins in the future? I know that as of today, no
| multi-table indices or statistics exist for Postgres, which
| has had led me to do some further denormalisations.
| rtuin wrote:
| It's really something you don't know how this applies to your
| Postgres DB, until you run into a situation where you do.
|
| The author explains this very well, it's a good read! I've
| learned about this padding little over a year ago, while I was
| designing a data intensive application with a colleague. I was
| skeptical about the advantage at first, but for our specific
| design, where we have 100 to 480+ columns in one table it makes a
| huge difference on table store size. Not so much on the indexes,
| though.
| sgarland wrote:
| > where we have 100 to 480+ columns in one table
|
| I'm sorry, what? Why?
| rtuin wrote:
| Glad you asked!
|
| This system contains measurements and state of physical
| devices (time series). It's designed for both heavy write and
| read, with slight emphasis on write. Each table is one type
| of device and contains 1 to 5 different measurements/states.
| But here's the trick: because data is queried with minimum
| bucket size of 15minutes I figured we could just create a
| column for each measurement + quarter of the day (i.e.
| measure0000, measure0015), so that's 100 columns for each
| measurement (96 quarter + 4 for DST), include the date in the
| key, et voila: excellent write performance (because it's
| mainly UPDATE queries) and good read performance.
|
| Okay, the queries to make sense of the data aren't pretty,
| but can be generated.
|
| I find it really cool how effective this is for time-series
| data without Postgres extensions (we're on RDS).
| napsterbr wrote:
| One interesting thing your team may want to look into (if
| you haven't already) is compression.
|
| Of course there are a multitude of variables we don't have
| access from the outside, but Postgres only compresses data
| that is TOASTed, and based on your description of the
| table, the data is not being TOASTed (and therefore not
| being compressed).
|
| Instead, if you could somehow pack your timeseries entries
| into an array, you would get the benefits of compression
| automatically.
|
| Given your write performance requirements, using an array
| may be out-of-question (and you may get too much overhead
| from dead tuples) -- but who knows? Always a good idea to
| benchmark.
|
| I actually considered mentioning this at the post but
| figured it was too long already and could be the material
| for a future one :)
| rtuin wrote:
| This is interesting! Turning the values into arrays is
| not suitable for our base-tables, but might be for the
| several materialization tables we run.
| aidos wrote:
| The updates wouldn't be in place anyway because of the
| Postgres mvcc approach, right?
|
| I can see the advantage in terms of just needing a single
| tuple for a reads. So a timestamp + value model would
| likely take twice as much heap space than your approach?
|
| Given that you're probably always just inserting new data
| you could use a brin index to get fast reads on the date
| ranges. Would be interesting to see it in action and play
| around to see the tradeoffs. The model you've settled on
| sounds like it would be a pain to query.
| sgarland wrote:
| > The updates wouldn't be in place anyway because of the
| Postgres mvcc approach, right?
|
| They might be if the columns being updated weren't
| indexed [0], but since it sounds like at least one is,
| no, not in-place.
|
| Though interestingly, your comment on BRIN indexes is
| quite relevant, as that's the one type of index that HOT
| can still work with.
|
| [0]: https://www.postgresql.org/docs/current/storage-
| hot.html
| rtuin wrote:
| Good observation. The updates are as much HOT-updates as
| possible. I wasn't familiar with BRIN-indexes before, so
| I have to look into that.
|
| At first glance our solution follows a similar approach,
| let me elaborate:
|
| - no index columns are updated ever, only inserted
|
| - all tables are partitioned based on date (partition
| range is 1 month)
|
| - for some tables there is another layer of partitioning
| (3 sub-partitions, based on one specific column)
|
| - finding an appropriate fillfactor is important to
| improve the speed of UPDATE statements (HOT-updates)
|
| - standard vacuum / auto vacuum settings work great for
| us so far.
|
| - to improve ANALYZE performance, set column statistics
| of value-only columns to 0.
| sgarland wrote:
| I assume you looked at the various Postgres functions, like
| width_bucket()?
|
| I've never ran a PG-based TSDB so I'm reluctant to pass
| judgment one way or the other, but as I hope you can
| understand, "we have hundreds of columns in a table" is
| alarming at first glance.
| rtuin wrote:
| The schema is organized by how we need to query it which
| is mostly by fetching all measurements for an entire day
| or month. Querying statistics on those metrics is not
| often needed, maybe a few times per week or day, so we do
| that in Snowflake if necessary. Or create a specialized
| streaming data flow when it the stats are queried more
| often.
| sevensor wrote:
| Fascinating. How reliable is the measurement apparatus? Or
| to put it another way, how tolerant of missed measurements
| is this scheme? I've been in a similar situation in a past
| life and we took a different approach. I was never
| satisfied with how we ended up doing it though.
| rtuin wrote:
| I see your point. Some of our measurements are counters
| and interpolate missing data via certain rules. We store
| these interpolations in a separate table with exactly the
| same format and join the two tables directly when
| querying the data. Since the real value always takes
| precedence over the interpolated value this is just a
| COALESCE(realmeasure0000, interpolation0000) statement.
|
| This is super fast when taking advantage of postgres'
| partition-wise joins.
| sevensor wrote:
| That's really neat, thanks for explaining!
| epgui wrote:
| I don't know what they're doing, but highly-denormalized
| tables are very common, and sometimes even the norm,
| depending on how things are set up, especially in OLAP
| contexts.
| sgarland wrote:
| Highly denormalized tables are often the norm simply
| because the tables weren't properly normalized to begin
| with, and the data model wasn't properly done, such that
| reasonable joins are overly difficult.
|
| OLAP is of course its own problem, and most of the best
| practices for OLTP do not apply.
| magicalhippo wrote:
| Not OP but in similar situation. Main reason is that the
| primary tables represent various official declarations, and
| they have a _lot_ of fields. In addition, our customers are
| required by law to keep the data readily available for 5-10
| years. Also the law states our customers are responsible for
| the data they send through our systems, so they want to see
| what they 're sending.
|
| So just to give an example of how field count "blows up",
| there can be up to 8 different parties related to a single
| declaration, and for each we want to have for display/print
| purposes our internal id, id in customers system,
| organization number, name, address, postal code, location,
| country code, contact person, contact person phone and email.
|
| So we're talking about almost 100 fields just to store
| parties.
|
| We want to store all that separate from say our "companies"
| table as a company might change name, location etc, and we're
| required to show/print the details from today 5+ years down
| the line.
|
| We could put that in a separate table, but that means 8 joins
| to fetch data for a single record just there. And that's just
| one piece of the declarations, and at least 3-4 of these
| parties our customers want to see in overview grids and be
| able to filter on there.
|
| Now tack on 20+ years of doing this, with various official
| systems coming and going with different technical
| requirements, and it adds up.
| mkesper wrote:
| Sounds like it's rife to convert to JSON fields maybe?
| magicalhippo wrote:
| If we could limit us to only PostgreSQL we could perhaps
| do that for new modules. Sadly we have some customers
| which want to use their own MSSQL database.
|
| Though we'll see what the future holds. PostgreSQL is
| interesting due to cost and features, and many of the
| large customers are shifting their POV on how they want
| to run things, almost doing 180's in some cases.
| sgarland wrote:
| You do not want to do this, I assure you. The TOAST / DE-
| TOAST overhead alone would wreck performance, plus OP
| said it was update-heavy; Postgres can't update a JSON
| object on disk, it just rewrites the entire blob.
| syncsynchalt wrote:
| > Postgres can't update a JSON object on disk, it just
| rewrites the entire blob.
|
| Postgres does this for _all_ row updates; it must, for
| transaction isolation to work.
| hashhar wrote:
| I'm not sure, there's also HOT-updates (heap-only
| tuples). It's an optimization where data in modified in
| place if none of the modified columns are part of an
| index and maybe other conditions but I don't remember it
| all too well.
| pphysch wrote:
| I strongly disagree. This is an excellent application for
| JSONB. The user outlined that the records are for
| historical compliance purposes and therefore aren't
| frequently updated after the initial phase.
|
| A schema of `party1 JSONB, party2 JSONB, ...`, or even
| `parties JSONB` would likely be far more maintainable and
| sensible than the "table-as-spreadsheet" approach.
| OliverJones wrote:
| Good material!
|
| It has to be said, the order of columns in correctly designed
| multicolumn BTREE indexes is governed by the shape of the queries
| the indexes support.
|
| So don't arbitarily reorder columns in your indexes to handle
| alignment.
| delduca wrote:
| Could the Django ORM deal with this?
| sgarland wrote:
| Yes, but it doesn't that I'm aware of. Certainly able to
| override the classes yourself, though. Make a lookup table of
| column type / length and then reorder based on that.
| jbkkd wrote:
| It can't right now, but this is a doable extension
| SoftTalker wrote:
| This is the sort of thing that good DBAs used to know about and
| manage, but nowadays that isn't a fashionable job.
| sgarland wrote:
| It's coming back in the form of DBRE (hi, it's me), but
| unfortunately actual core RDBMS knowledge is often lacking.
| RDBMS administration is legitimately difficult, as they have a
| million knobs to turn, and a million gotchas that can tank
| performance if you aren't careful. I should mention here that
| this difficulty really only occurs at scale. If your DB has a
| few million rows, as long as your queries are indexed it's
| unlikely that you'll experience problems. You probably won't
| have optimal performance, but it's likely to be good enough.
|
| Personally, I love it, but also find it frustrating. The amount
| of times I've presented "here's your problem, here's how to fix
| it, sorry it's gonna suck" only to be told "nah, we'll just
| upsize the instance" is far too many.
| didgetmaster wrote:
| It's been awhile since I dug into the bowels of PG; but the
| overall row size with respect to page size used to be important.
| If you had a table where every row took up 4097 bytes and the
| page size was 8K; the your disk footprint for that table was
| double. Only one row fit in a page and the other (nearly) half of
| each page was wasted.
|
| If that is still true, then alignment issues could push you over
| the edge if you are close to it.
| gnfargbl wrote:
| The ever-useful postgres_dba tool
| (https://github.com/NikolayS/postgres_dba) can help with this.
| Install it, start psql, run :dba and choose option p1 to see how
| much table space you would save by reordering your columns.
| samokhvalov wrote:
| Thanks for mentioning!
| gnfargbl wrote:
| Thanks for the tool! It has helped me out of the mire more
| than once.
| fforflo wrote:
| Looked nice so I automatically did git clone && cd
| postgres_dba && make all install installcheck... and failed
| :D
|
| Creatures of habit... any plans to make this an extension?
| tiffanyh wrote:
| How much of this article is rooted in hard drives with physically
| mechanical spinning disk latency assumptions, vs ssd/nvme?
| pornel wrote:
| None at all. The alignment is for cheap and easy access of data
| once it is in memory. It's probably rooted in PostgreSQL being
| written in C where aligned access is trivial, and dereference
| of an unaligned pointer is Undefined Behavior and requires more
| clunky code instead.
| nzach wrote:
| As the sibling said: "not at all".
|
| But I see why this could be confusing. In the article he wrote:
|
| > Postgres will happily add padding to the underlying data in
| order to make sure it is properly aligned at the physical
| layer. Having the data aligned ensures faster access time when
| retrieving pages from disk.
|
| And this is correct. The problem is that "physical layer"
| refers to the physical memory layout and how things are loaded
| into the CPU. And not how they are stored in the disk(mostly).
|
| I'm not expert in this subject, but as far I understand the
| main factor for this kind of behavior is the way a CPU reads
| data from its cache, i.e. 1 line(64bytes) at a time. And this
| is why we always pad to factors of 64(2, 4, 8, 16, 32).
|
| This is the first time I read about this in the context of PG,
| but I've already encoutered the same issue in C and Go. So for
| me this is just a new manifestation of the same underlying
| problem.
|
| https://mecha-mind.medium.com/demystifying-cpu-caches-with-e...
| branko_d wrote:
| From the article: Having the data aligned ensures
| faster access time when retrieving pages from disk.
|
| Byte-level alignment cannot possibly have anything to do with
| retrieving pages from disk, simply because the unit of retrieval
| is the whole page. From the hardware/OS perspective, a page is
| just an opaque blob of bytes (comprised from one or more blocks
| on the physical drive).
|
| Only after these bytes have reached RAM does the byte-level
| alignment play a role, because CPU works slower on misaligned
| data.
|
| The article itself then goes on to illustrates the above (and
| seemingly contradict itself): SQLite does not pad
| or align columns within a row. Everything is tightly packed
| together using minimal space. Two consequences of this design:
| SQLite has to work harder (use more CPU cycles) to access data
| within a row once it has that row in memory. SQLite uses
| fewer bytes on disk, less memory, and spends less time moving
| content around because there are fewer bytes to move.
| winternewt wrote:
| I think the idea is that padding results in lower storage
| efficiency, which means fewer rows per page and hence lower I/O
| throughput. By changing the column order you can reduce the
| amount of padding required.
| branko_d wrote:
| Sure, having less padding increases I/O efficiency. I was
| just commenting on the author's apparent confusion as to why
| the padding is there in the first place.
|
| Here is the full(er) quote: Postgres will
| happily add padding to the underlying data in order to make
| sure it is properly aligned at the physical layer. Having the
| data aligned ensures faster access time when retrieving pages
| from disk.
|
| This might be misunderstood as "Postgres adds padding to
| speed-up disk I/O", which is the opposite of what actually
| happens. Padding slows-down I/O but speeds-up the CPU
| processing afterwards.
|
| SQLite made the opposite tradeoff.
| napsterbr wrote:
| You are absolutely correct, the current wording causes
| confusion as to where the speed-up happens. Over the
| weekend I'll add a note and link to this thread, thanks for
| pointing that out.
| dspillett wrote:
| The wording implies to me that Postgres is doing the padding
| for alignment to reduce IO costs which, as branko_d suggests,
| would do the opposite. You are reading it as the intervention
| of the DBA reordering columns to remove the padding will
| improve IO efficiency by fitting more rows into each page,
| which you are right would be beneficial in that way.
|
| Postgres will be performing padding for alignment to improve
| processing speed once data is in local memory - CPUs are
| usually much faster at reading & writing aligned data0. This
| is trading off memory use and IO efficiency for CPU gain,
| which is the right optimisation _if_ you assume that your
| core working set fits nicely into RAM and that your CPU(s)
| have large enough cache that you don 't create the same
| problem there1. Other DBs don't do this padding at all either
| because they didn't think of it or, more likely in the case
| of the big ones, because they are optimising more for IO than
| being concerned about CPU bottlenecks, or perhaps they
| natively rearrange the fields where it makes a difference
| instead of being beholden to the column ordering given by the
| user2.
|
| ----
|
| [0] in fact some architectures don't directly support
| unaligned access at all, though probably not any
| architectures Postgres supports
|
| [1] causing extra cache evictions if cache segment width
| aligns badly with the padding such that less data fits in the
| available cache
|
| [2] if the user needs to care about physical ordering like
| this, you have a leaky abstraction
| silvestrov wrote:
| Wild guess: this is a left-over from the old days of Sun SPARC
| processors which could only access doubles on 8-byte aligned
| memory addresses.
|
| For Intel processors you just paid a performance penality for
| unaligned accesses but SPARC processors would generate a TRAP for
| unaligned accesses.
|
| I'd really like to see some performance testing of alignment for
| modern Intel/ARM CPUs. My guess is that tightly packing would be
| better as more data would fit into caches and you would have less
| i/o.
|
| Modern CPUs are very different from back in the 90's when
| PostgreSQL was designed. Posgresql has traditionally been very
| slow to stop supporting old OSes that are no longer used.
___________________________________________________________________
(page generated 2024-10-08 23:02 UTC)