[HN Gopher] An unexpected find that freed 20GB of unused index s...
___________________________________________________________________
An unexpected find that freed 20GB of unused index space in
PostgreSQL
Author : haki
Score : 285 points
Date : 2021-02-01 14:17 UTC (8 hours ago)
(HTM) web link (hakibenita.com)
(TXT) w3m dump (hakibenita.com)
| tantalor wrote:
| Graphing "free storage" is meaningless and confusing; it should
| be "used storage".
|
| Available storage depends on usage and capacity.
|
| Edit: I meant for this article; of course I believe it is useful
| to track this in practice.
| AnotherGoodName wrote:
| Used makes sense for getting a feeling for pure performance
| (smaller the better as its likely to be in memory).
|
| Available makes sense for knowing when things will just plain
| break (reaching 0 = write failure for a DB).
|
| >Every few months we get an alert from our database monitoring
| to warn us that we are about to run out of space.
|
| In this case they were avoiding their DB server breaking. They
| didn't do this for performance reasons.
| zufallsheld wrote:
| If you pay for a fixed amount of storage and only use it
| partially, you should also monitor free storage sdso you know
| when you waste it.
| fanf2 wrote:
| Free storage is what matters because it makes it very obvious
| when you are getting close to a disk-full outage.
| brianberns wrote:
| > REINDEX INDEX CONCURRENTLY index_name;
|
| > If for some reason you had to stop the rebuild in the middle,
| the new index will not be dropped. Instead, it will be left in an
| invalid state and consume space.
|
| Well, that sure sounds like a bug in PostreSQL to me.
| striking wrote:
| Well, you can't just _delete_ it. It is an object that was
| created by some user and there 's no good reason for the
| database to get rid of it automatically. The database keeps a
| record of the invalid thing, even though it is invalid.
| brianberns wrote:
| The good reason to get rid of it automatically: It takes up
| space.
|
| Is there any good reason to keep it? (The fact that it was
| "created by some user" doesn't seem like much of a reason.)
|
| IMHO, creating an index should be atomic: Either you end up
| with a valid index, or you end up with nothing.
| voganmother42 wrote:
| pretty well documented behavior as far as concurrent:
| https://www.postgresql.org/docs/current/sql-createindex.html...
| ivoras wrote:
| Is the partial index technique to avoid indexed NULL data as
| effective for PostgreSQL 13+?
|
| It looks like in v13+ PostgreSQL could create a single leaf for
| NULL data and just store row pointers within it, which should
| reduce data sizes at least a bit.
| chrismeller wrote:
| He actually mentioned index de-duplication earlier:
| https://hakibenita.com/postgresql-unused-index-size#activati...
|
| If I had to guess, I would say that it doesn't accomplish
| anything (or as much as you'd think) for null values simply
| because there is no real data to store in either approach, you
| just have a bunch of pointers either way.
| petergeoghegan wrote:
| NULL values are not special as far as deduplication is
| concerned. They use approximately as much disk space as a
| non-NULL integer column without deduplication, and compress
| just as well with deduplication. Deduplication is effective
| because it eliminates per-tuple overhead, so you see most of
| the benefits even with index tuples that naturally happen to
| have physically small keys. You'll still get up to a 3x
| decrease in storage overhead for the index provided there is
| low cardinality data (and not necessarily that low
| cardinality, ~10 or so tuples per distinct value will get you
| there).
|
| The NULL issue is documented directly -- see the "Note" box
| here:
|
| https://www.postgresql.org/docs/devel/btree-
| implementation.h...
| mattashii wrote:
| Not per se _as effective_, but it will still help a lot. NULL
| tuples pre-pg13 take ~ 14 bytes each, and 18 bytes when
| aligned. (= 2 (ItemID, location on page) + 6 (TID) + 2 (t_info)
| + 4 (NULL bitmap) + 4 bytes alignment). When deduplication is
| enabled for your index, then your expected tuple size becomes
| just a bit more than 6 bytes (~ 50 TIDs* in one tuple => 2
| (ItemId) + 6 (alt tid) + 2 (t_info) + 4 (null bitmap) + 50 * 6
| (heap TIDs) / 50 => ~ 6.28 bytes/tuple).
|
| So, deduplication saves some 65% in index size for NULL-only
| index-tuples, and the further 35% can be saved by using a
| partial index (so, in this case, deduplication could have saved
| 13GB).
|
| *note: last time I checked, REINDEX with deduplication enabled
| packs 50 duplicates in one compressed index tuple. This varies
| for naturally grown indexes, and changes with column types and
| update access patterns.
| [deleted]
| mulander wrote:
| Partial indexes are amazing but you have to keep in mind some
| pecularities.
|
| If your query doesn't contain a proper match with the WHERE
| clause of the index - the index will not be used. It is easy to
| forget about it or to get it wrong in subtle ways. Here is an
| example from work.
|
| There was an event tracing structure which contained the event
| severity_id. Id values 0-6 inclusive are user facing events.
| Severity 7 and up is debug events. In practice all debug events
| were 7 and there were no other values above 7. This table had a
| partial index with WHERE severity_id < 7. I tracked down a
| performance regression, when an ORM (due to programmer error)
| generated WHERE severity_id != 7. The database is obviously not
| able to tell that there will never be any values above 7 so the
| index was not used slowing down event handling. Turning the query
| to match < 7 fixes the problem. The database might also not be
| able to infer that the index can be indeed used, for example when
| prepared statements are involved WHERE severity_id < ?. The
| database will not be able to tell that all bindings of ? will
| satisfy < 7 so will not use the index (unless you are running PG
| 12, then that might depend on the setting of plan_cache_mode[1]
| but I have not tested that yet).
|
| Another thing is that HOT updates in PostgreSQL can't be
| performed if the updated field is indexed but that also includes
| being part of a WHERE clause in a partial index. So you could
| have a site like HN and think that it would be nice to index
| stories WHERE vote > 100 to quickly find more popular stories.
| That index however would nullify the possiblity of a hot update
| when the vote tally would be updated. Again, not a problem but
| you need to know the possible drawbacks.
|
| That said, they are great when used for the right purpose. Kudos
| to the author for a nice article!
|
| [1] - https://postgresqlco.nf/doc/en/param/plan_cache_mode/
| GordonS wrote:
| > The database is obviously not able to tell that there will
| never be any values above 7
|
| You say "obviously", but with updated statistics this is the
| exactly the kind of thing you might expect the planner to know
| and aid index decisions.
|
| I'm a huge fan of Postgres, coming to it around 5 years ago
| from at least 10 previous years with SQL Server, but I have hit
| a few things like this in that time. IME the planner is much
| more fickle about how you specify your predicates than SQL
| Server is.
| londons_explore wrote:
| All statistics in postgres are considered best effort
| guidance. Even if the statistics are wrong it can never
| impact the correctness of the results.
| mulander wrote:
| No, I don't think statistics can let you get away with this.
| Databases are concurrent, you can't guarantee that a
| different session will not insert a record that invalidates
| your current statistics.
|
| You could argue that it should be able to use it if the table
| has a check constraint preventing severity_id above 7 being
| ever inserted. That is something that could be done, I don't
| know if PostgreSQL does it (I doubt it) or how feasable it
| would be.
|
| Is SQL Server able to make an assumption like that purely
| based on statistics? Genuine question.
| GordonS wrote:
| > No, I don't think statistics can let you get away with
| this. Databases are concurrent, you can't guarantee that a
| different session will not insert a record that invalidates
| your current statistics.
|
| Of course you can't make a guarantee like that, but why
| would you need to? Statistics are there to guide planner
| choices, not make cast iron predictions.
| cmeacham98 wrote:
| Because the database has to decided whether or not to use
| the index. If it decides to use the index, and there
| _are_ values above 7, then it will misbehave (the query
| will miss those results). Now of course the database
| could then scan the rows for values above 7 it missed but
| at that point there 's no point in using the index and
| you might as well have row scanned for the original
| query.
|
| As a result, the database has to be 100% sure that there
| are no values _at all_ above 7 to safely and efficiently
| use the index, ex. when there's a constraint.
| mulander wrote:
| Let us say that in our example table we have 100 000
| records with severit_id < 7, 200 000 with severity_id = 7
| and 3 records with severity_id = 8.
|
| Statistics claim 100k id < 7, 200k id = 7 and 0 with id >
| 7. The last 3 updates could have happened right before
| our query, the statistics didn't update yet.
|
| Let us assume that we blindly trust the statistics and
| they currently state that there are absolutely no values
| with severity_id > 7 and you have a query WHERE
| severity_id != 7 and a partial index on severity_id < 7.
|
| If you trust the statistics and actually use the index
| the rows containing severity_id = 8 will never be
| returned by the query even if they exist. So by using the
| index you only scan 100 k rows and never touch the
| remaining ~200k. However this query can't be answered
| without scanning all ~300k records. This means, that on
| the same database you would get two different results for
| the exact same query if you decided to drop the index
| after the first run. The database can't fall back and
| change the plan during execution.
|
| Perhaps I misunderstood you originally. I thought you
| suggested that the database should be able to know that
| it can still use the index because currently the
| statistics claim that there are no records that would
| make the result incorrect. You are of course correct,
| that the statistics are there to guide the planner
| choices and that is how they are used within PostgreSQL -
| however some plans will give different results if your
| assumption about data are wrong.
| mattb314 wrote:
| I doubt it? At least the number times the "last updated"
| column appears on SQL server stats [1] leads me to believe
| it collects stats async with updates to the table.
|
| The only system I've heard of that relies on up-to-date
| statistics for correctness is snowflake (long but
| interesting talk here [2]), where having accurate max/mins
| for each micro partition is really helpful for cutting down
| the amount of data in the large range scan queries common
| in BI. I'd guess that being a BI system, snowflake can get
| away with higher row update latency too.
|
| [1] https://www.sqlshack.com/sql-server-statistics-and-how-
| to-pe...
|
| [2] https://www.youtube.com/watch?v=CPWn1SZUZqE
| lucian1900 wrote:
| Partial indexes can be useful in any case where one value has
| much higher cardinality than others.
|
| Indexing boolean columns is often only useful if one of the
| values is uncommon and the index is partial to only include those
| uncommon rows.
| mnw21cam wrote:
| Agreed. To explain why this is the case, consider that table in
| the story that had 99% NULL values. If you were to try to run
| "SELECT FROM table WHERE column IS NULL", then Postgresql
| wouldn't use the index anyway, because it would be faster to
| just read sequentially through the entire table and filter out
| the 1% that don't match.
| maweki wrote:
| That would highly depend on what you select. If the query
| could be answered by index only, like COUNT(*), it would
| probably use the index. You are right if you want to query
| any data from that row that's not in the index.
| mnw21cam wrote:
| I might be out of touch a little with Postgres (I last used
| it in 2010), but my impression was that COUNT(*) still
| needed to scan the actual table in order to exclude rows
| that had been deleted in a transaction, due to the way
| multi-version concurrency worked. Is this something that
| has been improved since then?
| zeroimpl wrote:
| They support index-only scans now, so there is some sort
| of optimization which bypasses the table lookup, at least
| in certain cases.
| mjw1007 wrote:
| Summary: if you have an index on a column which is mostly NULL,
| consider using a partial index covering only the records where
| it's non-NULL.
| latch wrote:
| Another benefit of partial indexes is to limit a unique
| constraint:
|
| create index users_email on users(email) where status !=
| 'delete'
| SomeHacker44 wrote:
| Be very careful, then, as the optimizer will (usually?) not
| use the index if the condition is not part if the query.
| jackTheMan wrote:
| you can create views for that, then it will be always part
| of the query.
| malinens wrote:
| Too bad MySQL does not have partial indexes.
|
| We have one huge table I want to add some indexes for specific
| cases (for max 1% of records) but server will not have enough
| memory for it if I add those indexes for all records :/
| fraktl wrote:
| MySQL has pluggable storage engines. TokuDB does what you're
| after (adds indexes on the fly, as well as alter tables on the
| fly without overloading the server).
| wolf550e wrote:
| Altering table online without using pt-online-schema-change
| doesn't help if they want an index that covers only some of
| the keys but not others.
| hu3 wrote:
| This page about TokuDB reads:
|
| > TokuDB has been deprecated by its upstream maintainer. It
| is disabled from MariaDB 10.5 and has been been removed in
| MariaDB 10.6 - MDEV-19780. We recommend MyRocks as a long-
| term migration path.
|
| https://mariadb.com/kb/en/tokudb/
|
| Is MyRocks comparable?
| crazygringo wrote:
| As long as you've got primary keys on the huge table, there's a
| hacky solution -- create a second table with columns for just
| the first table's primary key and the columns you're indexing
| and your desired index, and ensure you always
| write/update/delete both tables simultaneously using
| transactions. Then when needed, use the index on the second
| table and join it to your first with the primary key.
|
| Annoying, but it should work for most queries I'd expect
| without too much SQL.
|
| I've definitely "rolled my own indexing" like this in the past,
| though it's more often been duplicating strings into a custom
| "collation" or other transformations.
|
| Another solution is simply to split your table in two, with the
| same columns in both, and the index only on one of the tables.
| But of course that really depends on your business logic --
| queries that need to retrieve data from both tables together
| can get pretty hairy/slow, and if you've got auto-incrementing
| PKEY's then avoiding collisions between the two tables can be
| tricky on its own. So this is definitely the less general
| solution.
|
| Of coure it certainly would be nicer if MySQL supported partial
| indexes. It seems so useful, I'm surprised it didn't happen
| long ago.
| lucian1900 wrote:
| The first approach is one of the steps towards normalising a
| database.
| crazygringo wrote:
| Actually it's the opposite of database normalization.
|
| Normalizing removes data redundancy. This adds data
| redundancy.
|
| When I design a database structure, it's common to start
| with the most normalized representation possible. And then
| to denormalize the minimum necessary for performance
| reasons -- duplicating rows and/or columns just like here
| so certain data can be retrieved more quickly, whenever
| indexes aren't powerful or featured enough.
| fauigerzigerk wrote:
| I think what lucian1900 may be thinking is that instead
| of create table purchase_order (
| id int primary key, ordered_on timestamptz
| not null, customer_id int not null references
| customer, canceled_on timestamptz );
|
| you could have create table
| purchase_order ( id int primary key,
| ordered_on timestamptz not null, customer_id
| int not null references customer );
| create table order_cancelation ( order_id int
| primary key references purchase_order,
| canceled_on timestamptz not null );
|
| This is indeed a better normalised schema and it allows
| you to index order_cancelation.canceled_on without
| worrying about nulls.
| lucian1900 wrote:
| Exactly, that's what I thought was being described.
| crazygringo wrote:
| Oh then, absolutely. I was assuming a constraint that
| columns couldn't be removed from the original table. But
| if you can, then your example is an even better solution.
| abfan1127 wrote:
| Could you create a temporary high memory slave MySQL server,
| sync the master, add the index, sync back to master, and
| decommission the temporary high memory? I don't know enough
| about master/slave operations to know if it would work.
| wolf550e wrote:
| 'malinens doesn't have the storage space to store an index
| over a column if all values in the column are indexed. They
| want to index only some values, but not others. This feature
| does not exist in MySQL.
| matsemann wrote:
| > _Clear bloat in tables_
|
| Ohh, we've had issues with this. We have this table that's mostly
| ephemeral data, so rows are constantly inserted and then deleted
| after a certain amount of time. Due to a bug the deletion didn't
| work for a while and the db grew very large. Fixed the deletion,
| but no amount of vacuuming actually allows us to fully reclaim
| that space so we don't have to pay for it.
|
| At the same time the extra cost is probably negligible compared
| to spending more energy fixing it..
| hinkley wrote:
| The problem we always ran into with deletes is them triggering
| full table scans because our indexes weren't set up correctly
| to test foreign key constraints properly. Constant game of
| whack-a-mole that everyone quickly grew tired of. Also more
| indexes increases the slope of the line for insert operations
| as data size grows.
|
| Another solution is tombstoning data so you never actually do a
| DELETE, and partial indexes go a long way to making that scale.
| It removes the logn cost of all of the dead data on every
| subsequent insert.
| mulander wrote:
| > The problem we always ran into with deletes is them
| triggering full table scans because our indexes weren't set
| up correctly to test foreign key constraints properly.
|
| This is a classic case where partitioning shines. Lets say
| those are logs. You partition it monthly and want to retain 3
| months of data.
|
| - M1 - M2 - M3
|
| When M4 arrives you drop partition M1. This is a very fast
| operation and the space is returned to the OS. You also don't
| need to vacuum after dropping it. When you arrive at M5 you
| repeat the process by dropping M2.
|
| > Another solution is tombstoning data so you never actually
| do a DELETE, and partial indexes go a long way to making that
| scale. It removes the logn cost of all of the dead data on
| every subsequent insert.
|
| If you are referring to PostgreSQL then this would actually
| be worse than outright doing a DELETE. PostgreSQL is copy on
| write so an UPDATE to a is_deleted column will create a new
| copy of the record and a new entry in all its indexes. The
| old one would still need to be vacuumed. You will accumulate
| bloat faster and vacuums will have more work to do.
| Additionally, since is_deleted would be part of partial
| indexes like you said, a deleted record would also incur a
| copy in all indexes present on the table.
|
| Compare that to just doing the DELETE which would just store
| the transaction ID of the query that deleted the row in cmax
| and a subsequent vacuum would be able to mark it as reusable
| by further inserts.
| alexfromapex wrote:
| It seems like this is an optimization that Postgres should handle
| internally, doesn't it?
| boomer918 wrote:
| Partial indexes can flip query plans if the covered part becomes
| so small that it won't be represented when sampled by the stats
| collector. The planner could then decide that the index scan
| isn't worth it and could try an alternative less efficient index
| if one exists.
| tbrock wrote:
| Yeah and sadly using the index in those scenarios could be even
| more worth it due to the high selectivity it has.
|
| Is PG smart enough to avoid that if the query patterns are
| frequently or exclusively covered by the index?
| gangstead wrote:
| The included query for finding which indexes in your database
| could benefit from a partial index is amazing. Thanks for putting
| the extra effort into this post.
| nieve wrote:
| The article includes a couple of useful queries unrelated to the
| "find" and led me to these useful bloat-detection resources
| https://wiki.postgresql.org/wiki/Show_database_bloat
| https://github.com/ioguix/pgsql-bloat-estimation
| pierrebai wrote:
| The chart seems to show an uptick of 2GB, not 20GB. Am I missing
| something?
| pottertheotter wrote:
| This has nothing to do with the content, but the design of this
| page really stuck out to me. It's very easy to read and doesn't
| have fluff. But it still feels modern (in the good way). It's
| perfectly balanced.
| paxys wrote:
| Agreed! It's a perfect example of how you can make a website
| with "modern" features (responsive design, accessible, mobile
| friendly, dark mode, static pages, embeds) without it being a
| bloated mess.
| de6u99er wrote:
| When I did my Oracle DBA training 15 years ago, I learnt about
| database reorgs.
|
| It means basically exporting your database (or tables) and
| importing it again. What happens is that deleted data which
| doesn't necessarily free up space (Oracle reuses the freed up
| space sometimes) doesn't get exported.
|
| https://www.iri.com/blog/vldb-operations/database-reorgs-why...
|
| https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUEST...
| brianwawok wrote:
| A vacuum full basically does this for a table, copying the data
| from location A to location B, cleaning up junk. I think index
| rebuilding may take a separate command?
| Tostino wrote:
| Vacuum full does a index rebuild automatically. Since a
| vacuum full builds an entire new heap table, the old indexs
| are all pointing to the incorrect locations for all tuples,
| so it has no choice but to rebuild.
| brianwawok wrote:
| Excellent there you go.
|
| Is there a way to just do the index build part, short of
| dropping an index and adding it back?
| jvolkman wrote:
| I believe pg_repack can do that.
| Jedd wrote:
| This is described about a quarter the way into TFA.
|
| > REINDEX INDEX CONCURRENTLY index_name;
|
| (A vanilla REINDEX will lock the table, preventing
| writes, while it runs. The CONCURRENT creates a new
| index, replicates any updates to the original while it
| does so, and then does an atomic switcheroo at the end.)
| jeffbee wrote:
| Dumping and reloading databases used to be mandatory for major
| postgresql updates, which is one of the reasons postgresql
| wasn't suitable for production workloads until recently and
| also why it resisted fixing bugs in vacuum, index, and
| compaction for many years.
| AbacusAvenger wrote:
| Whoah, that's news to me.
|
| I used PostgreSQL fairly recently (a year or so ago?) and
| ended up abandoning it after I was forced to do the
| export/import dance through a few version upgrades.
|
| When did that requirement go away?
| dialamac wrote:
| It never did.
|
| The difference is that you can use logical replication
| since 10 to prevent downtime during upgrade.
|
| Which if you were using it a year ago could have been done.
| jskrablin wrote:
| Ever since there's pg_upgrade available. Since 8.4 or so -
| https://www.percona.com/blog/2019/04/12/fast-upgrade-of-
| lega...
|
| Dump and reload is ok if you have a small database or can
| afford hours of downtime... if not, use pg_upgrade.
| Uberphallus wrote:
| Since 9 there's pg_upgrade, personally I never had an issue
| and it was very fast, so the downtime is in the order of a
| few minutes, which is ok for my usecase. YMMV.
| avereveard wrote:
| "pg_upgrade does its best to make sure the old and new
| clusters are binary-compatible, e.g., by checking for
| compatible compile-time settings, including 32/64-bit
| binaries. It is important that any external modules are
| also binary compatible, though this cannot be checked by
| pg_upgrade."
|
| This makes me very nervous tho, I've at least two exts
| (trigrams and gists) maybe they work, maybe not, I just
| prefer the ease of mind of a old fashioned dump.
___________________________________________________________________
(page generated 2021-02-01 23:01 UTC)