[HN Gopher] Show HN: PostgreSQL index advisor
___________________________________________________________________
Show HN: PostgreSQL index advisor
This is a Postgres extension that can determine if a query should
have an index. For example, for this table:
create table book( id int primary key, title
text not null ); You can run `index_advisor()` to see
if there should be an index on a select statement:
select * from index_advisor('select book.id from book where
title = $1'); And it will return (summarized):
{"CREATE INDEX ON public.book USING btree (title)"} It works
particularly well with pg_stat_statements[0] which tracks execution
statistics of all SQL statements executed on your Postgres
database. It leans heavily on HypoPG[1], an excellent extension to
determine if PostgreSQL will use a given index without spending
resources to create them. [0] pg_stat_statements:
https://www.postgresql.org/docs/current/pgstatstatements.htm...
[1] https://github.com/HypoPG/hypopg
Author : kiwicopple
Score : 324 points
Date : 2024-04-14 02:14 UTC (20 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jitl wrote:
| I've been needing this and also HypoPG, which I didn't realize
| had been available on RDS for a year plus!
|
| The issue I'm struggling with these days is that I have an index
| that I want Postgres to use for one of the relations in a 3+ wide
| join, but unless I put a limit on a CTE of one of the tables,
| Postgres tries to run each join in parallel and ends up trying to
| join a bazillion rows like an asshole.
|
| These days wrangling the query planner has me on the verge of
| breaking up with pg :-(
| kiwicopple wrote:
| i don't know how far it was developed, but I believe
| pg_plan_guarantee[0] was created to solve this sort of issue
|
| > _This extension provides a mechanism, using which you can set
| your execution plan in stone; that is, Postgres will execute
| the plan you give it, or it will throw an error, but it will
| never try to guess a plan for you._
|
| [0] https://github.com/DrPostgres/pg_plan_guarantee
| jitl wrote:
| Ugh I would love LOVE to have something like this or even
| better a stable API where I can manually specify my own
| physical query plan. As is, I'm on RDS and this thing is
| experimental so I have no way to use it.
| jaggederest wrote:
| For what it's worth, depending on how many people and/or
| budget you have available to man on call rotations, running
| your own postgresql server with replication is not that
| hard. The RDS premium is like a whole engineer's salary by
| the time you get to Large Databases (tm), and if you're
| running production, staging, and maybe a data warehouse, it
| starts to rapidly be able to pay for a whole team.
| jaggederest wrote:
| You can set max_parallel_workers_per_gather to zero in the
| session where you're running the problem query, if that's
| helpful. That will disable the parallel query behavior
| entirely. You can just reset it back to what it was once the
| query is complete. I've run into this issue before and that was
| my go-to fix.
| krembo wrote:
| I'd strongly oppose setting the max parallel workers on
| session level just to bypass an execution plab. I stead
| understand the logic why pg behave the way it is, and change
| the query accordingly
| jaggederest wrote:
| Yeah you can file a patch with postgres and get it into the
| next point release or something but in the mean time you
| really want to keep that query from OOMing your database :)
| justinclift wrote:
| Guessing you've seen this already and it's not doing what you
| want?
|
| https://github.com/ossc-db/pg_hint_plan
| polishdude20 wrote:
| Have you tried changing the order of the joins? I know it
| shouldn't matter but I've found sometimes it tricks the query
| planner into doing the right things!
| arronax wrote:
| While pinning a single plan or disabling parallel operations is
| a valid quick fix, I'd recommend trying to actually understand
| why the optimizer is doing what it's doing. It's all "just" a
| simple math of execution node costs. The problem with plan
| hints, or disabling parallel, is that unless you have a perfect
| understanding of where the data will be in, say, a year, you
| may just have created another problem, just delayed. Some
| column will go from 10 to 10,000 unique values and your good
| plan will become way off. And at the end, perhaps you just need
| to tell PG that an index access is not 4 times as expensive as
| a plain heap scan.
|
| ps fighting the optimizer is the worst
| itsgrimetime wrote:
| Does something like this for MongoDB? Searching hasn't turned up
| anything useful.
| winrid wrote:
| Ops Manager (paid tool) kinda supports this but in my
| experience it misses a lot of obvious suggestions.
| pkiv wrote:
| The supabase team always delivers. Excited to give this a try!
| sbstp wrote:
| I've often thought that a database that could automatically
| detect slow queries and create the necessary indexes would be
| neat. You run a load test on your application, which in turns
| calls the database and you collect all the queries it makes. Then
| the database automatically adjusts itself.
| ComodoHacker wrote:
| Big Guys do this. For big bucks, of course.
| tuwtuwtuwtuw wrote:
| > big bucks
|
| You get that feature in Azure SQL Database for $5/month.
| tuwtuwtuwtuw wrote:
| That exists in Microsoft SQL Server. It can create new indexes,
| drop unused indexes, change query plans when it detect
| degradation and so on.
| BrentOzar wrote:
| Source? I've been working with SQL Server for a couple of
| decades and I don't believe it will automatically create or
| drop indexes under any circumstances. You might be thinking
| of Azure SQL DB.
| Ciantic wrote:
| "Automatic tuning, introduced in SQL Server 2017 (14.x),
| notifies you whenever a potential performance issue is
| detected and lets you apply corrective actions, or lets the
| Database Engine automatically fix performance problems."
| [1]
|
| I have used this in Azure SQL too, but according to that it
| should be in SQL Server.
|
| https://learn.microsoft.com/en-us/sql/relational-
| databases/a...
| couchand wrote:
| Good link!
|
| > Automatic index management identifies indexes that
| should be added in your database, and indexes that should
| be removed. Applies to: Azure SQL Database
| BrentOzar wrote:
| Read that link carefully: only automatic plan regression
| is available in SQL Server, not the automatic index
| tuning portion. The index tuning portion only applies to
| Azure SQL DB.
| tuwtuwtuwtuw wrote:
| What's the point of asking for a source when you would find
| it on Google in one minute? Odd way of learning. Not like I
| brought up some debated viewpoint.
| hobs wrote:
| Probably because most of the stuff you'd find in the top
| search results would include the GP's name.
|
| Just a few sentences later "Automatic tuning in Azure SQL
| Database also creates necessary indexes and drops unused
| indexes" - that's not in on-prem SQL Server.
| simplyinfinity wrote:
| Google the name of the person you're replying to :)
| radicalbyte wrote:
| They've had a non-automatic "query advisor" in there
| forever, it operated on profiling data and was highly
| effective.
| taspeotis wrote:
| That's an Azure SQL thing, not MSSQL.
| elric wrote:
| I'm sure the database could, but it doesn't mean the database
| should. Indexes come at the cost of extra disk space, slower
| inserts, and slower updates. In some cases, some slower queries
| might be an acceptable tradeoff. In other cases, maybe not. It
| depends.
| kiwicopple wrote:
| this is our posture for this extension on the supabase
| platform. we could automate the creation of the indexes using
| the Index Advisor, but we feel it would be better to expose
| the possible indexes to the user and let them choose
| gneray wrote:
| this is the way ^^
| dmurray wrote:
| You could tell it "you have a budget of X GB for disk
| space, choose the indexes that best optimize the queries
| given the budget cap."
|
| Not perfect, because some queries may be more time-critical
| than others.
|
| You could even annotate every query (INSERT and UPDATE as
| well as SELECT) with the dollar amount per execution you're
| willing to pay to make it 100ms faster, or accept to make
| it 100ms slower. Then let it know the marginal dollar cost
| of adding index storage, throw this all into a constraint
| solver and add the indexes which are compatible with your
| pricing.
| d0100 wrote:
| Are the trade-offs measurable? If they are the database could
| just undo the index...
|
| Not just indexing, but table partitions, materialized views,
| keeping things in-memory...
| remus wrote:
| > Are the trade-offs measurable?
|
| Yes, but you need the context about what is the correct
| tradeoff for your use case. If you've got a service that
| depends on fast writes then adding latency via extra
| indices for improved read speed may not be an acceptable
| trade off. It depends on your application though.
| masklinn wrote:
| Because indexes have costs you need a much more complicated
| system which can feed back into itself and downgrade
| probationary indexes back to unindexed.
| fulafel wrote:
| Several databases index everything, needed or not. (And
| sometimes have mechanisms to force it off for some specific
| data)
| freedomben wrote:
| Even this isn't sufficient, because some problems with over-
| indexing don't become apparent until the size of a table gets
| much larger, which only happens a drop at a time. I suppose
| if it was always probationary and continually being
| evaluated, at some point it could recognize that for example
| INSERTs are now taking 1000x longer than they were 2 years
| ago. But that feels like a never-ending battle against corner
| cases, and any automatic actions it takes add significant
| complexity to the person debugging later.
| arronax wrote:
| Oracle DB is, or was, very close to that with its query
| profiles, baselines, and query patches. It wasn't automatic
| back in 2014 when I last worked on it, but all the tools were
| there. Heck, it was possible to completely rewrite a bad query
| on the fly and execute a re-written variant. I suppose it all
| stems from the fact that Oracle is regularly used under massive
| black boxes, including the EBS.
|
| Also, the problem with automatic indexing is that it only gets
| you so far, and any index can, in theory, mess up another query
| that is perfectly fine. Optimizers aren't omniscient. In
| addition, there are other knobs in the database, which affect
| performance. I suppose, a wider approach than just looking at
| indexes would be more successful. Like Ottertune, for example.
| Tostino wrote:
| The problem of new indexes messing up otherwise good queries
| is something I've battled on and off for the past decade with
| Postgres. Definitely annoying.
| rand_r wrote:
| How would an index mess up another query? AFAIK indexes
| would only hurt write performance marginally per index, but
| most slow queries are read-only. I've tended to just add
| indexes as I go without thinking about it and haven't run
| into issues, so genuinely curious.
| dmurray wrote:
| > Heck, it was possible to completely rewrite a bad query on
| the fly and execute a re-written variant.
|
| Is there really such a thing as a bad query that can be
| rewritten to give the same results but faster? For me, that's
| already the query optimizer's job.
|
| Of course there are "bad queries" where you query for things
| you don't need, join on the wrong columns, etc. And yeah the
| optimizer isn't perfect. But a query that you expect the
| query optimizer to "rewrite" and execute in an optimal way is
| a good query.
| Scene_Cast2 wrote:
| One example is that in Presto, joins assume that left
| (IIRC) table is the big one you stream, and the right one
| is the small one you hash. One of the newer features was
| that in some cases, the optimizer is able to ignore the SQL
| order and correctly pick which table to hash and which one
| to stream.
| legulere wrote:
| Isn't that just basic table/index statistics to know
| which table is the smallest?
| b3lm0nt wrote:
| Andrew Kane built dexter, which is an automatic indexer for
| Postgres.
|
| https://github.com/ankane/dexter
|
| https://ankane.org/introducing-dexter
| ed_balls wrote:
| Default DB for App Engine (NDB) has this feature. Implicit
| indexes are tad annoying.
| GordonS wrote:
| I might be misremembering, but IIRC RavenDB does this (it's a
| commercial document DB, written in C#).
| ComodoHacker wrote:
| This doesn't seem to consider or give insight on any tradeoffs
| involved.
|
| Also, the underlying extension, HypoPG, doesn't seem to collect
| any statistics on data to influence query planner.
| CAP_NET_ADMIN wrote:
| If you have some PostgreSQL performance issues, I'd recommend
| checking out PGAnalyze - they've offered a much more advanced
| index advisor for some time now.
|
| My company is a paid customer since around 2020 and we are very
| satisfied, easily beats the Datadog's (which we use for the rest
| of our infra and apps) observability offering for PostgreSQL.
| allan_s wrote:
| Re: performance issue
|
| I used to think that performance issue in relational database
| was always a matter of :
|
| * missing indexes * non-used indexes due to query order (where
| A, B instead of B, A)
|
| But we had the case recently where we optimized a query in
| postgresql which was taking 100% of cpu during 1s (enough to
| trigger our alerting) by simply splitting a OR in two separate
| query.
|
| So if you are looking for optimisation it may be good to know
| about "OR is bad". The two queries run in some ms both.
| xvinci wrote:
| I'm sorry but you just cannot say that "OR is bad" - it being
| a key part of SQL . It's most likely your use that is bad
| (e.g. your intermediate result exceeding some cache size).
|
| But "bad performance always due to indexes" gives a hint that
| you are somewhat new: No, bad performance in my experience
| was almost always due to developers either not understanding
| their ORM framework, or writing too expensive queries with or
| without index. Just adding indexes seldom solved the problem
| (maybe 1/5 of the time).
| mrklol wrote:
| OR is indeed not bad, but you have to think about when to
| use it. It can easily make queries slower compared to other
| operators. That's exactly why we have the terms "ugly OR" /
| "bad OR".
| thom wrote:
| It's worth having a mental model of _why_ OR can be
| suboptimal. Often it's because you're only hitting an
| index on half the conditional, or forcing PG into a
| bitmap scan, or worse turning and index lookup into a
| sequential scan. Not to bang on about indexes too much
| but a partial index on the OR condition works if you're
| lazy, although splitting into two queries is often a
| great solution as it gives you two fast queries to
| combine instead of one slow one (although sometimes that
| implies PG's statistics are incorrect because it might
| have been able to know which side of the conditional cuts
| out more data).
| dz08dl wrote:
| It's complicated; that's why there isn't a one-size-fits-
| all solution. In the end, you want to have a good execution
| plan, and there's usually not just one and the same action
| to achieve that.
| magicalhippo wrote:
| > Just adding indexes seldom solved the problem
|
| We write all our queries by hand. We've got decades of
| experience and I'd say we're pretty proficient.
|
| For us adding an index is almost always the solution,
| assuming the statistics are fine.
|
| Either we plain forgot, or a customer required new
| functionality we didn't predict so no index on the fields
| required.
|
| Sure sometimes a poorly constructed query slips out or the
| optimizer needs some help by reorganizing the query, but
| it's rare.
| magicalhippo wrote:
| We are transitioning from SQLAnywhere to MSSQL, and saw the
| same for a key query.
|
| SQLAnywhere handled the single OR fine, but we had to split
| the query into two using UNION ALL for MSSQL not to be slow
| as a snail burning tons of CPU.
|
| No idea why the MSSQL optimizer doesn't do that itself, it's
| essentially what SQLAnywhere does.
| Bishonen88 wrote:
| How does one install this on a aws rds instance?
| kiwicopple wrote:
| hypopg is now supported in RDS so the steps would be:
|
| 1. Enable hypopg create extension if not
| exists hypopg;
|
| 2. Copy/paste the plpgsql file:
|
| https://github.com/supabase/index_advisor/blob/main/index_ad...
|
| We are also developing the Trusted Language Extension with the
| RDS team, so at some point it should be easier to do this
| through database.dev:
|
| https://database.dev/olirice/index_advisor
| crorella wrote:
| Interesting! We did something similar for distributed query
| engines like presto and Spark, instead of indexes we use
| partitions and buckets, the benefits are less compute, time and
| money.
| jensenbox wrote:
| dexter uses a similar premise with hypog.
|
| https://github.com/ankane/dexter
| PikachuEXE wrote:
| I am using https://github.com/ankane/pghero/ and this is one of
| its features with GUI
| kiwicopple wrote:
| Andrew builds some amazing tools for the postgres ecosystem (he
| is also the creator of pgvector)
| dveeden2 wrote:
| This is somewhat similar to TiAdvisor for TiDB, which also uses a
| hypothetical method.
|
| https://www.pingcap.com/blog/introducing-tiadvisor-automated...
| fforflo wrote:
| The convenient thing about this is that it's written in vanilla
| Pl/PgSQL. It can be tempting to copy the
| `index_advisor(text)`function in a session and start hard-coding
| stuff and heuristics :D .
|
| Most meaningful extensions need to be compiled, installed,
| created dropped.
| victorbjorklund wrote:
| Oh that is indeed very nice
| pokipoke wrote:
| You still need to install hypopg which is C extension
| l5870uoo9y wrote:
| This can be optimized both easier and further[0]. How is case-
| sensitivity optimally handled? Should the query only match exact
| titles?
|
| [0]: https://www.sqlai.ai/snippets/cluzdmi8w006d53gt82mguaga
| gregw2 wrote:
| Is this aware of parent and child inherited tables?
| Ozzie_osman wrote:
| My team has been using Pganalyze (which has its own index
| advisor), and honestly, it's one of my favorite technical tools
| ever. I don't know how you could run a large-scale postgres setup
| without it. In addition to index advice, it also shows top
| queries, vacuums, etc, and monitors for a whole host of other
| issues out of the box.
|
| They also have a ton of great content on their blog (5mins of
| postgres) where the founder will find blog posts by different
| authors/companies and analyze them in depth.
| lfittl wrote:
| Thanks for the kind words!
|
| For anyone interested in how pganalyze's approach compares to
| this extension (and other alternatives like dexter, or using
| HypoPG directly), I gave a talk with my colleague Philippe last
| year at PgCon that describes how we use constraint programming
| and CP-SAT for dealing with the trade-off between index write
| overhead and read performance improvement, across multiple
| queries on a table:
|
| https://www.pgcon.org/events/pgcon_2023/schedule/session/422...
| polivier wrote:
| If anyone has some background in optimization and is
| interested in the more technical side of things, I gave a
| talk at JOPT 2023, comparing MIP and CP approaches for this
| problem. The slides can be found here:
| https://github.com/PhilippeOlivier/jopt2023
| Rapzid wrote:
| This is awesome! Was the talk recording by chance?
| lfittl wrote:
| Yep, there is a "link to video" link on the talk page -
| here is the direct link:
|
| https://www.youtube.com/watch?v=pGN_pORKtSQ
|
| We also did a more recent webinar that has some slight
| revisions on top of that talk, recording available in our
| docs: https://pganalyze.com/docs/indexing-engine/cp-model
| Rapzid wrote:
| Cheers! I skimmed right over that.
| brightball wrote:
| Would be great if there was something you could run on tables to
| recommend more space efficient data types based on what's
| actually stored in the table.
| klysm wrote:
| I can't imagine a concrete example of this
| jpgvm wrote:
| Examples are easy, usefulness... harder to imagine.
|
| Simple example, bigint column where all values would fit in
| smallint or if only 0/1 are present then boolean.
|
| For a more complex idea if a large number of boolean columns
| are present in a table suggest packing them into
| integer/bigint as appropriate or bit(n) if individual
| querying/indexing via bit operators is needed.
|
| There are many ways to claw back bytes on disk from
| PostgreSQL if you really need to and a lot of them could be
| suggested automatically.
|
| The reason I say usefulness is harder to imagine is I don't
| know of anyone that would want to do this but wouldn't know
| how or where to look for these strategies. It's as if
| awareness of the need is commensurate with ability to resolve
| it.
| vvern wrote:
| CockroachDB has a similar feature built in. It takes existing
| queries which are slow and then analyzes hypothetical indexes for
| better query plans and suggests them. You can add them in the
| console ui with one click. It's pretty useful.
| danolivo wrote:
| The term 'slow' is too relational and not strong. I guess, we
| should look up for queries, which can be potentially faster -
| see into estimation errors or number of data pages involved
| into the query.
| icang888 wrote:
| Twslive
| danolivo wrote:
| Having creation advice, the extension obviously must provide
| candidates to delete and, less obvious, candidates to merge some
| indexes.
| cpursley wrote:
| Good stuff, just added this and some others mentioned in this
| thread to my Postgres Is Enough listicle:
| https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...
| asah wrote:
| Nice. This is 404ing: https://levelup.gitconnected.com/how-
| cloudflare-achieved-55-...
___________________________________________________________________
(page generated 2024-04-14 23:01 UTC)