[HN Gopher] Show HN: PostgreSQL index advisor
       ___________________________________________________________________
        
       Show HN: PostgreSQL index advisor
        
       This is a Postgres extension that can determine if a query should
       have an index. For example, for this table:
       create table book(           id int primary key,           title
       text not null         );       You can run `index_advisor()` to see
       if there should be an index on a select statement:
       select *         from index_advisor('select book.id from book where
       title = $1');       And it will return (summarized):
       {"CREATE INDEX ON public.book USING btree (title)"}       It works
       particularly well with pg_stat_statements[0] which tracks execution
       statistics of all SQL statements executed on your Postgres
       database.  It leans heavily on HypoPG[1], an excellent extension to
       determine if PostgreSQL will use a given index without spending
       resources to create them.  [0] pg_stat_statements:
       https://www.postgresql.org/docs/current/pgstatstatements.htm...
       [1] https://github.com/HypoPG/hypopg
        
       Author : kiwicopple
       Score  : 324 points
       Date   : 2024-04-14 02:14 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jitl wrote:
       | I've been needing this and also HypoPG, which I didn't realize
       | had been available on RDS for a year plus!
       | 
       | The issue I'm struggling with these days is that I have an index
       | that I want Postgres to use for one of the relations in a 3+ wide
       | join, but unless I put a limit on a CTE of one of the tables,
       | Postgres tries to run each join in parallel and ends up trying to
       | join a bazillion rows like an asshole.
       | 
       | These days wrangling the query planner has me on the verge of
       | breaking up with pg :-(
        
         | kiwicopple wrote:
         | i don't know how far it was developed, but I believe
         | pg_plan_guarantee[0] was created to solve this sort of issue
         | 
         | > _This extension provides a mechanism, using which you can set
         | your execution plan in stone; that is, Postgres will execute
         | the plan you give it, or it will throw an error, but it will
         | never try to guess a plan for you._
         | 
         | [0] https://github.com/DrPostgres/pg_plan_guarantee
        
           | jitl wrote:
           | Ugh I would love LOVE to have something like this or even
           | better a stable API where I can manually specify my own
           | physical query plan. As is, I'm on RDS and this thing is
           | experimental so I have no way to use it.
        
             | jaggederest wrote:
             | For what it's worth, depending on how many people and/or
             | budget you have available to man on call rotations, running
             | your own postgresql server with replication is not that
             | hard. The RDS premium is like a whole engineer's salary by
             | the time you get to Large Databases (tm), and if you're
             | running production, staging, and maybe a data warehouse, it
             | starts to rapidly be able to pay for a whole team.
        
         | jaggederest wrote:
         | You can set max_parallel_workers_per_gather to zero in the
         | session where you're running the problem query, if that's
         | helpful. That will disable the parallel query behavior
         | entirely. You can just reset it back to what it was once the
         | query is complete. I've run into this issue before and that was
         | my go-to fix.
        
           | krembo wrote:
           | I'd strongly oppose setting the max parallel workers on
           | session level just to bypass an execution plab. I stead
           | understand the logic why pg behave the way it is, and change
           | the query accordingly
        
             | jaggederest wrote:
             | Yeah you can file a patch with postgres and get it into the
             | next point release or something but in the mean time you
             | really want to keep that query from OOMing your database :)
        
         | justinclift wrote:
         | Guessing you've seen this already and it's not doing what you
         | want?
         | 
         | https://github.com/ossc-db/pg_hint_plan
        
         | polishdude20 wrote:
         | Have you tried changing the order of the joins? I know it
         | shouldn't matter but I've found sometimes it tricks the query
         | planner into doing the right things!
        
         | arronax wrote:
         | While pinning a single plan or disabling parallel operations is
         | a valid quick fix, I'd recommend trying to actually understand
         | why the optimizer is doing what it's doing. It's all "just" a
         | simple math of execution node costs. The problem with plan
         | hints, or disabling parallel, is that unless you have a perfect
         | understanding of where the data will be in, say, a year, you
         | may just have created another problem, just delayed. Some
         | column will go from 10 to 10,000 unique values and your good
         | plan will become way off. And at the end, perhaps you just need
         | to tell PG that an index access is not 4 times as expensive as
         | a plain heap scan.
         | 
         | ps fighting the optimizer is the worst
        
       | itsgrimetime wrote:
       | Does something like this for MongoDB? Searching hasn't turned up
       | anything useful.
        
         | winrid wrote:
         | Ops Manager (paid tool) kinda supports this but in my
         | experience it misses a lot of obvious suggestions.
        
       | pkiv wrote:
       | The supabase team always delivers. Excited to give this a try!
        
       | sbstp wrote:
       | I've often thought that a database that could automatically
       | detect slow queries and create the necessary indexes would be
       | neat. You run a load test on your application, which in turns
       | calls the database and you collect all the queries it makes. Then
       | the database automatically adjusts itself.
        
         | ComodoHacker wrote:
         | Big Guys do this. For big bucks, of course.
        
           | tuwtuwtuwtuw wrote:
           | > big bucks
           | 
           | You get that feature in Azure SQL Database for $5/month.
        
         | tuwtuwtuwtuw wrote:
         | That exists in Microsoft SQL Server. It can create new indexes,
         | drop unused indexes, change query plans when it detect
         | degradation and so on.
        
           | BrentOzar wrote:
           | Source? I've been working with SQL Server for a couple of
           | decades and I don't believe it will automatically create or
           | drop indexes under any circumstances. You might be thinking
           | of Azure SQL DB.
        
             | Ciantic wrote:
             | "Automatic tuning, introduced in SQL Server 2017 (14.x),
             | notifies you whenever a potential performance issue is
             | detected and lets you apply corrective actions, or lets the
             | Database Engine automatically fix performance problems."
             | [1]
             | 
             | I have used this in Azure SQL too, but according to that it
             | should be in SQL Server.
             | 
             | https://learn.microsoft.com/en-us/sql/relational-
             | databases/a...
        
               | couchand wrote:
               | Good link!
               | 
               | > Automatic index management identifies indexes that
               | should be added in your database, and indexes that should
               | be removed. Applies to: Azure SQL Database
        
               | BrentOzar wrote:
               | Read that link carefully: only automatic plan regression
               | is available in SQL Server, not the automatic index
               | tuning portion. The index tuning portion only applies to
               | Azure SQL DB.
        
             | tuwtuwtuwtuw wrote:
             | What's the point of asking for a source when you would find
             | it on Google in one minute? Odd way of learning. Not like I
             | brought up some debated viewpoint.
        
               | hobs wrote:
               | Probably because most of the stuff you'd find in the top
               | search results would include the GP's name.
               | 
               | Just a few sentences later "Automatic tuning in Azure SQL
               | Database also creates necessary indexes and drops unused
               | indexes" - that's not in on-prem SQL Server.
        
               | simplyinfinity wrote:
               | Google the name of the person you're replying to :)
        
             | radicalbyte wrote:
             | They've had a non-automatic "query advisor" in there
             | forever, it operated on profiling data and was highly
             | effective.
        
           | taspeotis wrote:
           | That's an Azure SQL thing, not MSSQL.
        
         | elric wrote:
         | I'm sure the database could, but it doesn't mean the database
         | should. Indexes come at the cost of extra disk space, slower
         | inserts, and slower updates. In some cases, some slower queries
         | might be an acceptable tradeoff. In other cases, maybe not. It
         | depends.
        
           | kiwicopple wrote:
           | this is our posture for this extension on the supabase
           | platform. we could automate the creation of the indexes using
           | the Index Advisor, but we feel it would be better to expose
           | the possible indexes to the user and let them choose
        
             | gneray wrote:
             | this is the way ^^
        
             | dmurray wrote:
             | You could tell it "you have a budget of X GB for disk
             | space, choose the indexes that best optimize the queries
             | given the budget cap."
             | 
             | Not perfect, because some queries may be more time-critical
             | than others.
             | 
             | You could even annotate every query (INSERT and UPDATE as
             | well as SELECT) with the dollar amount per execution you're
             | willing to pay to make it 100ms faster, or accept to make
             | it 100ms slower. Then let it know the marginal dollar cost
             | of adding index storage, throw this all into a constraint
             | solver and add the indexes which are compatible with your
             | pricing.
        
           | d0100 wrote:
           | Are the trade-offs measurable? If they are the database could
           | just undo the index...
           | 
           | Not just indexing, but table partitions, materialized views,
           | keeping things in-memory...
        
             | remus wrote:
             | > Are the trade-offs measurable?
             | 
             | Yes, but you need the context about what is the correct
             | tradeoff for your use case. If you've got a service that
             | depends on fast writes then adding latency via extra
             | indices for improved read speed may not be an acceptable
             | trade off. It depends on your application though.
        
         | masklinn wrote:
         | Because indexes have costs you need a much more complicated
         | system which can feed back into itself and downgrade
         | probationary indexes back to unindexed.
        
           | fulafel wrote:
           | Several databases index everything, needed or not. (And
           | sometimes have mechanisms to force it off for some specific
           | data)
        
           | freedomben wrote:
           | Even this isn't sufficient, because some problems with over-
           | indexing don't become apparent until the size of a table gets
           | much larger, which only happens a drop at a time. I suppose
           | if it was always probationary and continually being
           | evaluated, at some point it could recognize that for example
           | INSERTs are now taking 1000x longer than they were 2 years
           | ago. But that feels like a never-ending battle against corner
           | cases, and any automatic actions it takes add significant
           | complexity to the person debugging later.
        
         | arronax wrote:
         | Oracle DB is, or was, very close to that with its query
         | profiles, baselines, and query patches. It wasn't automatic
         | back in 2014 when I last worked on it, but all the tools were
         | there. Heck, it was possible to completely rewrite a bad query
         | on the fly and execute a re-written variant. I suppose it all
         | stems from the fact that Oracle is regularly used under massive
         | black boxes, including the EBS.
         | 
         | Also, the problem with automatic indexing is that it only gets
         | you so far, and any index can, in theory, mess up another query
         | that is perfectly fine. Optimizers aren't omniscient. In
         | addition, there are other knobs in the database, which affect
         | performance. I suppose, a wider approach than just looking at
         | indexes would be more successful. Like Ottertune, for example.
        
           | Tostino wrote:
           | The problem of new indexes messing up otherwise good queries
           | is something I've battled on and off for the past decade with
           | Postgres. Definitely annoying.
        
             | rand_r wrote:
             | How would an index mess up another query? AFAIK indexes
             | would only hurt write performance marginally per index, but
             | most slow queries are read-only. I've tended to just add
             | indexes as I go without thinking about it and haven't run
             | into issues, so genuinely curious.
        
           | dmurray wrote:
           | > Heck, it was possible to completely rewrite a bad query on
           | the fly and execute a re-written variant.
           | 
           | Is there really such a thing as a bad query that can be
           | rewritten to give the same results but faster? For me, that's
           | already the query optimizer's job.
           | 
           | Of course there are "bad queries" where you query for things
           | you don't need, join on the wrong columns, etc. And yeah the
           | optimizer isn't perfect. But a query that you expect the
           | query optimizer to "rewrite" and execute in an optimal way is
           | a good query.
        
             | Scene_Cast2 wrote:
             | One example is that in Presto, joins assume that left
             | (IIRC) table is the big one you stream, and the right one
             | is the small one you hash. One of the newer features was
             | that in some cases, the optimizer is able to ignore the SQL
             | order and correctly pick which table to hash and which one
             | to stream.
        
               | legulere wrote:
               | Isn't that just basic table/index statistics to know
               | which table is the smallest?
        
         | b3lm0nt wrote:
         | Andrew Kane built dexter, which is an automatic indexer for
         | Postgres.
         | 
         | https://github.com/ankane/dexter
         | 
         | https://ankane.org/introducing-dexter
        
         | ed_balls wrote:
         | Default DB for App Engine (NDB) has this feature. Implicit
         | indexes are tad annoying.
        
         | GordonS wrote:
         | I might be misremembering, but IIRC RavenDB does this (it's a
         | commercial document DB, written in C#).
        
       | ComodoHacker wrote:
       | This doesn't seem to consider or give insight on any tradeoffs
       | involved.
       | 
       | Also, the underlying extension, HypoPG, doesn't seem to collect
       | any statistics on data to influence query planner.
        
       | CAP_NET_ADMIN wrote:
       | If you have some PostgreSQL performance issues, I'd recommend
       | checking out PGAnalyze - they've offered a much more advanced
       | index advisor for some time now.
       | 
       | My company is a paid customer since around 2020 and we are very
       | satisfied, easily beats the Datadog's (which we use for the rest
       | of our infra and apps) observability offering for PostgreSQL.
        
         | allan_s wrote:
         | Re: performance issue
         | 
         | I used to think that performance issue in relational database
         | was always a matter of :
         | 
         | * missing indexes * non-used indexes due to query order (where
         | A, B instead of B, A)
         | 
         | But we had the case recently where we optimized a query in
         | postgresql which was taking 100% of cpu during 1s (enough to
         | trigger our alerting) by simply splitting a OR in two separate
         | query.
         | 
         | So if you are looking for optimisation it may be good to know
         | about "OR is bad". The two queries run in some ms both.
        
           | xvinci wrote:
           | I'm sorry but you just cannot say that "OR is bad" - it being
           | a key part of SQL . It's most likely your use that is bad
           | (e.g. your intermediate result exceeding some cache size).
           | 
           | But "bad performance always due to indexes" gives a hint that
           | you are somewhat new: No, bad performance in my experience
           | was almost always due to developers either not understanding
           | their ORM framework, or writing too expensive queries with or
           | without index. Just adding indexes seldom solved the problem
           | (maybe 1/5 of the time).
        
             | mrklol wrote:
             | OR is indeed not bad, but you have to think about when to
             | use it. It can easily make queries slower compared to other
             | operators. That's exactly why we have the terms "ugly OR" /
             | "bad OR".
        
               | thom wrote:
               | It's worth having a mental model of _why_ OR can be
               | suboptimal. Often it's because you're only hitting an
               | index on half the conditional, or forcing PG into a
               | bitmap scan, or worse turning and index lookup into a
               | sequential scan. Not to bang on about indexes too much
               | but a partial index on the OR condition works if you're
               | lazy, although splitting into two queries is often a
               | great solution as it gives you two fast queries to
               | combine instead of one slow one (although sometimes that
               | implies PG's statistics are incorrect because it might
               | have been able to know which side of the conditional cuts
               | out more data).
        
             | dz08dl wrote:
             | It's complicated; that's why there isn't a one-size-fits-
             | all solution. In the end, you want to have a good execution
             | plan, and there's usually not just one and the same action
             | to achieve that.
        
             | magicalhippo wrote:
             | > Just adding indexes seldom solved the problem
             | 
             | We write all our queries by hand. We've got decades of
             | experience and I'd say we're pretty proficient.
             | 
             | For us adding an index is almost always the solution,
             | assuming the statistics are fine.
             | 
             | Either we plain forgot, or a customer required new
             | functionality we didn't predict so no index on the fields
             | required.
             | 
             | Sure sometimes a poorly constructed query slips out or the
             | optimizer needs some help by reorganizing the query, but
             | it's rare.
        
           | magicalhippo wrote:
           | We are transitioning from SQLAnywhere to MSSQL, and saw the
           | same for a key query.
           | 
           | SQLAnywhere handled the single OR fine, but we had to split
           | the query into two using UNION ALL for MSSQL not to be slow
           | as a snail burning tons of CPU.
           | 
           | No idea why the MSSQL optimizer doesn't do that itself, it's
           | essentially what SQLAnywhere does.
        
       | Bishonen88 wrote:
       | How does one install this on a aws rds instance?
        
         | kiwicopple wrote:
         | hypopg is now supported in RDS so the steps would be:
         | 
         | 1. Enable hypopg                   create extension if not
         | exists hypopg;
         | 
         | 2. Copy/paste the plpgsql file:
         | 
         | https://github.com/supabase/index_advisor/blob/main/index_ad...
         | 
         | We are also developing the Trusted Language Extension with the
         | RDS team, so at some point it should be easier to do this
         | through database.dev:
         | 
         | https://database.dev/olirice/index_advisor
        
       | crorella wrote:
       | Interesting! We did something similar for distributed query
       | engines like presto and Spark, instead of indexes we use
       | partitions and buckets, the benefits are less compute, time and
       | money.
        
       | jensenbox wrote:
       | dexter uses a similar premise with hypog.
       | 
       | https://github.com/ankane/dexter
        
       | PikachuEXE wrote:
       | I am using https://github.com/ankane/pghero/ and this is one of
       | its features with GUI
        
         | kiwicopple wrote:
         | Andrew builds some amazing tools for the postgres ecosystem (he
         | is also the creator of pgvector)
        
       | dveeden2 wrote:
       | This is somewhat similar to TiAdvisor for TiDB, which also uses a
       | hypothetical method.
       | 
       | https://www.pingcap.com/blog/introducing-tiadvisor-automated...
        
       | fforflo wrote:
       | The convenient thing about this is that it's written in vanilla
       | Pl/PgSQL. It can be tempting to copy the
       | `index_advisor(text)`function in a session and start hard-coding
       | stuff and heuristics :D .
       | 
       | Most meaningful extensions need to be compiled, installed,
       | created dropped.
        
         | victorbjorklund wrote:
         | Oh that is indeed very nice
        
         | pokipoke wrote:
         | You still need to install hypopg which is C extension
        
       | l5870uoo9y wrote:
       | This can be optimized both easier and further[0]. How is case-
       | sensitivity optimally handled? Should the query only match exact
       | titles?
       | 
       | [0]: https://www.sqlai.ai/snippets/cluzdmi8w006d53gt82mguaga
        
       | gregw2 wrote:
       | Is this aware of parent and child inherited tables?
        
       | Ozzie_osman wrote:
       | My team has been using Pganalyze (which has its own index
       | advisor), and honestly, it's one of my favorite technical tools
       | ever. I don't know how you could run a large-scale postgres setup
       | without it. In addition to index advice, it also shows top
       | queries, vacuums, etc, and monitors for a whole host of other
       | issues out of the box.
       | 
       | They also have a ton of great content on their blog (5mins of
       | postgres) where the founder will find blog posts by different
       | authors/companies and analyze them in depth.
        
         | lfittl wrote:
         | Thanks for the kind words!
         | 
         | For anyone interested in how pganalyze's approach compares to
         | this extension (and other alternatives like dexter, or using
         | HypoPG directly), I gave a talk with my colleague Philippe last
         | year at PgCon that describes how we use constraint programming
         | and CP-SAT for dealing with the trade-off between index write
         | overhead and read performance improvement, across multiple
         | queries on a table:
         | 
         | https://www.pgcon.org/events/pgcon_2023/schedule/session/422...
        
           | polivier wrote:
           | If anyone has some background in optimization and is
           | interested in the more technical side of things, I gave a
           | talk at JOPT 2023, comparing MIP and CP approaches for this
           | problem. The slides can be found here:
           | https://github.com/PhilippeOlivier/jopt2023
        
           | Rapzid wrote:
           | This is awesome! Was the talk recording by chance?
        
             | lfittl wrote:
             | Yep, there is a "link to video" link on the talk page -
             | here is the direct link:
             | 
             | https://www.youtube.com/watch?v=pGN_pORKtSQ
             | 
             | We also did a more recent webinar that has some slight
             | revisions on top of that talk, recording available in our
             | docs: https://pganalyze.com/docs/indexing-engine/cp-model
        
               | Rapzid wrote:
               | Cheers! I skimmed right over that.
        
       | brightball wrote:
       | Would be great if there was something you could run on tables to
       | recommend more space efficient data types based on what's
       | actually stored in the table.
        
         | klysm wrote:
         | I can't imagine a concrete example of this
        
           | jpgvm wrote:
           | Examples are easy, usefulness... harder to imagine.
           | 
           | Simple example, bigint column where all values would fit in
           | smallint or if only 0/1 are present then boolean.
           | 
           | For a more complex idea if a large number of boolean columns
           | are present in a table suggest packing them into
           | integer/bigint as appropriate or bit(n) if individual
           | querying/indexing via bit operators is needed.
           | 
           | There are many ways to claw back bytes on disk from
           | PostgreSQL if you really need to and a lot of them could be
           | suggested automatically.
           | 
           | The reason I say usefulness is harder to imagine is I don't
           | know of anyone that would want to do this but wouldn't know
           | how or where to look for these strategies. It's as if
           | awareness of the need is commensurate with ability to resolve
           | it.
        
       | vvern wrote:
       | CockroachDB has a similar feature built in. It takes existing
       | queries which are slow and then analyzes hypothetical indexes for
       | better query plans and suggests them. You can add them in the
       | console ui with one click. It's pretty useful.
        
         | danolivo wrote:
         | The term 'slow' is too relational and not strong. I guess, we
         | should look up for queries, which can be potentially faster -
         | see into estimation errors or number of data pages involved
         | into the query.
        
       | icang888 wrote:
       | Twslive
        
       | danolivo wrote:
       | Having creation advice, the extension obviously must provide
       | candidates to delete and, less obvious, candidates to merge some
       | indexes.
        
       | cpursley wrote:
       | Good stuff, just added this and some others mentioned in this
       | thread to my Postgres Is Enough listicle:
       | https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...
        
         | asah wrote:
         | Nice. This is 404ing: https://levelup.gitconnected.com/how-
         | cloudflare-achieved-55-...
        
       ___________________________________________________________________
       (page generated 2024-04-14 23:01 UTC)