[HN Gopher] Making a Postgres query 1k times faster
       ___________________________________________________________________
        
       Making a Postgres query 1k times faster
        
       Author : d0mine
       Score  : 68 points
       Date   : 2024-05-15 21:00 UTC (2 hours ago)
        
 (HTM) web link (mattermost.com)
 (TXT) w3m dump (mattermost.com)
        
       | olliej wrote:
       | Based on the last time I did anything involving databases, I
       | think the easy way to do this is "find a query written by someone
       | like olliej, and then fix the clear and obviously stupid choices"
       | :D
        
       | paulddraper wrote:
       | I'm gonna save you 15 minutes.
       | 
       | The author filtered with:                   CreateAt > $1 OR
       | (CreateAt = $1 AND Id > $2)
       | 
       | That can't use one index for both conditions. Instead, make it
       | faster and simpler:                  (CreateAt, Id) > ($1, $2)
       | 
       | End of post.
        
         | damidekronik wrote:
         | I think a nicer takeaway is what the author summarized with:
         | 
         | 1. Always use BUFFERS when running an EXPLAIN. It gives some
         | data that may be crucial for the investigation.
         | 
         | 2. Always, always try to get an Index Cond (called Index range
         | scan in MySQL) instead of a Filter.
         | 
         | 3. Always, always, always assume PostgreSQL and MySQL will
         | behave differently. Because they do.
        
         | mewpmewp2 wrote:
         | That is a bit wrong and I'm pointing it out because your
         | summary logically didn't make sense to me. So I did have to
         | read the blog post after reading your comment.
         | 
         | The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2
         | like in your sample) AND Id > $2)
         | 
         | So the idea is here about paginating through posts that might
         | be constantly changing so you can't use a simple offset, as
         | that would give you duplications along the way. So you try to
         | use CreateAt, but it could be possible that CreateAt is equal
         | to another one so you fallback to ID.
         | 
         | But here I stopped reading the blog post, because I now think
         | why not use Id in the first place since it also seems to be
         | auto increment since otherwise you couldn't really rely on it
         | to be a fallback like this? I don't have time to investigate it
         | further, but tldr; that still left me confused - why not use ID
         | in the first place.
        
           | selecsosi wrote:
           | They could be using something like application generated time
           | sortable UUID7s or some other sortable key
           | 
           | https://uuid7.com/
           | 
           | [edit] CreatedAt timestamp could be something from the client
           | when the post is submitted or tagged from the ingest server
           | and not when they actually are processed and hit the database
        
             | mewpmewp2 wrote:
             | For sure, but still why not use that in the first place?
             | 
             | And I agree I shouldn't have said "auto increment".
        
               | selecsosi wrote:
               | It depends on the resolution of the timestamp. With 1000s
               | of posts a second coming in, the fallback would be that
               | it is at the same "timestamp" and then we fallback to a
               | greater identifier. My guess is that the resolution of
               | the timestamp allows for more efficient postgres index
               | usage then the id index which my guess is larger and a
               | string so not as efficiently searchable as the underlying
               | identifier
               | 
               | (my guess is time based index offer faster search
               | performance or lower overhead than a string based search
               | index which doesn't understand it is representing encoded
               | time data)
        
               | mewpmewp2 wrote:
               | I'm only speculating here, but it doesn't seem right to
               | me at all to end up with a solution like that. I don't
               | think there would be a reason where this ID would perform
               | so bad as an index that it would be worth what I would
               | consider odd reasoning and tech debt if I'm understanding
               | things correctly. And clearly the query is tech debt
               | because now it did cause such an issue. But also I think
               | current solution seems like tech debt to me, having high
               | odds of causing issues in the future, if the DB driver
               | was to be changed or something about the way DB resolves
               | indexes was to change, etc.
               | 
               | And with fallback they would end up reusing that index
               | anyway.
        
           | OJFord wrote:
           | Ha, that's a good point. Devil's advocate though, and because
           | perhaps maybe they changed it slightly for the example in the
           | blog post, it could be that 'CreateAt' is more like
           | 'PublishedAt', i.e. doesn't include a possible draft period,
           | but then id (which does correspond to actual record initial
           | creation time, obviously) is as good as anything to
           | disambiguate them - because it's actually arbitrary, only
           | needs to be consistent at that point.
        
             | mewpmewp2 wrote:
             | This is supposed to be a solution to a rare edge case
             | though, and it seems to me then that there would still be
             | some rare edge cases where the timestamp would mismatch
             | with the order of the ID and so still cause the very edge
             | case it was supposed to avoid?
             | 
             | Because it's ">" it might be missing that one record during
             | what I think is pagination.
        
           | enragedcacti wrote:
           | It seems like the code is used both for indexing from scratch
           | and for maintaining an existing index. In the case of
           | maintaining an existing index you need to know where to start
           | back up and Postgres doesn't guarantee that SERIAL primary
           | keys are always in chronological order when there are
           | multiple simultaneous sessions.
           | 
           | https://dba.stackexchange.com/questions/266405/does-
           | ordering...
        
             | mewpmewp2 wrote:
             | But if you are doing indexing, why do you necessarily care
             | that it's specifically chronological?
             | 
             | Or besides that, if there are odds of CreateAt collision,
             | and you are fallbacking to ID, you are still possibly not
             | getting it chronologically?
             | 
             | And also if CreateAt does happen to equal to another record
             | that is exactly the case where Postgres might most likely
             | not have the auto incr chronological.
             | 
             | So still it seems like the edge case it tries to prevent it
             | would still happen at least at similar magnitude of odds.
        
           | paulddraper wrote:
           | Thanks, I fixed that typo.
        
         | OJFord wrote:
         | Agree it's well-known, in a slightly snobbish sort of way, but
         | it could do with being more widely known for sure; it is a bit
         | of a gotcha, and IMO TFA is quite a nice tutorial on how they
         | went about it, investigating the issue and determining the
         | problem etc., perhaps especially if the answer as you wrote
         | isn't already somewhat familiar, the explain analyze -err-
         | _explainer_ is quite nice. Not reference material, but as an
         | intro, hey this is how you might be able to debug a similar
         | problem you encounter.
        
         | aidos wrote:
         | Thanks. Too late though, I already read it.
         | 
         | The article felt like it was fumbling around when the initial
         | explain pointed right at the issue. I didn't know this specific
         | trick, so I did learn something I guess.
        
       | MuffinFlavored wrote:
       | > WHERE (Posts.CreateAt, Posts.Id) > (?1, ?2)
       | 
       | Did not know you could do "tuple" filtering like this
        
       | hansonkd wrote:
       | OR Queries are performance killers because they often force a
       | full table scan. Another alternative to the Or is actually a
       | UNION.
       | 
       | A UNION allows you to use two separate indexes and can speed up
       | queries.
        
         | hun3 wrote:
         | [delayed]
        
         | Tostino wrote:
         | If at all possible, use a union all rather than a plain union
         | to avoid an extra sort / unique node.
         | 
         | I've used that OR > UNION ALL trick a number of times to vastly
         | improve performance on specific queries. It's crazy how much of
         | an effect it can have.
         | 
         | I wish Postgres would implement a planner optimization to
         | automatically run queries with an OR more efficiently (e.g. use
         | the same plan as with a UNION/ALL where possible).
        
       | brunooliv wrote:
       | Great post thanks for sharing this "from the trenches" it's
       | always a joy to read these deep dives! Thanks
        
       | mewpmewp2 wrote:
       | My question after a brief look - why not just use ID instead of
       | CreateAt?
        
       ___________________________________________________________________
       (page generated 2024-05-15 23:00 UTC)