[HN Gopher] Making a Postgres query 1k times faster
___________________________________________________________________
Making a Postgres query 1k times faster
Author : d0mine
Score : 68 points
Date : 2024-05-15 21:00 UTC (2 hours ago)
(HTM) web link (mattermost.com)
(TXT) w3m dump (mattermost.com)
| olliej wrote:
| Based on the last time I did anything involving databases, I
| think the easy way to do this is "find a query written by someone
| like olliej, and then fix the clear and obviously stupid choices"
| :D
| paulddraper wrote:
| I'm gonna save you 15 minutes.
|
| The author filtered with: CreateAt > $1 OR
| (CreateAt = $1 AND Id > $2)
|
| That can't use one index for both conditions. Instead, make it
| faster and simpler: (CreateAt, Id) > ($1, $2)
|
| End of post.
| damidekronik wrote:
| I think a nicer takeaway is what the author summarized with:
|
| 1. Always use BUFFERS when running an EXPLAIN. It gives some
| data that may be crucial for the investigation.
|
| 2. Always, always try to get an Index Cond (called Index range
| scan in MySQL) instead of a Filter.
|
| 3. Always, always, always assume PostgreSQL and MySQL will
| behave differently. Because they do.
| mewpmewp2 wrote:
| That is a bit wrong and I'm pointing it out because your
| summary logically didn't make sense to me. So I did have to
| read the blog post after reading your comment.
|
| The query should be CreateAt > $1 OR (CreateAt = $1 (NOT $2
| like in your sample) AND Id > $2)
|
| So the idea is here about paginating through posts that might
| be constantly changing so you can't use a simple offset, as
| that would give you duplications along the way. So you try to
| use CreateAt, but it could be possible that CreateAt is equal
| to another one so you fallback to ID.
|
| But here I stopped reading the blog post, because I now think
| why not use Id in the first place since it also seems to be
| auto increment since otherwise you couldn't really rely on it
| to be a fallback like this? I don't have time to investigate it
| further, but tldr; that still left me confused - why not use ID
| in the first place.
| selecsosi wrote:
| They could be using something like application generated time
| sortable UUID7s or some other sortable key
|
| https://uuid7.com/
|
| [edit] CreatedAt timestamp could be something from the client
| when the post is submitted or tagged from the ingest server
| and not when they actually are processed and hit the database
| mewpmewp2 wrote:
| For sure, but still why not use that in the first place?
|
| And I agree I shouldn't have said "auto increment".
| selecsosi wrote:
| It depends on the resolution of the timestamp. With 1000s
| of posts a second coming in, the fallback would be that
| it is at the same "timestamp" and then we fallback to a
| greater identifier. My guess is that the resolution of
| the timestamp allows for more efficient postgres index
| usage then the id index which my guess is larger and a
| string so not as efficiently searchable as the underlying
| identifier
|
| (my guess is time based index offer faster search
| performance or lower overhead than a string based search
| index which doesn't understand it is representing encoded
| time data)
| mewpmewp2 wrote:
| I'm only speculating here, but it doesn't seem right to
| me at all to end up with a solution like that. I don't
| think there would be a reason where this ID would perform
| so bad as an index that it would be worth what I would
| consider odd reasoning and tech debt if I'm understanding
| things correctly. And clearly the query is tech debt
| because now it did cause such an issue. But also I think
| current solution seems like tech debt to me, having high
| odds of causing issues in the future, if the DB driver
| was to be changed or something about the way DB resolves
| indexes was to change, etc.
|
| And with fallback they would end up reusing that index
| anyway.
| OJFord wrote:
| Ha, that's a good point. Devil's advocate though, and because
| perhaps maybe they changed it slightly for the example in the
| blog post, it could be that 'CreateAt' is more like
| 'PublishedAt', i.e. doesn't include a possible draft period,
| but then id (which does correspond to actual record initial
| creation time, obviously) is as good as anything to
| disambiguate them - because it's actually arbitrary, only
| needs to be consistent at that point.
| mewpmewp2 wrote:
| This is supposed to be a solution to a rare edge case
| though, and it seems to me then that there would still be
| some rare edge cases where the timestamp would mismatch
| with the order of the ID and so still cause the very edge
| case it was supposed to avoid?
|
| Because it's ">" it might be missing that one record during
| what I think is pagination.
| enragedcacti wrote:
| It seems like the code is used both for indexing from scratch
| and for maintaining an existing index. In the case of
| maintaining an existing index you need to know where to start
| back up and Postgres doesn't guarantee that SERIAL primary
| keys are always in chronological order when there are
| multiple simultaneous sessions.
|
| https://dba.stackexchange.com/questions/266405/does-
| ordering...
| mewpmewp2 wrote:
| But if you are doing indexing, why do you necessarily care
| that it's specifically chronological?
|
| Or besides that, if there are odds of CreateAt collision,
| and you are fallbacking to ID, you are still possibly not
| getting it chronologically?
|
| And also if CreateAt does happen to equal to another record
| that is exactly the case where Postgres might most likely
| not have the auto incr chronological.
|
| So still it seems like the edge case it tries to prevent it
| would still happen at least at similar magnitude of odds.
| paulddraper wrote:
| Thanks, I fixed that typo.
| OJFord wrote:
| Agree it's well-known, in a slightly snobbish sort of way, but
| it could do with being more widely known for sure; it is a bit
| of a gotcha, and IMO TFA is quite a nice tutorial on how they
| went about it, investigating the issue and determining the
| problem etc., perhaps especially if the answer as you wrote
| isn't already somewhat familiar, the explain analyze -err-
| _explainer_ is quite nice. Not reference material, but as an
| intro, hey this is how you might be able to debug a similar
| problem you encounter.
| aidos wrote:
| Thanks. Too late though, I already read it.
|
| The article felt like it was fumbling around when the initial
| explain pointed right at the issue. I didn't know this specific
| trick, so I did learn something I guess.
| MuffinFlavored wrote:
| > WHERE (Posts.CreateAt, Posts.Id) > (?1, ?2)
|
| Did not know you could do "tuple" filtering like this
| hansonkd wrote:
| OR Queries are performance killers because they often force a
| full table scan. Another alternative to the Or is actually a
| UNION.
|
| A UNION allows you to use two separate indexes and can speed up
| queries.
| hun3 wrote:
| [delayed]
| Tostino wrote:
| If at all possible, use a union all rather than a plain union
| to avoid an extra sort / unique node.
|
| I've used that OR > UNION ALL trick a number of times to vastly
| improve performance on specific queries. It's crazy how much of
| an effect it can have.
|
| I wish Postgres would implement a planner optimization to
| automatically run queries with an OR more efficiently (e.g. use
| the same plan as with a UNION/ALL where possible).
| brunooliv wrote:
| Great post thanks for sharing this "from the trenches" it's
| always a joy to read these deep dives! Thanks
| mewpmewp2 wrote:
| My question after a brief look - why not just use ID instead of
| CreateAt?
___________________________________________________________________
(page generated 2024-05-15 23:00 UTC)