[HN Gopher] Faster offset pagination for Rails apps
___________________________________________________________________
Faster offset pagination for Rails apps
Author : samlambert
Score : 83 points
Date : 2022-08-16 16:10 UTC (6 hours ago)
(HTM) web link (planetscale.com)
(TXT) w3m dump (planetscale.com)
| latchkey wrote:
| tl;dr: deferred join is faster in MySQL
| aarondf wrote:
| For those interested in the performance of this deferred join
| technique, here are some real results from the Laravel package:
|
| - https://twitter.com/mdavis1982/status/1482429071288066054
|
| - https://twitter.com/joecampo/status/1483550610028957701
|
| - https://twitter.com/max_eckel/status/1483764319372333057?s=2...
|
| - https://twitter.com/max_eckel/status/1483852300414337032?s=2...
|
| - https://twitter.com/1ralphmorris/status/1484242437618941957?...
|
| - https://twitter.com/julioelpoeta/status/1549524738980077568?...
| danpalmer wrote:
| Anyone else notice the example query times are very high? Might
| just be that the example code was written on a slow machine, but
| I'd normally expect queries like this to take ~10ms in
| production, not ~500-1500ms.
| boesboes wrote:
| Feels like something the query planner should do to me... I
| wonder if this works on PG too, recently had some performance
| issues exactly due to large offsets :)
| Sesse__ wrote:
| In general, you cannot push a LIMIT and OFFSET down through a
| join, since the result is not guaranteed to be the same. (I
| have no idea what this gem does about that, but maybe it has
| some sort of 1:1 guarantee through foreign key information, or
| maybe it just doesn't care about getting too few rows if there
| are no matching tables in a join.)
|
| The linked blog post seems to indicate that you can satisfy an
| OFFSET purely using an index, which generally isn't the case.
| If you have ORDER BY id LIMIT 10 OFFSET 100000, the planner has
| to fetch 100010 rows in most databases that I know of.
| thomasmg wrote:
| Yes, using OFFSET for pagination is often problematic. The
| best alternative IMHO is "keyset pagination", but it requires
| a bit of work. See https://use-the-index-luke.com/no-offset
| jsmith99 wrote:
| There's a bit more explanation here
| https://dba.stackexchange.com/a/59490/167040 about why the
| trick works. MySQL can use an index for OFFSET but normally
| doesn't due to using early row lookups.
| Sesse__ wrote:
| To be clear: MySQL cannot use an index to directly seek
| past the first OFFSET rows; you still have O(n) behavior
| for OFFSET n. But you can use a covering index scan for the
| rows that you skip, which especially for InnoDB is somewhat
| cheaper than getting the full row. I guess this is what the
| technique is about.
|
| Edit: Indeed, this technique is seemingly for converting a
| regular query to a self-join, ie. a LIMIT m OFFSET n => (a
| LIMIT m OFFSET n) JOIN a, which is allowed as long as you
| have a unique constraint on the column you're joining on. I
| had assumed it actually wanted to convert something that
| was already a join.
| oconnore wrote:
| It seems like the query planner could perform this
| optimization, but also it seems like the index could maintain
| cardinality data to avoid requiring a scan/offset of ids.
|
| If the index had cardinality metadata, it could jump directly
| to the leaf node containing the first row in the result set.
| Sesse__ wrote:
| > If the index had cardinality metadata, it could jump
| directly to the leaf node containing the first row in the
| result set.
|
| It's difficult to maintain cardinality metadata in the face
| of MVCC, even for the table as a whole. Page-level or similar
| on an index would be even harder.
| ricardobeat wrote:
| I'm surprised this kind of optimization is not done by the query
| planner, seems like low hanging fruit. Maybe there are
| pathological cases that are more complex to solve?
| samlambert wrote:
| > Maybe there are pathological cases that are more complex to
| solve?
|
| This. Building query planners is incredibly hard and doing too
| much magic can get really can trip up users.
| Sesse__ wrote:
| It's not low-hanging fruit at all:
|
| * The gain depends on low-level details of your storage, can
| easily be negative, and is not necessarily easy to calculate
| precisely
|
| * It requires adding an extra join, which can be expensive
| (depending on a bazillion factors) and generally slows down
| your planner
|
| * The classic System R join optimization framework generally
| does not support adding or removing joins as part of cost-based
| optimization (although a pure Cascades-based optimizer probably
| would)
|
| * It's for something you shouldn't do in the first place (large
| OFFSET)
|
| Of course, the self-join is a hack. You could envision a
| cleaner approach where you separate out the index-only scan
| from a "fetch the real row" node, which can then be pulled up
| in certain cases (it does require you to track column sets, but
| isn't impossible), even past joins. However, see the last
| point. Also, MySQL's storage interface doesn't support it (you
| can push a WHERE clause down to be done before the full row
| fetch, and you can ask for the full row fetch to never be done
| at all, but you cannot ask for an explicit row fetch from an
| index lookup; that functionality is not exposed).
|
| It's fruit, but it definitely isn't low-hanging.
| fareesh wrote:
| Is this limitation exclusive to MySQL or does it affect Postgres
| as well?
| e12e wrote:
| Some discussion on (some) options for pagination with postgres:
| https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...
|
| Somewhat related I also came across this:
| https://www.datadoghq.com/blog/100x-faster-postgres-performa...
|
| I wonder a bit about the actual time for queries in the
| submitted article - seconds for a few thousands of lines of
| limit..offset seems terribly slow? (any query over tens/50ms I
| would generally call slow - in this context - displaying some
| data to a human)?
| aledalgrande wrote:
| If you order by created_at you will probably have an index on
| that col, so why not just using a cursor on created_at? Is this
| more performant? Or is it just for DX?
| ezekg wrote:
| This is pretty slick. I may check it out in the future. Offset
| pagination is a pain, especially with Postgres. I wrote a proof
| of concept for cursor pagination [0] for Rails apps that use UUID
| primary keys, but have yet to implement it or something like it.
|
| [0]:
| https://gist.github.com/ezekg/5ff5e965410885501198235d6a779c...
| syntheticcdo wrote:
| I wonder what the performance improvement would be to simply add
| an index to the created_at column, or alternatively sort on the
| id column (under the assumption that the id column is indexed and
| essentially correlated 1:1 to created_at).
| aarondf wrote:
| The PlanetScale team puts out incredible engineering content.
| Congrats on the release!
|
| If you're using Laravel, there's a package that you can use to
| achieve the same effect: https://github.com/hammerstonedev/fast-
| paginate
| ipaddr wrote:
| Is it only for simple pagination?
| aarondf wrote:
| Not sure what you mean by simple pagination. If you mean
| simple as in "only shows prev/next links" then no, it shows
| knows the count of pages.
|
| It supports length aware and simple.
| ajsharp wrote:
| This is a smart solution to the problem, if you must use OFFSET.
| That said, if you can avoid OFFSET, do, as it's typically
| considerably slower than a simple indexed range query.
|
| If you're OFFSET-ing, say, 1000 records, I believe the database
| needs to load those 1000 records in some way, to exclude them
| from the result set (it's possible mysql does this differently
| than postgres). With a ranged query and a cursor (e.g. select *
| from tweets where created_at > CURSOR LIMIT 20) is, generally,
| more efficient. But cursored range queries don't make sense in a
| lot of cases and often come with some additional complexity in
| the code.
| samwillis wrote:
| It looks like "fast_page" is performing two completely separate
| queries, the first just retuning the id, the second retuning the
| full row for each id. Then _doing the join with Ruby_ , having to
| do two full round trips to the db.
|
| The referenced "Deferred join technique" however does a _single_
| query using an inner join to achieve the same result.
|
| So, if I'm reading this right, "fast_page" could be made even
| faster by doing an inner join instead, removing the additional
| query. I'm not sure why they wouldn't have done that? It would
| also ensure both queries were run in the same transaction so you
| don't experience race conditions. I'm not a rail dev so maybe
| that's a given within a request/response process.
|
| As someone else said, I'm surprised the query planner doesn't
| account for this already.
|
| These "deferred joins" where you join in your application code
| are useful for n+1 type issues. The Django orm has a
| .prefetch_related method [0] enabling you to optimise down to a
| single additional query where you would have had a secondary
| query for each row.
|
| 0:
| https://docs.djangoproject.com/en/4.1/ref/models/querysets/#...
| aarondf wrote:
| MySQL often doesn't support limit in subqueries, hence the
| roundtrip.
|
| > ERROR 1235 (42000): This version of MySQL doesn't yet support
| 'LIMIT & IN/ALL/ANY/SOME subquery'
|
| See https://dev.mysql.com/doc/refman/5.7/en/subquery-
| restriction...
| samwillis wrote:
| Ah ok, confusingly the OP (and your blog post they link to)
| both shows example SQL with a subquery using limit and imply
| it's from a book on MySQL optimisations.
| aarondf wrote:
| Yeah that's a bit confusing. The deferred join in the book
| does indeed use an inner join but my package (and
| PlanetScale's) both use a "deconstructed subquery."
|
| I made that choice for mine because I didn't want to
| pollute the `select` part of the developer's query to
| exclude the joined data. If we were writing the whole query
| from scratch we could, but not in the scenario where the
| package is actually used, unfortunately.
| gavinray wrote:
| You can use LIMIT in correlated subqueries in MySQL +8 using
| LATERAL
|
| https://dev.mysql.com/doc/refman/8.0/en/lateral-derived-
| tabl...
|
| Prior to this you can emulate it with ROW NUMBER Window
| function in subquery, and this is what LATERAL gets turned
| into in many DB engines I believe
|
| https://stackoverflow.com/questions/73127487/aws-
| athena-v2-p...
| simonw wrote:
| Might be possible to use a CTE instead, something like this:
| with page_numbers as ( select id from docs order by
| id limit 25 offset 50 ) select * from docs
| where id in ( select id from page_numbers )
| aledalgrande wrote:
| I don't know if it's the same for MySQL, but in Postgres a
| CTE would possibly force the planner to make less efficient
| choices [1]. Would be interesting if someone who knows more
| than me could clarify on this
|
| [1] https://hakibenita.com/be-careful-with-cte-in-postgre-
| sql
| Lukas_Skywalker wrote:
| This changed in version 12. See the disclaimer:
|
| This article is intended for PostgreSQL versions 11 and
| prior. Starting at version 12, PostgreSQL changed the way
| it treats CTE to prevent the issues described in this
| article.
| aarondf wrote:
| Mmm yeah, that's clever! I like it.
|
| I think for my package (the Laravel one) I'll have to hold
| off on that because CTEs only work with MySQL 8.0 and up.
| Also the Laravel query builder doesn't have methods to add
| CTEs, unfortunately.
|
| I'm gonna noodle on this though, there's something nice
| about it.
| [deleted]
___________________________________________________________________
(page generated 2022-08-16 23:01 UTC)