[HN Gopher] Faster offset pagination for Rails apps
       ___________________________________________________________________
        
       Faster offset pagination for Rails apps
        
       Author : samlambert
       Score  : 83 points
       Date   : 2022-08-16 16:10 UTC (6 hours ago)
        
 (HTM) web link (planetscale.com)
 (TXT) w3m dump (planetscale.com)
        
       | latchkey wrote:
       | tl;dr: deferred join is faster in MySQL
        
       | aarondf wrote:
       | For those interested in the performance of this deferred join
       | technique, here are some real results from the Laravel package:
       | 
       | - https://twitter.com/mdavis1982/status/1482429071288066054
       | 
       | - https://twitter.com/joecampo/status/1483550610028957701
       | 
       | - https://twitter.com/max_eckel/status/1483764319372333057?s=2...
       | 
       | - https://twitter.com/max_eckel/status/1483852300414337032?s=2...
       | 
       | - https://twitter.com/1ralphmorris/status/1484242437618941957?...
       | 
       | - https://twitter.com/julioelpoeta/status/1549524738980077568?...
        
       | danpalmer wrote:
       | Anyone else notice the example query times are very high? Might
       | just be that the example code was written on a slow machine, but
       | I'd normally expect queries like this to take ~10ms in
       | production, not ~500-1500ms.
        
       | boesboes wrote:
       | Feels like something the query planner should do to me... I
       | wonder if this works on PG too, recently had some performance
       | issues exactly due to large offsets :)
        
         | Sesse__ wrote:
         | In general, you cannot push a LIMIT and OFFSET down through a
         | join, since the result is not guaranteed to be the same. (I
         | have no idea what this gem does about that, but maybe it has
         | some sort of 1:1 guarantee through foreign key information, or
         | maybe it just doesn't care about getting too few rows if there
         | are no matching tables in a join.)
         | 
         | The linked blog post seems to indicate that you can satisfy an
         | OFFSET purely using an index, which generally isn't the case.
         | If you have ORDER BY id LIMIT 10 OFFSET 100000, the planner has
         | to fetch 100010 rows in most databases that I know of.
        
           | thomasmg wrote:
           | Yes, using OFFSET for pagination is often problematic. The
           | best alternative IMHO is "keyset pagination", but it requires
           | a bit of work. See https://use-the-index-luke.com/no-offset
        
           | jsmith99 wrote:
           | There's a bit more explanation here
           | https://dba.stackexchange.com/a/59490/167040 about why the
           | trick works. MySQL can use an index for OFFSET but normally
           | doesn't due to using early row lookups.
        
             | Sesse__ wrote:
             | To be clear: MySQL cannot use an index to directly seek
             | past the first OFFSET rows; you still have O(n) behavior
             | for OFFSET n. But you can use a covering index scan for the
             | rows that you skip, which especially for InnoDB is somewhat
             | cheaper than getting the full row. I guess this is what the
             | technique is about.
             | 
             | Edit: Indeed, this technique is seemingly for converting a
             | regular query to a self-join, ie. a LIMIT m OFFSET n => (a
             | LIMIT m OFFSET n) JOIN a, which is allowed as long as you
             | have a unique constraint on the column you're joining on. I
             | had assumed it actually wanted to convert something that
             | was already a join.
        
         | oconnore wrote:
         | It seems like the query planner could perform this
         | optimization, but also it seems like the index could maintain
         | cardinality data to avoid requiring a scan/offset of ids.
         | 
         | If the index had cardinality metadata, it could jump directly
         | to the leaf node containing the first row in the result set.
        
           | Sesse__ wrote:
           | > If the index had cardinality metadata, it could jump
           | directly to the leaf node containing the first row in the
           | result set.
           | 
           | It's difficult to maintain cardinality metadata in the face
           | of MVCC, even for the table as a whole. Page-level or similar
           | on an index would be even harder.
        
       | ricardobeat wrote:
       | I'm surprised this kind of optimization is not done by the query
       | planner, seems like low hanging fruit. Maybe there are
       | pathological cases that are more complex to solve?
        
         | samlambert wrote:
         | > Maybe there are pathological cases that are more complex to
         | solve?
         | 
         | This. Building query planners is incredibly hard and doing too
         | much magic can get really can trip up users.
        
         | Sesse__ wrote:
         | It's not low-hanging fruit at all:
         | 
         | * The gain depends on low-level details of your storage, can
         | easily be negative, and is not necessarily easy to calculate
         | precisely
         | 
         | * It requires adding an extra join, which can be expensive
         | (depending on a bazillion factors) and generally slows down
         | your planner
         | 
         | * The classic System R join optimization framework generally
         | does not support adding or removing joins as part of cost-based
         | optimization (although a pure Cascades-based optimizer probably
         | would)
         | 
         | * It's for something you shouldn't do in the first place (large
         | OFFSET)
         | 
         | Of course, the self-join is a hack. You could envision a
         | cleaner approach where you separate out the index-only scan
         | from a "fetch the real row" node, which can then be pulled up
         | in certain cases (it does require you to track column sets, but
         | isn't impossible), even past joins. However, see the last
         | point. Also, MySQL's storage interface doesn't support it (you
         | can push a WHERE clause down to be done before the full row
         | fetch, and you can ask for the full row fetch to never be done
         | at all, but you cannot ask for an explicit row fetch from an
         | index lookup; that functionality is not exposed).
         | 
         | It's fruit, but it definitely isn't low-hanging.
        
       | fareesh wrote:
       | Is this limitation exclusive to MySQL or does it affect Postgres
       | as well?
        
         | e12e wrote:
         | Some discussion on (some) options for pagination with postgres:
         | https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...
         | 
         | Somewhat related I also came across this:
         | https://www.datadoghq.com/blog/100x-faster-postgres-performa...
         | 
         | I wonder a bit about the actual time for queries in the
         | submitted article - seconds for a few thousands of lines of
         | limit..offset seems terribly slow? (any query over tens/50ms I
         | would generally call slow - in this context - displaying some
         | data to a human)?
        
       | aledalgrande wrote:
       | If you order by created_at you will probably have an index on
       | that col, so why not just using a cursor on created_at? Is this
       | more performant? Or is it just for DX?
        
       | ezekg wrote:
       | This is pretty slick. I may check it out in the future. Offset
       | pagination is a pain, especially with Postgres. I wrote a proof
       | of concept for cursor pagination [0] for Rails apps that use UUID
       | primary keys, but have yet to implement it or something like it.
       | 
       | [0]:
       | https://gist.github.com/ezekg/5ff5e965410885501198235d6a779c...
        
       | syntheticcdo wrote:
       | I wonder what the performance improvement would be to simply add
       | an index to the created_at column, or alternatively sort on the
       | id column (under the assumption that the id column is indexed and
       | essentially correlated 1:1 to created_at).
        
       | aarondf wrote:
       | The PlanetScale team puts out incredible engineering content.
       | Congrats on the release!
       | 
       | If you're using Laravel, there's a package that you can use to
       | achieve the same effect: https://github.com/hammerstonedev/fast-
       | paginate
        
         | ipaddr wrote:
         | Is it only for simple pagination?
        
           | aarondf wrote:
           | Not sure what you mean by simple pagination. If you mean
           | simple as in "only shows prev/next links" then no, it shows
           | knows the count of pages.
           | 
           | It supports length aware and simple.
        
       | ajsharp wrote:
       | This is a smart solution to the problem, if you must use OFFSET.
       | That said, if you can avoid OFFSET, do, as it's typically
       | considerably slower than a simple indexed range query.
       | 
       | If you're OFFSET-ing, say, 1000 records, I believe the database
       | needs to load those 1000 records in some way, to exclude them
       | from the result set (it's possible mysql does this differently
       | than postgres). With a ranged query and a cursor (e.g. select *
       | from tweets where created_at > CURSOR LIMIT 20) is, generally,
       | more efficient. But cursored range queries don't make sense in a
       | lot of cases and often come with some additional complexity in
       | the code.
        
       | samwillis wrote:
       | It looks like "fast_page" is performing two completely separate
       | queries, the first just retuning the id, the second retuning the
       | full row for each id. Then _doing the join with Ruby_ , having to
       | do two full round trips to the db.
       | 
       | The referenced "Deferred join technique" however does a _single_
       | query using an inner join to achieve the same result.
       | 
       | So, if I'm reading this right, "fast_page" could be made even
       | faster by doing an inner join instead, removing the additional
       | query. I'm not sure why they wouldn't have done that? It would
       | also ensure both queries were run in the same transaction so you
       | don't experience race conditions. I'm not a rail dev so maybe
       | that's a given within a request/response process.
       | 
       | As someone else said, I'm surprised the query planner doesn't
       | account for this already.
       | 
       | These "deferred joins" where you join in your application code
       | are useful for n+1 type issues. The Django orm has a
       | .prefetch_related method [0] enabling you to optimise down to a
       | single additional query where you would have had a secondary
       | query for each row.
       | 
       | 0:
       | https://docs.djangoproject.com/en/4.1/ref/models/querysets/#...
        
         | aarondf wrote:
         | MySQL often doesn't support limit in subqueries, hence the
         | roundtrip.
         | 
         | > ERROR 1235 (42000): This version of MySQL doesn't yet support
         | 'LIMIT & IN/ALL/ANY/SOME subquery'
         | 
         | See https://dev.mysql.com/doc/refman/5.7/en/subquery-
         | restriction...
        
           | samwillis wrote:
           | Ah ok, confusingly the OP (and your blog post they link to)
           | both shows example SQL with a subquery using limit and imply
           | it's from a book on MySQL optimisations.
        
             | aarondf wrote:
             | Yeah that's a bit confusing. The deferred join in the book
             | does indeed use an inner join but my package (and
             | PlanetScale's) both use a "deconstructed subquery."
             | 
             | I made that choice for mine because I didn't want to
             | pollute the `select` part of the developer's query to
             | exclude the joined data. If we were writing the whole query
             | from scratch we could, but not in the scenario where the
             | package is actually used, unfortunately.
        
           | gavinray wrote:
           | You can use LIMIT in correlated subqueries in MySQL +8 using
           | LATERAL
           | 
           | https://dev.mysql.com/doc/refman/8.0/en/lateral-derived-
           | tabl...
           | 
           | Prior to this you can emulate it with ROW NUMBER Window
           | function in subquery, and this is what LATERAL gets turned
           | into in many DB engines I believe
           | 
           | https://stackoverflow.com/questions/73127487/aws-
           | athena-v2-p...
        
           | simonw wrote:
           | Might be possible to use a CTE instead, something like this:
           | with page_numbers as (           select id from docs order by
           | id limit 25 offset 50         )         select * from docs
           | where id in (           select id from page_numbers         )
        
             | aledalgrande wrote:
             | I don't know if it's the same for MySQL, but in Postgres a
             | CTE would possibly force the planner to make less efficient
             | choices [1]. Would be interesting if someone who knows more
             | than me could clarify on this
             | 
             | [1] https://hakibenita.com/be-careful-with-cte-in-postgre-
             | sql
        
               | Lukas_Skywalker wrote:
               | This changed in version 12. See the disclaimer:
               | 
               | This article is intended for PostgreSQL versions 11 and
               | prior. Starting at version 12, PostgreSQL changed the way
               | it treats CTE to prevent the issues described in this
               | article.
        
             | aarondf wrote:
             | Mmm yeah, that's clever! I like it.
             | 
             | I think for my package (the Laravel one) I'll have to hold
             | off on that because CTEs only work with MySQL 8.0 and up.
             | Also the Laravel query builder doesn't have methods to add
             | CTEs, unfortunately.
             | 
             | I'm gonna noodle on this though, there's something nice
             | about it.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-08-16 23:01 UTC)