[HN Gopher] Paginating Requests in APIs (2020)
___________________________________________________________________
Paginating Requests in APIs (2020)
Author : exemel_hn
Score : 181 points
Date : 2022-05-28 15:31 UTC (7 hours ago)
(HTM) web link (ignaciochiazzo.medium.com)
(TXT) w3m dump (ignaciochiazzo.medium.com)
| a-dub wrote:
| cursors result in expensive server side state. if ordering is
| less important an alternative is to make cardinality queries
| cheaper and then allow the client to request sections thereby
| making the scheme stateless for the purposes of the server.
|
| so one query returns an approximate cardinality of the result set
| or a partition count based on approximate size of the result set
| and a fixed request response size.
|
| the second to n queries are gets, but the gets include the
| partition count and the partition requested. the server uses a
| hashing scheme to either reduce the input data used for the
| construction of the response or filtering of the response itself.
|
| use of numerical partition ids and hashing allows for the
| response to maintain consistency over time as well as pushing of
| the sharding/selection scheme as deep as necessary into the data
| layer to avoid scalability issues.
|
| requests that are too big are cancelled and given abuse points.
| therusskiy wrote:
| How do you find on the backend if a page is the last one? The
| trick I used to avoid extra SQL queries is if a user asks for 10
| records I load up to 11 and then return 10. If there were 11 then
| I return a cursor for the next page, otherwise next page is null.
| Is there a better way?
| yashap wrote:
| Just let people query for an empty page every now and then, no
| big deal.
| whenlambo wrote:
| Moreover, you can query, like, +3 more, and if the extra is
| less than 3, return that extra items (short tail) also. This
| way, you won't have the last page with 1-2 items only.
| omegalulw wrote:
| How do you deal with cases where results must be ordered by
| some field and the data changes between page 1 request and
| page 2 request?
| NonNefarious wrote:
| You could select into a temporary table and return pages of
| results from that.
|
| You'd need some mechanism to delete the temporary table
| after a time, of course, but I imagine this is not
| uncommon.
|
| Another method would be having a last-modified timestamp on
| every row in the DB; then you could select records at or
| before the initial query time. Seems like overkill just for
| this one purpose, but I imagine it might be generally
| useful information to have in the DB.
| spockz wrote:
| If the ordering changed because the data changed then just
| show the new situation at that page. If the results' order
| is important there are several things you can do (at
| least):
|
| 1. You could save the order under a token and iterate
| respective to that saved order, only showing the data for
| the ids that were originally included. 2. Instead of
| continuing from a page count show the next amount. Any
| mutation before that point will be invisible without having
| to store the original result list (only ids).
| noneeeed wrote:
| That's what we do, I'm not sure there is another way other than
| making a request for the next page.
| drbojingle wrote:
| You could ask for a count of records and divide by the page
| size.
| neoglow wrote:
| Counts become incredibly slow on some databases like
| Postgres. It needs to do a sequential scan, which results in
| slow(er) performance on even a ten- thousands of rows. When
| operating at scale, this is very, very slow.
| kadoban wrote:
| This seems unlikely to depend on which database software.
| It could easily depend on what indexes exist for the query.
| fabian2k wrote:
| From what I understand it's not slow if it can use an
| index, but the criteria for that are rather hard to
| understand as a normal user. If you vacuum often enough and
| the visibility map is current, it doesn't need a full table
| scan.
| spurgu wrote:
| This is what I do, since counts should cause minimal overhead
| (I've never had to deal with huge databases). But neither
| does the N+1 records trick which OP describes, that's clever.
| :)
| SonOfLilit wrote:
| Counts can cause significant overhead on even millions of
| rows.
| spurgu wrote:
| Thanks, that's good to know (will have to perform some
| tests at some point). I do have a few tables (namely
| logs) that will have millions of rows eventually over
| time. But there I've also been planning on rotating them
| whenever that becomes a problem, so I'm not overly
| concerned. Or experiment with OP's solution.
| MereInterest wrote:
| If I were to ask the number of even numbers greater than
| zero and less than 10 billion, you could very quickly
| give an exact answer. If I were to ask the exact number
| of prime numbers on that same range, it would take much
| longer to find the correct answer.
|
| In the same way, asking for counts may or may not be a
| quick answer, depending on both the table structure and
| the query being performed. Query the number of customers
| who started this year, on a table indexed by start date,
| and it's incredibly cheap. If that same table is only
| indexed by last name, then counts become more expensive.
| I'd say that the advice to return the counts has an
| implicit addendum to make sure that your database is
| appropriately indexed such that your typical queries are
| cheap to perform.
| inferiorhuman wrote:
| So I'm coming at this from a Postgres POV, IIRC MySQL can
| take some shortcuts for COUNT(*). Constraining the query
| will help but counting is expensive as a sequential table
| scan is always required.
|
| Meanwhile I can only think of one use case for pagination
| with exact row counts - a user facing application. If
| you're preallocating storage you can get by with an
| estimate. If you're consuming an API and want something
| from near the end you can flip the sort order.
| nolevels wrote:
| Not necessarily in postgres. If an index is available it
| may use that, though it will still have to refer to the
| visibility map to see if any row is available to the
| current transaction (cheap check), and if it can't
| determine from that, then and only then does it need to
| also refer to the heap / actual table (slower check).
| inferiorhuman wrote:
| Is that a recent change? Back in 2016 the Citus guys
| claimed that: There is no single
| universal row count that the database could cache, so it
| must scan through all rows counting how many are
| visible. Performance for an exact count grows
| linearly with table size.
|
| https://www.citusdata.com/blog/2016/10/12/count-
| performance/
| nolevels wrote:
| Apologies, I should've clarified. I just meant postgres
| doesn't necessarily have to scan the actual table to get
| the count in some cases if an index is available, not
| that it caches the actual count.
| inferiorhuman wrote:
| So that's not the greatest quote but the article shows
| postgres having to fall back on a sequential scan for a
| naive COUNT(*). The author goes on to mention that as of
| 9.2 Postgres can do an index scan (in the context of
| SELECT DISTINCT): But now we come to a
| quirk, SELECT COUNT(DISTINCT n) FROM items will not use
| the index even though SELECT DISTINCT n does. As many
| blog posts mention ("one weird trick to make
| postgres 50x faster!") you can guide the planner by
| rewriting count distinct as the count of a subquery
|
| To me it seems like you can trick the query planner into
| doing an index scan sometimes, but that it's a bit
| brittle. Has this improved much since?
| inferiorhuman wrote:
| Sure, but that point you've really got to start thinking
| about architecture.
|
| Do you really need to paginate millions of rows _and_
| count them? Can you distribute the database and
| parallelize counting? Can you save counts somewhere else?
| Can you use estimates?
| [deleted]
| [deleted]
| SPBS wrote:
| So cursor pagination is just keyset pagination, except instead of
| using a column value directly you use an opaque string that gets
| translated back to a column value. So you can modify how the
| cursor gets translated back to a column value anytime without
| breaking API compatibility.
|
| That said I'm not sure if everyone agrees on that definition,
| from what I know people do consider keyset pagination and cursor
| pagination to basically mean the same thing.
| richbell wrote:
| > So cursor pagination is just keyset pagination, except
| instead of using a column value directly you use an opaque
| string that gets translated back to a column value.
|
| HashIds is a popular solution if those columns are numerical or
| can be represented numerically (e.g. timestamp).
|
| https://hashids.org/
| noneeeed wrote:
| I thinks that's what the article is saying. I found the
| descriptions didn't really explain the differences very well.
|
| We use a cursor that combines the sort key and normally the ID
| as a secondary sort and pointer value, otherwise you can't
| paginate through data where the sort column has duplicates. We
| don't really do much to obfuscate these, but we don't document
| the structure and people who try to tweak the values won't get
| far.
| giaour wrote:
| One advantage of opaque cursors is that they can include
| information that's not part of your public API. For instance,
| if you allow soft deletes and do not return deleted items in
| the list, you can end up with a massive stretch of deleted
| items separating two live results. You can leave a note in the
| cursor indicating what record was last scanned, and that's not
| possible in keyset pagination.
| blenderdt wrote:
| A REST API can't use page based pagination because it's not
| stateless. It is also not cachable.
| ok123456 wrote:
| if the page, limit and offset are parameters it is.
| blenderdt wrote:
| If you query the first page with a limit of 10, then insert
| 10 items, do you get the same items with the same parameters?
| isbvhodnvemrwvn wrote:
| Using that logic you can only have restful APIs if your
| application is useless and data never changes (which is
| fine for me, I hate debates about restfulness)
| ok123456 wrote:
| If you really want to cache the results use e-tags based on
| a hash set on write, or just include that hash as part of
| your request as well.
| skrebbel wrote:
| It's exactly as RESTful and cacheable and stateless as common
| websites, eg the HN frontpage or a newspaper homepage. I think
| your bar for when something is RESTful or not is too high to be
| useful.
| blenderdt wrote:
| Edit: my bad, I was looking at the wrong 'more' button.
| joemi wrote:
| How is https://news.ycombinator.com/news?p=2 not
| pagination?
| croes wrote:
| Why? The server doesn't need to know if I already requested any
| previous page when I request page 9. And if the data doesn't
| change it can easily be cached.
| hinkley wrote:
| "Rest in Practice" (if I'm remembering the right book) would
| disagree strenuously with you.
|
| Once you're doing REST instead of just "somewhat sane JSON
| responses" then the response has self-descriptive elements,
| including links to related resources. And importantly, URLs for
| older resources become static very, very quickly. Not unlike
| the internals of subversion or git.
|
| The trick there is to realize that O(1) = O(2). If you're
| trying to show 25 items per page, you _do not need a query that
| returns the most recent 25 items_. That creates a situation
| where every single interaction is distinct from every other
| interaction.
|
| Asking for 25 results and everything newer than that is barely
| more complicated than pagination already is.
| branko_d wrote:
| I don't like the term "cursor-based" pagination. Database cursors
| are different from this, have been around for decades, and can
| actually be used to implement pagination (albeit with requiring
| the transaction to be kept alive between pages which can be...
| problematic).
|
| Perhaps, "token-based" pagination would be a better term?
| abraxas wrote:
| This is highly aggravating and confusing when hipsters try to
| coopt terms that had a defined meaning for decades. The ones
| that grate me are "retina display", "real time", "x-nanometer
| node".
| [deleted]
| blurker wrote:
| I agree. I was unsure if the author meant db cursors which
| would be keeping transactions alive. That has obvious scaling
| issues. But reading the comments here has me convinced they are
| talking about application generated cursors. I think this is a
| pretty reasonable confusion and it would benefit from added
| clarity, like naming them differently.
| kortex wrote:
| > Perhaps, "token-based" pagination would be a better term?
|
| Let's do it. When I was first learning about it, I wondered how
| the concept related to database cursors, if at all.
|
| It's really just a pointer to the next item at its most simple.
| But it really is just any opaque blob. Maybe you want it to be
| a jwt/encrypted blob with a request counter so you can track
| request usage by specific accounts (along with the pointer). jk
| that's probably a terrible idea (too entangled). Idk, just came
| up with it. Point is, you _could_ do something like that, it 's
| just a blob opaque to the client.
|
| So I like "token-based pagination".
| [deleted]
| drdec wrote:
| Thank you, now I don't need to write this comment
| dwaite wrote:
| I tried an experiment a while back to handle this using custom
| units for the "Range" HTTP Header.
|
| The biggest drawback is that the expectation is still that data
| would be valid concatenated, so (for example) multiple results
| would be expressed as multiple JSON values concatenated together,
| rather than entries in a single array.
| aaaaaaaaaaab wrote:
| theteapot wrote:
| If I take keyset approach then obfuscate my sorting key+value in
| some token it becomes a "cursor"? The only point in doing this is
| to stop abuse of large offsets?
|
| Also article states a con of Keyset is you can't use an offsets.
| Technically you can, just the same a pagin (WHERE ID > X ORDER BY
| ID LIMIT Y OFFSET Z)??
| sa46 wrote:
| The API improvement proposals (AIP, yes it's a terrible acronym)
| are a rich source of API design wisdom. It assumes protobufs but
| the majority of proposals would work for any API.
|
| The relevant one is https://google.aip.dev/158, which recommends
| cursor-based pagination and covers different pagination use-
| cases, including:
|
| - support for different sort orders - mentioned in article as
| weakness of keyset pagination
|
| - support for skipping results - mentioned as a weakness of
| cursor-based pagination
|
| - why to use opaque page tokens
| grogers wrote:
| This is a good reference, but IMO page tokens need to expire,
| even if you aren't storing any data for them. Expiring tokens
| make it possible to change pagination formats over time (you
| only need to wait the expiry time while supporting both
| formats), eg to improve efficiency.
|
| Also I'll add that if you support a pagination API that accepts
| args (e.g. to filter on different dimensions) you should
| include a hash of the args in the opaque next page token, and
| fail the call if the args change. This prevents some nonsense
| scenarios where you started paging by scanning index A but the
| user switched filters such that you'd instead scan index B, but
| you don't have the key for that column.
| fabian2k wrote:
| Are these actually server-side cursors or keyset pagination? For
| keyset pagination you need essentially the last value of the
| columns you used to sort by, but this seems to have some random
| or encoded cursor value. I didn't think server-side cursors would
| be used at that scale so this has to be keyset pagination, so I
| think I'm missing something.
|
| And while I really like keyset pagination, it is annoying to
| implement in some cases. If your sorting column is not unique you
| need to sort by multiple columns. And that is really annoying and
| less efficient if you don't have row value comparisons. And not
| all databases and especially not all ORMs support those. I'm
| still waiting on EF Core support here, it looks a bit like this
| might happen but it's not certain yet.
| xdfgh1112 wrote:
| Surely a keyset, even if it's multi-key, is more performant
| than an offset query. Offset queries usually require the
| previous rows to be computed first, so large offsets are super
| slow.
| Jweb_Guru wrote:
| It is, but it's not a cursor in the traditional sense.
| ko27 wrote:
| EF core now supports it with PostgreSQL
|
| https://github.com/npgsql/efcore.pg/pull/2350
| oconnor663 wrote:
| By server-side cursors do you mean like open connections to
| MySQL reading a stream of query results, or something like
| that? I don't think that's feasible for most websites. The
| biggest issue is that you'd have to guarantee that when someone
| clicks "next page", their browser ends up talking to the exact
| same machine that it got the previous page from. But websites
| running on big fleets of machines don't usually work like that;
| instead your queries are usually somewhat randomly distributed
| across a whole tier. You might work around that by trying to
| keep a persisitent connection open from the browser to the
| server, but you'd lose that connection when you refreshed the
| page or when the server restarted, or if you lost WiFi for a
| minute. Making hiccups like that visible to the user is pretty
| frustrating. It's a lot more robust to use some sort of cursor
| design that doesn't rely on any persistent connections.
|
| Some other problems that crop up:
|
| - A lot of websites expect to serve most results from cache,
| and to only hit the DB a minority of the time, but to get your
| hands on a real DB cursor you'd have to hit the DB every time.
|
| - Not all DB's support moving cursors backwards. Maybe most of
| them don't? I'm not sure.
|
| - Open DB connections are usually a somewhat limited resource.
| gruez wrote:
| >but this seems to have some random or encoded cursor value
|
| Maybe the actual key value is stored server side for
| obfuscation/optimization reasons?
| terandle wrote:
| Wish this article would have gone into details about how cursor
| based pagination is actually implemented. Like they did for the
| other options by showing SQL. Is this something the database
| engine has to provide support for?
| andrewmackrodt wrote:
| You can do something like the following in application code.
|
| Imagine a table with the following schema/data:
| id=1,created_at='2022-28-05T18:00Z'
| id=2,created_at='2022-28-05T18:00Z'
| id=3,created_at='2022-28-05T18:00Z'
|
| To retrieve articles in order of descending creation time, our
| sort order would be: created_at DESC, id DESC
|
| The last item in the ORDER BY query should be a unique key to
| ensure we have consistent ordering across multiple queries. We
| must also ensure all supported sort columns have an INDEX for
| performance reasons.
|
| Assume our application returns one article at a time, we have
| already retrieved the first article in the result set. Our
| cursor must include information for that last records ORDER BY
| values, e.g. serialize('2022-28-05T18:00Z,3'), for example
| purposes I will use base64, so our cursor is
| MjAyMi0yOC0wNVQxODowMFosMw==.
|
| When the user requests the next set of results, they will
| supply the cursor and it will be used to construct a WHERE
| (AND) expression: created_at < '2022-28-05T18:00Z' OR
| (created_at = '2022-28-05T18:00Z' AND id < 3)
|
| So our query for the next set of results would be: SELECT *
| FROM articles WHERE (created_at < '2022-28-05T18:00Z' OR
| (created_at = '2022-28-05T18:00Z' AND id < 3) ORDER BY
| created_at DESC, id DESC LIMIT 1
|
| For queries with more than 2 sort columns the WHERE expression
| will slightly increase in complexity but it need only be
| implemented once. If the sort order direction is flipped, make
| sure to adjust the comparator appropriately, e.g. for 'DESC'
| use '<' and for 'ASC' use '>'.
| delusional wrote:
| Isn't that basically just a trivial extension of what they
| article calls "keyset based pagination"?
|
| I'm not complaining since this is also what I usually do, I'm
| just wondering of theres more to it.
| cortesoft wrote:
| Oftentimes under the hood, cursor based paging IS keyset
| paging. The difference is that the keyset is obfuscated,
| which can allow the implementation to change without
| changing the API call.
| TravelPiglet wrote:
| How do you handle removal?
| andrewmackrodt wrote:
| I haven't had a need to implement this but something like
| the following should work to provide the same* records
| regardless of insertion/deletion.
|
| * data of the record may have been updated, but if there
| are 1,000 records total, paginating in either direction
| will always return those same 1,000 records regardless of
| insertion/deletion.
|
| - Filter by created_at <= time the endpoint first returned
| a result.
|
| - Utilize soft deletes, where deleted_at IS NULL OR
| deleted_at < time the endpoint first returned a result.
|
| Any keys allowed to be used in the ORDER BY query should
| also be immutable.
| catlifeonmars wrote:
| Makes total sense. It also seems trivially automatable on the
| surface. Echoing GP: Is this an existing DBMS feature? If
| not, why not?
| qsort wrote:
| No need for anything special, you would just query with
| something like "SELECT whatever FROM Table WHERE cursor>value
| ORDER BY cursor LIMIT n". Obviously there needs to be an index
| over the columns of the cursor, but any reasonable engine
| should allow you to add that effortlessly.
| mbStavola wrote:
| The complication comes from having a cursor over results that
| are sorted on multiple columns. It's not incredibly difficult
| to implement, but it can certainly be annoying to do,
| especially if you want to support before+after cursor as
| well.
| gabetax wrote:
| Check https://use-the-index-luke.com/no-offset as a better
| reference.
|
| But in most SQL databases, cursors are something you implement
| and parse at the application layer, and translate to a WHERE
| clause on the (hopefully indexed) column you're ordering on.
| That turns the O(N) "OFFSET" into a O(log(n)) index seek.
| nafey wrote:
| How is a cursor different from key set in that case?
| cdash wrote:
| It's not in that case, cursors can be implemented however
| you want though. Its an abstraction, that prevents you from
| having to change the schema of the API when you change the
| implementation.
| lazide wrote:
| Many databases have actual cursor implementations which are
| transactional. That means you'll get a consistent view at the
| point in time you created the cursor.
|
| That said, they tend to live only as long as the db
| connection (with some exceptions), so yeah you need some
| application work to make it sane.
| sgtfrankieboy wrote:
| Some database engines will support Cursors like this, but from
| what I can gather Cursor based navigation is a wrapper around
| KeySet or page navigation, but base64 encoded parameters.
|
| Making the implementation details of pagination unimportant to
| the API layer the user uses.
| pragmatic wrote:
| Continuation Token is a better term than cursor at it avoids
| confusion with DB cursors.
|
| https://phauer.com/2018/web-api-pagination-timestamp-id-cont...
| rwieruch wrote:
| Kinda related to this discussion about pagination APIs:
|
| One year ago I had to design an API [0] for paginated nodes which
| also allows nested nodes. Overall there would be more than
| 100.000 nodes in the database.
|
| On the UI side there would be a tree table to display the nodes.
| First you would fetch page 0 with 100 nodes and from there you
| could either perform a paginated fetch for the next page (page 1)
| or a nested fetch for page 0 for one of the initially fetched
| nodes. So essentially the API allows paginated and nested fetches
| where each nested fetch would be a paginated fetch too.
|
| We used cursor based pagination, because there would be added and
| deleted nodes eventually in a collaborative environment. It was
| definitely a fun challenge!
|
| [0] https://www.robinwieruch.de/react-tree-list/
| zikohh wrote:
| "Cursor-based pagination is the most efficient method of paging
| and should always be used when possible." Facebook API Docs
|
| Why is it more efficient than key set pagination? According to
| this article
| https://archive.ph/2021.04.30-125536/https://medium.com/swlh...
| the only difference is that the value of the key is encoded to
| allow the implementation to change
| wespiser_2018 wrote:
| Both approaches use the underlying index in the database, but
| cursor approaches need to look at far fewer results in order to
| return the page. If you are deep into the result, your offset
| could be high, like 1000 or so. The database would have to
| start at the beginning and go through 1000 results, especially
| if filters are in play, in order to get the the N items in the
| page.
| nesarkvechnep wrote:
| Can't we do both offset and keyset-based pagination? If a user
| wants a specific page, let them specify it, return them links to
| the next and previous pages, using filter + limit.
| mmastrac wrote:
| You should never design an API that uses offsets for pagination
| unless you're dealing with small amounts of data (<10000 rows).
| Cursors give you far more flexibility and if you want to be lazy,
| you can just hide an offset in an opaque cursor blob and upgrade
| later on.
|
| I don't think I've used offsets in APIs for at least 10 years.
| Lightly-obfuscated cursor tokens are one of the first things I
| build in web projects and that's usually less than an hour's
| work.
|
| If you _really_ need the ability to drop the needle in your
| dataset with pagination, design your system to use pseudo-
| pagination where you approximate page-to-record mappings and
| generate cursors to continue forward or backward from that point.
| xdfgh1112 wrote:
| Iirc digitalocean API uses offsets. We had to write special
| code to handle the possibility of the list changing while it
| was being queried.
| JamesSwift wrote:
| Do any of these avoid that scenario? Unless you are
| paginating on `updated_at ASC` there is an edge case of
| missing new data on previously requested pages.
| singron wrote:
| Missing data that began existing after you started querying
| is usually OK. If you had requested the data 1 second
| earlier, it wouldn't have been returned anyway. With offset
| pagination, you can miss data that always existed. This can
| make you believe the item that previously existed had been
| deleted.
|
| Be very careful with timestamp or auto-increment id
| pagination too. These don't necessarily become visible in
| the same order since the id or timestamp is generated
| before the transaction commits unless your database has
| some specific way of ensuring otherwise.
| xdfgh1112 wrote:
| Exactly.
| giaour wrote:
| When I worked for a news site, we moved all listing endpoints
| from offset to timestamp based pagination because offset
| based pagination assumes the data being queried will remain
| stable between requests. This was, of course, very rarely
| true during business hours.
|
| Duplicates or missed entries in the result set are the most
| likely outcome of offset-based pagination.
| gopher_space wrote:
| Something always feels off about repopulating lists that
| people are working on. Like that concept needs an entirely
| different display method.
| lazide wrote:
| It's what happens with stateless protocols and offset based
| pagination 100% of the time, but most people don't notice
| it.
| MAGZine wrote:
| I consider offset-based paged pagination to be a mark of the
| novice.
| steve_adams_86 wrote:
| edit: I just saw another comment of yours, so I see this was
| meant more contextually than I realized.
|
| Can you explain why offsets would never be a suitable
| solution? Is there a clear explanation as to why?
|
| I understand how cursors are superior in some cases. But what
| if you have a site which paginates static data? Offsets would
| allow you to cache the results easily, overcoming any
| performance concerns (which would be irrelevant if the
| dataset was small anyway), providing a better user experience
| (and developer experience due to simpler implementation
| details) overall.
|
| I can see that it would be a novice move if you've got
| staggering amounts of records that take a while to query. But
| that's actually pretty rare in my experience.
| MAGZine wrote:
| It doesn't even have to be a cursor, you can technically
| page off of any field you can index (i've written a lot of
| sync APIs so there are gotchas, but the point stands).
|
| Limit/offset is usually (though not always, as you point
| out) egregious for anything more than hobbies or small
| projects. But, I am biased as when I build APIs, I
| definitely expect some amount of traffic/volume/tuple
| count, and offset/limit will not do.
| mcphage wrote:
| Or the mark of someone who asked the users what they wanted.
| MAGZine wrote:
| I don't disagree, but everyone would prefer to have an API
| that worked over one that doesn't, or one that causes
| service outages due to database latency. (did anyone ask
| the users if they wanted the api to work?)
|
| If you're dealing with very small datasets, its fine. I'm
| an average person using average APIs, which means that when
| I see offset-based pagination, it's usually on a service
| deployed and used by a lot of people.
|
| Unsurprisingly, the offset based APIs often include some
| other arbitrary limit like "offset limited to 10k" or
| something silly but understandable if you've built an API
| used by thousands of people before, or understand how
| databases work.
|
| They're also often superseded by betters APIs that actually
| allow you to page the entire result set. Then you have a
| deprecated API that you either support forever or annoy
| users by turning it off.
|
| So yes, if you are building something non-internal/pet
| project, limit/offset is probably the mark of the novice.
| xthrowawayxx wrote:
| how do you jump to arbitrary pages with cursor pagination?
| singron wrote:
| You don't, but if your cursor is just a position in some
| well-defined order, you can sometimes craft a cursor out of a
| known position. Continuing the book metaphor, instead of
| skipping to page 100, which has no semantic meaning, you
| could skip to the beginning of the "M" section of the
| dictionary, or see the first 10 words after "merchant".
| lazide wrote:
| Depending on what you mean by 'page', you probably want a
| query? (Show me all results with values between 1 and 10)
|
| If you want to jump ahead in the results, that is
| fundamentally unstable though.
| tshaddox wrote:
| You generally don't, unless you want to linearly scan through
| n pages. If your API is using offset or page numbers and the
| underlying collection is also having items added or removed,
| the behavior is undefined or straight up wrong. I think it's
| okay to use offsets or page numbers in cases where the
| collection is static or where it's acceptable to occasionally
| skip over an item or see a duplicate item while paginating.
| rhacker wrote:
| Or... Search not page. which is typically way more useful in
| reality.
| orf wrote:
| Cursor based pagination makes sense. At a certain scale every
| part of your API becomes something you need to support, and that
| includes pagination. This makes it super difficult to change the
| storage you use, as suddenly you need to support integer based
| offsets rather than opaque tokens.
| duskwuff wrote:
| Cursors also mean you get more predictable behavior when
| interacting with data that's changing under you. Imagine
| scrolling back through an active chat, for example -- you
| wouldn't expect each page to contain some of the same chat
| messages that are already on screen.
| pornel wrote:
| Another downside of offset-based pagination is less consistent
| handling of data changes during pagination.
|
| If any row is inserted anywhere before the page you requested,
| you'll see some other item twice (once at the end of the previous
| page, then again at the top of the next page). Similarly if any
| item was removed, you'll skip over some _other_ item when
| paginating due to every page shifting by one.
| catlifeonmars wrote:
| I'd take it further: offset based pagination failure modes look
| different based on the implementation. If you have API clients,
| you're exposing that implementation detail to them.
| cortesoft wrote:
| That was mentioned as a con for offset based pagination.
| verst wrote:
| Related: Do you think SDKs for these APIs should themselves
| require you to interact with the cursors?
|
| If I ask a SDK to get me 100 items but internally it returns only
| 50 before needing to make another call with a marker or cursor It
| would be nice if the SDK had an option to handle this for me
| (possibly exposing cursors still for those who really want it).
| jahewson wrote:
| What if it fails after 50?
| verst wrote:
| Maybe return the items it found so far and provide the cursor
| it failed on? Have a way to optionally provide the cursor? I
| can think of some ways I'd implement this in an SDK. Probably
| for simplicity I'd have some sort of JSON status field
| indicating the request (traversing all cursors) completed. If
| it's not complete I still would look for the next cursor it
| returned and do the usual thing manually.
| progval wrote:
| SDKs should typically return a lazy iterator in languages which
| support it, so caller do not even care about pagination and
| just iterate on it until they don't want any more data; while
| the SDK fetches new pages when needed.
|
| In pseudo-Python: def get_stuff():
| url = "https://example.org/api.php" while url:
| response = requests.get(url) yield from
| response.json() url =
| parse_link(response.headers["link"]).next
|
| and you call it with: for item in
| get_stuff(): print(item)
| [deleted]
| scarface74 wrote:
| The AWS Python API users "paginators" to allow you to iterate
| over results with a single for loop and it handles the
| pagination in the background. The underlying API - which is
| also exposes - uses and opaque "NextToken"
| nthypes wrote:
| Cursor based pagination is a good idea for APIs because it allows
| for more flexibility in how data is displayed to the user. It
| also helps to prevent large amounts of data from being returned
| all at once, which can cause performance issues.
| robertlagrant wrote:
| All pagination can help prevent large amounts of data being
| returned all at once.
| skeeter2020 wrote:
| > It also helps to prevent large amounts of data from being
| returned all at once, which can cause performance issues.
|
| Isn't that the point of all pagination, regardless of how it's
| implemented?
| cortesoft wrote:
| > It also helps to prevent large amounts of data from being
| returned all at once, which can cause performance issues
|
| You can also limit page size and accomplish the same thing.
| gregw2 wrote:
| For large datasets, consider avoiding pagination complexity by
| having your api return a (asynch) job ID, and dumping the whole
| resultset to a file which can be fetched (over https) via the
| jobID.
|
| I have done this with both Postgres/S3 and Redshift/S3 backends
| and presigned URLs and looks like I could do it with Snowflake
| too.
| wespiser_2018 wrote:
| Nice write up!
|
| I've really struggled with Haskell's lack of support of
| pagination, and worked on at least two "inner-source" libraries
| that offer what the author is calling "Key Set" or limit/offset
| and Cursor Based as Servant combinators.
|
| What I've found, is that "Key Set" is much more ergonomic for
| library authors to implement, if they have to write the
| database/sql calls themselves, although "Cursor Based" is
| technically more efficient.
|
| It's my opinion that the standard, out of the box behavior for an
| endpoint should be to paginate every single endpoint that returns
| a list of results, but in order to do that easily you'll need
| some framework support. More mature frameworks have this right
| now, and IMO this is one of the weak points in using a less
| supported language like Haskell. You'll either have to roll your
| own solution, or rely on spotty library support that does that
| pagination operation in CPU.
| Cthulhu_ wrote:
| The limitation of cursor-based pagination - where you only know
| the previous, current, and next page's cursor identifier - sounds
| ideal for infinite scrolling applications, where the user does
| not get the option to pick a page.
|
| Another solution or optimization I've seen is in forum software,
| where if a user executes a search, the IDs of the search results
| are copied into a "search results" table with the equivalent of a
| session ID; that way, the data remains stable between paging, and
| the data set over which pagination has to be done (e.g. via
| offset/limit) is very small; joins can be used to link to the
| items in question in the search results (e.g. posts, threads), or
| all the items that need to be shown in the paginated search
| results can be copied over to the ephemeral search results table.
|
| And of course, since it's ephemeral and memory is cheap these
| days, the whole search result can be cached in server-side
| memory.
| buro9 wrote:
| Moving forward and back in a cursor (pages) is easy... but how
| about random pages?
|
| i.e. this forum, HN. If there were ten pages of comments, going
| from page 1 to page 2, or vice versa is trivial with a "after" or
| "before" query. But what about jumping directly to page 3 without
| previously visiting pages 2 or 4?
|
| So far I've implemented a hybrid... pages still exist, but next
| and previous page (the most common actions) are cursor.
|
| Is there a better way?
| jack_squat wrote:
| IMO if a solution requires users to address data by page, it's
| a sign that the search functionality is not good enough.
|
| In general a user doesn't want to find "page 3", they are
| looking for a record that has certain characteristics. They
| shouldn't need to think about pages, just search terms.
| kortex wrote:
| If you use ULIDs for IDs (sortable by time, millisecond
| resolution) and a tombstone field (nullable deleted_at is a
| good one) then you have a very stable collection that
| guarantees no new objects will be inserted/deleted before the
| write head - it is append only in effect.
|
| You can then do some cool things, especially if your objects
| are evenly distributed and especially if you don't need exact
| page sizes or can over-query and hide them as needed.
|
| If you then know the approximate frequency of objects, you can
| then map some linear scaling to an approximate range of ULIDs.
| Basically German tank problem in reverse.
|
| https://en.m.wikipedia.org/wiki/German_tank_problem
| higeorge13 wrote:
| I have seen a variation of the cursor based approach in Intercom
| API [1] but with a fixed cursor (while the example shows a
| different cursor on every request), which also does not allow you
| to create another one within 1 minute. It feels almost like a
| database server cursor, but anyone wants to guess how this is
| implemented?
|
| [1] https://developers.intercom.com/intercom-api-
| reference/refer...
| jmull wrote:
| > Bad performance for large OFFSET in SQL. When doing OFFSET Nin
| SQL, the database needs to scan and count N rows.
|
| For user-facing apps, you've failed pretty hard already if users
| can't find what they need on the first page or so.
|
| If your offsets are big enough for this to matter, I'd rather
| spend time on why users are so many pages in to the data than
| optimizing the performance of later pages.
|
| (Processing clients, on the other hand, should query for what
| they need. Processing in batches may make sens, so there cursor-
| or what they call keyset-based "pagination" makes good sense.
| Though in the case of processing clients, I wouldn't call it
| "pagination"... it's more like batches. I've mainly used "kelset-
| based" pagination for processing events, which can alleviate some
| of the "cons".)
| giaour wrote:
| Users do fuzzy searches across data all the time. They may
| remember approximately when they last saw a record but not its
| exact name, or their recollection may be close but imperfect.
| z3t4 wrote:
| Pagination in API's are a PITA as you have to make many requests
| to get all the data. 99% of the time you want all data, not just
| the first page. For the 1% use case you could use http-range or
| page. Pagination is often implemented as an premature
| optimization.
| indymike wrote:
| 8 minutes of wire time is bad and pagination is usually part of
| server side frameworks...
| russellendicott wrote:
| In my experience the only reason you would want all the data is
| if the API lacks good filter queries. IMO, making it a PITA to
| download all data is kind of a feature.
| cortesoft wrote:
| > 99% of the time you want all data, not just the first page.
|
| This certainly doesn't seem to be the case. Take viewing this
| very website, for example... are you suggesting that most
| people want to see all posts of all time, rather than just the
| first page or two? Paging is used extensively for feeds, and
| feeds are crazy common these days, and people rarely want to
| see all posts ever on a feed.
| crescentfresh wrote:
| > Pagination is often implemented as an premature optimization.
|
| or for forwards-compatibility.
| stevebmark wrote:
| I'd also point out that the GraphQL/Relay spec for pagination
| enforces cursor based pagination:
| https://relay.dev/graphql/connections.htm. It's also a really
| nice pagination spec overall, edges/node/pageInfo are well
| thought out and flexible.
|
| In my professional experience, lots of devs will implement
| fetching APIs and not consider pagination, and then when the
| responses start growing (testing data, use, whatever), user
| experience sucks. Pagination should be a default API design for
| anything returning an unbounded list of items.
| Ambolia wrote:
| How do you deal in GraphQL when you have a query hitting with
| multiple nodes which may return multiple cursors for the
| different nodes? Seems like a nightmare if you have to check
| and retry manually for each node, maybe there are libraries
| that take care of it transparently?
| stevebmark wrote:
| How is this different than hitting multiple REST endpoints
| for different resources with their own individual pagination?
| If the client needs to glue together paginated resources from
| two places, it's a hairy problem no matter what the protocol.
| A backend for frontend that does this under the hood is one
| solution to both, or a specialized endpoint that does it
| under the hood.
| Ambolia wrote:
| In the case of hitting multiple REST endpoints for
| different resources, on each query it's clear what
| pagination I'm using and combining the data is just
| appending it to the final list. In each case the partial
| results are flat and linear.
|
| But in the case of GraphQL, like I may get an incomplete
| list, or I may get a complete list that will have some
| fields with missing data, or some combination of both,
| going down deep depending on the structure of the query. I
| don't see a clear way to request the missing data without
| duplicating part of what I already have, and I don't see a
| clear way to combine the results.
|
| But I may be missing something about GraphQL or something
| about the tools that makes this simple.
|
| For example something like this: {
| users(first: 10000, after: "cursor123") { edges
| { cursor node {
| id name
| friends(first: 10000, after: "cursor235") {
| edges { cursor
| node { id
| name }
| } } } }
| } }
| hamandcheese wrote:
| In the relay JS library you can avoid some duplication
| with fragments. Fragments are pieces of a graphql query
| that you can include in multiple queries. If you are
| using flow, the type system will also ensure that you
| can't render a component unless you have used the correct
| fragment in your query.
|
| That said, paginating multiple resources on the same page
| is still going to be complicated.
| stevebmark wrote:
| Oh. I think you'd handle it the same as rest, make a new
| individual field query for that friend by ID and ask for
| the pagination there. I think that would be manual work
| for graphql or rest, im not aware of anything that does
| that automatically. It's also not a case I've personally
| ran into so not sure.
| grahamm wrote:
| Agreed, so many times I've raised paging with an internal API
| provider only to be told they can retrofit it later if needed.
| Then you start to use production data and bam the one call
| doesn't work.
| User23 wrote:
| I disagree. I think pagination is a clumsy error prone kludge.
| That it performs better than the awful naive approach by
| additionally burdening the caller doesn't remotely make it
| good. The correct approach is a buffered stream with proper
| flow control. Incidentally that's what _Transmission Control
| Protocol_ was invented for.
|
| Probably the nicest way to expose this that fits contemporary
| sensibilities would be as a lazy sequence that does sensible
| prefetching and asks the server for more data as it's iterated
| through.
| stevebmark wrote:
| When dealing with the most common case (I think) for
| pagination, clients (web/native) requesting data, are you
| suggesting keep a long running websocket only to fetch data
| in response to new page requests? What benefit would that
| afford, given that the requests are pretty far apart in time?
|
| Or are you focusing more on the backend machine to machine
| case? In that case, some kind of automatic prefetching, lazy
| loading/yield setup sounds pretty nice if that's abstracted
| away when you want to iterate over a large dataset from a
| separate resource server. It's not a pattern I've used
| before, mainly because it's not often that one server wants
| to know _everything_ that another server has.
| giaour wrote:
| > a lazy sequence that does sensible prefetching and asks the
| server for more data as it's iterated through
|
| Isn't this an accurate description of cursor-based
| pagination? You can have the server hold the cursor and
| associate a request for the next page based on the client ID
| (like in the PostgreSQL wire protocol), or you can make the
| flow stateless (at the network level) by handing the client a
| ticket for the next page.
| jayd16 wrote:
| Pagination implies many requests where as a streaming API
| would keep to a single request, no? No worry about multiple
| servers or juggling state.
| giaour wrote:
| It's still multiple requests on a streaming API if you're
| asking the client to notify you when it's ready for the
| next page. Most DB engines use TCP connection affinity to
| identify which cursor is in play, but the client still
| needs to send a packet asking for the next page. Both
| client and server need to keep the TCP connection open,
| which is state for both parties.
|
| I see you mentioned gRPC streams in a sibling comment,
| which are a great alternative to REST-based pagination,
| but they are highly stateful! Streams are built into the
| protocol, so a generic gRPC client will handle most of
| the state juggling that would have been your
| responsibility with a REST client.
| jayd16 wrote:
| This is a bit inaccurate I think. You can make a RESTful
| api over gRPC. The protocol is irrelevant. An open
| connection is not a violation of statelessness. If
| anything, a streaming API where you can close the cursor
| upon completion and treat new requests as fully new is a
| lot less stateful.
|
| We're also talking about APIs in general, not just http
| REST.
|
| >It's still multiple requests on a streaming API
|
| Perhaps, but the requests can be pipelined. You don't
| need to wait for the response to complete before asking
| for more.
| giaour wrote:
| The original article is talking about pagination over
| stateless (at the protocol level) HTTP requests.
| (Referring to APIs that are offered over stateless HTTP
| as "REST APIs" is technically incorrect but reflects
| common usage and is a convenient shorthand.)
|
| gRPC is able to offer powerful streaming abstractions
| because it utilizes HTTP/2 streams, which are cursor
| based rather than strictly connection oriented. The state
| is still there; it's just a protocol-level abstraction
| rather than an application-level abstraction.
|
| > Perhaps, but the requests can be pipelined. You don't
| need to wait for the response to complete before asking
| for more.
|
| That sort of defeats the purpose of grpc flow control,
| doesn't it?
| jayd16 wrote:
| >That sort of defeats the purpose of grpc flow control,
| doesn't it?
|
| Why do you say that? I don't know the full implementation
| details themselves but generally there's no reason you
| can't safely ask for even more after having asked and
| consumed some. If you have a buffer of M bytes and ask
| for M, then consume N, you could immediately ask for N
| more without waiting to receive all of the original M.
|
| Although, I wasn't speaking about gRPC in that case
| though. I'm not sure how exactly gRPC achieves back
| pressure. I was only speaking abstractly about a
| pipelined call vs many pages. You seemed to claim that
| multiple requests was a requirement.
|
| But fine, ignoring gRPC, you could possibly tune your
| stack such that normal http calls achieve back pressure
| from network stacks and packet loss. Http is built on top
| of TCP streams, after all. That doesn't make it
| inherently stateful does it?
|
| Going all the way back, I still think its fair to say
| that a streaming response with backpressure can be
| stateless and without multiple requests. If you want to
| argue that multiple packets or TCP signals are needed
| then perhaps so, but I think that's a far cry from the
| many separate requests a paginated call requires and I
| dont think its accurate enough to conflate them.
| giaour wrote:
| > I don't know the full implementation details themselves
| but generally there's no reason you can't safely ask for
| even more after having asked and consumed some
|
| I think we're saying the same thing but using different
| formulations. If you send an HTTP request for a list with
| 20 items, then get back a response with 10 items and a
| link for the next page, that is essentially the same as
| cosuming 20 items over a stream with flow control. The
| point of returning a cursor in the response and having
| the client send a new request for the next page is to
| support a stateful stream over a stateless protocol. In
| neither case are you waiting for the response to be
| complete before processing items, since your message
| indicates that "complete" here means the full result set
| has been sent to the client.
|
| > But fine, ignoring gRPC, you could possibly tune your
| stack such that normal http calls achieve back pressure
| from network stacks and packet loss. Http is built on top
| of TCP streams, after all. That doesn't make it
| inherently stateful does it?
|
| That's pretty much how streaming responses are
| implemented in TCP-based protocols (like the SQL query
| APIs exposed by Postgres or MySQL). TCP connections can
| be terminated for unrelated reasons, which is why you
| don't see this pattern very often in recent protocols.
| When a TCP connection is dropped and restarted due to,
| say, network congestion, you have to restart the
| paginated operation from the head of the list. H2 (and,
| vicariously, gRPC) streams are built to be more resilient
| to network noise.
|
| But to answer your question, yes, that pattern is
| inherently stateful. Pagination has to be, since the
| server has to know what the client has seen in order to
| prepare the next batch of results. You can manage this
| state at the protocol level (with streams), or you can
| push it to the application level (with cursor tokens
| embedded in the messages exchanged). The streaming
| approach requires a stateful server and client, whereas
| the application-level approach only requires a stateful
| client.
| jayd16 wrote:
| I think my opinion on close enough also differs from
| yours (and that's ok!) but you're also forgetting that a
| streamed/connection based approach only involves a single
| client and server and state during the connection.
|
| A paged response needs to sync state across many
| machines. As you have said, you need a clientID, a cursor
| reference, or a sticky session. That's not nothing.
|
| I think they're inherently different approaches.
| freedomben wrote:
| I find pagination really annoying on the client, but the
| majority of the time it does seem like the client doesn't
| actually need the whole thing so it's a good (and often very
| necessary) performance optimization. For endpoints where that
| isn't the case, it's (usually) not too hard to allow a
| sentinel value for page size that means infinity, so the
| client really can get it all in one swoop.
| jayd16 wrote:
| >Probably the nicest way to expose this that fits
| contemporary sensibilities would be as a lazy sequence that
| does sensible prefetching and asks the server for more data
| as it's iterated through.
|
| gRPC's streaming API is a good example. It has backpressure
| built in and you just need to configure the buffer sizes.
| Clients just consume items off the stream and async writes
| will suspend as needed.
| robertlagrant wrote:
| What are you saying that would be different in practice? What
| advantages would it have?
| jayd16 wrote:
| Not sure I agree with the parent but thinking it through,
| you have a single connection and a cleaner life cycle.
| Streaming writes and reads could mean back-pressure is
| possible. A client can consume items as needed as opposed
| to mapping usage to a page size.
|
| There's no worry about wasting a full request on the last
| page with a single item. In fact there's no worry about
| round trips at all, it's just a matter of buffer back
| pressure that communicates the need for more or less.
|
| Hmm, when you think about it, its a pretty good way to go.
| User23 wrote:
| That's consistent with what I have in mind. I too think
| it's pretty much best of both worlds. You get better
| efficiency than pagination with the convenience of an API
| that's no different from fetching all the data.
| nerdponx wrote:
| I worked with an API like this once, which streamed
| NDJSON one line at a time. I loved the amount of control
| that it gave me on the client side!
| amelius wrote:
| > There is no way to skip pages. If the user wants page X, it
| needs to request pages from 1 to X.
|
| Huh? The client can ask the server to generate a new cursor, any
| desired amount of items ahead in the list.
| mypalmike wrote:
| With opaque cursors, how does a client specify "a desired
| amount"?
| amelius wrote:
| You just say "this cursor + 5 pages", for example.
| hombre_fatal wrote:
| You send `GET /items[?cursor=<cursor>]` to my server.
|
| I respond: [ "items": [
| {...}, {...}, {...}, ... ], "next_cursor":
| "77963b7a" ]
|
| How do you request "this cursor + 5 pages"?
| amelius wrote:
| GET /items?cursor=77963b7a&add_to_cursor=5pages
| dingleberry420 wrote:
| If the server supports this, which they won't, because they
| don't want to support pagination.
| travisgriggs wrote:
| In 32 years of software development, it fascinates me how ever
| much more time I spend just moving/translating data between
| domains. I used to model things more, and do human interaction
| more. Now it seems like a dominant task I do in any of the
| products I work on is just shifting data from one
| place/format/representation/node to the other.
| itisit wrote:
| Ah, yes...the GRIMM stack: General Repetition of Integrations,
| Migrations, and Misery
| GiorgioG wrote:
| "Modern day best practices"...
| RandyRanderson wrote:
| Let's rewrite the microservices wiki as bit:
|
| "A microdata architecture - a variant of the data-oriented
| architecture structural style - arranges data as a collection
| of loosely-coupled data. In a microdata architecture, data are
| fine-grained and the relations are lightweight."
|
| Likely most experienced data architects are now laughing at
| what a dumb idea the above is. That's the same reaction I had
| when I first heard of microservices - "that's obviously a bad
| idea; ppl aren't that dumb", I thought.
| travisgriggs wrote:
| But big monoliths/spaghetti balls of code aren't great
| either, you'd agree with that?
|
| It seems like we don't have the ability to do "all things in
| moderation" as an industry very well. We're in constant
| search of The Silver Bullet, that if we just apply it,
| will... buy the world a coke and live in perfect harmonies or
| something. And we go through the cycle again and again and
| again.
| [deleted]
| [deleted]
| [deleted]
| Lightbody wrote:
| The Google Calendar API leaves a lot to be desired
| (documentation, etc), but over time what I've found is that it's
| really well-designed once you understand it.
|
| In their case, they went with a "pageToken" variable which is
| basically a cursor.
|
| The API has been around for years.
___________________________________________________________________
(page generated 2022-05-28 23:00 UTC)