[HN Gopher] What's good about offset pagination; designing paral...
___________________________________________________________________
What's good about offset pagination; designing parallel cursor-
based web APIs
Author : clra
Score : 54 points
Date : 2021-01-03 17:18 UTC (5 hours ago)
(HTM) web link (brandur.org)
(TXT) w3m dump (brandur.org)
| draw_down wrote:
| > it uses offsets for pagination... understood to be bad practice
| by today's standards. Although convenient to use, offsets are
| difficult to keep performant in the backend
|
| This is funny. Using offsets is known to be bad practice
| because.... it's hard to do.
|
| Look I'm just a UI guy so what do I know. But this argument gets
| old because I'm sorry, but people want a paginated list and to
| know how many pages are in the list. Clicking "next page" 10
| times instead of clicking to page 10 is bullshit, and users know
| it.
| alexchamberlain wrote:
| I can't find one right now, but I feel like there must be an
| algorithm that can identify the key that can be used for each
| page, yet cheaper than a full sort (ie cheaper than offset
| pagination in generality). Of course, a skip list would work
| for a secondary index.
| Rafert wrote:
| How do people even figure out they need to be on page 10?
| Sounds like a filter of some kind (e.g. before a date) would be
| better, except they trained themselves to use x amount of pages
| to approximate.
|
| Either way, 10 pages isn't so bad but tens of thousands can
| become troublesome as explained on
| https://shopify.engineering/pagination-relative-cursors
| PurplePanda wrote:
| Maybe they've been there before and happen to remember "page
| 10".
|
| Maybe someone else has been there before and told them "Go to
| page ten".
|
| Maybe you know that there are 20 pages, and are looking to
| find the median, or are just informally playing around,
| looking at the distribution.
|
| Same as you'd do with an interesting page of a book. I don't
| think I'd stop using page numbers in dead tree books if they
| somehow came with filters.
| ppeetteerr wrote:
| Pagination of an immutable collection is one thing and can be
| parallelized. Pagination of a mutable collection (e.g. a database
| table), on the other hand, is risky since two requests might
| return intersecting data if new data was added between the
| requests being executed.
|
| True result sets require relative page tokens and a
| synchronization mechanism if the software demands it.
| simonw wrote:
| Intersecting data is fine provided there's a unique ID for each
| result that can be used to de-duplicate them.
|
| Ideally I'd want a system that guarantees at-least-once
| delivery of every item. I can handle duplicates just fine, what
| I want to avoid is an item being missed out entirely due to the
| way I break up the data.
| ppeetteerr wrote:
| It's more than just de-duplicating, tho. Imagine you query a
| dataset and get something like a page count and a chunk size.
| That page count cannot be trusted if the dataset is mutable.
| If an item is inserted at the beginning of the set, you're
| going to miss the last item.
|
| Pagination is hard
| the_arun wrote:
| For dynamic usecase, DynamoDB has implemented pagination
| with something called _lastEvaluatedKey_ - https://docs.aws
| .amazon.com/amazondynamodb/latest/developerg...
|
| This is different from LIMIT in RDBMS
|
| Wouldn't this pattern solve the complexity you are talking
| about?
| jasonhansel wrote:
| It's important here that "created" is an _immutable_ attribute.
| Otherwise you could get issues where the same item appears on
| multiple lists (or doesn 't appear at all) because its attributes
| changed during the scanning process.
| adontz wrote:
| I believe data export and/or backup should be a separate API,
| which is low priority and ensures consistency.
|
| Here we just see regular APIs are being abused for data export.
| I'm rather surprised the author did not face rate limiting.
| gampleman wrote:
| To point out the obvious: generally API providers don't
| particularly want you to pararelize your request (they even
| implement rate limiting to make it harder on purpose). If they
| wanted to make it easy to get all the results, they would allow
| you to access the data without pagination - just download all the
| data in one go.
| sb8244 wrote:
| > If they wanted to make it easy to get all the results
|
| Speaking from experience...we want to make it easy but also
| want to keep it performant. Getting the data all in one go is
| generally not performant and is easy to abuse as an API
| consumer. For example, always asking for all of the data rather
| than maintaining a cursor and secondary index (which is so much
| more performant for everyone involved).
| alexchamberlain wrote:
| We provide (internal) access to data where we provide
| interactive access via GraphQL-based APIs and bulk access via
| CSV or RDF dumps - I feel like dump files are grossly
| undervalued these days.
| sb8244 wrote:
| I agree. I am going to reflect on this and see if there's a
| way to support dump files long term in our app. We sorta
| support it today but it's ad hoc implementation since an
| export can range from a few hundred of a thing to tens of
| millions of a thing.
|
| Is there any good literature or patterns on supporting
| dumps in the tens of millions or larger?
|
| I wrote a sheets plug-in that uses our cursor API to
| provide a full dump into a spreadsheet. Our professional
| services team is in love with it, so I bet they'd love
| generic data export capability.
| tshaddox wrote:
| That's the point. Running multiple paginated queries in
| parallel is essentially circumventing the API provider's
| intent to limit the number of items requested at one time.
| eyelidlessness wrote:
| A certain level of parallelism is generally within the realm of
| good API citizenship. Even naive rate limiting schemes tend to
| permit a certain number of concurrent requests (as they well
| should, since even browsers may perform concurrent requests
| without any developer intervention).
|
| Rate limiting and pagination aren't (necessarily) about making
| full data consumption more difficult. They're more often about
| optimizing common use cases and general quality of service.
|
| Edit to add: in certain circles (eg those of us who take REST
| and HATEOAS as baseline HTTP API principles), parallelism is
| often not just expected but often encouraged. A service can
| provide efficient, limited subsets of a full representation and
| allow clients to retrieve as little or as much of the full
| representation as they see fit.
| corty wrote:
| One thing that frequently bugs me is APIs limiting number of
| items per page for reasons of efficiency. I can perfectly
| understand low limits for other reasons, like not helping
| people scrape your data.
|
| But limiting for efficiency is usually done in a way that I
| would call a cargo cult: First, the number of items per
| "page" is usually a number one would pick per displayed page,
| in the range of 10 to 20. This is inefficient for the general
| case, the amount of data transmitted is usually just the same
| size as the request plus response headers. So if the API
| isn't strictly for display purposes, pick a number of items
| per page that gives a useful balance between not transmitting
| too much useless data, but keeping query and response
| overhead low. Paginate in chunks of 100kB or more.
|
| In terms of computation and backend load, pagination can be
| as expensive for a 1-page-query as for a full query. Usually
| this occurs when the query doesn't directly hit an index or
| similar data structure where a full sweep over all the data
| cannot be avoided. So think and benchmark before you
| paginate, and maybe add an index here and there.
| eyelidlessness wrote:
| I think keeping temporal history and restricting paginated
| results to the data at the point in time where the first page was
| retrieved would be a pretty decent way to solve offset based
| _interfaces_ (regardless of the complexity of making the query
| implementation efficient). Data with a lot of churn could churn
| on, but clients would see a consistent view until they return to
| the point of entry.
|
| Obviously this has some potential caveats if that churn is also
| likely to quickly invalidate data, or revoke sensitive
| information. Time limits for historical data retrieval can be
| imposed to help mitigate this. And individual records can be
| revised (eg with bitemporal modeling) without altering the set of
| referenced records.
| ako wrote:
| I think for most use cases, as a user i'd rather see the newest
| items in a list, then consistency of pagination. If i forget to
| manually refresh, i might miss out on important new items.
|
| Why do you think it is important for users to have temporal
| consistency?
| eyelidlessness wrote:
| Well I'll use a recent example I encountered that was
| actually very frustrating. I was looking for a font to use
| for a logo for a personal project. The site I was using
| (won't name and shame, and I can't recall the site now
| anyway) had no sorting options, items were ordered by
| whatever "popularity" formula they use. As I paginated, many
| of the fonts I'd previously viewed would appear on subsequent
| pages, often in a different order. It was frustrating not
| just because I could tell that I was probably missing fonts
| that were being bumped up to previous pages, but also because
| it made me doubt my mental model of my own browsing history:
| "Did I navigate back too far? Did I forget a tangential click
| and end up on a different search path?"
|
| It's not a great UX. And in some ways I suspect that _my own
| views_ were at least partially causing it, which made me more
| hesitant to even click on anything unless I was sure it was
| worth the disruption.
| ako wrote:
| This doesn't sound like a "temporal consistency" problem,
| rather an inconsistent and untransparent ordering issue.
| eyelidlessness wrote:
| Huh? It's absolutely a temporal consistency problem. The
| ordering was absolutely clear and consistent (most->least
| popular at time of request). But the popularity scores
| were changing so rapidly that "at time of request" makes
| the ordering _unstable_. If the ordering was determined
| by popularity at the time I accessed page 1, the ordering
| would have been stable.
|
| Sure, that popularity score would be stale. But who
| cares?
|
| Think of it this way. Suppose you're viewing your Twitter
| timeline in recent order, and suppose the pagination
| (lazy loaded on scroll) worked this way, and suppose you
| have new Tweets arriving at the same rate you scroll.
| What you would see is the same page of Tweets repeat
| forever (well, until you hit the pagination cap).
|
| This is why people come up with solutions like cursors.
| But what I was suggesting is that you can instead
| continue to use offsets (for the benefits discussed in
| the article like parallelism and predictability) if you
| paginate on the state of the data _at the time you began_
| (edit: or on the state of your sorting criteria at the
| time you began, which allows for the mitigations I
| described upthread).
|
| That's not to suggest that once you begin a pagination,
| you'll forever access stale data. It's to suggest that a
| set of pagination requests can be treated as a session
| accessing a stable snapshot.
|
| This can also be totally transparent to the client, and
| entirely optional (eg pagination is performed with an
| offset and an optional validAt token).
| ako wrote:
| Ok, thanks for the explanation. I hadn't expected the
| popularity score to be so unstable. Means that a lot of
| users are concurrently scoring the fonts, and that the
| averages are continuously being recalculated. Unexpected.
|
| But you're right, if the dataset is continuously changing
| at high frequency, pagination makes no sense.
| arcbyte wrote:
| I think you could accomplish something similar with token
| pagination by requesting a number of items that will result in
| multiple "pages" for your user interface. Then as the user
| iterates through you can request additional items. This isn't
| parallelizing, but provides the same low-latency user experience.
| felixhuttmann wrote:
| A few thoughts:
|
| 1) AWS dynamodb has a parallel scanning functionality for this
| exact use case.
| https://docs.aws.amazon.com/amazondynamodb/latest/developerg...
|
| 2) A typical database already internally maintains an
| approximately balanced b-tree for every index. Therefore, it
| should in principal be cheap for the database to return a list of
| keys that approximately divide the keyrange into N similarly
| large ranges, even if the key distribution is very uneven. Is
| somebody aware of a way where this information could be obtained
| in a query in e.g. postgres?
|
| 3) The term 'cursor pagination' is sometimes used for different
| things, either referring to an in-database concept of cursor, or
| sometimes as an opaque pagination token. Therefore, for the
| concept described in the article, I have come to prefer the term
| keyset pagination, as described in
| https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin....
| The term keyset pagination makes it clear that we are paginating
| using conditions on a set of columns that form a unique key for
| the table.
| gigatexal wrote:
| From the code sample in the article I didn't know you could
| append to a slice from within a go func
| mssundaram wrote:
| As long as you use the mutex locks
___________________________________________________________________
(page generated 2021-01-03 23:01 UTC)