Pagination of Database Query Results

This article is highly inspired by the blog post We need tool support for keyset pagination.
Please consider reading the original first and then my interpretation and additional thoughts about this idea.

We have a typical database base backed web application. It can be a rich client. It can be a NoSQL database. Whatever, but for the moment maybe a transactional SQL database and a web application will be used as the most common and best understood example.

We kind of construct a query, of which the result set is so large, that we do not want to transfer, render and display it all at once. Think of Google, where you might have millions of hits and only see the first 10 or 100 or so. Now you can navigate through the result set, usually by going to the next pages successively.

Nice web applications tell you how many records there are exactly. And then calculate how to split this up into a large number of pages and allowing you to go to an arbitrary page, or at least to some of the pages (last, next 10 pages) immediately.

There are several problems with this. First of all: the set you are basing this on changes with time. This could be fixed by creating some temporary snapshot in memory, in the database or in some kind of db-cache and keeping that around for some time. This is expensive and usually not worth the price, so we live with some misbehavior.

The second problem is the count. Count actually needs to do a full scan of the result set, there is no decent short cut. Now it would usually be good enough to estimate the size of the result set and that is exactly what google does. Sometimes the estimation is ridiculously bad, but we have gotten used to it. That is much better than using the real count, because it really speeds up the application by a great factor and improves user experience overall. Of course it is a bit more challenging to program an estimation and only worth it for big result sets. We could still jump to any of the next 10 pages by just increasing the size of the current query by 11 and showing the required segment. Since we sort the result by something, we could just get the first N results sorted backwards and then wrap that in a query that sorts them forward. So dropping the count can help for performance without too much loss in functionality.

Now the next problem is, that having navigated through a lot of pages, the usual way of doing a query and then constraining it with some mechanism like „rownum“, „row_number“, „limit/offset“ to the subset we want to display always requires obtaining the resultset from the beginning to our page. Usually that does not matter, because we navigate through the application manually and that means not too many pages from the beginning. But it can become an issue, if the application is heavily used.

Now we sort by something. Usually not the id, but some business logic relevant keys or a combination. They do not have to be unique, but we can always add the ID (or more generally the primary key) as the last sort key to make it unique. Then the next page can be found by just adding a condition based on our sort key. Assuming we want to show N records. The query might be
SELECT * FROM TABX   WHERE cond1 AND cond2   ORDER BY A, B, C ASC   LIMIT n;
Now the last record of the result has A=a1, B=b1, C=c1.
Then the next page can be obtained by
SELECT * FROM TABX   WHERE cond1 AND cond2    AND (A>a1 OR A=a1 AND B>b1 OR A=a1 AND B=b1 AND C>c1)   ORDER BY A, B, C ASC   LIMIT n;
To skip three pages and get the fourth of the following pages is done like this:
SELECT * FROM TABX   WHERE cond1 AND cond2    AND (A>a1 OR A=a1 AND B>b1 OR A=a1 AND B=b1 AND C>c1)   ORDER BY A, B, C ASC   LIMIT n OFFSET 3*n;
We do use (or simulate) OFFSET here, but in a more intelligent way, because we do not force it to work harder than necessary by rebasing our query on what we already know.

We get a bit more consistency, because new records being inserted in the beginning will not be seen, but they will at least not disturb the succession of our result set. Maybe that is worth more than the performance gain, because we get it with a reasonable effort. caches, snapshots or even keeping it in „session memory“ are not really serious approaches to this issue, they will just sooner or later blowup our application in production, at the worst possible moment.

Schreibe einen Kommentar

Antworten abbrechen