Pagination quirk with MongoDB
Depok,We use MongoDB as our database at Cermati. A few days ago when I was building a new product line, I needed to implement pagination with MongoDB using its aggregate pipeline query and the pattern we have been using at Cermati, $sort + $skip + $limit. $sort is used to sort (obviously) the products based on their interest rate. $skip is like an offset; it tells MongoDB how many documents to skip before returning the results. $limit tells how many documents to return. So, a query string like
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 10 },
{ '$limit': 10 }
] );
will sort the collection ascending by interest rate, discard the first 10 records, and return the next 10 records. Why aggregate query? Because actually, the documents in the collections have rateTable property as an array of objects. Those objects are the ones having interestRate property. So, to sort them all based on interest rate, I need to unwind rateTable first. Unwinding can only be done in aggregate query.
To implement pagination, I used queries like these
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 0 },
{ '$limit': 10 }
] );
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 10 },
{ '$limit': 10 }
] );
I expect the first and second query to return the first 10 documents and the next 10 documents with the lowest interest rate. However, I found that the two queries had overlapping results! What happened?
Reading MongoDB documentation, I found that MongoDB has some optimization tricks when doing aggregate query. What particularly relevant to my case are $skip + $limit optimization and $sort + $limit coalescence. These two things, plus the fact that some of my documents have equal interest rate, are the cause of my problem.
Let me first explain the two optimizations first. When MongoDB sees $skip immediately followed by $limit, it will swap the two and increase the limit value by the skip value. So, the query
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 10 },
{ '$limit': 10 }
] );
will actually be transformed to
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$limit': 20 },
{ '$skip': 10 }
] );
Next, when MongoDB sees $sort immediately followed by $limit, it will only keep n documents when sorting where n is the limit value. In other words, instead of sorting the whole collection and only return the first n documents, MongoDB will keep the first n documents sorted in memory and update them if necessary, i.e. if it sees a document having lower value than the largest value of all n documents it keeps so far, as it scans through all the documents. I don’t know the actual implementation but I’d like to think that MongoDB is using heap for this purpose.
How did these two cause problem to my query? I am not really sure but I have an idea. Let me give an example. Suppose we want to display only two documents per page and we have these four documents stored in MongoDB:
[
{ name: 'A', rateTable: { interestRate: 0.2 } },
{ name: 'B', rateTable: { interestRate: 0.1 } },
{ name: 'C', rateTable: { interestRate: 0.2 } },
{ name: 'D', rateTable: { interestRate: 0.3 } }
]
To get the two documents for the first page, my query will be like
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 0 },
{ '$limit': 2 }
] );
which is transformed to
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$limit': 2 },
{ '$skip': 0 }
] );
so MongoDB will keep only two documents in memory, that is A and B, and it will not update it by adding C or D since none of them has lower interest rate than A, which is the document with the largest interest rate in the memory. So, the result of this query are documents A and B.
To get the next two documents, my query will be like
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$skip': 2 },
{ '$limit': 2 }
] );
which is transformed to
db.collections.aggregate( [
{ '$sort': { 'rateTable.interestRate': 1 } },
{ '$limit': 4 },
{ '$skip': 2 }
] );
so this time MongoDB will keep all four documents in memory. When it processes the $skip operator to skip the first two documents with the lowest interest rate, it skips documents B and C and returns documents A and D. This is possible since we don’t actually know how MongoDB handles a tie when sorting. For all we know, MongoDB considers documents A and C as equal since they have equal interest rate. So there we have the two queries have an overlapping result that is document A.
How did I solve this? Easy enough: by including another field to sort which is known to be unique. This serves as the tie-breaker when sorting. In our example that might be the name field. So next time one wants to sort something in MongoDB, remember to include some field to sort as the tie-breaker. Hope this helps.