Tech Focus Fri 29th May, 2015
NoSQL, MongoDB, and the importance of Data Modelling
It’s a brave new NoSQL powered world out there, and companies are increasingly moving to new technologies for data storage and reporting. And these solutions are indeed powerful — removing database features that people don’t always need (stored procedures, schema enforcement by the db, etc.), in exchange for features that we do want (really easy replication and failover, easy horizontal scalability, better concurrency).
But these systems aren’t a magic bullet, as evidenced by the number of people that have trouble using them, and the number of people that actively complain about them.
This is probably true for almost all NoSQL solutions, but in my experience it seems particularly true with MongoDB, probably for a few reasons:
- It’s the NoSQL solution I have the most experience with, so it’s the one I have the most opportunity to complain about.
- It’s one of the most (if not the most) popular NoSQL DBs, so it’s the solution most people try, often the first solution people try, and one of the solutions most likely to be used by
fanboysrelatively inexperienced developers. - It is a document database, which offers the widest array of modelling options — for better and for worse.
Data modelling is important no matter what persistence solution you use (yes, even in the SQL world, where there are endless courses, books, and articles available about basic modelling choices, like how to best represent a tree or graph¹). But document databases provide maximum flexibility in how you persist a given bunch of data…so with good data modelling techniques you have the maximum ability to “model your way out of trouble” with performance and concurrency issues, but with bad modelling you have a lot of rope to hang yourself with.
In my professional experience, I’ve run into a number of situations where a poor developer experience with MongoDB was largely based on the way it was being used, and suboptimal data modelling.
Let’s look at two of those cases.
Example 1: A reservation system
This is one of the more recent cases I’ve dealt with, a leading online restaurant reservation system developed by our friends at bookatable.com. Bookatable provides real-time reservation and marketing services for 13,000 restaurants across 12 countries, and they are very committed to delivering up-to-date table availability information to users of their consumer web and native apps. Their working data set is the set of available table reservations for each restaurant, and it covers many days in the immediate past and near future, with many different tables for each restaurant.
Originally, the data was modelled with one document per possible reservation, with a structure something like this:
{ _id: //a unique MongoDB ObjectId, restaurantId: //int or long, date: //a date, time of day: //a time, ... //other details of the reservation //(table no, size, etc) }
The developers at Bookatable are very sharp, had done a great job with general performance tuning of their app, and were following most of the published MongoDB guidelines. Even so, they were experiencing significant performance problems, to the extent that there was concern from management that MongoDB might not be the correct tool for the job. In fact, it can be quite suitable for this approach, and some relatively minor tweaks to the data model led to some dramatic improvements.
Tiny documents that are too finely grained:
MongoDB documentation regularly mentions the maximum document size (as of 2015 it’s 16MB), but there’s very little discussion of the minimum effective document size. It varies from setup to setup, and is best measured empirically, but in a discussion I had in a London pub with Dwight Merriman once, his opinion was that the effective minimum size was about 4KB.
This makes sense when you think about how MongoDB loads documents from disk. MongoDB files are organised into 4KB pages, and the DB always reads an entire page into memory at once². Some of the data in that page will be the document you’re looking for. However, MongoDB documents are not stored in any particular order (including _id order), and a page contains random documents³. So it’s usually best to work under the assumption that the rest of the page contains data completely unrelated to your query, and that having been loaded into memory, it will then be ignored.
You don’t need to worry about a small number of your documents being slightly smaller than the limit, but if you’ve got a lot of documents that are drastically smaller than the page size then you are very likely throwing away a lot of the data your system spends precious IO retrieving.
Documents larger than page size are no problem, as MongoDB does guarantee that a single document is always stored contiguously on disk, which makes it easy to return. So between 4KB and 16MB, smaller is better, but below 4KB smaller can be really bad.
A document per individual reservation meant that almost all of the documents in the database were under the minimum effective document size…often by an order of magnitude. The client was definitely experiencing the effects of all of that wasted IO.
Retrieving a large number of documents to satisfy common queries:
Related to the previous point, I generally use “the number of documents that need to be loaded” as an easy-to-calculate approximation of “how expensive is this query”. It’s far from exact, but it’s usually pretty easy to figure out, and it can go a long way with a little bit of napkin math.
For this client, a common query is “for a given restaurant, what are all of the times I can get a free table for X people in the next two days?”. Because of how the data was organised (one document per potential reservation), one document had to be retrieved for each potential table with the right size, times the number of timeslots. In a large restaurant (say 50 tables) with 20 time-slots each day (open 10 hours a day with 2 possible timeslots per hour per table) up to 2,000 documents need to be returned for a 2 day span, and that means (give or take) 2,000 potential random IO operations. Consider the same operation across a set of restaurants in a city, and a very large amount of IO is required to answer queries.
How data modelling fixed the problem:
Changing the data model so that a single document contained all available reservations per restaurant per day solves almost all of these performance issues.
The documents got larger, but because they were previously well under the 4KB limit, this doesn’t negatively impact the IO requirements of the system at all. There are also far fewer documents – now just 1 document per restaurant per day. Satisfying common queries now requires retrieving far fewer documents – the common query “for a given restaurant, what are all of the times I can get a free table for X people in the next two days?” – now requires exactly 2 documents to be loaded.
This also has an effect on the number of writes that are required: Updating a restaurant’s inventory for a day (or adding data for a new day) requires just one document write.
Example 2: Time series data & calculating catchments
This is a slightly older example from a project Equal Experts worked on with Telefónica Digital in 2012-2013, with a very interesting use case: Calculating catchment areas over time.
Briefly, a catchment area is the area where the people who visit a certain location live. As traffic varies a lot over the course of a year, the requirement was to calculate a catchment area over time: “Over a 6 month period, where did the people who visited location X come from, and in what concentrations?”
The United Kingdom was divided into about 120,000 areas (“locations”) and for each day, traffic movement was described as “X people who live in location Y visited location Z.” The challenge was to be able to retrieve and aggregate that information in a timely fashion to generate a report.
The product designer even provided a worst case “nightmare query:” Calculate the catchment for a collection of 80 destinations over a period of 6 months.
Another consultancy⁴ took the data, and modelled it in the most straightforward way they could think of, where each document was a tuple:
{sourceLocationId, destinationLocationId, date, visitors}
This didn’t work. In fact, it didn’t work so much that when we first got involved tuning this particular use case, the other consultancy determined that MongoDB was completely unsuitable for the task at hand, and shouldn’t be used.
The problem is that this tuple is significantly less than 4KB in size, at about 128 bits (plus key names) per tuple. And with 120,000 sources and destinations, up to 120k * 120k documents had to be loaded each day. That’s 14.4 billion documents per day.
Trying to insert that many documents every day completely overloaded the MongoDB cluster. The team involved tried everything they could think of to performance tune the cluster (more shards, more IO, more mongos instances).
Worse, satisfying the “nightmare query” involved loading documents for 80 destinations * 120,000 potential sources * 180 days, or 1.7 billion documents, out of a potential 2.59 trillion documents in the database for the time period.
It was at this point that Equal Experts first became involved in this particular feature. Our very first suggestion was to rethink the data model, and as it happened, changing the data model completely solved this performance problem.
Our first modelling change was to use larger documents, where each document contained all of the traffic for 1 destination over 1 day:
{ _id: //an objectId, destinationLocationId: //int, date: //date, sources: { sourceLocationId: visitors //both ints, .... //up to 120,000 of these } }
This resulted in much larger documents, but still comfortably under the MongoDB document size limit.
This immediately reduced the number of documents required each day from 14.4 billion to 120,000. And the number of documents required to satisfy the nightmare query was 80 destinations * 180 days = 14,400 documents.
This was an improvement, but we were still unhappy with loading over 14 thousand documents to satisfy a query, so we further optimised the model.
We stayed with a document per destination per day, but instead of just storing the visits for that day, we stored the visits for that day and the total visits so far that year:
{ _id: //an objectId, destinationLocationId: //int, date: //date, sources: { sourceLocationId: { dailyVisits: //int, ytd: //int }, .... //up to 120,000 of these } }
At this point the documents were large enough that we were heavily optimising key names to keep documents under control. I have omitted that here to keep things more readable.
Now to return the catchment for a date range, you query the document for the end of the range, and the document for the beginning of the range, and just subtract the two sets of totals from each other, and presto: you’ve got the number of visits during the range.
Inserting new data is still pretty easy: Just load yesterday’s data, add today’s numbers to it, and you’ve got the new totals for so far this year.
The number of documents that need to be loaded to satisfy the “nightmare query” is now 80 destinations * 2 days = 160 documents.
That’s a total of 10.8 million times fewer documents that theoretically would need to be retrieved⁵. It’s highly doubtful that any amount of physical tuning could ever provide a similar performance increase.
Conclusions
So, to recap:
- The number of documents that need to be read (or written) by a particular operation is often a useful approximation for much IO needs to be performed.
- Modelling is important. Changes to a data model can impact performance as much, or even more than regular performance tuning, physical tuning, or configuration changes.
- Document databases give you maximum flexibility for modelling, and therefore a lot of potential power, but also a lot of rope with which to hang yourself.
- A good data model design at the beginning of a project can be worth its weight in gold.
…and…
If you want some advice about data modelling, (especially with NoSQL and MongoDB), we’re available.
¹ when you can’t just use neo4j 😉
² this IO strategy is also used by virtually every SQL database on the market.
³ a single page will always contain documents from the same collection, but there are no other guarantees at all about which documents they will be.
⁴ in this case, names omitted to protect the very, very, guilty — as will shortly become clear
⁵ “theoretically” mostly because the other company’s implementation was never able to successfully answer a query