MongoDB Best Practices
Tip 1: Normalize if you need to future-proof data
Normalization “future-proofs” your data: you should be able to use normalized data for different applications that will query the data in different ways in the future. This assumes that you have some data set that application after application, for years and years, will have to use. There are data sets like this, but most people’s data is constantly evolving, and old data is either updated or drops by the wayside. Most people want their database performing as fast as possible on the queries they’re doing now, and if they change those queries in the future, they’ll optimize their database for the new queries.
Also, if an application is successful, its data set often becomes very application-specific. That isn’t to say it couldn’t be used for more that one application; often you’ll at least want to do meta-analysis on it. But this is hardly the same as “future-proofing” it to stand up to whatever queries people want to run in 10 years.
Tip 2: Embed dependent fields
When considering whether to embed or reference a document, ask yourself if you’ll be querying for the information in this field by itself, or only in the framework of the larger document. For example, you might want to query on a tag, but only to link back to the posts with that tag, not for the tag on its own. Similarly with comments, you might have a list of recent comments, but people are interested in going to the post that inspired the comment (unless comments are first-class citizens in your application).
If you have been using a relational database and are migrating an existing schema to MongoDB, join tables are excellent candidates for embedding. Tables that are basically a key and a value—such as tags, permissions, or addresses—almost always work better embedded in MongoDB. Finally, if only one document cares about certain information, embed the information in that document.
Tip 3: Use the correct types
Storing data using the correct types will make your life easier. Data type affects how data can be queried, the order in which MongoDB will sort it, and how many bytes of storage it takes up.
Any field you’ll be using as a number should be saved as a number. This means if you wish to increment the value or sort it in numeric order.
Sorting compares all numeric types equally: if you had a 32-bit integer, a 64-bit integer, and a double with values 2, 1, and 1.5, they would end up sorted in the correct order. However, certain operations demand certain types: bit operations (AND and OR) only work on integer fields (not doubles). The database will automatically turn 32-bit integers into 64-bit integers if they are going to overflow (due to an $inc, say), so you don’t have to worry about that.
Similarly to numbers, exact dates should be saved using the date type. However, dates such as birthdays are not exact; who knows their birth time down to the millisecond? For dates such as these, it often works just as well to use ISO-format dates: a string of the form yyyy-mm-dd. This will sort birthdays correctly and match them more flexibly than if you used dates, which force you to match birthdays to the millisecond.
All strings in MongoDB must be UTF-8 encoded, so strings in other encodings must be either converted to UTF-8 or saved as binary data.
Always save ObjectIds as ObjectIds, not as strings. This is important for several reasons. First, queryability: strings do not match ObjectIds and ObjectIds do not match strings. Second, ObjectIds are useful: most drivers have methods that can automatically extract the date a document was created from its ObjectId. Finally, the string representation of an ObjectId is more than twice the size, on disk, as an ObjectId.
Tip 4: Avoid using a document for _id
You should almost never use a document as your _id value, although it may be unavoidable in certain situations (such as the output of a MapReduce). The problem with using a document as _id is that indexing a document is very different than indexing the fields within a document. So, if you aren’t planning to query for the whole subdocument every time, you may end up with multiple indexes on _id, _id.foo, _id.bar, etc., anyway.
You also cannot change _id without overwriting the entire document, so it’s impractical to use it if fields of the subdocument might change.
Tip 5: Create indexes that cover your queries
If we only want certain fields returned and can include all of these fields in the index, MongoDB can do a covered index query, where it never has to follow the pointers to documents and just returns the index’s data to the client. So, for example, suppose we have an index on some set of fields:
> db.foo.ensureIndex({"x" : 1, "y" : 1, "z" : 1})
Then if we query on the indexed fields and only request the indexed fields returned, there’s no reason for MongoDB to load the full document:
> db.foo.find({"x" : criteria, "y" : criteria},
... {"x" : 1, "y" : 1, "z" : 1, "_id" : 0})
Now this query will only touch the data in the index, it never has to touch the collection proper. Notice that we include a clause “_id” : 0 in the fields-to-return argument. The _id is always returned, by default, but it’s not part of our index so MongoDB would have to go to the document to fetch the _id. Removing it from the fields-to-return means that MongoDB can just return the values from the index.
If some queries only return a few fields, consider throwing these fields into your index so that you can do covered index queries, even if they aren’t going to be searched on. For example, z is not used in the query above, but it is a field in the fields-to-return and, thus, the index.
Tip 6: Always use safe writes in development
In development, you want to make sure that your application is behaving as you expect and safe writes can help you with that. What sort of things could go wrong with a write? A write could try to push something onto a non-array field, cause a duplicate key exception (trying to store two documents with the same value in a uniquely indexed field), remove an _id field, or a million other user errors. You’ll want to know that the write isn’t valid before you deploy.
One insidious error is running out of disk space: all of a sudden queries are mysteriously returning less data. This one is tricky if you are not using safe writes, as free disk space isn’t something that you usually check. I’ve often accidentally set –dbpath to the wrong partition, causing MongoDB to run out of space much sooner than planned.
During development, there are lots of reasons that a write might not go through due to developer error, and you’ll want to know about them.
Tip 7: Start up normally after a crash
If you were running with journaling and your system crashes in a recoverable way (i.e., your disk isn’t destroyed, the machine isn’t underwater, etc.), you can restart the database normally. Make sure you’re using all of your normal options, especially — dbpath (so it can find the journal files) and –journal, of course. MongoDB will take care of fixing up your data automatically before it starts accepting connections.
This can take a few minutes for large data sets, but it shouldn’t be anywhere near the times that people who have run repair on large data sets are familiar with (probably five minutes or so). Journal files are stored in the journal directory. Do not delete these files.
Tip 8: Manually clean up your chunks collections
GridFS keeps file contents in a collection of chunks, called fs.chunks by default. Each document in the files collection points to one or more document in the chunks collection. It’s good to check every once and a while and make sure that there are no “orphan” chunks—chunks floating around with no link to a file. This could occur if the database was shut down in the middle of saving a file (the fs.files document is written after the chunks).
To check over your chunks collection, choose a time when there’s little traffic (as you’ll be loading a lot of data into memory) and run something like:
> var cursor = db.fs.chunks.find({}, {"_id" : 1, "files_id" : 1});
> while (cursor.hasNext()) {
... var chunk = cursor.next();
... if (db.fs.files.findOne({_id : chunk.files_id}) == null) {
... print("orphaned chunk: " + chunk._id);
... }
This will print out the _ids for all orphaned chunks. Now, before you go through and delete all of the orphaned chunks, make sure that they are not parts of files that are currently being written! You should check db.curren tOp() and the fs.files collection for recent uploadDates.