Wednesday, November 10, 2010 at 8:00AM | Derek Stainer
In a recent O'Reilly webcast Kyle Banker, of 10Gen, discussed indexing in MongoDB. In the webcast, Banker discusses how indexes can be designed to optimize the performance of MongoDB. In addition, Banker discusses tips for diagnosing and fixing performance issues should they arise. We have the video and slides to the webcast.
So yesterday we spoke about an extension to Cassandra that provides asynchronous triggers. Now we will see a use case in action, managing a secondary index with triggers. In this post, Maxim Grinev and Martin Hentschel are at it again, they describe the use case here:
Cassandra does not support secondary indexes at first, but storing redundant data (in a different layout) will give you the same effect. The main drawback is that your application (the code that writes to the DB) needs to take care of managing the index. Every time you write to the DB, you also need to maintain your index.
So by using the asynchronous triggers you can maintain the secondary index without the performance impact.
Follow Up: Cassandra is planning native support for secondary indexes in Cassandra. Here is the JIRA (CASSANDRA-749) full of the discussion among the committers.
Whether you are working with a traditional relational database like MySQL or a NoSQL database like MongoDB, indexing plays a big part performing efficient lookups. Alberto Lerner, from 10gen, discusses how MongoDB indexes documents for fast retrieval.
MongoDB Indexing Basics:
MongoDB can use separate tree structures to index a collection
MongoDB is going to try an use an index to avoid going through a collection
Our job as developers is to provide MongoDB with indices on important queries. What makes an important query? Frequently used queries or queries where response time is critical. MongoDB provides us with an automatic index on _id. Others can be created with 'ensureIndex'.
When searching in MongoDB the search criteria is either the index key or the prefix of an index key. Searching can yield both exact matches or several results. There are a couple of special cases with indexing. For instance, null fields. In the case of null fields a value of null is stored in the index. In the case of fields that contain arrays, there will be one entry per element in that array.
Whenever, MongoDB evaluates search criteria only one index will be used. What if you have several indices, how does MongoDB decide? It uses previous executions, so it's a performance based selection process.
In the end, almost everything we do related to building applications revolves around trade-offs, and MongoDB is no different. Lerner reminds us that there is a trade-off that exists between search criteria performance and the insertion/updating/deletion of keys. So Lerner suggests as a concluding point to be "mindful of number of indexes and choice of keys".
This presentation, by Benjamin Black, discusses indexing with Cassandra. Specifically, Black starts out discussing the differences with indexes and query creation between traditional relational data stores and column stores.
Relational stores are schema driven. Start with your schema and work forward
Column stores are query driven. Start from your queries and work backwards
In addition to these differences he discusses how things change when you are "at scale".
Denormalization is the norm
Everything depends on the indices
Black, proceeds to discuss two types of indices, super column and composite key, and provides some very good examples.
From a deployment perspective, there are two common strategies, that I know of, to deal with your indices. First and most simple is simply deploying them along with your application data in the same keyspace. A second strategy is to deploy a separate physical cluster of machines that handle the indices. Complicates the deployment and the code, but I'm sure there are some good reasons to do so.