Last year we had a post where we broke down Kenny Gorman's presentation from MongoSV. I was fortunate to get some of Kenny's time for a quick question and answer session about his presentation at MongoSV, NoSQL in general and of course MongoDB.
NoSQL DB: Tell us about yourself, what's your role at Shutterfly, what do you do there, etc.
KG: My name is Kenny Gorman, I am a Data Architect at Shutterfly Inc. I am responsible for the design and architecture of our persistent data stores. I have been using Oracle for well over a decade and also ran one of the largest shard'ed PostgreSQL implementations in the world. A huge amount of my real knowledge about Oracle came from my years at Paypal/Ebay where we essentially ran some of the largest and most hardest working Oracle databases anywhere. I really learned a lot from those years in terms of how to do stuff right. It was a total 'dream team' there. I came to Shutterfly to help them move them the next level with some of the data problems they were seeing with the existing Oracle implementation.
NoSQL DB: What was your first exposure to NoSQL in general?
KG: I was aware of the 'movement' so to speak for a long time, but it wasn't until we did a ground up analysis at Shutterfly that I really dove in. I am a firm believer that one needs to choose the right databases for the job. Not all problems are relational problems, so it's great to have more mature options these days vs having to always write everything yourself.
NoSQL DB: What are your overall thoughts of the "NoSQL movement"?
KG: I kinda hate the term 'NoSQL'. I much prefer the term Not Only SQL. Best is non-relational. But that's not as catchy. I do love the fact that people are exploring different data storage platforms and there is lots of energy in the space. Lots of options out there for people to explore and use. Just be sure to really understand the tool you are using. Learn it inside and out before jumping in.
NoSQL DB: In your presentation you mention that you are currently using your RDBMS as a key/value store. Did you evaluate any NoSQL key/value databases like Redis, Riak or Membase? If so, are you planning on using any of them?
KG: The problem we have is not really a key/value problem. Essentially it's a document problem. We store data in XML documents. There are so many limitations to storing an XML document in a BLOB in Oracle by key that it really drove us to look in new directions. MongoDB is a document database, so it really enriched our data modeling abilities and allowed for new query patterns unavailable before. Not only that, but MongoDB's atomic operators simplifies our software stack greatly vs reading/modifying/writing whole XML documents. MongoDB is also flexible enough to allow us to model things like a folder hierarchy as well. So it covers multiple use-cases.
NoSQL DB: What other NoSQL databases did you evaluate (I believe you mentioned Cassandra as one that you looked at)?
KG: We looked at just about everything we could think of. Lots of options like Cassandra made sense for a very narrow problem set, where MongoDB allowed for flexibility. We wanted that flexibility.
The use case of indexing arrays was key for us. For instance in mongo-speak:
In that simple use case the ability for a query against tags to be indexed and fast was a very compelling case. It can be done in a relational store sure, but not as easily. We found most use cases got better when modeled in JSON/BSON/MongoDB. The challenge was unlearning a decade or more of RDBMS modeling in order to think in documents. It's a journey really.
Another use case was sorting. We wanted to be able to decide sort by any key at run time and not have to store data in a particular pre-sorted order. MongoDB does that well, we can sort and index by any key.
NoSQL DB: Obviously there are multiple factors that go into product evaluation, what single factor was the most important in your decision?
KG: The document structure was key.
NoSQL DB: What has been the biggest surprise (good or bad) using MongoDB?
KG: In retrospect, we adopted so early on that there were some (and in some cases still are) some maturity issues. Like a single db wide lock. We had to engineer around that. On the good side, I have been very pleased with our new modeling options, and just the overall simplicity of MongoDB. An admin can go from zero to hero in a day vs years with Oracle. MongoDB is such a nice change, it's just so simple. MongoDB has so much energy behind it right now that the maturity question is a non-issue. We are very pleased with the overall direction and velocity of the project.
NoSQL DB: Biggest obstacle using MongoDB?
KG: To use MongoDB properly, one has to think different about modeling and application design. Generally the solution is so much simpler. However, it's not always easy to retro-fit your application to make a perfect match to MongoDB. Many times one makes compromises. I think if we started from scratch we would be able to take more advantage of MongoDB. But as it is, we are seeing huge gains so far, and are excited about the future.
My feeling is that ORM's have been a huge part of the reason people look at non relational databases. People say RDBMS's can't scale, they are slow, etc, etc. But in reality it's largely a question of modeling. If you just use an ORM and let Hibernate or whatever manage your data model then of course it's going to suck. Then it's slow, and then you need a cache, so you need memcached, and then you are shaving a yak. There are lots of things wrong with the modern RDBMS stack used at most companies. At Paypal we didn't ORM, we de-normalized and had an obsession towards performance. We also had top shelf engineers, same with Shutterfly. You really need great engineers who understand that you can't just ORM-away your problems. I am not a fan of the ORM, can you tell?
NoSQL DB: You mentioned the desire to use commodity hardware are there specific hardware considerations that people need to be aware of if you are going to be using MongoDB?
KG: Thats just it. Just use a linux box. We use Dell's. Let your operations staff negotiate the best deal from a vendor. Don't use anything exotic, keep it simple.
The one item is we use battery backed controllers because we use SATA drives with cache on them. Thus if the machine goes cold power off we could lose data if it wasn't for the battery on the controller. We use getlastError(w=2) so we know the data is in memory on 2 machines before the call returns, and we fsync() every second. Thats how we limit our exposure. When single server durability comes out I hope to use it to reduce latency cause by the getlastError() call.
Figuring out how much memory and storage per shard has been a challenge. You have to balance availability, capacity, cost, performance, concurrency, etc. You have to know your data and the data model to make good decisions. We chose a config, and thats now our 'brick' we build clusters on.
NoSQL DB: What feature, in your opinion, is missing from MongoDB?
KG: Coming from an Oracle and PostgreSQL background, there are tons of things I would like to see in MongoDB. I have created Jira's for most of them. But single server durability is key, better profiling details like Oracle's wait interface (10046 trace), tools for managing shards, background index rebuild, compression, stuff like that. Eliot and Dwight have been great about adopting the easy ones, and coding the harder ones into future releases. Mongostat is an example of that, I wrote the original in python, and Eliot re-wrote it in C++ and included it in the distribution. It's a copy of a tool we used at Ebay.
NoSQL DB: Anything else you'd like to tell us about NoSQL or MongoDB that you feel is particular important for folks to know?
KG: Lets see. Mainly when implementing MongoDB try to think different about your data model. I see lots of examples on the web of people using a semi-relational model in MongoDB. If you do that you don't get the benefits of Mongo. Try to model in documents. If you can't, then question if MongoDB is the correct solution! Your data has a format it wants to live in, listen carefully to it, it will tell you.