Building a content repository on top of NoSQL
Thursday, July 29, 2010 at 6:00AM |
Derek Stainer On Tuesday's Links of the Day, we featured a link that discussed a content repository named Lily that was built on top of HBase and Solr. Well in today's post we are going to dive deeper and look at how OuterThought came to the conclusion to use HBase. Secondly, how they are using HBase to solve their problems.
OuterThought was having trouble scaling in three areas of their application:
- Access Control
- Facet Browsing
- Anything that required Random Access
Their previous architecture consisted of MySQL, Lucene and the file system itself. Knew they needed to grow a solution that allowed for scalability, availability and performance. So how did they try to get there? Using traditional approaches. Pushed more logic into the database, scaled out the database and added message queues among everything else. Ultimately, NoSQL begins entering the picture.
So what are the requirements for the migration from MySQL and what NoSQL store would they migrate to? They took a phased approach.
Phase 1
- Automatic scaling to large data sets
- Fault tolerance
- Flexible data model for sparse data
- Efficient access to random data
- Open source
- Java (not a hard requirement)
- Commodity hardware
Phase 2
- Integration to Hadoop (nice but not necessary)
- Consistency
- Atomic Updates
What did the selection of HBase provide?
- HDFS good for storing large blobs of data
- Data model that was flexible and fit their CMS document model
- Ordered tables which allowed for scan ranges among other things
So what does Lily use HBase for?
- Storage of underlying content
- Storage of forward/backward link index tables
- Storage of various secondary indexes
HBase,
Lily,
OuterThought,
Solr 

Reader Comments (1)
It’s great to hear others talking about “content repositories” and alternatives to relational databases. Search as a first class citizen is also a critical aspect. With MarkLogic Server, we’ve built search into the DNA of the product from the ground up. This allows us to have full ACID transactions and real-time search—no asynchronous index updates or queuing. With a shared-nothing cluster architecture, it’s possible to build sophisticated search applications over hundreds of TB of content on commodity hardware. We’ve got customers doing this in production today.
Full disclosure: I’m a Product Manager at MarkLogic