At last weeks Cassandra Summit, Jonathan Ellis, contributor to Cassandra and co-founder of Riptano, gave a presentation about the current state of Cassandra. If you are using or considering using Cassandra this presentation gives you some good insight on where Cassandra is at and where it's going for it's 0.7.0 release.
Entries in Cassandra (86)
Josh Owens, from WebPulp.tv, interviews Joe Stump of Digg and now SimpleGeo fame. The interview itself discusses a number of different topics but there is a good chunk of it dedicated to Cassandra. Specifically, Joe discusses the ease at which a Cassandra cluster can be setup, modifications they've made to be AWS zone aware and more.
Check it out:
So yesterday we spoke about an extension to Cassandra that provides asynchronous triggers. Now we will see a use case in action, managing a secondary index with triggers. In this post, Maxim Grinev and Martin Hentschel are at it again, they describe the use case here:
Cassandra does not support secondary indexes at first, but storing redundant data (in a different layout) will give you the same effect. The main drawback is that your application (the code that writes to the DB) needs to take care of managing the index. Every time you write to the DB, you also need to maintain your index.
So by using the asynchronous triggers you can maintain the secondary index without the performance impact.
Check it out: Managing Indexes in Cassandra using Async Triggers
Follow Up: Cassandra is planning native support for secondary indexes in Cassandra. Here is the JIRA (CASSANDRA-749) full of the discussion among the committers.
Maxim Grinev and Martin Hentschel have written about a extension they have written for Cassandra, asynchronous triggers. Specifically a trigger in Cassandra is:
Like traditional database triggers, Cassandra Async trigger is a procedure that is automatically executed by the database in response to certain events on a particular database object (e.g. table or view). The distinguishing feature of Async trigger is that the database responds to the client on successful update execution without waiting for triggers to be executed, thus reducing response latency.
What are the attributes of these triggers:
- "After" triggers - A trigger is executed after the update operation that fires the trigger and can see the results of the update
- Trigger procedures are implemented in Java
- Cassandra Async triggers are mutation-level triggers. A trigger is executed for each mutation issued to the column family.
- Cassandra Async triggers are asynchronous. The database acknowledges update execution to the client after the update is executed and the fired triggers are submitted for execution. Actual execution of fired triggers happens after the acknowledgement to the client. It allows saving latency but leads to eventual consistency of data.
- Guarantees triggers to be executed at least once
We will see in a post tomorrow about what you can do with asynchronous triggers.
Read more: Extending Cassandra with Async Triggers
Ashwin Jayaprakash has a good post about installing and starting Cassandra on Windows 7. Ashwin has some thoughts about the installation and what he thinks is missing from Cassandra.
Read more: Cassandra for First Timers
An idempotent process, is a process which no matter how many times repeated the results are the same. So what does this have to do with Cassandra? Maxim Grinev is going to tell us. Specifically he points out:
When you develop application for Cassandra you should be aware of the following fact. Even when client observes the failure of an update it is still possible that this update has been executed successfully. The cause of such anomalous behavior is that Cassandra does not support transactional rollback.
This "lack of support" for transactional rollback is actually by design, mainly because rollback of distributed transactions is both expensive and hard to scale. Therefore, as Maxim points out our applications must deal with these intricacies.
So how do you design a data model to account for the issues described above? Well Maxim describes a rather clever way of designing your application:
Instead of just storing the mapping of URLs to counters (i.e. column family URL_statistics where each record has an URL as a key and a single column having counter as its value) a solution can be to store the mapping of each URL to the IDs of the tweets which contains the URL (i.e. column family URL_Tweets where each record has an URL as a key, columns representing tweets, column names are the tweet IDs, and column values are not used). URL counters will then be computed on retrieval by counting tweet IDs.
It is a good idea to store tweet IDs as column names so that Cassandra automatically eliminates duplicates – repeated update will be ignored in this case (use this great Cassandra feature to make your updates idempotent!).
So how early should you design this into your application's data model?
It is common that Cassandra applications are not initially designed for idempotence. At first, small scale deployments do not exhibit these subtle problems and work fine. Only as time passes and their deployments expand the problems manifest and the applications respond to handle them. Do it right from the beginning.
We are seeing today more instances of polyglot programming, where multiple languages exist in harmony. Well we are starting to see the same approach when it come to persistence. Whenever you read posts about what NoSQL is about, one of the common answers is choice. The idea that you choose the right tool for the job. In the end that is what polyglot persistence is all about, chosing one or more persistence frameworks to solve your problems.
So with that said I turn your attention to a recent post by Todd Stavish concerning the very topic of polyglot persistence. Todd has built a sample application that uses both Cassandra and InfiniteGraph with interesting results:
A feature of the hybrid system that was particularly useful was the ability to to learn from the graph and update the original Cassandra store. For instance, learning a shorter path between friends in InfiniteGraph and then updating the columns in Cassandra. For a friend suggestion application, the shortest might be the critical one. However, InfiniteGraph can hop vertexes and edges quickly. Therefore, calculating the longer paths was possible too. Alternative longer path are useful for certain applications like tracking money laundering or other law enforcement needs. In general, I found that I could easily try multi-path analysis, update Cassandra, then run my original queries in Cassandra over the new data
Check out the post and the code Todd's provided
In 2006 Google released a whitepaper which was quite influential to the NoSQL ecosystem. The paper was BigTable: A Distributed Storage System for Structured Data. Here is a excerpt from the paper's abstract.
"Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. ... In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable."
Well ultimately the design and implementation details, discussed in the BigTable paper, have found their way into several NoSQL databases. With HBase and Cassandra leading the pack. That lead's us into today's presentation. The folks at Gemini Mobile Technologies have summarized the BigTable paper in their presentation for the NoSQL Summer Tokyo Edition. It's clear and its concise so enjoy.
Adin Scannell, of GridCentric has posted a recipe for creating a Cassandra Cluster in 5-minutes. Adin provides a very detailed account from installation of required packages to starting up the cluster. One of the very nice features of Cassandra is that all nodes are treated equally. This allows for clusters to be set up rather easily. Adin highlights this point:
Cassandra does not have different classes of nodes, so I had no need to run anything special on the master of the virtual cluster. Cassandra only requires a seed node, so that a freshly started instance can learn about the others. I use the master for this purpose, since we can assume that it's always around.
So if you are looking to create a Cassandra cluster, check it out.
I'd like to thank all of the folks that came out to the first San Diego NoSQL meetup group. Special thanks to both Charles Glommen, for organizing the meeting and TouchCommerce for sponsoring the meeting.
I've included the slides from last night's presentations
So in case you missed it, Cassandra has been in the news the last couple of days. So I thought this would be a good opportunity to provide an introduction to Cassandra via Gary Dusbabek from Rackspace. This presentation was actually given at Silicon Valley Cloud Computing Group back in June of this year.
Couple of key points about Cassandra (not from the presentation):
- Initially created by Facebook for search functionality for users inbox mail on the site.
- The source code was open sourced and released to the Apache Software Foundation.
- Its design was inspired by both Google's BigTable and Amazon's Dynamo.
- It's considered to be a column data store, similar to a Google BigTable or Apache HBase.
So why Cassandra at all? As Dusbabek mentions from his presentation "vertical scaling is hard". So as the amount of data we create and analyze increases, our strategies for dealing with that data change. Dusbabek walks us through a number of topics in his discussion including scaling, replication model, data model and practical considerations.
So without any further interruptions...