Riak's Map/Reduce Query and Data Processing System
Friday, November 5, 2010 at 6:00AM |
Derek Stainer Map/Reduce query mechanisms are nothing new in NoSQL. Several NoSQL databases such as CouchDB and MongoDB. This next presentation by Kevin Smith of Basho Technologies describes Riak's use of Map/Reduce as its query and data processing system.
Why use Map/Reduce? Riak's Wiki Page provides us the answer:
Key-value stores like Riak generally have very little functionality beyond just storing and fetching objects. MapReduce adds the capability to perform more powerful queries over the data stored in Riak. It also fits nicely with the functional programming orientation of Riak's core code and the distributed nature of the data storage.
Here are some of the key concepts/points taken from the presentation:
Map/Reduce Facts:
- Map phases execute in parallel w/data locality
- Reduce phases execute in parallel on the node where job was submitted
- Results are not cached or stored
- Phases can be written in Erlang or Javascript
Map Phase
- Inputs must generate bucket/key pairs
- Must return a list
- Parallel results are aggregated into a single list
Reduce Phase
- Performed on the node coordinating the map/reduce job
- Two processes per reduce phase to add minor parallelism
- Must return a list
Riak allows Map/Reduce jobs to be submitted via HTTP
- Riak exposes Map/Reduce through REST API
- Job is described using JSON
- Submitted via Post
There are a couple of problems with mapping. First it beats up the nodes. In addition, it's inefficient for large buckets.
How does Riak solve this? With a real query scheduler. The result of this work is 2X performance. In addition to the performance gains this reduces contention for the javascript VMs as well as improves on cluster replication through replicas.
Another problem that arises is that querying data can become expensive using only Map/Reduce. Riak solves this problem via key filtering. This allows Riak to select objects based off their bucket and key. In addition, it reduces the work necessary to satisfy a query.
What does the future hold for Riak's Map/Reduce framework?
- Distributed Reduce Phases
- External MapReduce processes
Check out the entire presentation below
Basho Technologies,
Kevin Smith,
Riak 

Reader Comments