Tony Tam from Wordnik describes their migration from MySQL to MongoDB. So why migrate from MySQL? Inserts on their MyISAMTables had approached 10 seconds an insert. They continued to produce workarounds. However, this led to an increase in system babysitting. Nothing than a fragile system to make those weeknights and weekends extra fun, right?

What are the results?

Moved 5 Billion rows from MySQL to MongoDB

Sustained 100,000 inserts/second

Migration tool was the bottleneck (CPU Bound)

Wordnik now reads from MongoDB very fast

Read + create java objects @ 250,000/second

What about the advice of going live with MongoDB?

Choose your use case carefully if migrating incrementally

Scary no matter what

Test your perf monitoring system first!

Use your DAOs from migration

Turn on MongoDB on one server, monitor, tune (rollback, repeat)

Full switch over when comfortable

As a follow-up Wordnik discussed in a post that they are now hosting 9 billion documents. Read more at B is for Billion

In his presentation Marko sets up an experiment to test a graph database, in this case Neo4j, against a relation data store, MySQL. The purpose of the experiment is to traverse the graph five levels deep. The graph in the experiment contains 1 million vertices and 4 million edges.

For the run of the experiment a traverser is placed on a single vertex

For each step, the traverser moves to it's adjacent vertices

Repeat each step five times

The results? Neo4j completed the experiment in 14 minutes vs. MySQL not completing the job. The full results of the experiment can be found here.

So why use a graph database? Marko provides us with three potential reasons:

If solution to your problem can be represented as a local process within a larger global structure

If solution to your problem can be represented as being with respect to a set of root elements

If solution to your problem does not require a global analysis of your data

So this is great and dandy from a theoretical perspective but what about real life use cases? Marko provides the following examples:

Local Searches - What is in the neighborhood around A?

Local Recommendations - Given A, what should A include in their neighborhood?

Local Ranks - Given A, how would you rank B relative to A?

Collaborative Filtering - Find all the items that the person A likes. Then ﬁnd all the people that like those same items. Then ﬁnd which items those people like that are not already the items that are liked by person A

Question expert identification - Find all the tags associated with question A. For all those tag, ﬁnd all answers (for any questions) that are tagged by those tags. For those answers, ﬁnd who created those answers

In conclusion, Marko offers these final points:

Graph databases are efficient with respects to local data analysis

Locality is deﬁned by direct referent structures

Frame all solutions to problems as a traversal over local regions of the graph

Installing CouchDB on a VM - Dennis Delimarsky provides a nice tutorial for installing CouchDB on a VM so that you can give CouchDB 1.0 a try

Riptano Packages Cassandra for the Enterprise - Matt Pfeil, co-founder and CEO of Riptano discusses Cassandra. Little bit of background here, Riptano was created as a commercial entity for Cassandra. Pfeil and Jonathan Ellis, who is the project chair for Apache Cassandra, co founded Riptano back in March of 2010

NoSQL The Dawn of Polyglot Persistence - We discussed Polyglot Persistence a few weeks back. Stephan Schmidt of Code Monkeyism provides some more ideas about the topic.

In last Friday's post we explored a presentation by Marko Rodriguez about the Graph Traversal Programming Pattern. Specifically we explored the various graph structures that exist. In this post we are going to explore graph databases.

Most databases can model a graph. But how do you define a graph database?

A graph database is any storage system that provides index-free adjacency

Every element (i.e. vertex or edge) has a direct pointer to its adjacent element

No O(log2(n)) index lookup required to determine which vertex is adjacent to which other vertex

If the graph is connected, the graph as a whole is treated as an atomic data structure

Marko proceeds to demonstrate the difference between a graph database and a non-graph database with respect to index adjacency.

Graph Database

Direct references to its adjacent vertices

Constant time cost to navigate between vertices

Non-Graph Database

Must look at an index to locate adjacent vertices

log2(n) time cost to move between vertices

More about index adjacency:

While any database can implicitly represent a graph, only a graph database makes the graph structure explicit

In a graph database, each vertex serves as a “mini index” of its adjacent elements

As the graph grows in size, the cost of a local step remains the same

The final point with regard to graph databases and indices Marko has the following points to make:

Graph databases allows you to explicitly model indices endogenous to your domain model. Your indices and domain model are one atomic entity—a graph

This has benefits in designing special-purpose index structures for your data.

Think about all the numerous types of indices in the geo-spatial community

Think about all the indices that you have yet to think about

In the final post of this series we will explore graph traversals with both artificial and real life examples.

Considerations for Hadoop and BI (part 2 of 2) - Part one of a two part posting that discusses various considerations to evaluate when considering using Hadoop for business intelligence.

Java development 2.0: NoSQL - Provides a basis from which developers can get started with schemaless modeling in Java, specifically Groovy

In his presentation at WindyCityDB, Marko Rodriguez, discusses graph traversal patterns. This is the first part of a multi-part series that will discuss this presentation. Specifically in this posting we are going to discuss the various types of graph structures. We will be discussing graph databases and graph traversals in a following posts.

So what types of graph structures are there? It's an interesting question and one I did not know the answer to until this presentation. Before we can begin discussing the various graph structures we need a small primer.

Graph Primer

Dots are vertices

Lines are edges

Dots and Lines make a Graph

Undirected Graph

Vertices

All denote the same type of object

Edges

All edges denote the same type of relationship

All edges denote a symmetric relationship

Examples

Collaborator graph

Road graph

Directed Graph

Vertices

All denote the same type of object

Edges

All edges denote the same type of relationship

All edges denote a asymmetric relationship

Primer

Directed edge is a line with an arrow

Examples

Twitter follow graph

Web href-citation graph

Single-Relational Graphs

Both undirected and directed graphs are considered single relational graphs.

All edges have the same meaning/type

Perhaps the most common type of graph type

Limitations of a single relational graph:

Only can express a single type of vertex

Only can express a single type of edge

In general are very limiting graph types

Multi-Relational Graphs

Obviously the opposite of a single-relational graph.

What are the gains with multi-relational graphs?

Allows for explicit typing of edges

Explicit typing allows for

edges to have different meanings

vertices to have different types

Property Graph

Specialized graph which extends a multi-relational graph by adding key/value map to both edges and vertices

Properties useful for expressing non-relational data

Allows further refinement of the meaning of an edge

Property graphs are the basis for other types of graphs

The entire presentation is shown below. We'll discuss graph databases on Monday.