Wednesday, March 23, 2011 at 11:00AM | Derek Stainer
Marko Rodriguez, graph database guru and contributor to the open source project TinkerPop, has written a post discussing the current state of the union of the TinkerPop project and its various subprojects. TinkerPop is an open source software development group launched in fall 2009 to focus on technologies in the emerging graph database area.
As of Spring 2011, TinkerPop's most popular projects include:
Blueprints: A property graph model interface
Pipes: A dataflow framework using process graphs to provide atomic operations
We have another presentation from Marko Rodriguez on Graph databases. In this presentation Rodriguez discusses graph database technology, applied and theoretical work with graphs, and finally reviews the TinkerPop graph processing stack (this includes Gremlin which we discussed yesterday).
Thursday, February 10, 2011 at 12:01PM | Derek Stainer
One of the chief complaints about various NoSQL solutions is that they do not have a unified query interface. How you query in MongoDB differs drastically from, say, Cassandra. Fortunately, at least in the graph family of NoSQL databases a series of projects are emerging to change all of that.
We have a presentation about one of those projects. Gremlin. What is Gremlin you might ask?
Thursday, September 23, 2010 at 6:00AM | Derek Stainer
Gremlin a turing-complete graph programming language has just been released. Version 0.5 sports a complete redesign and implementation of the Gremlin compiler. Other improvements include:
Significant performance improvements
Numerous changes to the Gremlin function library
Additions to the Gremlin type system
Official announcement is here Download the release here Release Notes are here
In this presentation, Marko Rodriguez, discusses Gremlin, a graph programming language.
What is Gremlin?
Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.
There are several interesting facts about Gremlin. First, Gremlin is JSR-223 compliant which allows the language itself to be embedded in Java applications, much like Jython, JRuby or Groovy code fragments can. In addition, Gremlin is considered a DSL (domain specific language) and NOT an API. Gremlin is designed to work with several different types of graph databases via Blueprints.
Two important things that Gremlin provides:
Gremlin provides a means to elegantly map single-relational graph analysis algorithms over to the multi-relational graph domain
Gremlin provides an elegant way to do automated reasoning in multi-relational graphs using path expressions
Gremlin works with a type of graph called a property graph. A property graph has the following properties:
Vertices and edges are labeled with unique identifiers
Edges are directed, labeled, and can form loops
Multiple edges of the same label can exist for the same vertex pair
Vertices and edges can have any number of key/value pair properties/attributes
So as an example of a typical Gremlin expression you have the following:
./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name
Here is how this expression would be evaluated:
Get the current object(s)
outE[@label=‘knows’]: Get the outgoing edges of the current object(s), where their labels equal ‘knows’
inV[matches(@name,‘va.{3}’) and @age > 21]: Get the incoming vertices of those ‘knows’ edges, where the names of those vertices are 5 characters long, start with ‘va’, and whose age is greater than 21
@name: get the name of those particular incoming vertices
The language is simple, easy to understand and very concise. Marko dives into several other examples of expressions, as well as, language specifics like types, expressions and variables.
If you are planning on working with graph databases, take a look at Gremlin.
In his presentation Marko sets up an experiment to test a graph database, in this case Neo4j, against a relation data store, MySQL. The purpose of the experiment is to traverse the graph five levels deep. The graph in the experiment contains 1 million vertices and 4 million edges.
For the run of the experiment a traverser is placed on a single vertex
For each step, the traverser moves to it's adjacent vertices
Repeat each step five times
The results? Neo4j completed the experiment in 14 minutes vs. MySQL not completing the job. The full results of the experiment can be found here.
So why use a graph database? Marko provides us with three potential reasons:
If solution to your problem can be represented as a local process within a larger global structure
If solution to your problem can be represented as being with respect to a set of root elements
If solution to your problem does not require a global analysis of your data
So this is great and dandy from a theoretical perspective but what about real life use cases? Marko provides the following examples:
Local Searches - What is in the neighborhood around A?
Local Recommendations - Given A, what should A include in their neighborhood?
Local Ranks - Given A, how would you rank B relative to A?
Collaborative Filtering - Find all the items that the person A likes. Then find all the people that like those same items. Then find which items those people like that are not already the items that are liked by person A
Question expert identification - Find all the tags associated with question A. For all those tag, find all answers (for any questions) that are tagged by those tags. For those answers, find who created those answers
In conclusion, Marko offers these final points:
Graph databases are efficient with respects to local data analysis
Locality is defined by direct referent structures
Frame all solutions to problems as a traversal over local regions of the graph
In last Friday's post we explored a presentation by Marko Rodriguez about the Graph Traversal Programming Pattern. Specifically we explored the various graph structures that exist. In this post we are going to explore graph databases.
Most databases can model a graph. But how do you define a graph database?
A graph database is any storage system that provides index-free adjacency
Every element (i.e. vertex or edge) has a direct pointer to its adjacent element
No O(log2(n)) index lookup required to determine which vertex is adjacent to which other vertex
If the graph is connected, the graph as a whole is treated as an atomic data structure
Marko proceeds to demonstrate the difference between a graph database and a non-graph database with respect to index adjacency.
Graph Database
Direct references to its adjacent vertices
Constant time cost to navigate between vertices
Non-Graph Database
Must look at an index to locate adjacent vertices
log2(n) time cost to move between vertices
More about index adjacency:
While any database can implicitly represent a graph, only a graph database makes the graph structure explicit
In a graph database, each vertex serves as a “mini index” of its adjacent elements
As the graph grows in size, the cost of a local step remains the same
The final point with regard to graph databases and indices Marko has the following points to make:
Graph databases allows you to explicitly model indices endogenous to your domain model. Your indices and domain model are one atomic entity—a graph
This has benefits in designing special-purpose index structures for your data.
Think about all the numerous types of indices in the geo-spatial community
Think about all the indices that you have yet to think about
In the final post of this series we will explore graph traversals with both artificial and real life examples.
In his presentation at WindyCityDB, Marko Rodriguez, discusses graph traversal patterns. This is the first part of a multi-part series that will discuss this presentation. Specifically in this posting we are going to discuss the various types of graph structures. We will be discussing graph databases and graph traversals in a following posts.
So what types of graph structures are there? It's an interesting question and one I did not know the answer to until this presentation. Before we can begin discussing the various graph structures we need a small primer.
Graph Primer
Dots are vertices
Lines are edges
Dots and Lines make a Graph
Undirected Graph
Vertices
All denote the same type of object
Edges
All edges denote the same type of relationship
All edges denote a symmetric relationship
Examples
Collaborator graph
Road graph
Directed Graph
Vertices
All denote the same type of object
Edges
All edges denote the same type of relationship
All edges denote a asymmetric relationship
Primer
Directed edge is a line with an arrow
Examples
Twitter follow graph
Web href-citation graph
Single-Relational Graphs
Both undirected and directed graphs are considered single relational graphs.
All edges have the same meaning/type
Perhaps the most common type of graph type
Limitations of a single relational graph:
Only can express a single type of vertex
Only can express a single type of edge
In general are very limiting graph types
Multi-Relational Graphs
Obviously the opposite of a single-relational graph.
What are the gains with multi-relational graphs?
Allows for explicit typing of edges
Explicit typing allows for
edges to have different meanings
vertices to have different types
Property Graph
Specialized graph which extends a multi-relational graph by adding key/value map to both edges and vertices
Properties useful for expressing non-relational data
Allows further refinement of the meaning of an edge
Property graphs are the basis for other types of graphs
The entire presentation is shown below. We'll discuss graph databases on Monday.