Scala and Graph Databases with Gremlin-Scala

Posted on October 24, 2013 by Stefan Bleibinhaus

In this post, I give an introduction to Gremlin-Scala [1] , which aims to be a thin Scala [2] wrapper for Gremlin [3] . Gremlin is a graph DSL [4] for traversing a number of graph databases [5] . The post includes a short introduction to graph databases in general and to the Gremlin project. The focus is Scala example code [6] , which shows how Gremlin-Scala can be used to access the graph databases Neo4j [7] , OrientDb [8] and Tinkergraph [9] and finding the shortest path [10] in a graph.

1 Introduction

1.1 Graph Databases

A graph database [5] is a database, which allows to store data in the form of graphs and query it using graph traversal. One of the major advantages of graph databases [11] is the natural way in which data can be represented and that queries do not need joins, as the data is connected via edges. They also benefit from the rising popularity of NoSql [12] databases in general. Among the more well known graph databases are Neo4j [7] and OrientDb [8] , which will also be used in the example code of this blog post.

The mathematical background [13] of graph databases is actually very interesting and I recommend reading up some of the standard algorithms on graphs, e.g. Breadth-first search [14] , Depth-first search [15] , Shortest path [10] , Travelling salesman problem [16] , Prim’s algorithm [17] (which finds a Minimum spanning tree [18] ).

For a more developer oriented introduction I recommend this video [19] from the Neo4j guys (I especially found the example helpful, which starts after the first 18 minutes). Another resource, which also targets Neo4j users is O'Reilly's Graph Databases book, which can be downloaded for free here [20] .

1.2 Gremlin and Gremlin-Scala

Gremlin [3] is a domain specific language for traversing property graphs. It has implementations to access most graph databases, especially graph databases supporting the property-graph model [21] . Gremlin-Scala [1] , which is a thin Scala [2] wrapper for Gremlin, aims to make Gremlin available to Scala users. It is now maintained by Michael Pollmeier [22] and the more recent releases aim to make the interface feel more native to Scala developers. The current focus of his efforts is on a more fluent Scala Api and type safety. A good start to Gremlin and Gremlin-Scala are the Gremlin documentation [23] , the Gremlin wiki [24] , Gremlin steps [25] and the Gremlin-Scala [1] page on Github.

Gremlin is developed by Tinkerpop [26] , which offers a stack of frameworks and libraries intended to make the use of graph databases easier and unify the access to the various graph databases available. The bottom layer of those frameworks and libraries is Blueprints [27] . For example, developing in Gremlin-Scala means also making use of the Pipes framework [28] . For real world applications, Rexster [29] is a very interesting choice to abstract the low level database handling and leaving it to the framework. It is a graph server, which allows to use predefined REST [30] queries targeting the underlying graph database. The Gremlin-Scala page on Github has a short paragraph [31] on how to use the project together with Rexster.

2 Code example

In the following, I give an overview of the example code for this blog post. The complete example project [6] is available on Github. It uses Gremlin-scala 2.4.1 together with Scala 2.10.3. To run it, just start up Sbt [32] and run the tests (Warning: By default, it populates /tmp/neo4j).

2.1 Base test class

The base class is an abstract class containing all the tests, which will be described in the following sections. It has an abstract method to obtain the Scalagraph, which is the access to the graph database. This abstract method is then implemented for each of the used graph databases. The base class also makes sure in the after method, that the graph database will be cleaned up and is shut down properly. Code snippet 1 shows the outline of the base class.

Code snippet 1: The code outline of the abstract GraphTestBase class.
2.2 Implementations for Neo4j, OrientDb and Tinkergraph

The implementations of the base class for each of the databases are quite simple, as they only need to override the getGraph method and make sure to run the tests. Thanks to the interface of Gremlin, this is very easy to achieve. Code snippet 2 shows the code for the Neo4j database, which will store the data on the disk (make sure to change the path to one suiting your machine) and Code snippet 3 shows the code for the OrientDB database, which will store the data in memory for this test.

Code snippet 2: The implementation of the test for the Neo4j database.
Code snippet 3: The implementation of the test for the OrientDB database.
2.3 Storing data in a vertex

The first test is very simple and just shows how to store primitives in a vertex. The code can be seen in Code snippet 4.

Code snippet 4: The first test showing how to store primitives in a graph database using Gremlin-Scala.
2.4 Adding edges to the graph

The second test, which has been copied over from the Gremlin-Scala test suite shows how to add an edge between two vertices and query for it afterwards. Code snippet 5 shows the code.

Code snippet 5: The second test shows how to connect vertices with edges in Gremlin-Scala.
2.5 Finding the shortest path in a network

The last test is a bit more complex and gives a glimpse of how useful graph databases can be for certain scenarios. The example is inspired by an upcoming trip of mine, in which I am going to explore the Far North of New Zealand with friends. We will start from Auckland and our final goal is Cape Reinga, which is the most northern point of the northern island. In this example, I have added some cities between Auckland and Cape Reinga as nodes of the graph. The roads between those cities are modelled as weighted edges containing the information of the distance and the time how long the drive takes. The code for the setup can be seen in Code snippet 6.

Code snippet 6: The preparation for the shortest path test.

The goal is now, to find the shortest path [33] between Auckland and Cape Reinga. To do this, the loop function of Gremlin-Scala is being used. In the following code seen in Code snippet 7, a loop function is defined, which takes a startstep and repeats the steps, until it finds the Cape Reinga vertex or has looped for 7 times (first function inside the loop call). It keeps all paths, in which the last step ends on the vertex Cape Reinga (second function inside the loop call). This function is then used to go from the vertex Auckland to the connected vertices and loop over this step. The result are 29 different variants showing on how to get from Auckland to Cape Reinga, jumping from vertex to vertex, with a maximum of 7 steps.

Code snippet 7: Just collecting all the paths to Cape Reinga with a maximum of seven steps.

But that is not enough yet, as the goal was not just to find all the paths, but to find the shortest path. For this, the edges need to be added to the hops between the nodes, as seen in Code snippet 8. After adding the edges and with them the information on the travel time and the distance, the shortest path can be found. To facilitate this, I am parsing the path into a tuple, containing its total time to travel (using the edges in the path) and the visited vertices in the path. Now the shortest path is just the one with the smallest travel time. For visualisation purposes, I have added a println statement of the ordered paths with their travel time, which can be seen in Code snippet 9.

Code snippet 8: Querying for the shortest path to Cape Reinga.
Code snippet 9: The printed result of the test.

3 References

comments powered by Disqus
Admin login