Features Graph Databases Lead image: © Jakub Krechowicz, 123RF.com

Building more efficient database applications with a graph database

Graph Store

A graph database can identify the really important relationships on a social web, find the shortest paths, and optimize visitor flows. We compare some leading open source graph database options. By Tim Schürmann

Peter knows Paula, a freeway links Cologne with Dortmund, and a network cable leads from Marvin to Zaphod. These relationships can be recorded in a visually intuitive way (Figure 1), and the results are referred to by computer scientists as graphs. In combination with the matching algorithms, graphs are perfect for computing routes, analyzing relationships (who knows whom?), identifying bottlenecks on networks, optimizing pipeline systems, or avoiding congestion on a freeway.

Figure 1: A simple graph shows the relationships between persons.

In the real world, application developers often crunch their graph data into relational databases, but using a traditional relational database can eat a lot of space and take a toll on performance, as well as burning lots of programming time. Graph databases are an infinitely preferable solution: they store the mesh of relationships, road maps, and networks in a space-saving way, thus supporting fast queries; some graph databases even offer efficient analysis algorithms.

Although graph databases have existed for many years, they have remained in the background until fairly recently. New developments such as NoSQL, the semantic web, and geodetic information systems such as OpenStreetMap (Figure 2) have brought a new popularity to this alternative database form. The next time you or someone in your organization rolls out a custom database, you might want to consider whether a graph database would be better than a traditional relational database for your intended use. Search engines, for instance, sometimes use graph databases for language and text analysis, as well as to establish knowledge bases.

Figure 2: From a mathematical point of view, road maps are nothing but graphs.

In this article, I introduce some of the leading open source graph database alternatives, including Neo4j, GraphDB, InfoGrid, HyperGraphDB, and VertexDB. You'll find a quick comparison in Table 1.

Tabelle 1: Five Graph Databases

Name	Neo4j 1.6	Sones Graph DB 2.0	InfoGrid 2.9.5	HyperGraphDB 1.1	VertexDB
Homepage	http://neo4j.org	http://www.sones.de/static-en/	http://www.infogrid.org	http://www.hypergraphdb.org	http://www.dekorte.com/projects/opensource/vertexdb/
License	GPLv3 (Community edition), AGPLv3 (Advanced and Enterprise editions), commercial	AGPLv3 or commercial	AGPLv3 or commercial	LGPL	Revised BSD
Interfaces	Java, Ruby, Python, C#, JRuby, Jython, Clojure, PHP, JavaScript, Scala	C#, Java, JavaScript, PHP	Java Enterprise Edition	Java	HTTP
Query Language	Cypher, Gremlin	GQL (Graph Query Language)	API only	API only	Proprietary
HTTP/REST	Yes/Yes	Yes/Yes (based on GQL)	Yes/Yes	No/No	Yes/No
Replication	Master/Slave	No	Peer-to-peer	Peer-to-peer	No
Transactions	Yes (ACID)	Yes	Yes	Yes (ACID and MVCC)	Restricted
Persistence/In-Memory Operation	Own file format/No	Commercial edition only (own graph filesystem)/Yes	Yes (different possibilities)/Yes	Berkeley DB/No	Tokyo Cabinet/No
Version Control	No	No	Only creation and last modification dates	No	No
Distribution of Graphs	No (only cache proxys)	No	Yes	Yes	No

Nodes, Edges, and Labels

The basics are identical for all the graph database candidates. If you want to map the relation between Peter and Paula, you draw a box or a circle for each person and connect them with a line. Computer scientists refer to the circles as nodes; the lines are known as edges.

Edges come in two flavors: directional edges are one-way streets and are represented in the graphic by an arrow. For example, Peter is Klaus's father, but not vice versa. Non-directional edges apply in both directions. For example, you can drive down the freeway from Dortmund to Cologne, but also from Cologne to Dortmund.

Graph databases typically use two directional edges pointing in opposite directions instead of a non-directional edge (as you can see in Figure 1 between Peter and Paula). The labels on the edges represent the type of relation (e.g., "is married to").

This is the standard terminology in math and computer science, but nearly all graph databases use their own vocabulary, which users first need to learn. For example, labels are often referred to as edge types.

Properties

You can add more information to nodes and edges, such as the name and age of a person. A graph with properties added to it is simply known as a property graph. Properties are typically simple key-value pairs, such as Age: 16. HyperGraphDB always stores complete Java objects.

To improve visibility in very large graphs, the developers can merge nodes to form a new, big hypernode – an edge then connects more than two nodes. For example, you could merge the cities of Dortmund, Essen, Bochum, and Duisburg to an Industrial Area node on the map. In a similar way, you can merge edges to create hyperedges. The hypergraphs created by doing so are the domain of GraphDB and HyperGraphDB.

Often it is permissible to define a type or schema for nodes or edges. This means the database can automatically verify consistency, or simply check whether an edge is permissible between two nodes. Commercial databases also often have a kind of version management system that remembers every change to a node. This means users can log, visualize, or revert manipulations. None of the five free graph databases offers this feature.

Traversal

Does Peter know my buddy Dieter? Can I get from Berlin to Cologne? If you want to answer questions like this, you need to walk the edges of a graph in a process known as traversing. What you will normally be interested in is the shortest path between two nodes; this tells a vacuum cleaner salesman the most efficient route to the next residential area (traveling salesman problem). In our lab, only Neo4j offered a couple of efficient built-in algorithms or queries for this kind of problem; all of the other candidates expected users to design the solution themselves.

Polyglot with Cypher and Gremlin

SQL has established itself as the standard query language for relational databases, but almost every graph database uses its own Graph Query Language – which is naturally incompatible with all the others. The languages are mainly trimmed to match a specific field of application. For example, Cypher in Neo4j is very good at identifying patterns, and GQL from GraphDB focuses on traversing.

The Gremlin language, which is specially designed for graph databases, is the first attempt at standardization [1]. It specializes in traversing, but of all the free graph databases, it is only used by Neo4j. The only feature that has made its way into nearly every graph database is a REST (Representational State Transfer [2]) interface, which lets clients manipulate the database using special HTTP requests.

REST and JSON

Typically, data will cross the wire in the standard JSON format, but it is often permissible to bundle the data into an XML document. Of course, the databases also offer interfaces to one or multiple programming languages; application developers only need to add the supplied (client) library to use the built-in functions and objects that access the actual data in the programming language of their choice.

Apart from the free version of Sones GraphDB, all of the graph databases group multiple individual actions to create a transaction; some, like Neo4j for example, fulfill the ACID (Atomicity, Consistency, Isolation, and Durability [3]) principle.

To supply an answer as quickly as possible, graph databases index their data sets in a style similar to their relational counterparts. Administrators typically need to tell the database which data is worth indexing. In some databases, the index itself is a graph; but many free databases use legacy storage methods, with key-value stores proving very popular (like Berkeley DB).

Replication Based on Peer-to-Peer

To avoid the possibility of bringing the server to its knees with complex requests, experienced administrators will want to replicate the database to additional servers. It is pretty difficult to keep graphs consistent across multiple servers. Only three free databases address this issue: Neo4j relies on the single-master/multiple-slave model, wherein a single master server replicates its data to multiple slave servers. Users only have read access to the copies. Changes are only accepted by the master, which then propagates them to the slaves. InfoGrid and HyperGraphDB prefer a peer-to-peer model, in which each database updates all the others.

Partitioning the database is equally exciting. If a graph becomes too large for a server, the administrator could distribute it over multiple servers. To do so, they would first need to cut it at a certain position – unfortunately, no algorithm can yet automatically find the right point to dissect. At least some fairly usable heuristics pay attention to minimizing the number of cut edges (min-cut). However, the graphs can split up into unequal parts, so that one server has more work than the others. With the exception of InfoGrid and HyperGraphDB, none of the free graph databases offer distributed data storage.

Amazingly, most of the free graph databases are programmed in Java; only Sones uses C# for its GraphDB. VertexDB relies on Ruby. All of them thus run in a virtual machine, which has a significant effect on performance (see the "Benchmarking" box). Despite this, the developers ignore other, potentially faster languages, such as C or C++.

Benchmarking

If you are interested in comparing the database processing speeds in a benchmark, you will inevitably run up against a number of issues: first, the graph databases are designed for very different fields of application. Whereas= Sones GraphDB comes from the social network analysis camp, the roots of HyperGraphDB are in artificial intelligence. On top of this, all of the databases use different query languages. For example, Cypher by Neo4j specializes in pattern recognition, whereas GraphDB focuses on traversing graphs. The bottom line is that this comparison is like comparing apples and oranges.

The data basis is another problem. To achieve reliable results, you need to use graphs that are as large as possible. However, graph databases are very application specific and typically very much under lock and key. The prime example here is road maps, to which you only gain access by paying a fee and signing a restrictive agreement.

At least Sones developed a small benchmark for its own database [4]. However, it is optimized for GraphDB and primarily intended to compare various (server) hardware and/or platforms. The data basis Sones uses is randomly generated graphs, for which the user can freely define the size and run time (Figure 3). If you want comparative figures for other systems, you need to search painstakingly through older whitepapers [5].

Neo4j

Neo4j is one of the oldest graph databases. Created in 2003 as the underpinnings of a content management system, the developers outsourced Neo4j into a separate project in 2007. Today Neo Technologies coordinates ongoing development and raises finances through commercial support. Neo4j can run as a standalone database server, but it can also be embedded in your own Java programs. The Java option is mainly useful for improving processing speed. The Neo4j server offers a web administration tool for the database on http://localhost:7474 (Figure 3). Additionally, you can enhance the server by installing plugins.

The example in Listing 1 shows you how you can access the Neo4j database from your own Java program. The code should be self-explanatory for Java programmers, which makes it fairly easy to get started with Neo4j. You can now find connectors for various other languages, including Ruby, Python, Scala, Clojure, and C#. Neo4j is particularly popular with JRuby developers.Since version 1.4, Neo4j has had its own query language (Cypher), which is mainly designed for facilitating pattern detection, but you can also address Neo4j in Gremlin.

Listing 1: Java Access to Neo4j

01 // Create database in /var/storagepath and define new relation:
02 GraphDatabaseService database = new EmbeddedGraphDatabase("/var/storagepath");
03 private static enum RelTypes implements RelationshipType { reachable }
04 // Start transaction:
05 Transaction tx = database.beginTx();
06 // Define nodes:
07 Node dortmund = database.createNode();
08 Node cologne = database.createNode();
09 // Set properties:
10 dortmund.setProperty("Name", "Dortmund");
11 cologne.setProperty("Name", "Köln");
12 // Create edges:
13 Relationship a1_do_k = dortmund.createRelationshipTo(cologne, RelTypes.reachable);
14 Relationship a1_k_do = cologne.createRelationshipTo(dortmund, RelTypes.reachable);
15 // Save data and quit:
16 tx.success();
17 tx.finish();
18 database.shutdown();

Complex Relationship Patterns

For Neo4j, edges are relations that possess a type (e.g., "is married to") and can accept additional properties. Properties comprise key-value pairs, and the values can only be simple, well-known data types such as strings or integers.

Neo4j stores graphs in files using its own format; Apache Lucene [6] handles the indexes. Neo4j can only index node and edge properties, and the developer needs to manage the indexes, which is more complex, but at least it means you only need to index the parts of the graph you are interested in.

Incidentally, Neo4j always stores its data on the hard disk; you cannot use it as an in-memory database as of this writing.

Programmers can control the process of traversing the graph flexibly. For example, you can define the order in which Neo4j traverses the edges of a node, and which branches it should even follow. In addition to this, Neo4j comes with some useful graph algorithms, such as the shortest path computed with the Dijkstra or the A* algorithms.

Neo4j supports replication on the network. All of the servers independently elect a write master, the only machine to perform write operations; all of the others asynchronously pick up the data and respond to read operations. If the write master fails, the remaining group elects a new master.

The Community edition of Neo4j is released under GPLv3. The Advanced edition includes monitoring capabilities but is released under the AGPLv3. This license also applies to the largest Enterprise edition, which contains replications and high availability. Neoclipse is the "visual graph database tool" provided by the developers, which is based on the Eclipse IDE [7].

Insolvency Management: Sones GraphDB

GraphDB has its origins in modeling and storing social networks. Little wonder the manufacturer, Sones (founded in 2007 in Erfurt, Germany), uses the relationships between the cartoon characters in "The Simpsons" for its examples [8].

Sones GraphDB is written in C# and thus is based on the .Net framework (or on Mono under Linux [9]). According to Sones's own measurements, the database works faster on Mono than with the .Net framework on Windows. Other connectors exist for Java and PHP. Sones stores the data in a proprietary GraphFS filesystem. The database itself supports directional hyperedges. Sones refers to the resulting construct as a property hypergraph.

Nodes or Edges

For GraphDB, both properties and edges are attributes and are distinguished by type: strings or integers are properties, but the node type is an edge.

GraphDB comes with its own query language named GQL, which is obviously modeled on SQL. Just as you first need to create a table in a relational database, in GQL, you first need to define how a node should look and what information it stores. Consider the following example:

CREATE ABSTRACT VERTEX TYPE Person ATTRIBUTES (String Name) MANDATORY (Name)INDICES (Name)

This command tells GraphDB that the graph contains nodes that store a String under the designator Name, whose output is also MANDATORY, and which needs to be indexed (INDICES). The ABSTRACT keyword indicates that you can't directly create person nodes but first need to derive one or multiple additional node types from them:

CREATE VERTEX TYPES Friend EXTENDS Person ATTRIBUTES (Integer Age, SET<Friend> knows)

GraphDB can thus inherit properties between nodes. Now you can create matching friends and connect them:

INSERT INTO Friend VALUES (Name = 'Peter', Age = 52)
INSERT INTO Friend VALUES (Name = 'Paula', Age = 34)
LINK Friend (Name = 'Peter') TO Friend(Name = 'Paula') VIA knows

If you want to access the database from one of the supported programming languages – C#, PHP, Java, or JavaScript – you need to issue GQL commands. Inserting a node in C# looks like this:

var result = GraphDSServer.Query(SecToken, TransactionID, "INSERT INTO Friend VALUES (Name ='Peter', Age = 52)", SonesGQLConstants.GQL);

C# users can also build a request with special features, but this quickly leads to unintelligible spaghetti code, even for simple queries.

The database server provides a small web application named WebShell that lets users directly manipulate the data via GQL in the browser. To do so, you only need to surf to http://localhost:9975/WebShell after launching the application (Figure 4). WebShell returns output in JSON or XML format. The same thing applies if you want to access the database with REST.

Figure 4: In the GraphDB WebShell app, you can manage Sones' database as if you were in a Linux shell.

A free version of Sones GraphDB is licensed under the AGPLv3. In contrast to the commercial version, the free version works entirely in RAM (in-memory database). If somebody switches off the server, all the stored data is gone. Additionally, the open source version lacks transactions and properties on edges. They were introduced in version 2.0, which was released in 2011; this version also has a modular design and can thus be extended by adding new functions or query languages.

Shortly before this issue went to press, Sones went into insolvency. The future of the company and its database is currently unclear. Some in the user community are looking to save at least the open source variant and continue its development.

InfoGrid

The InfoGrid project isn't just working on a simple graph database but on a whole framework that lets developers design complete web applications. The framework contains modules for user management and rights management and even has a template system that prepares the database content and outputs it as, for example, an HTML page or XML document. The programming language is Java; the web applications and the graph database run on a Java EE application server (Enterprise Edition).

InfoGrid is designed to merge information from various sources. The prime example is multiple newsfeeds, which the framework processes and feeds to the web application in the form of a graph. The data remain in their external sources; InfoGrid simply creates a virtual graph for them. This action is hidden from the user, who does not see where the data resides.

An InfoGrid database can communicate with others on the network using the XPRISO protocol [10]. This supports access to the data of the other instances and peer-to-peer replication. The database can either keep the information stored directly in InfoGrid in RAM, write it to the disk as a file, or pass it on to other services, such as Hadoop [11], Amazon S3, or even a relational database.

InfoGrid refers to nodes as mesh objects, and relationships can exist between them. The resulting graph is a mesh base. If you assign a type to your nodes, you can ensure that the nodes only contain specific properties.

To define what is allowed, you can create a model in an XML file. InfoGrid prescribes a couple of properties, including the unique ID, and even an expiry date, after which it will delete the node if so desired.

Add-on Properties

Version 2 of the database lets users add properties to nodes, which the developers claim leads to a faster processing speed. If you need properties on edges, you need to insert additional intermediate nodes – associative nodes. Additionally, edges are always non-directional and always need to connect two nodes. InfoGrid cannot index properties.

Each change in the graph triggers an event for which you can register objects as listeners. InfoGrid then automatically notifies them when the event occurs. This means the user can automatically react to a new message in a newsfeed. Listing 2 show an example of creating an InfoGrid database in Java and feeding a couple of nodes to it. The approach is similar to that of Neo4j.

Listing 2: Java Access to InfoGrid

01 // Create In-Memory Database:
02 MeshBase database = MMeshBase.create();
03 MeshObjectLifecycleManager molm = database.getMeshObjectLifecycleManager();
04 // Define transaction:
05 Transaction tx = database.createTransactionAsap();
06 // Create node:
07 MeshObject dortmund = molm.createMeshObject();
08 MeshObject cologne = molm.createMeshObject();
09 //Set properties:
10 dortmund.setProperty("Name", "Dortmund");
11 cologne.setProperty("Name", "Köln");
12 //connect with edge:
13 dortmund.relate(cologne);
14 // Commit transaction:
15 if(tx!=null) tx.commitTransaction();

On their website, the developers behind InfoGrid provide a sample web application named Mesh World [12]. This sample app gives users friendly insights into graphs and options for manipulating them. The driving force behind InfoGrid is Net Mesh, which also offers commercial support [13].

HyperGraphDB

HyperGraphDB comes from the artificial intelligence camp. As the name suggests, it works with hypergraphs that even let the user connect edges with other edges. In contrast to its competitors, HyperGraphDB is a purely embedded database. If you want to use it, you will always need to write your own application around it.

The only language you can use to do so right now is Java; connectors or converters for other languages do not exist. But the nodes in the graphs you manage can accept arbitrary Java objects; you simply need to drop them into the database and then connect them with edges. Listing 3 shows how easy this can be.

Listing 3: Java Access to HyperGraphDB

01 // Create database:
02 HyperGraph graph = HGEnvironment.get("/var/storagepath");
03 // Trigger transaction:
04 graph.getTransactionManager().beginTransaction();
05 // Insert node, city is a Java class I defined:
06 HGHandle dortmund = graph.add(new city("Dortmund"));
07 HGHandle cologne = graph.add(new city("Köln"));
08 // connect to simply edge:
09 HGHandle kante = graph.add(new HGPlainLink(dortmund, cologne));
10 // commit transaction:
11 graph.getTransactionManager().commit();

The add() methods bundle a Java object into a node and then return a handle. This fingerprint will let you relocate the node in the database later on – or manipulate it with other hypergraph methods. To store the node, HyperGraph first serializes the Java objects with the corresponding Java API and then pushes them into a Berkeley DB database [14]. This approach is very convenient for programmers, but it doesn't offer much in the way of performance.

Edges in HyperGraphDB are also simple Java objects that only implement the corresponding HGLink interface. You can imagine them as something like the intermediate nodes in InfoGrid. Normally, Listing 3 would need to create its own Java class that implements the HGLink interface. To save developers some work, HyperGraphDB includes a couple of prebuilt classes. For example, HGPlainLink in line 9 is a simple, unlabeled edge.

Working with HyperGraphDB is thus a fast experience. If you browse the documentation, you will quickly stumble across the unusual terminology. For example the building blocks of a graph are referred to as atoms. They possess a specific type and can store arbitrary user data. When an atom is connected to one or multiple other atoms, it creates a link (edge); otherwise, it assumes the function of a node.

To access stored atoms, the programmer needs to create a query comprising specially crafted Java objects and methods. If so desired, HyperGraphDB can first analyze a query and then create a query schedule, which it then remembers. The next query then runs more quickly. As in Neo4j, you are given good control over traversing.

Multiple individual HyperGraph instances will cooperate as an agent-based peer-to-peer database. To communicate, all of the instances use either the JXTA protocol [15] or the XMPP Jabber protocol, which is more commonly associated with Instant Messaging.

VertexDB

The latest addition to the field goes by the name of VertexDB, and it is unique in many ways. First, the database accepts commands in HTTP only. It only uses POST and GET requests and is thus not a complete REST interface. The responses take the form of text or JSON documents. Users can create transactions by bundling multiple VertexDB commands in a POST request.

Nodes removed by the administrator do not end up in a black hole; instead, VertexDB has a garbage collector that runs periodically and cleans up the database. Queues contain commands, which the database runs on a schedule.

The work approach, which is oriented on the FUSE interface in Linux, is also unusual: nodes in VertexDB are directories. Vertex uses them to store key-value pairs. Keys and values must be strings. If the key starts with an underscore (_), the value in the node contains a data payload, such as the _Age of a person. If the underscore is missing, VertexDB interprets the value as a reference to another node.

Filesystem operations are modeled on database queries. Users can create a node by, say, creating a new directory with mkdir. The corresponding HTTP request looks like this:

GET /res/paula?action=mkdir

Currently it is only possible to traverse a graph in a convoluted, manual way. The data is stored in a built-in Tokyo Cabinet database [16] in the background.

Various implementations of VertexDB exist. The developers wrote the original version in Ruby, but ports to Lua and JavaScript are now available. These alternative ports require all requests in a special JSON format, in contrast to their older sibling; the administrator then transfers the request using an HTTP POST.

Major Differences in the Test Field

The feature scope and approach used by the tested graph databases differ wildly. The candidates are mainly tailored for specific tasks and applications, and this makes it impossible to recommend a single candidate for universal use.

Neo4j is the most tried-and-trusted of all candidates, and it offers the most interfaces to programming languages. Neo4j is particularly recommended if you need a flexible way to control traversing. The Neo4j database is available in various frameworks, such as Bio4j, which manages proteins [17].

Sones GraphDB feels happiest in the field of information management – that is, when you need to define relationships between objects or when programmers need to link information. The GQL language is proprietary but easily acquired by newcomers or converts from the SQL camp. Unfortunately, the feature scope of the open source version is very restricted and nobody knows right now what the future of Sones GraphDB will be because of the insolvency proceedings.

Java EE or Maybe AI?

If you want to develop your own web applications on the basis of the Java Enterprise Edition, you should look into InfoGrid. The InfoGrid database comes into its own when you need to aggregate various data sources or handle relations. However, InfoGrid does not index properties and only supports non-directional edges.

HyperGraphDB strengths show up in the areas of artificial intelligence, language analysis, and semantic web – that is, precisely where its roots lie. Because the database developers only need to pass in Java objects, programming is very fast. The downside is the cost in terms of performance. Database queries are flexible and powerful but can become cluttered in more complex cases.

VertexDB is relatively young, but lean, lightweight, and easy to understand. You cannot add properties to edges, and complex traversing is not yet supported.

Alternative Hybrids

Many other databases also support graphs, but the some of these solutions are more like hybrids. For example, Orient DB [18] is an object-oriented database, but it will store relations as graphs. Some graph libraries park their graphs in normal (SQL) databases. One example is Apache Hama [19].