Tools NoSQL OrientDB Lead image: Lead Image © Victoria, Fotolia.com
Lead Image © Victoria, Fotolia.com
 

OrientDB document and graph database

Linked Worlds

OrientDB is a NoSQL document and graph database with a flexible data model and an elegant approach to querying. By Carsten Zerbst

Relational databases have been popular for many years, but they force users to squeeze their data models into table designs, which has increasingly proved to be too rigid. For example, tables do not have an elegant solution for saving the relationships of objects one below another. The document and graph data structures found in NoSQL (Not only SQL) databases – of which OrientDB is a representative – have solved some of the storage and retrieval problems of relational databases.

In this article, I demonstrate a few of the OrientDB features that cannot be implemented with classical relational databases. Figure 1 shows a comparison: Relational databases keep all data in tables, with a column for each attribute. The tables impose a rigid schema at run time, and additional attributes either require the existing tables to be adapted or an extra table to be defined. Both solutions involve intervention by the database administrator, and possibly even migration of the entire database.

Three kinds of databases.
Figure 1: Three kinds of databases.

With document databases, the data for each object is available in a document (XML, JSON, etc.). Each document has a unique ID that the database uses to access the record. Object links refer to other documents by ID. Additional attributes can easily be added to the document. This structure makes the document database more flexible than the relational database. One disadvantage, however, is that the database always has to download documents during the search or track object relationships.

Graph databases are based on two basic objects: The node and the connecting edges. Both can store any number of attributes; a special declaration is not usually necessary. This design results in a flexible data model that enables quick tree traversal via the objects.

Typical applications include all kinds of social networks (who with whom, when, and where). The storing of groups, documents, or project structures also benefits from flexible mapping of complex dependencies and cross references.

OrientDB is a document database overlain by a graph database. The document database provides the advantages of one-direction link relationships, key/value pairs, and object-oriented models. The graph database adds vertex and bi-directional edge relationships and speed benefits.

Installation

The software is available as a precompiled tarball from the website [1]. After downloading, unpack the archive at any point. The server can be started without further configuration using the script bin/server.sh to get started. When first launched, the database just prompts for a password for the database root user.

Although this completes the installation, the database administrator (DBA) will want to configure the database via SSL for a production installation, because the data and passwords are otherwise distributed over the network in plain text. A web application and the somewhat terse, but powerful, console are available for direct use. You can start the console and then log in to the database with:

bin/console.sh
connect remote:localhost root <Password>

Some NoSQL databases dissociate themselves from the query language of relational databases, but OrientDB uses SQL wherever possible. Users thus do not need to learn a new language; new commands are only needed for unique features. In the console, the help command provides an overview of available commands; more extensive documentation is available in the OrientDB wiki [2].

Getting Started

The first example is based on characters and books from the Discworld series (Figure 2) written by the brilliant Terry Pratchett, who sadly passed away in early 2015. It supports two types of node, Person and Book, which are connected via the two edges (connections) Relation and Appear. Nodes and edges have different attributes that can be used directly or as a list or map.

This example database details the relationships between the characters of the Discworld series.
Figure 2: This example database details the relationships between the characters of the Discworld series.

Listing 1 contains an excerpt of the console commands needed by the user to create the database and relationships shown in Figure 2. The user first connects the console with the server in the first two lines and then creates a new database called discworld. After creating this, the console automatically connects to the new database.

Listing 1: Create Database (Excerpt)

01 connect remote:localhost root <password>
02 create database remote:localhost/discworld root <password> plocal
03
04 create class Person extends V;
05 create property Person.birthday date;
06
07 create class Book extends V;
08 create property Book.translation embeddedmap;
09
10 create class Relation extends E;
11 create property Relation.from date;
12
13 create class Appear extends E;
14 create property Appear.chapter embeddedlist integer
15
16 insert into Person (last, first, birthday) values ('Vimes', 'Samuel', '1962-04-03');
17 insert into Person (last, first, birthday) values ('Ramkin', 'Sybil', '1969-09-06');
18
19 insert into Book (name, translation) values ('Guards! Guards!', {'de':'Wachen! Wachen!', 'fr' : 'Au Guet!'} );
20
21 create edge Relation from #11:0 to #11:1 set type='Married', from='1993-01-01';
22
23 create edge Appear from #11:0 to #13:0 set chapter={1,2,3,4,5,6};

Graph databases have base types V (node) and E (edge). Lines 4-8 create the classes Person and Book as an extension of the base node and defines their attributes.

The usual primitives such as integer, string, and date are available in OrientDB; lists, sets, and maps can also be as you can see from line 8 on. The translation attribute in the book class contains the map <Language> : <Title>, and the chapter attribute records the chapter in which a character appears. Relational databases would need to define an additional table for this purpose.

The edge classes Relation and Appear are similarly defined as node classes, the difference being that the base class is now E instead of V. Unlike relational databases, you do not need to define edge relationships as 1:1, 1:n, or m:n. Each edge represents a 1:1 relationship between two nodes. However, any number of edges can originate from a node or point to a node.

Node generation relies on the classic insert statement, for which the target class and the attribute values need to be specified. Not specifying the attributes last and first when defining Person does not yield an error message; instead, the database dynamically creates these attributes when inserted.

To populate lists, sets, or maps, OrientDB accepts a JSON-like notation (line 18). OrientDB assigns a unique object ID for each node (e.g., #11:3) which can then be used later to find or create the edges. Edges must be created using the new create edge statement. In addition to the IDs of the start and target objects, edges – like the nodes – can have arbitrary attributes.

Question Time

Once the data model has been defined and the first data added, it is time to start searching in the database. The normal select statement is ideal for querying edges and nodes directly; the user can use the typical comparison operators (e.g., =, >, like), which also works with embedded lists, sets, and maps.

Line 3 in Listing 2 finds the book. The additional value of OrientDB over a relational database is the elegant evaluation of object relationships across one or more edges. The out and in attributes for each object at the nodes and edges are defined automatically to store the object relationships, as you can see from the output in Figure 3.

Listing 2: Example Selects

01 select from Person where last='Vimes';
02 select from Person where birthday > '1965-01-01';
03 select from Book where translation ["de"] ="Ab die Post";
04 select expand( both() ) from Person where first='Sam';
05 select expand( in().out('Appear') ) from #11:2;
06 traverse any() from #11:0;
07 traverse out('Relation') from #11:1 while $depth < 10;
08 traverse any() from 11:0 while
09    ( @class='Relation' and from < '2000-01-01')
10    or from is null;
11 select name from (traverse any() from #11:0) where @class='Residence';
OrientDB automatically defines directional properties.
Figure 3: OrientDB automatically defines directional properties.

If you use the out() function in the select statement, you receive all edges emanating from the node; accordingly, you get the edges pointing to a node with the in() function and edges in both directions with the both() function. In combination with the expand() function, you will receive the nodes connected to the found edges.

Relationship Questions

Typical relationship questions can be easily clarified using questions such as "Who knows this person?" or "What is this activity dependent on?" The statement in line 6 (Listing 2) thus finds all characters and books linked via an edge by #11:0. Moreover, the desired edge types and directions can be specified explicitly and even linked to a path description.

The statement in line 5 tracks all links that point to #11:2 against their direction (in) and then follows the Appear links for the objects it found. In this way, you can find all books featuring characters who know Sam Jr. One real-world application for such a statement is to display items purchased in the past by a customer currently reviewing a product at a mail-order website.

When using in(), out(), and both(), the paths to be considered are part of the request; the inquirer only ever receives a portion of the start object as a result. Recursive execution is similarly important to receive all linked nodes and edges; this is handled by the traverse statement in OrientDB; it requires information about the edges to be tracked and about the conditions on the traversed nodes and edges.

Thanks to the any() keyword, the query in line 8 tracks all edges; it therefore returns the entire graph, including the root object starting at #11:0. A condition that checks node/edge or edge/node at each transition can be stipulated using while to provide more control. In the simplest case, this restricts the search depth. Additionally, you can access the attributes of the currently viewed nodes and edges and use them for classification or for limiting validity.

In the example, the relationships have a type and a start time (from); however, it is also possible to map resource allocations for projects or the allocation of components to specific production batches. Line 7 only identifies the members of a family. To this end, the while condition checks when traversing to see whether the relationship edges have a family relationship or whether they are Person type nodes.

As with a relational database, it is possible to nest statements. Lines 8-10 show this with a combination of select and traverse statements. In doing so, traverse determines the desired objects, and select acts as a filter to output only the names of the books found. Such combinations allow efficient queries, but this elegant query language tempts users to grab too many objects in the beginning and then to limit the objects to the actual search scope later with another select statement.

Querying in this manner can have disastrous consequences for large graphs; for the traverse statement in particular, it makes sense to reduce the number of objects traversed by limiting the edges or tweaking the while condition. In this way you reduce run time by a good third, even for this small example. The benefits would be even greater with larger graphs, of course; otherwise, the usual methods of optimization with OrientDB help (e.g., indexing frequently requested attributes and using the profile command to investigate queries).

As the few examples here demonstrate, OrientDB allows both a compact description of the data model and – with the advanced select and the new traverse statements – elegant querying of object relationships. Relational databases require more effort: A separate table would be needed for the list or map attributes, and traversing is also much more complex: Recursive descents are not available across databases, are unreadable, and are usually slow.

Interfaces

A choice of several interfaces is available for your own OrientDB applications. The first candidate is Java, with which the database itself is implemented. Using the client API, you can address the server from any program. Doing so also reveals the fast performance of OrientDB: Creating a tree structure with 8,000 nodes and edges each in a transaction takes less than five seconds on a small web server (Core I3, 4GB RAM) and recursive execution is done in one-third of a second.

Accessing OrientDB is not a problem beyond the Java world, either; native drivers are available for scripting languages such as Python, PHP, and Perl; the C and C# drivers open up access to these language families. The OrientDB Server also provides an HTTP interface. With GET and POST requests, the server can be used much like the console. Listing 3 contains a sequence of URLs the user can use to connect to the server, request information about the server and the Person class, and then search for objects. The results are returned in JSON format; updates go to the server in the same format.

Listing 3: HTTP Interface

http://localhost:2480/connect/discworld
http://localhost:2480/database/discworld
http://localhost:2480/class/discworld/Person
http://localhost:2480/query/discworld/sql/select from Person where last='Vimes'
http://localhost:2480/disconnect

The HTTP interface enables the use of any programming languages, especially directly from the JavaScript code of a web page. Furthermore, thanks to the JavaScript interpreter in Java, the scripting language can also be used to program database functions, which then run directly in the database server.

Conclusions

In this article I only demonstrate the rudimentary capabilities of OrientDB, but even these examples demonstrate the flexible data model and the elegant approach to querying. The initial hurdles are low because the query language is usually based on the well-known SQL. Other features of OrientDB, such as transactional security, access rights management, distributed databases, and direct data transfer from relational databases are only mentioned here. Databases with billions of records and gigabytes of disk space are in productive use.

Thanks to its liberal Apache license, OrientDB can also be used for commercial applications. The company behind the development, Orient Technologies, offers commercial support when needed. In addition to further development of the database, the money is also used for the freely available documentation. In just two years, OrientDB has moved from being almost totally useless to being quite useful.