Data virtualization using JBoss and Teiid
Uniformity Drive
Agile methods are increasingly used in today's IT environments, particularly for provisioning large amounts of data. Legacy access to data sources is a thing of the past. In this article, I outline the advantages of the underlying methods of JBoss Data Virtualization software, explain how to install it, and conclude with an initial data integration project.
Business Intelligence (BI) applications are responsible for analyzing, evaluating, and presenting data sets. The requirements for these applications have become more varied and complex over time. For example, users need to be able to view and analyze data in real time, rather than relying on historic data. This data is already outdated in many scenarios and no longer of any value. Additionally, users of BI applications have varying requirements and therefore sometimes need to create completely different reports on the basis of present data. The data itself is, of course, distributed across several sources, and each of these sources is a kind of isolated repository. Applications therefore need to be able to query several of these sources and do so using various interfaces. Employee data might exist in one SQL database and the employees' expense reports in Excel files. Different data sources therefore need to be enabled to process an employee expense report and store the results in a database. Depending on the desired report, this could involve a large amount of manual work.
The Limits of Proven BI Methods
In the past, such problems were solved by the extract, transform, and load (ETL) process. The relevant data were extracted from different sources, adapted, and transformed before being imported into a target database. The process, however, is quite complex, because all the data must be replicated and the transformation process is prone to error. Changes to the data models in the source systems can necessitate major adjustments to the transformation mechanisms. Real-time access with this kind of data transformation is not possible either.
The Big Data trend undoubtedly poses a huge challenge for today's applications. The task in these cases is processing large volumes of data that are either structured, unstructured, multi-structured, or semi-structured. Of course, you must accomplish all this virtually in real time. Such sets of data are preferably provided via NoSQL databases such as MongoDB. The Hadoop framework known from the Apache project plays an important role in processing the data.
Additionally, large amounts of data can come from various sources that are not always local. For example, almost all data resides on cloud-based systems, especially in the area of social media. Access to this type of data involves very different requirements than for locally available data, such as how the data can be transferred in a secure manner. Encryption plays a major role here. The duration of the transfer also often poses a problem with large data sets because of the possible high latencies.
Classic BI systems have a hard time implementing all the requirements just mentioned. The architecture of such systems usually comprises a variety of databases. The data travels from one data repository to the next as ETL jobs and, in doing so, passes through a wide variety of staging areas. In the past, batch-based processes for transforming data worked well. Increasing requirements are slowly putting an end to this method, however, and new procedures and processes are needed to process large amounts of data safely and easily with low latency.
More Agile Applications Through Data Virtualization
The problems just described can be solved using new methods and processes. The key here is "real-time data integration" based on data virtualization. With this technology, a kind of virtual data hub is slotted between the various data sources and the applications that need to access the data. This hub provides transparent access to the data – no matter where it comes from or what interface provides it. The applications themselves also access the data via a uniform interface, such as a Java database connectivity (JDBC) technology or a web service. The data can come from a database, a Hadoop cluster, an XML file, or virtually any other source (Figure 1). It no longer matters where the data comes from; data access is abstracted by the virtual data hub.
The principle is similar to the use of metadirectories for authenticating users. Metadirectories also provide a unified view of different authentication sources, such as an LDAP server, an Active Directory, or the like, thus ensuring that the user only sees a single interface for logging in to a system or an application. The virtual data hub assumes the role of a metadirectory for data virtualization.
Data virtualization provides the great advantage that data is integrated when it is needed and without having to first copy it to a target database. This is a completely different approach to the previously described ETL process. A virtual database (VDB) that links the physical data sources with a specific "view" of the data they contain makes this possible. If an application needs a specific set of data, a particular view is used on the original data source. The data remains in situ and does not need to be copied first. This is a huge advantage over the ETL process, especially with big data.
Furthermore, for the approach presented here, developers only need to take care of providing the integration logic with the data; they have no need to provision the transformed data in an additional database. This method not only saves time, it also means that fewer infrastructure services need to be provided. The data itself can, of course, be modified because access is to the actual sources and not to the results of a transformation. Data from similar sources can be collated in different ways when developing the integration logic.
For example, views can be produced to combine multiple, complexly interrelated database tables in a single table and then make them available to the application. In this context, it is also important to mention that the integration logic can help unify different data formats. For example, data sets can contain phone numbers in different formats. Using an appropriate model, these can be converted to a uniform format and are then available as part of the virtual database.
Finally, it should be noted that a virtualization hub can control access to the source data in a very granular way. The hub can be viewed in this context as a kind of data firewall and can thus help implement compliance requirements.
Data Integration Using JBoss Data Virtualization
The JBoss Data Virtualization software, formerly known under the name JBoss Enterprise Data Services (EDS), is currently the only open source software on the market that offers such a form of data integration. It practically works as a virtualization hub upstream of the different data sources and provides applications with a uniform view of the data. For this, it looks as if the data comes from a single source. Figure 2 shows the different software components.
The JBoss Data Virtualization Server runs as a process within the JBoss Enterprise Application Platform (EAP) and has several tasks: For one, it is responsible for managing the virtual databases (VDBs). VDBs provide a uniform view of data from different sources. A VDB consists of a source model containing a view of the data source. The model contains information about the actual source data's structure and properties and about what the data that is made available to the applications looks like.
The server contains an access layer that determines how the VDBs can be accessed. The software provides JDBC, ODBC, or web services (SOAP/REST) interfaces. A query engine ensures optimal access to the individual data sources based on the existing source and view models. This happens through "Translators" and "Resource Adapters."
The Teiid Designer visual tool [1] allows users to create VDBs. The tool is available as a plugin for the JBoss Developer Studio graphical development environment. Furthermore, a Java API in the form of the Connector Development Kit can be used to adapt the Translators and Resource Adapters to the existing data sources.
As well as these two core components, various administrative tools help manage the environment. For example, AdminShell provides command-line-based access to the JBoss Data Virtualization Framework. Users of the JBoss Enterprise Platform will already be familiar with the management console for managing the application server.
Installation and Setup
After so much pure theory, it is time to put it into practice. The example here shows the steps required to install the software components successfully and how to implement a first small project to create a VDB. All the examples presented are based on a Red Hat Enterprise Linux installation, but they should also work seamlessly on other Linux systems such as CentOS or Fedora.
First, you need to grab the JBoss Data Virtualization software [2]. Then, unpack the JAR archive using
java -jar jboss-dv-installer-6.0.0.GA-redhat-4.jar
and run through the graphical installation process. You will find a detailed description of all the configuration options in the very comprehensive installation manual [3]. The installer automatically generates a JBoss EAP instance as part of the setup procedure; you do not need to install this separately.
Once the installation is complete, you will find the data virtualization software within the EAP installation folder ($EAP_HOME
). Note that the software is also available as an OpenShift cartridge; therefore, you can install it very quickly in an existing PaaS environment if you have one. You do, however, need to install JBoss Developer Studio [4] for the required design tools. Here, too, you can run the installer using
java -jar jbdevstudio-product-eap-universal \ -7.1.1.GA-v20140314-2145-B688.jar
and run through the graphical set-up process for the development environment. Once the installation is complete, you need to integrate Teiid Designer after restarting Developer Studio. To do this, select Help | Install New Software from the context menu and enter the link http://download.jboss.org/jbosstools/updates/stable/kepler/integration-stack/ as an installation repository.
You can then select the JBoss Data Virtualization Development option from the list to install the design tools.
Once the installation is complete, you still need to link the development environment to the JBoss Data Virtualization EAP instance. To do so, click on the No servers are available. Click this link to create a new server link in the Server tab and select the previously installed JBoss EAP instance.
This step completes the installation of your run-time environment, and you can now start on your project. You will first need to switch to the Teiid Designer perspective in JBoss Developer Studios. To do this, select the Window | Other Perspective | Other option from the context menu and click on Teiid Designer.
First Project
As a practical example, begin by creating a relational data source as an XML file. All applications should be able to access this data source via a virtual database with a matching model; accessing the actual data source is transparent for the application. Only the virtual database sees this. First, you need to create a model of the data source. You do this by creating a new Teiid project (New | Project) and then importing a new data source.
Figure 3 shows that Teiid Designer enables access to a variety of different data sources at this point. An XML file will be used in this example. A Git repository from Blaine Mincey [5] is helpful here; it supplies a completed file with fictitious customer data. The repository also includes a JDBC client written in Java that can be used to query the virtual database (VDB) you've created. If you have imported the XML file, you need to create a "relational view" of the data source. You do not do anything else in this step other than present certain elements of the file's XML structure as a virtual relational table.
To find out whether the relational view of the XML data source produced in this manner works, you can click on the previously created view and access the Preview function (Modeling | Preview Data). If everything worked, you should now find the previously selected XML elements as SQL tables listed in a new tab SQL Results (Figure 4).
Now you can create a virtual database from this model. To do so, click on New | Teiid VDB in your project and add the previously generated model consisting of an XML data source and a view for the project. This creates the VDB; in the next step, you can adjust the database's properties. Using Data Roles, for example, you can specify whether you want to make a table read-only or also provide write access (Figure 5).
To access the VDB created in this manner, you need to deploy it on the JBoss EAP instance. You can do this very easily outside of Developer Studio by clicking on the VDB within your project and then selecting Modeling | Deploy. You should now see in the Developer Studio console window how the VDB on the Teiid instance of the EAP server is provided. If this works, the VDB should now be enabled so that you can access it.
You can access the VDB using a web service or a simple JDBC query. The JDBC client from the aforementioned Git repository [5] is used for a first test in this example. If you have imported this into Developer Studio, you can edit the source code (in the Java perspective) to adapt the JDBC_ URL
to your environment: Enter the name of the project, the EAP server, and the VDB version in the URL. The URL in this example looks like this:
JDBC_URL = "jdbc:teiid:MyFirstProject@mm://\ localhost:31000;version=1";
Finally, you need to add a JDBC driver to the small Java application. To do this, select the client and click on Build Path | Configure Build Path | Add External JARs and navigate to the folder containing the Teiid instance on your EAP server. The folder contains a JDBC driver, which you can choose at this point.
The small test application is now prepared. If you select it in Developer Studio and click Run | As Java Application, you will now finally see the customer data from the XML data source in the console window.
With this test project, you can access the data in an XML file using a JDBC query to the VDB. If you edit the file, the changes are visible immediately when you make a new request to the VDB. Of course, you can now set up additional models for other data sources at this point and add these to the VDB you just created.
Conclusions
Various data sources can be accessed in a short space of time using a combination of JBoss Developer Studio, the Teiid Designer plugin, and JBoss Data Virtualization. A virtual database is created based on data models, and this database serves as a data source for the actual applications. The underlying sources can be accessed transparently for the applications. Real-time processing of data is worthy of particular mention. The ETL processes can be dropped because the data remains in its original location, and predefined views are used for access.