Tool your HPC systems for data analytics
Get a Grip
I was very hesitant to use the phrase "Big Data" in the title, because it's somewhat ill defined (plus some HPC people would have given me no end of grief), so I chose to use the more generic "data analytics." I prefer this term because the basic definition refers to the process, not the size of the data or the "three Vs" [1] of Big Data: velocity, volume, and variety.
The definition I tend to favor is from TechTarget [2]: "Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information." It doesn't mention the amount of data, although the implication is that there has to be enough to be meaningful. It doesn't say anything about the velocity, or variety, or volume in the definition. It simply communicates the high-level process.
Another way to think of data analytics is the combination of two concepts: "data analysis" and "analytics." Data analysis [3] is very similar to data analytics, in that it is the process of massaging data with the goal of discovering useful information that can be used for suggesting conclusions and supporting decision making. Analytics [4], on the other hand, is the discovery and communication of meaningful patterns in data. Even though one could argue that analytics is really a subset of data analysis, I prefer to combine the two terms, so it gathers everything from collecting the data in raw form to examining the data with algorithms or mathematics (typically implying computations) to look for possible information. I'm sure some people will disagree with me, and that's perfectly fine. We're blind men trying to define something we can't easily see and isn't easy to define, even if you can see it. (Think of defining "art," and you get the idea.)
Data analytics is the coming storm across the Oklahoma plains. You can see it miles away, and you had better get ready for it, or it will land on you with a pretty hard thump. The potential of data analytics (DA) has been fairly well documented. An easy example is PayPal, which uses DA and HPC for real-time fraud detection by adapting their algorithms all the time and throwing a great deal of computational horsepower into it. I don't want to dive into the mathematics, statistics, or machine learning of DA tools; instead, I want to take a different approach and discuss some aspects of data analytics that affect one of the audiences of this magazine – HPC people.
Specifically, I want to discuss some of the characteristics or tendencies of DA applications, methods, and tools, since these workloads are finding their way into HPC systems. I know one director of a large HPC center who gets at least three or four requests a week from users who want to perform data analytics on the HPC systems. Digging a little deeper, the HPC staff finds that the users are mostly not "classic" HPC users, and they have their own tools and methods. Integrating their needs into existing HPC systems has proven to be more difficult than they thought. As a result, it might be a good idea to present some of the characteristics or tendencies of these tools and users so you can be prepared when the users start knocking on your door and sending you email. By the way, these tools might be running on your systems already and you don't even know it.
Workload Characteristics
Before jumping in with both feet and listing all of the things that are needed in DA workloads, I think it's far better first to describe or characterize the bulk of DA workloads, which might reveal some indicators for what is needed. With these characteristics, I'm aiming for the "center of mass." I'm sure many people can come up with counterexamples, as can I, but I'm trying to develop some generalizations that can be used as a starting point.
In the subsequent sections, I'll describe some major workload characteristics, and I'll begin with the languages used in data analytics.
New Languages
The classic languages of HPC, Fortran and C/C++, are used somewhat in data analytics, but a whole host of new languages and tools are used as well. A wide variety of languages show up, particularly because Big Data is so hyped, which means everyone is trying to jump in with their particular favorite. However, a few have risen to the top:
- R [5]
- Python [6]
- Julia (up and coming) [7]
- Java [8]
- Matlab [9] and Matlab-compatible tools (Octave [10], Scilab [11], etc.)
Java is the lingua franca of MapReduce [12] and Hadoop [13]. Many toolkits for data analytics are written in Java, with the ability to be interfaced into other languages.
Because data analytics is, for the most part, about statistical methods, R, the language of statistics, is very popular. If you know a little Python or some Perl or some Matlab, then learning R is not too difficult. It has a tremendous number of built-in functions, primarily for statistics, and great graphics for making charts. Several of its libraries also are appropriate for data analytics (Table 1).
Tabelle 1: R Libraries and Helpers
Software |
Description |
Source |
---|---|---|
Analytics libraries |
||
R/parallel |
Add-on package extends R by adding parallel computing capabilities |
|
Rmpi |
Wrapper to MPI |
|
HPC tools |
R with BLAS, LAPACK, and MPI in Linux |
http://lostingeospace.blogspot.com/2012/06/r-and-hpc-blas-mpi-in-linux-environment.html |
RHadoop [14] |
R packages to manage and analyze data with Hadoop |
|
Database tools [15] |
||
RSQLite [16] |
R driver for SQLite |
|
rhbase [17] |
Connectivity to HBASE |
|
graph |
Package to handle graph data structures |
http://www.bioconductor.org/packages/devel/bioc/html/graph.html |
neuralnet [18] |
Training neural networks |
Python is becoming one of the most popular programming languages. The language is well suited for numerical analysis and general programming. Although it comes with a great deal of capability, lots of add-ons extend Python in the DA kingdom (Table 2).
Tabelle 2: Python Add-Ons
Software |
Description |
Source |
---|---|---|
Pandas |
Data analysis library for data analytics |
|
scikit-learn |
Machine learning tools |
|
SciPy |
Open source software for mathematics, science, and engineering |
|
NumPy |
A library for array objects including tools for integrating C/C++ and Fortran code, linear algebra computations, Fourier transforms, and random number capabilities |
|
matplotlib |
Plotting library |
|
Database tools |
||
sqlite3 [19] |
SQLite database interface |
|
PostgreSQL [20] |
Drivers for PostgreSQL |
|
MySQL-Python [21] |
MySQL interface |
|
HappyBase |
Library to interact with Apache HBase |
|
NoSQL |
List of NoSQL packages |
|
PyBrain |
Modular machine learning library |
|
ffnet |
Feed-forward neural network |
|
Disco |
Framework for distributed computing based on the MapReduce paradigm |
|
Hadoopy [22] |
Wrapper for Hadoop using Cython |
|
Graph libraries |
||
NetworkX |
Package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks |
|
igraph |
Network analysis |
|
python-graph |
Library for working with graphs |
|
pydot |
Interface to Graphviz's Dot language [23] |
|
graph-tool |
Manipulation and statistical analysis of graphs (networks) |
Julia is an up-and-coming language for HPC, but it is also drawing in DA researchers and practitioners. Julia is still a very young language; nonetheless, it has some very useful packages for data analytics (Table 3).
Tabelle 3: Julia Packages
Software |
Description |
Source |
---|---|---|
|
Functions to support the development of machine learning algorithms |
|
|
Basic statistics |
|
|
Probability distributions and associated functions |
|
|
Optimization functions |
|
|
Library for working with tabular data |
|
|
Crafty statistical graphics |
|
|
Interface to matplotlib [24] |
Matlab is a popular language in science and engineering, so it's natural for people to use it for data analytics. In general, a great deal of code is available for Matlab and Matlab-like applications (e.g., Octave and Scilab). Matlab and similar tools have data and matrix manipulation tools already built in, as well as graphics tools for plotting the results. You can write code in the respective languages of the different tools to create new functions and capabilities. The languages are reasonably close to each other, making portability easier than you might think, with the exception of graphical interfaces. Table 4 lists a few Matlab toolboxes from MathWorks and an open source toolbox for running parallel Matlab jobs. Octave and Scilab have similar functionality, but it might be spread across multiple toolboxes or come with the tool itself.
Tabelle 4: Matlab Toolboxes
Software |
Description |
Source |
---|---|---|
Statistics |
Analyze and model data using statistics and machine learning |
http://www.mathworks.com/products/statistics/?s_cid=sol_des_sub2_relprod3_statistics_toolbox |
Data Acquisition |
Connect to data acquisition cards, devices, and modules |
|
Image Processing |
Image processing, analysis, and algorithm development |
|
Econometrics |
Model and analyze financial and economic systems using statistical methods |
http://www.mathworks.com/products/econometrics/?s_cid=HP_FP_ML_EconometricsToolbox |
System Identification |
Linear and nonlinear dynamic system models from measured input-output data |
|
Database |
Exchange data with relational databases |
|
Clustering and Data Analysis |
Clustering and data analysis algorithms |
http://www.mathworks.com/matlabcentral/fileexchange/7473-clustering-and-data-analysis-toolbox |
pMatlab [25] |
Parallel Matlab toolbox |
http://www.ll.mit.edu/mission/cybersec/softwaretools/pmatlab/pmatlab.html |
These are just a few of the links to DA libraries, modules, add-ons, toolboxes, or what have you for languages that are increasingly popular in the DA world.
Single-Node Runs
Although not true for all DA workloads, the majority of applications tend to be designed for single-node systems. The code might be threaded, but very little code runs in parallel, communicating across multiple nodes on a single data set. MapReduce jobs split the data into pieces and roughly perform the same operation on each piece of the data, but with no communication between tasks. In other words, they are a collection of single-node runs with no real coupling or communication between them as part of the computations.
Just because most code is executed as single-node runs doesn't mean data analytics doesn't run a large volume at the same time. In some cases, the analysis is performed with a variety of starting points or conditions. Sometimes you make different assumptions in the analysis or use different analysis techniques, resulting in the need to run a large number of single-node runs.
An example of a data analysis technique is principal component analysis [26] (PCA), in which a data set is decomposed. The resulting decomposition can then be used to reduce the data set or problem to a smaller one that is more easily analyzed. However, choosing how much of the data set to remove is something of an art, so the reduction analysis on the reduced data set might need to be performed several times or the amount of data reduction varied. It also means the analysis on the reduced data set might have to be performed many times. The end result is that the overall analysis could take a while and use a great deal of computation.
Other examples use multiple nodes when NoSQL databases are used. I've seen situations in which users have their data in a NoSQL database distributed across many nodes and then run MapReduce jobs across all of the nodes. The computations employed in the map phase place much emphasis on floating-point operations. The reduce phase has some fairly heavy floating-point math, but also many integer operations. All of these operations can also have a heavy I/O component. The situation I'm familiar with was not that large in HPC terms – perhaps 12 to 20 nodes. However, it was just a "sample" problem. The complete problem would have started with 128 nodes and quickly moved to several thousand nodes. Although these are not MPI or closely coupled workloads, they can stress the capabilities of a large number of nodes.
Interactivity
A key aspect of data analytics is interacting with the analysis itself. Interactivity can mean multiple things. One form involves manual (i.e., interactive) preparation of the data for analysis, such as examining the data for outliers.
An outlier can be one data point or several data points that lie outside the expected range of data. A more mathematical way of stating this is that an outlier is an observation that is distant from other observations (you can substitute the word "measurement" for "observation"). The sources of outliers vary and include experimental error, instrument error, and human error. Typically, either the use of robust analysis methods that can tolerate outliers or removal of outliers from the data is desirable. The process of removing outliers is something of an art that requires scrutinizing the data, often with the use of visual analysis methods.
Sometimes, outliers are retained in the data set, which requires that you use a new or different set of tools. A simple example uses the idea that the median of a data set is more robust [27] than the mean (average). If you take a value in a data set and make it extremely large (approaching infinity), then the mean obviously changes. However, the median value changes very little. This doesn't mean the median is necessarily a better statistic than the mean, just that it is more robust in the presence of outliers. It may require several different analyses of the data set to understand the effect of outliers on the analysis. Robust computations might have to be employed if the presence of outliers greatly affects non-robust analysis. In either case, it takes a fair amount of analysis either to find the outliers or to determine what analysis techniques are most appropriate.
Visualization is a key component in data analytics. The human mind is very good at finding patterns, especially in visual information, so researchers plot various aspects of data to get a visual representation that can lead to some sort of understanding or knowledge. During data analysis, the plots used are not always the same, because the plots need to adapt to the data itself. Researchers might want to try several kinds of charts on a single data set to better understand the data. Unfortunately, it is difficult to know which charts use a priori. Whatever the case, DA systems need compute nodes with graphics cards.
HPC systems are not typically equipped with graphics cards for visualizing data; they are typically used for computation, with the results pulled back either to an interactive system or a user's workstation. Some HPC systems offer what is termed "remote visualization." The University of Texas Advanced Computing Center (TACC) has a system named Longhorn [28] that was designed to provide remote visualization and computation in one system. HPC systems like this allow researchers to get an immediate view of their work, so they can either make changes to the computations or change the analysis.
Data analytics is not like a typical HPC application that is run in batch mode with no user interaction. On the contrary, data analytics may require a great deal of interactivity.
Data Analytics Pipeline
Data analytics is a young and growing field that is trying to mature rapidly. As a result, many computations might be needed to arrive at some final information or answers. To get there, the computations are typically done in what is called a "pipeline" (also called a workflow). The phrase is commonly used in biological sequence analysis referring to a series of sequential steps taken to arrive at an answer. The output from one step is the input to the next step. Data analytics is headed rapidly in this same direction, but perhaps without explicitly stating it. Data analytics starts with raw data and then massages and manipulates it to get it ready for analysis. The analysis typically consists of a series of computational steps to arrive at an answer. Sounds like a pipeline to me.
Some of the steps in a DA pipeline might require visual output, others an interactive capability, or you might need both. Each step may require a different language or tool, as well. The consequence is that data analytics could be very complex from a job flow perspective. This can have a definite effect on how you design HPC systems and how you set up the system for users.
These characteristics – new languages, single-node runs (but lots of them), interactivity, and DA pipelines – are common to the majority of data analytics. Keep these concepts in mind when you deal with DA workloads.
Lots of Rapidly Changing Tools
Two features of DA tools is that they vary widely and change rapidly. A number of these tools are Java based, so HPC admins will have to contend with finding a good Java runtime environment for Linux. Moreover, given the fairly recent security concerns, you might have to patch Java fairly frequently.
Other language tools experience fairly rapid changes, as well. For example, the current two streams of Python are the 2.x and the 3.x series. Python 3.0 chose to drop certain features from Python 2.x and add new features. Some toolkits still work with Python 2.x and some work with Python 3.x. Some even work with both, although the code is a bit different for each. As an HPC admin, you will have to have several different versions of tools available on every node for DA users. (Hint: You should start thinking about using Environment Modules [29] or Lmod [30] to your advantage.)
One general class of tools that changes fairly rapidly is NoSQL databases. As the name implies, NoSQL databases don't follow SQL design guidelines and do things differently. For example, they can store data differently to adapt to a type of data and a type of analysis. They also forgo consistency in favor of availability and partition tolerance. The focus of NoSQL databases is on simplicity of design, horizontal scaling, and finer control over availability.
A large number of NoSQL databases are available with different characteristics (Table 5). Depending on the type of data analytics being performed, several of the databases can be running at the same time. As an administrator, you need to define how these databases are architected. For example, do you create dedicated nodes to store the database, or do you store the database on a variety of nodes that are then allocated by the resource manager to a user who needs the resource? If so, then you need to define these nodes with special properties for the resource manager.
Tabelle 5: Databases
Name |
URL |
---|---|
Wide column stores |
|
HBase |
|
Hypertable |
|
Document stores |
|
MongoDB |
|
Elasticsearch |
|
CouchDB |
|
Key value stores |
|
DynamoDB |
|
Riak |
|
Berkeley DB |
|
Oracle NoSQL |
http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html |
MemcacheDB |
|
PickleDB |
|
OpenLDAP |
|
Graph |
|
HyperGraphDB |
|
GraphBase |
|
Bigdata |
|
AllegroGraph |
|
Scientific |
|
SciDB |
Storage and Compute – Hadoop
One of the tools to which you will really have to pay attention is Hadoop. By Hadoop, I mean not only the Hadoop filesystem (HDFS) [31], but also the idea of MapReduce and how you write applications using MapReduce concepts. These are really two different things, but to make my life easier, I will use "Hadoop" to mean both, unless I specifically refer to one or the other.
Using Hadoop can greatly complicate your life as an HPC administrator. Embodied in Hadoop is the concept of moving the compute to where the data is located. Therefore, submitting a DA job that uses Hadoop is a bit more complicated because certain nodes will contain the needed data. While HDFS can copy the data to other nodes that's not really the thrust of Hadoop. Therefore your resource manager needs to be "data aware" so that it can find the nodes where the data is located or copy the data to other nodes that are available.
Another complication is that Hadoop 2, the current version of Hadoop, uses something called YARN [32]. YARN stands for Yet Another Resource Negotiator. Fundamentally, it is a resource manager similar to Slurm, Moab, OGE, or Torque. If you have DA applications that depend on Hadoop and YARN within an HPC system that already has a resource manager, you will get a situation of "Who's on First?" – that is, which resource manager "owns" or "controls" which specific resources (nodes). I think all HPC administrators know that you can't have two resource managers trying to manage the same nodes; you will have lots of problems very fast.
Most compute nodes in HPC systems either have no disk (diskless) or a single disk. Consequently, it is really difficult to use them as Hadoop nodes. You have several options at this point. One option is to give all or a portion of the compute nodes in the cluster a fair amount of local disk for storing data. If all of the nodes have the capability, then the resource manager's life is a bit easier, but you will need more racks and the general cost of the system will go up. If you only make a portion of the compute nodes appropriate for Hadoop (lots of local disks), then you need to tell the resource manager that these nodes have different properties (e.g., "Hadoop") and set up the resource scheduling appropriately. Although this is cheaper than having all the nodes stuffed with disks, it is a bit more complicated.
A second option is to build HDFS on some sort of centralized storage. HDFS is a meta-filesystem, in that it is a filesystem built on top of other filesystems (usually "local" filesystems). This means you can build HDFS on top of almost any storage you want. For example, if you had a centralized storage such as Lustre, you could just build HDFS on top of it [33]. However, this approach does not take advantage of centralized storage, nor does it allow HDFS to be used effectively.
As a third alternative, Intel has created some tools for Intel Enterprise Edition of Lustre (IEEL) that allow MapReduce applications to write directly to Lustre [34]. These tools also allow the "shuffle" phase of MapReduce to be skipped. In the current version of IEEL, version 2.0, a beta version of a tool replaces YARN with your existing resource manager. This allows you to use a single resource manager within your HPC system.
Summary
Data analytics is a probably the fastest growing computational workload today. Relative to HPC, it is still done on a somewhat small scale, although companies such as PayPal are proving the need for larger scale computations. Naturally, the desire is for these computations to be done on HPC systems to avoid the cost of a second system. However, data analytics is a different workload than what you have experienced in the HPC world to date.
In this article, I've reviewed some aspects of data analytics workloads. Be ready for:
- Lots of new languages, including interfaces to traditional databases and NoSQL databases
- Lots of single-node runs (possibly lots of memory)
- Interactivity
- Interactive login
- Visualization
- Graphics cards in nodes
- Data analytics pipelines
- Lots of rapidly changing tools
- SQL tools
- NoSQL tools
- Hadoop and storage
- Hadoop moves computation to storage (most of the time)
- Hadoop uses local storage
- Hadoop 2.0 uses its own resource manager YARN, which can easily cause problems with the resource manager
If you read through these highlights and talk to your DA users, you will see that you might need to add or change your processes and you might need to add new hardware. If you don't have DA users today, then I suggest you look a little closer or be ready for the data analytics wave to overtake you.