Features Julia Programming Language

Julia: A new language for technical computing

Toothsome

The Julia language is a very powerful parallel computing model that works across multiple cores and cluster nodes. Can this new language deliver on bold claims of fast, easy, and parallel? By Douglas Eadline

In the early days of the personal computer, many people built and bought early desktop systems simply to explore computing on their own. Of course, they often had access to mainframe systems or even minicomputers, but something about having the computer physically next to you for your private use was appealing. As the sole user and owner, early users controlled everything, including the reset switch. Total ownership allowed early pioneers to tinker with hardware and software without concern for other users.

Some would argue that a whole new industry was launched from this "tinkering." The relatively low cost of early desktop computing allowed anyone who was curious to explore and adapt early PCs to their needs. Initially, programming tools were rare, and many early users found themselves writing assembly language programs or even toggling in machine code instructions. It was not long until Microsoft Basic was available and became one of the first high-level languages used by the early PC crowd. Languages like C and Fortran that were previously only available on larger systems soon followed. The PC revolution created a new class of "developer" – someone who had specific domain experience and a programmable PC at their disposal. Countless applications seemed to spring up overnight. Some applications went on to become huge commercial successes, and others found a niche in their specific application area.

Applying these lessons to HPC, you might ask, "how do I tinker with HPC?" The answer is far from simple. In terms of hardware, a few PCs, an Ethernet switch, and MPI get you a small cluster; or, a video card and CUDA get you some GPU hardware. Like the PC revolution, a low cost of entry now invites the DIY crowd to learn and play with HPC methods, but, the real question is: What software can a domain specialist use to tinker with HPC?

Many of the core HPC programming tools are often too low level for most domain specialists. Learning Fortran, C/C++, MPI, CUDA, or OpenCL is a tall order. These tools tend to be "close to the hardware" and can be difficult to master without a significant time investment. To get closer to their problem domain, many technical computing users prefer languages like Python, R, and MATLAB. These higher level tools move the programming environment closer to the user's problem and are often easier to use than more traditional low-level tools.

Often you'll see a "speed for convenience" trade-off in high-level programming tools, however. Whereas many of the performance languages like C or Fortran are statically compiled, most "convenient" languages use a slower dynamic compilation method that allows real-time interaction and tinkering with code sections. Moreover, the higher level languages are more expressive and often require fewer lines of code to create a program.

Another issue facing all languages is parallel computing. The advent of multicore has forced the issue because the typical desktop has at least four cores. Additionally, the introduction of multicore servers, HPC clusters, and GP-GPU computing has fragmented many traditional low-level programming models. High-level languages that want to hide these features from the user have various levels of success, but for the most part, parallel computation is an afterthought.

The lack of a good high-level "tinker language" for HPC has been an issue for quite a while; that is, how can a domain expert (e.g., a biologist) quickly and easily express a problem in such a way that they can use modern HPC hardware as easily as a desktop PC. An "HPC Basic" – a language that gets users started quickly without the need to understand the details of the underlying machine architecture – is needed. (To be fair, some would suggest there should have never been Basic in the first place!)

Julia Is Not Bashful

Recently, the new language Julia [1] has seen a lot of discussion as a tool for technical computing environments. The authors explain their justification for the language as follows:

We want a language that's open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that's homoiconic (i.e., has the same representation of code and data), with true macros like Lisp, but with obvious, familiar mathematical notation like MATLAB. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as MATLAB, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled. … We want to write simple scalar loops that compile down to tight machine code using just the registers on a single CPU. We want to write A*B and launch a thousand computations on a thousand machines, calculating a vast matrix product together. [2]

The Julia site [1] has quite a bit more information about their bold plan, but the above paragraph sounds like a dream come true for many HPC users. In particular, some of the issues Julia addresses have been holes in the HPC landscape for years and seem almost impossible until you look at the micro-benchmarks in Table 1.

Tabelle 1: Benchmark Times (Smaller is Better)*

	Julia	Python	MATLAB	Octave	R	JavaScript
	v3f670da0	v2.7.1	vR2011a	v3.4	v2.14.2	v8 3.6.6.11
`fib`	1.97	31.47	1,336.37	2,383.80	225.23	1.55
`parse_int`	1.44	16.50	815.19	6,454.50	337.52	2.17
`quicksort`	1.49	55.84	132.71	3,127.50	713.77	4.11
`mandel`	5.55	31.15	65.44	824.68	156.68	5.67
`pi_sum`	0.74	18.03	1.08	328.33	164.69	0.75
`rand_mat_stat`	3.37	39.34	11.64	54.54	22.07	8.12
`rand_mat_mul`	1.00	1.18	0.70	1.65	8.64	41.79
* Tests are relative to C++ and were run on a MacBook Pro with a 2.53GHz Intel Core 2 Duo CPU and 8GB of 1,066MHz DDR3 RAM (from the Julia website).

Considering the "P" in HPC is for Performance, the results in Table 1 should invite further investigation into Julia. One of the big assumptions about many high-level languages has been the loss of efficiency when compared with C or Fortran. This table indicates that this does not need to be the case. Even getting close to the speeds of traditional compiled languages would be a welcome breakthrough in high-level HPC tools.

In addition to speed, many features of Julia should appeal to the domain experts who use HPC. The following short list highlights some important benefits of Julia. (Consult the Julia manual [3] for a full set of features.)

Free and open source (MIT licensed)
Syntax similar to MATLAB
Designed for parallelism and distributed computation (multicore and cluster)
C functions called directly (no wrappers or special APIs needed)
Powerful shell-like capabilities for managing other processes
Lisp-like macros and other meta-programming facilities
User-defined types are as fast and compact as built-ins
LLVM-based [4], just-in-time (JIT) compiler that allows Julia to approach and often match the performance of C/C++
An extensive mathematical function library (written in Julia)
Integrated mature, best-of-breed C and Fortran libraries for linear algebra, random number generation, FFTs, and string processing

One ability all high-level languages need is to "glue" together existing libraries from other sources. Too much good code is available out there to ignore or re-write. Through the use of the LLVM compiler, Julia can use existing shared libraries compiled with GCC or Clang tools without any special glue code or compilation – even from the interactive prompt. The result is a high-performance, low-overhead method that lets Julia leverage existing libraries.

Another important HPC feature of Julia is a native parallel computing model based on two primitives: remote references and remote calls. Julia uses message passing behind the scenes but does not require the user to control the environment explicitly like MPI. Communication in Julia is generally "one-sided," meaning the programmer needs to manage only one processor explicitly in a two-processor operation. Julia also has support for distributed arrays.

Hands On

Julia is different because it is designed for HPC, which makes it more exciting than most other programming languages. Julia is, among other things, open source, fast, scalable, easy to learn, and extensible. It fills a void in HPC that allows users to "tinker" with hardware and software, so now it's time to look at getting some hands-on experience with Julia.

Because Julia is new, some aspects of the language are still developing. The extremely good documentation [6] is worth consulting if you want to explore further. Because Julia is somewhat of a moving target, it is best to pull the latest release from the web. Currently, Julia builds support:

GNU/Linux: x86/64 (64-bit); x86 (32-bit).
Darwin/OS X: x86/64 (64-bit); x86 (32-bit).
FreeBSD: x86/64 (64-bit); x86 (32-bit).

The following examples were built and run on a Limulus [7] personal cluster running Scientific Linux 6.2 on an Intel i5-2400S with 4GB of memory. If you don't want to bother building Julia but would like to play with it on the web, check out the on-line Julia version [8] (do not enter Your Name or Session Name, just click the Try Julia Now bar.)

To build Julia locally, move to a working directory with at least 2GB of available file space and enter:

$ git clone git://github.com/JuliaLang/Julia.git

After the download completes, you will have a new Julia directory. To build Julia, move to the Julia directory, enter make, and go grab a cup of coffee, head out to lunch, or walk the dog. The Julia build takes a while and will pull down any needed packages. It also uses LLVM, an open modular compiler framework as a back end for the language. When the build is finished, you should find a Julia binary in your working directory. To start the Julia interpreter, enter:

./julia

If you don't want to see the title (Listing 1) on subsequent startups, use

Listing 1: Julia Header on Startup

01 $./julia
02                _
03    _       _ _(_)_     |
04   (_)     | (_) (_)    |
05    _ _   _| |_  __ _   |  A fresh approach to technical computing
06   | | | | | | |/ _` |  |
07   | | |_| | | | (_| |  |  Version 0.0.0+86921303.rc6cb
08  _/ |\__'_|_|_|\__'_|  |  Commit c6cbcd11c8 (2012-05-25 00:27:29)
09 |__/                   |
10 julia>

julia -q

to get the command line. It is also a good idea to set your PATH to include the new binary. Like many popular interactive tools, you can enter expressions like:

julia> sqrt(2*7)+(6/4)
5.241657386773941

As I mentioned before, both a user manual and library reference are available online, so you can explore more of the Julia language using these resources. Because Julia is pointed squarely at the HPC crowd, and because parallel computing is an integral part of technical computing, I'll jump right into how Julia expresses parallelism.

Diving Into the Deep End

Before I begin, however, let me provide my standard "MPI is still great" disclaimer. Higher level languages often try to hide the details of low-level parallel communication. With this "feature" comes some loss of efficiency, similar to writing Fortran instead of low-level machine code. The trade-off is often acceptable because the gains in convenience outweigh the loss of efficiency. Not to worry, MPI is here to stay and it is still great. Just remember higher level languages that hide low-level MPI coding let more people play in the HPC game. As stated in the Julia manual, Julia provides a simple one-sided messaging model:

Julia's implementation of message passing is different from other environments such as MPI. Communication in Julia is generally "one-sided," meaning that the programmer needs to explicitly manage only one processor in a two-processor operation. Furthermore, these operations typically do not look like "message send" and "message receive" but rather resemble higher level operations like calls to user functions. [3]

The authors also state that Julia provides two built-in primitives:

… remote references and remote calls. A remote reference is an object that can be used from any processor to refer to an object stored on a particular processor. A remote call is a request by one processor to call a certain function on certain arguments on another (possibly the same) processor. [3]

Before I explore parallel computation, I need to start Julia on a multicore machine with two processors:

Julia -q -p 2

I will look at dynamically adding processors in the next section, but for now, I will use two cores on the same machine. It also makes sense not to oversubscribe the number of cores on your machine (i.e., the -p argument should not exceed the number of cores on your machine). In the example below, I use remote_call to add two numbers on another processor. The first argument is the processor number, the second is the function, and the remaining arguments are arguments to the function (in this case, I am adding 2 + 2). Then I fetch the result. I can also use a remote call to operate on previous results.

julia> r = remote_call(2, +, 2, 2)
RemoteRef(2,1,1)
Julia> fetch(r)
4
julia> s = remote_call(2, +, 1, r)
RemoteRef(2,1,2)
julia> fetch(s)
5

Remote calls return immediately and do not wait for the task to complete. The processor that made the call proceeds to its next operation while the remote call happens somewhere else. A fetch() call will wait until the result is available, however. Also, you can wait for a remote call to finish by issuing a wait() on its remote reference.

In the above example, the need to supply processor numbers is not very portable, and as such, these primitives are not used by most programmers. A Julia macro called @spawn removes this dependency. For example:

julia> r = @spawn 7-1
RemoteRef(2,1,7)
julia> fetch(r)
6

As I will explain below, a @parallel macro also is very helpful with loops. Because Julia is interactive, it is also important to keep in mind that it is possible to define and enter functions from the command line that are not automatically copied to remote cores (on the local or remote nodes). To make functions available on all processing cores, the load function must be used. This function will load Julia programs on all known cores associated with the current Julia instance. Alternatively, Julia will load the file startup.jl (if it exists) in your home directory.

Calling All Cores

Julia can use cores on the local machine and on remote machines (remote nodes will almost always be cluster nodes). The following is brief description of how to add cores dynamically to your Julia instance. First, I will start Julia with one core. The nprocs() function will report the number of cores available to the current instance. For example:

$ julia -q
julia> nprocs()1

If I want to add two local cores, I can use the addprocs_local() function, as in Listing 2. In this case, I added one core. Note that nprocs() now will show a total of two cores:

julia> nprocs()
2

Listing 2: addprocs

01 julia> addprocs_local(1)
02 ProcessGroup(1,{LocalProcess(), Worker("10.0.0.1",9009,4,IOStream(),IOStream(),{},{},2,false)},{Location("",0), Location("10.0.0.1",9009)},2,{(1,0)=>WorkItem(bottom_func,(),false,(addprocs_local(1),1),(),(),intset(1))})

Adding remote cores (those on other machines) can be done in two ways. The first is to add the nodes explicitly with the addprocs_ssh() function. In the example in Listing 3, I am adding nodes named n0 and n2. It also should be noted that using remote nodes assumes that Julia has been installed in the same location on each node or is available via a shared file system. Make sure the PATH variable points to your Julia binary on the remote nodes. Again, note the number of processors (cores) has been increased – to four.

Listing 3: Adding Nodes

01 julia> addprocs_ssh({"n0","n2"})
02 ProcessGroup(1,{LocalProcess(), Worker("10.0.0.1",9009,4,IOStream(),IOStream(),{},
03 {},2,false)  ...  },{Location("",0), Location("10.0.0.1",9009) ... Location("10.0.0.12",9009)},4,
04 {(1,0)=>WorkItem(bottom_func,(),false,(thunk(AST(lambda({},{{#1, #2}, {{#1, Any, 2}, {#2, Any, 2}}, {}},
05 begin
06   #1 = top(Array)(top(Any),2)
07   top(arrayset)(#1,1,"n0")
08   top(arrayset)(#1,2,"n2")
09   #2 = #1
10   return addprocs_ssh(#2)
11 end
12 ))),1),(),(),intset(1))})
13 julia> nprocs()
14 4

To verify that the remote nodes are indeed involved with the computation, I can devise a simple parallel loop using the @parallel macro. I also use the Julia run function to run hostname on each node. For example:

julia> @parallel for i=1:4
       run(`hostname`)
       end
julia> limulus
limulus
n2
n0

By selecting i=1:4, I expect Julia to use all available cores, and indeed, two local cores and the two remote nodes report in with their names. (The local host machine is named "limulus.") The nodes are used in round-robin fashion, and if the loop count is increased, Julia cycles through the available resources again.

julia> @parallel for i=1:8
              run(`hostname`)
              end
julia> limulus
limulus
limulus
limulus
n2
n0
n2
n0

To get a better feel for parallel computation, I can run the example from the Julia documentation [9]. First, see how it works on one node. The following program will generate a random bit (0 or1) and sum the result (the + represents a parallel reduction). The tic function starts a timer, and toc reports the elapsed time.

julia> tic();
  nheads = @parallel (+) for i=1:100000000
  randbit()
  end;
  s=toc();
  println("Number of Heads: $nheads in $s seconds")
elapsed time: 11.234276056289673 seconds
Number of Heads: 50003873 in 11.234276056289673 seconds

The loop took 11.23 seconds to complete using one core. Next, if I run the exact same loop using two local and two remote nodes, I get 5.67 seconds. Finally, if I restart Julia with four local cores, I get 3.71 seconds. The speed-ups certainly indicate Julia is running in parallel (you can also observe the Julia process with top on worker nodes). Note that the simple parallel loop required no information about the number or location of cores. This feature is extremely powerful because it allows applications to be written on a small number of cores (or a singe core) and then run in parallel when more cores are available. Although such a method does not guarantee efficiency, removing the parallel bookkeeping from the programmer is always helpful. Many other useful parallel functions are worth exploring, including things like myid(), which gives a unique processor number that can be used for identification.

The final way to add cores is through Grid Engine [10] using the addprocs_sge() function. I assume other schedulers like Torque will be supported in the future. This feature, while seemingly mundane, can be very useful. It allows programs to scale dynamically over a whole cluster or cloud. Programs could be constructed to seek resources as needed and, if available, use them. Consider a user running a dynamic Julia program from their notebook or workstation. The program could request cores from a cluster or cloud; if the resources are not available, the request could be withdrawn or the user's program could wait until the resources become available. Although this functionally is still under development, the parallel computing component of Julia is written in Julia, so improving and experimenting with the code is quite simple because there is no need to dig down into the code and modify a low-level routine. As noted, because Julia is new, some features are still under development, like removing cores or better error recovery.

Another interesting scenario for Julia is use on a desk-side HPC resource. For instance, the Limulus system I am using here is a fully functioning four-node cluster with one head/login node that is powered all the time (just like a workstation). When extra resources are needed, the user or the batch scheduler can power up nodes at any time. With the ability to request resources dynamically, a Julia program could request cores through the batch scheduler, which activates the needed nodes and runs the desired program. The user never has to think about nodes, batch queues, or administrative issues. The Julia program will manage itself and allow the user to focus more on the problem at hand rather than the details of parallel computing.

When Ever

Taking away the responsibility of managing explicit communication and synchronization from the user has advantages. How work is scheduled is now virtually transparent to the user. With Julia, multiple parallel computations are managed as tasks or co-routines. Whenever code performs a communication operation like fetch or wait, the current task is suspended and a scheduler picks another task to run. When the wait event completes (e.g., the data shows up), the task is restarted. This design has the advantage of a dynamic environment where explicit synchronization is not the responsibility of the user. In addition, dynamic scheduling allows easy implementation of master/worker divide-and-conquer algorithms.

Finding Julia

Although Julia holds much promise for technical computing, it is still a young language and is currently undergoing some changes and improvements. Meanwhile, the Julia community is growing rapidly. As such, Julia is probably not ready for heavy production use at this point in time, although it is possible to tinker with Julia on almost any desktop machine. Source and binary packages are available; consult the Julia download and build instructions [5] page for more information.

I have only touched a small part of Julia's parallel capability, and I hope this partial introduction has given you a feel for the power of Julia's parallel computing model. Remember, all this goodness comes with blazingly fast execution, so creating parallel applications or algorithms is not just an academic exercise.

Finally, I do not mean to diminish Julia in any way by labeling it a "tinker language." Indeed, many "first" languages seem to be lifetime tools for many users, and early programs or prototypes grow and morph into larger, production-level projects. A nice quality of Julia is that there is almost no barrier to start coding right away. Those familiar with MATLAB will find it particularly easy to get started. Also, you do not need to use advanced features, like parallel computing, from the beginning.

The nice thing about "tinkering" is you can try simple things first, test ideas, and end up with a working prototype in very little time. That your prototype runs almost as fast as in "real code" is a welcome benefit.