run your Java applications on heterogeneous hardware

By  |  0 Comments
Related Products

Heterogeneous hardware is current in virtually each computing system: our smartphones comprise a Central Processing Unit (CPU), and a Graphics Processing Unit (GPU) with a number of cores; our laptops comprise, almost definitely, a multi-core CPU with an built-in GPU plus a devoted GPU; knowledge facilities are including Subject Programmable Gate Arrays (FPGAs) connected to their methods to speed up specialised duties, whereas lowering vitality consumption. Furthermore, firms are implementing their very own hardware for accelerating specialised applications. As an example, Google has developed a processor for sooner processing of TensorFlow computation, referred to as Tensor Processing Unit (TPU). This hardware specialization and the latest recognition of hardware accelerators is because of the finish of Moore’s legislation, by which the variety of transistors per processor doesn’t double each 2 years with each new CPU technology anymore, as a consequence of bodily constraints. Due to this fact, the best way to acquire sooner hardware for accelerating functions is thru hardware specialization.

The primary problem of hardware specialization is programmability. Most probably, every heterogeneous hardware has its personal programming mannequin and its personal parallel programming language. Requirements equivalent to OpenCL, SYCL, and map-reduce frameworks facilitate programming for brand spanking new and/or parallel hardware. Nevertheless, many of those parallel programming frameworks have been created for low-level programming languages equivalent to Fortran, C, and C++.

Though these programming languages are nonetheless extensively used, the truth is that business and academia have a tendency to make use of higher-level programming languages equivalent to Java, Python, Ruby, R, and Javascript. Due to this fact, the query now could be, the way to use new heterogeneous hardware from these high-level programming languages?

There are presently two primary options to this query: a) by way of exterior libraries, by which customers is perhaps restricted to solely a set of well-known capabilities; and b) by way of a wrapper that exposes low-level parallel hardware particulars into the high-level applications (e.g., JOCL is a wrapper to program OpenCL from Java by which builders must know the OpenCL programming mannequin, knowledge administration, thread scheduling, and many others.). Nevertheless, many potential customers of those new parallel and heterogeneous hardware aren’t essentially consultants on parallel computing, and maybe, a a lot simpler resolution is required.

On this article, we focus on TornadoVM, a plug-in to OpenJDK that enables builders to mechanically and transparently run Java applications on heterogeneous hardware, with none required data on parallel computing or heterogeneous programming fashions. TornadoVM presently helps hardware acceleration on multi-core CPUs, GPUs, and FPGAs and it is ready to dynamically adapt its execution to one of the best goal system by performing code migration between a number of units (e.g., from a multi-core system to a GPU) at runtime. TornadoVM is a analysis venture developed on the College of Manchester (UK) and it’s totally open-source and accessible on Github. On this article, we current an outline of TornadoVM and the way programmers can mechanically speed up a photography-filter on multi-core CPUs and GPUs.

How does TornadoVM work?

The final thought of TornadoVM is to put in writing or modify as fewer traces of code as potential, and mechanically execute that code on accelerators (e.g., on a GPU). TornadoVM transparently manages the execution, reminiscence administration, and synchronization, with out specifying any particulars in regards to the precise hardware to run on.

TornadoVM’s structure consists of a standard layered structure mixed with a microkernel structure, by which the core part is its runtime system. The next determine reveals a high-level overview of all of the TornadoVM elements and the way they work together with one another.



On the high degree, TornadoVM exposes an API to Java builders. This API permits customers to establish which strategies they need to speed up by operating them on heterogeneous hardware. One necessary facet of this programming framework is that it doesn’t mechanically detect parallelism. As an alternative, it exploits parallelism on the task-level, by which every process corresponds to an present Java methodology.

The TornadoVM-API also can create a bunch of duties, referred to as task-schedule. All duties inside the similar task-schedule (all Java strategies related to the task-schedule) are compiled and executed on the identical system (e.g., on the identical GPU). By having a number of duties (strategies) as a part of a task-schedule, TornadoVM can additional optimize knowledge motion between the primary host (the CPU) and the goal system (e.g., the GPU). This is because of non-shared reminiscence between the host and the goal units. Due to this fact, we have to copy the information from the CPU’s primary reminiscence to the accelerator’s reminiscence (usually by way of a PCIe bus). These knowledge transfers are certainly very costly and might harm the end-to-end efficiency of our functions. Due to this fact, by creating a bunch of duties, knowledge motion will be additional optimized if TornadoVM detects that some knowledge can keep on the goal system, with out the necessity of synchronizing with the host aspect for each kernel (Java methodology) that’s executed.

SEE ALSO: What sort of Java developer are you? Take our Java Quiz to seek out out!

The next code snippet reveals an instance of the way to program a typical map-reduce computation by utilizing TornadoVM. The category Pattern accommodates three strategies: one methodology that performs the vector addition (map); one other methodology that computes the discount (cut back), and the final one which creates the task-schedule and executes it (compute). The strategies to be accelerated are the tactic map and cut back. Be aware that the person augments the sequential code with annotations equivalent to @Parallel and @Scale back which can be used as a touch to the TornadoVM compiler to parallelize the code. The final methodology (compute), creates an occasion of the task-schedule Java class and specifies which strategies to speed up. We’ll go into the main points of the API with a full instance within the subsequent part.

public class Pattern 
          public static void map(float[] a, float[] b, float[] c) 
             public static void cut back(float[] enter, @Scale back float[] out) 
                      for (@Parallel int i = zero; i < enter.size; i++) 
           public void compute(float[] a, float[] b, float[] c, float[] output) 

TornadoVM Runtime

The TornadoVM runtime layer is cut up between two subcomponents: a task-optimizer and a bytecode generator. The duty optimizer takes all duties inside the task-schedules and analyzes knowledge dependencies amongst them (dataflow runtime evaluation). The purpose of this, as we now have talked about earlier than, is to optimize knowledge motion throughout duties.

As soon as the TornadoVM runtime system optimizes the information transfers, it then generates inner TornadoVM-specific bytecodes. These bytecodes aren’t seen to the builders and their position is to orchestrate the execution on heterogeneous units. We’ll present an instance of the interior TornadoVM bytecodes within the subsequent block.


As soon as the TornadoVM bytecodes have been generated, the execution engine executes them in a bytecode interpreter. The bytecodes are easy directions that may be reordered internally to carry out optimizations –  for instance, to overlap computation with communication.

The next code snippet reveals an inventory of generated bytecodes for the map-reduce instance proven within the earlier code snippet. Each task-schedule is enclosed between BEGIN-END bytecodes. The quantity that follows every bytecode is the system by which all duties inside a task-schedule will execute on. Nevertheless, the system will be modified at any level throughout runtime. Recall that we’re operating two duties on this explicit task-schedule (a map-method and a reduce-method). For every methodology (or process), TornadoVM must pre-allocate the information and to carry out the corresponding knowledge transfers. Due to this fact, TornadoVM executes COPY_IN, which is able to allocate and replica knowledge for the read-only knowledge (equivalent to arrays a and b from the instance), and allocate the area on the system buffer for the output (write-only) variables by calling the ALLOC bytecode. All bytecodes have their bytecode-index (bi) that different bytecodes can check with. For instance, because the execution of lots of the bytecodes is non-blocking, TornadoVM provides a barrier by operating the ADD_DEP bytecode and an inventory of bytecode-indexes to attend for.

SEE ALSO: Java 13 – why textual content blocks are well worth the wait

Then, to run the kernel (Java methodology), TornadoVM executes the bytecode LAUNCH. The primary time this bytecode is executed, TornadoVM will compile the referenced methodology (in our instance are the strategies referred to as map and cut back) from Java bytecode to OpenCL C. For the reason that compiler is, in truth, a supply to supply (Java bytecode to OpenCL C), one other compiler is required. The latter compiler is a part of the driving force of every goal system (e.g., the GPU driver for NVIDIA, or the Intel driver for an Intel FPGA) that can compile the OpenCL C to binary. TornadoVM then shops the ultimate binary in its code cache. If the task-schedule is reused and executed once more, TornadoVM will receive the optimized binary from the code-cache saving the time of re-compilation.  As soon as all duties are executed, TornadoVM copies the ultimate end result into the host reminiscence by operating the COPY_OUT_BLOCK bytecode.

BEGIN <zero>
COPY_IN <zero, bi1, a>
COPY_IN <zero, bi2, b>
ALLOC <zero, bi3, c>
ADD_DEP <zero, b1, b2, b3>
LAUNCH <zero, bi4, @map, a, b, c>
ALLOC <zero, bi5, output>
ADD_DEP <zero, b4, b5>
LAUNCH <zero, bi7, @cut back, c, output>
COPY_OUT_BLOCK <zero, bi8, output>
END <zero>

The next determine reveals a high-level illustration of how TornadoVM executes and compiles the code from Java to OpenCL. The JIT compiler is an extension of the Graal JIT compiler for OpenCL developed on the College of Manchester. Internally, the JIT compiler builds a management move graph (CFG) and a knowledge move graph (DFG) for the enter program which can be optimized throughout totally different tiers of compilation. Within the TornadoVM JIT compiler presently exist three tiers of optimization: a) architecture-independent optimizations (HIR), equivalent to loop unrolling, fixed propagation, parallel loop exploration or parallel sample detection; b) reminiscence optimizations, equivalent to alignment within the MIR, and c) architecture-dependent optimizations. As soon as the code is optimized, TornadoVM traverses the optimized graph and generates OpenCL C code, as proven on the correct aspect of the next determine.


Moreover, the execution engine mechanically handles reminiscence and retains consistency between the system buffers (allotted on the goal system), and the host buffers (allotted on the Java heap). Since compilation and execution are mechanically managed by the TornadoVM, end-users of TornadoVM wouldn’t have to fret in regards to the inner particulars.

Testing TornadoVM

This part reveals some examples of the way to program and run TornadoVM. We present, for example, a easy program of the way to rework an enter colored JPEG picture to a grayscale picture. Then we present the way to run it for various units and measure its efficiency. All examples introduced on this article can be found on-line on Github.

Grayscale transformation Java code

The Java methodology that transforms a colour JPEG picture into grayscale is the next:

class Picture {
  personal static void grayScale(int[] picture, remaining int w, remaining int s) 
    for (int i = zero; i < w; i++) 
        for (int j = zero; j < s; j++)  (grayLevel << 16) 

For each pixel within the picture, the alpha, purple, inexperienced and blue channels are obtained. Then they’re all mixed right into a single worth to emerge the corresponding gray pixel, which is lastly saved once more into the picture array of pixels.

Since this algorithm will be executed in parallel,  it is a perfect candidate for hardware acceleration with TorandoVM. To program the identical algorithm with TornadoVM, we first use the @Parallel annotation to annotate the loops that may probably run in parallel. TornadoVM will examine the loops and can analyze if there is no such thing as a knowledge dependency between iterations. On this case, TornadoVM will specialize the code to make use of 2D indexing in OpenCL.  For this instance, the code seems to be as follows:

class Picture {
  personal static void grayScale(int[] picture, remaining int w, remaining int s) 
    for (@Parallel int i = zero; i < w; i++) 


Be aware that we introduce @Parallel for the 2 loops. After this, we have to instruct TornadoVM to speed up this methodology. To take action, we create a task-schedule as follows:

TaskSchedule ts = new TaskSchedule("s0")
    .process("t0", Picture::grayScale, imageRGB, w, s)

// Execute the task-schedule (blocking name)

The duty-schedule is an object that describes all duties to be accelerated. At first we cross a reputation to establish the task-schedule (“s0” in our case, nevertheless it might be any identify). Then, we outline which Java arrays we need to stream to the enter duties. This name signifies to the TornadoVM that we need to copy the contents of the array each time we invoke the execute methodology. In any other case, if no variables are specified within the streamIn, TornadoVM will create a cached read-only copy for all variables wanted for the duties’ execution.

The following name is the duty invocation. We will create as many duties as we would like inside the similar task-schedule. As we described within the earlier part, every process references an present Java methodology. The arguments to the duty are as follows: first we cross a reputation (in our case we identify it “t0”, nevertheless it might be another identify); then we cross both a lambda expression or a reference to a Java methodology. In our case we cross the tactic grayScale from the Java class Picture. Lastly, we cross all parameters to the tactic, as another methodology name.

After that, we have to point out to TornadoVM which variables we need to synchronize once more with the host (primary CPU). In our case we would like the identical enter JPEG picture to be up to date with the accelerated grayscale one. This name, internally, will drive a knowledge motion in OpenCL, from system to host, and replica the information from the system’s international reminiscence to the Java heap that resides within the host’s reminiscence. These 4 traces solely declare the duties and variables for use. Nevertheless, nothing is executed till the programmer invokes the execute methodology.

As soon as we now have created this system, we compile it with normal javac. Within the TornadoSDK (as soon as TornadoVM is put in the machine), utility instructions are supplied to compile with javac with all classpaths and libraries already set:


At runtime, we use the twister command, which is, in truth, an alias to java with all classpaths and flags required to run TornadoVM over the Java Digital Machine (JVM). However, earlier than operating with TornadoVM, let’s examine which parallel and heterogeneous hardware can be found in our machine. We will question this by utilizing the next command from the TornadoVM SDK:

$  tornadoDeviceInfo
Variety of Twister drivers: 1
Complete variety of units  : four
Twister system=zero:zero
    NVIDIA CUDA -- GeForce GTX 1050
Twister system=zero:1
    Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Twister system=zero:2
    AMD Accelerated Parallel Processing -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Twister system=zero:three
    Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO

On this laptop computer, we now have an NVIDIA 1050 GPU, an Intel CPU and the Intel Built-in Graphics (Dell XPS 15’’, 2018). As proven, all of those units are OpenCL appropriate and all drivers are already put in. Due to this fact, TornadoVM can take into account all these units accessible for execution. Be aware that, on this laptop computer, we now have two units concentrating on the Intel CPUs, one utilizing the Intel OpenCL driver and one other one utilizing the AMD OpenCL driver for CPU. If no system is specified, TornadoVM will use the default one (zero:zero). To run our utility with TornadoVM, we merely sort:

$ twister Picture


With a view to uncover on which system our program is operating,  we will question primary debug info by way of TornadoVM by utilizing the `–debug` flag as follows:

$ twister --debug Picture
process information: s0.t0
    platform      	: NVIDIA CUDA
    system        	: GeForce GTX 1050 CL_DEVICE_TYPE_GPU (accessible)
    dims          	: 2
    international work offset: [0, 0]
    international work measurement  : [3456, 4608]
    native  work measurement  : [864, 768]

Which means we used an NVIDIA 1050 GPU (the one accessible in our laptop computer) to run this Java program. What occurred beneath is that TornadoVM compiled the Java methodology grayScale into OpenCL at runtime, and run it with the accessible OpenCL supported system. On this case, on the NVIDIA GPU. Extra info from the debug mode consists of what number of threads had been used to run in addition to their block measurement (native work measurement). That is mechanically determined by the TornadoVM runtime and it is determined by the enter measurement of the appliance. In our case, we used an enter picture of 3456×4608 pixels.

To date we managed to get a Java program operating, mechanically and transparently on a GPU. That is nice, however what about efficiency? We’re utilizing an Intel CPU i7-7700HQ on our testbed laptop computer. The time that it takes to run the sequential code with this enter picture is 1.32 seconds. On the GTX 1050 NVIDIA GPU it takes zero.017 seconds. That is 81x instances sooner to course of the identical picture.

SEE ALSO: Jakarta EE eight: Previous, Current, and Future

We will additionally change the system to run at runtime by passing the flag -D<task-scheduleName>.<taskName>.system=zero:X.

For instance, the next code snippet reveals the way to run TornadoVM to make use of the Intel Built-in GPU:

$ twister --debug -Ds0.t0.system=zero:three Picture
process information: s0.t0
    platform      	: Intel(R) OpenCL HD Graphics
    system        	: Intel(R) Gen9 HD Graphics NEO CL_DEVICE_TYPE_GPU (accessible)
    dims          	: 2
    international work offset: [0, 0]
    international work measurement  : [3456, 4608]
    native  work measurement  : [216, 256]

By operating on all units, we get the next speedup-graph. The primary bar reveals the baseline (operating with Java sequential code with no acceleration) which is 1. The second bar reveals the speedup of TornadoVM, in opposition to the baseline) by operating on a multi-core (four core) CPU. The final bars correspond to the speedup on an built-in GPU and a devoted GPU. By operating this utility with TornadoVM, we will rise up to 81x efficiency enhancements (NVIDIA GPU) over the Java sequential and as much as 62x by operating on the Intel built-in graphics card. Discover that on a multi-core configuration, TornadoVM is superlinear (27x on a four core CPU). This is because of the truth that the generated OpenCL C code can exploit the vector directions on the CPUs, equivalent to AVX and SSE registers accessible per core.


Use instances

The earlier part confirmed an instance of easy utility, by which a fairly frequent filter in images is accelerated. Nevertheless, TornadoVM’s performance extends past easy applications. For instance, TornadoVM can presently speed up machine studying and deep studying functions, pc imaginative and prescient, physics simulations and monetary functions.

SLAM Purposes

TornadoVM has used to speed up a posh pc imaginative and prescient utility (Kinect Fusion) on GPUs written in pure Java, which accommodates round 7k traces of Java code. This utility data a room with the Microsoft Kinect digicam, and the purpose is to do its 3D area reconstruction in real-time. With a view to obtain real-time efficiency, the room have to be rendered with not less than 30 frames per second (fps). The unique Java model achieves 1.7 fps, in the meantime, the TornadoVM model operating on a  GTX 1050 NVIDIA GPU achieves as much as 90 fps. The TornadoVM model of the Kinect Fusion utility is open-sourced and accessible on Github.


Machine Studying for the UK Nationwide Well being Service (NHS)

Exus Ltd. is an organization based mostly in London which is presently bettering the UK NHS system by offering predictions of sufferers’ hospital readmissions. To take action, Exus has been correlating sufferers’ knowledge that comprise their profile, traits and medical circumstances. The algorithm used for prediction is a typical logistic regression with tens of millions of components as knowledge units. To date, Exus have accelerated the coaching section of the algorithm by way of TornadoVM for 100Okay sufferers, from 70 seconds (the pure Java utility) to solely 7 seconds (10x efficiency enchancment). Moreover, they’ve demonstrated that, by utilizing a dataset of two million sufferers, the execution with TornadoVM improves by 14x.

Physics Simulation

We’ve additionally experimented with artificial benchmarks and computations generally used for physics simulation and sign processing, equivalent to NBody and DFT. In these instances we now have skilled speedups of as much as 4500x utilizing an NVIDIA GP100 GPU (Pascal Microarchitecture) and as much as 240x utilizing an Intel Nallatech 385a FPGA. A majority of these functions are computationally intensive, and the bottleneck is the kernel processing time. Thus, having an influence parallel system specialised for all these computation helps to extend the general efficiency.

Current and way forward for TornadoVM

TornadoVM is presently a analysis venture on the College of Manchester. Apart from, TornadoVM is a part of the European Horizon 2020 E2Data venture by which TornadoVM is being built-in with Apache Flink (a Java framework for batch and stream knowledge processing) to speed up typical map-reduce operations on heterogeneous and distributed-memory clusters.

TornadoVM presently helps compilation and execution on all kinds of units, together with Intel and AMD CPUs, NVIDIA and AMD GPUs, and Intel FPGAs. We’ve an ongoing work to additionally help additionally Xilinx FPGAs. With this feature, we goal to cowl all present choices of cloud suppliers Moreover, we’re integrating extra compiler and runtime optimizations, equivalent to using system reminiscence tiers and using digital shared reminiscence to cut back the entire execution time and improve the general efficiency.


On this article, we mentioned TornadoVM, a plug-in for OpenJDK for accelerating Java applications on heterogeneous units. At first, we described how TornadoVM can compile and execute code on heterogeneous hardware equivalent to a GPU. Then we introduced an instance for programming and operating TornadoVM on totally different units, together with a multi-core CPU, an built-in GPU and a devoted NVIDIA GPU. Lastly, we confirmed that, with TornadoVM, builders can obtain high-performance whereas conserving their functions completely hardware agnostic.  We imagine that TornadoVM presents an fascinating strategy by which the code to be added is straightforward to learn and keep, and on the similar time, it could possibly provide high-performance if parallel hardware is offered in a system.

Extra info relating to the technical elements of TornadoVM will be discovered under:


Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic utility reconfiguration on heterogeneous hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS Worldwide Convention on Digital Execution Environments (VEE 2019). ACM, New York, NY, USA, 165-178. DOI:

Juan Fumero and Christos Kotselidis. 2018. Utilizing compiler snippets to take advantage of parallelism on heterogeneous hardware: a Java discount case research. In Proceedings of the 10th ACM SIGPLAN Worldwide Workshop on Digital Machines and Intermediate Languages (VMIL 2018). ACM, New York, NY, USA, 16-25. DOI:

James Clarkson, Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, Christos Kotselidis, and Mikel Luján. 2018. Exploiting high-performance heterogeneous hardware for Java applications utilizing Graal. In Proceedings of the 15th Worldwide Convention on Managed Languages & Runtimes (ManLang ’18). ACM, New York, NY, USA, Article four, 13 pages. DOI:

TornadoVM with Juan Fumero.

Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, and Mikel Luján. 2017. Heterogeneous Managed Runtime Methods: A Laptop Imaginative and prescient Case Research. SIGPLAN Not. 52, 7 (April 2017), 74-82. DOI:


This work is partially supported by the European Union’s Horizon 2020 E2Data 780245 and ACTiCLOUD 732366 grants. Particular due to Gerald Mema from Exus for reporting with the NHS use case.


You must be logged in to post a comment Login