GPU Computing using jCUDA

Sunday, September 20, 2009

jCUDA provides access to CUDA for Java programmers, exploiting the full power of GPU hardware from Java based applications. Using jCUDA you can create cross-platform CUDA solutions, that can run on any operating system supported by CUDA without changing your code.

What do I need ?
  1. Recent NVIDIA hardware with a CUDA driver
  2. Java SDK
  3. Eclispe (optional)
  4. jCUDA library
  5. Probably you will have to install Visual Studio if you are using the Windows Platform

Which are the magic steps?

1. Download and install all the requirements

Check if you have a CUDA-enabled graphic card. If you don’t have one, you need to buy it. Then install the CUDA drivers for your graphic card and the Java SDK.
Download the jCuda library and use it together with your favorite IDE (if you have one).

2. Write your Java application

You can write your Java application normally and access to your GPU through jCUDA library. However you need to learn how to use it. This example will illustrate how to detect available GPUs and how to execute your CUDA program on them.

Note that CUDA program is stored inside a folder called resources, and carful with the number of memory bytes occupied by pointers. You will find that size of pointer often corresponds to the bit-architecture of target machine. That is, if the compiler is for 32 bit architecture, then the pointer would occupy 32 bits (4 bytes), and if the compiler is for 64 bit architecture, then the pointer would occupy 64 bits(8 bytes).
CUDA cuda = new CUDA(true);

int count = cuda.getDeviceCount();
if(count == 0){
 System.out.println("No GPU devices found"); 

System.out.println("Total number of devices: " + count); 
for (int i = 0; i < count; i++) {
 CUdevice dev = cuda.getDevice(i);
 String name = cuda.getDeviceName(dev);
 System.out.println("Name: " + name);
 int version[] = cuda.getDeviceComputeCapability(dev);
 System.out.println("Version: " + String.format("%d.%d", version[0], version[1]));
 CUdevprop prop = cuda.getDeviceProperties(dev);
 System.out.println("Clock rate: " + 
                      prop.clockRate + " MHz");
 System.out.println("Max threads per block: " + 

/** Select 1st device */

/** Create a context (necessary) */

/** Load the module */
File cubinFile = new File("resources", "sub_module.cubin");

/** Get the function we want */

/** Now we should allocate the necessary memory */
int memSize = Integer.SIZE / 8 * 64;
CUdeviceptr devicePtr = cuda.allocate(memSize);

int originalData[] = new int[64];
for (int i=0; i<originalData.length; i++)
 originalData[i] = i;

/** Copy the original array to device */
cuda.copy(devicePtr, originalData, memSize);

 * Setup function parameters
 * 1st Parameter is pointer to device memory
 * Note on 64 bit platforms, each pointer consumes 8 bytes, 
 * on 32 bit only 4.

int offset = 0;
int delta = 13;
cuda.setParameter(offset, devicePtr);

/** Change to 4 if you are using a 32 bits platform */
offset = 8;
cuda.setParameter(offset, delta);

 * A simple int scalar takes only 4 if 32 bits 
 * or 8 bytes if 64 bits.
offset += Integer.SIZE / offset;


/** Now, configure the execution configuration */
cuda.setFunctionBlockShape(originalData.length, 1, 1);


/** Wait for all operations to complete */
cuda.synchronizeStream(new CUstream(0));

/** Copy results back */
cuda.copy(originalData, devicePtr, memSize);

/** Verify results.... */
boolean correct = true;
for (int i=0; i<originalData.length; i++) {
 if (originalData[i] != (i - delta)*(i - delta)) {
  System.out.println("Error at " + i +":"+ originalData[i]);
  correct = false;

if (correct)
 System.out.println("Test passed");
 System.out.println("Test failed");

/** Release resources */;

3. Write your CUDA code

The following example is written in CUDA, which is very similar to C language. If you can’t understand this example, well, all I can tell you is: the result is equal to (src - amount) * (src - amount).
extern "C" __global__ void subtract(int *src, int amount){
 src[threadIdx.x] -= amount;
 src[threadIdx.x] *= src[threadIdx.x];

4. Compile your CUDA code

Before running the application, your CUDA code must be compiled to an assembly intermediate language, PTX and then assembled in the cubin file format.

Here is the simplified compilation process:

If you are using Windows, you can execute the following command.
nvcc.exe -cuda -I $(SDK)/common/inc resources/

If you prefer to use Ant. Use this script together with your favorite IDE.
<?xml version="1.0" encoding="UTF-8"?>
<project name="GPGPU" basedir="." default="all">
 <target name="all">
  <exec executable="nvcc.exe">
   <arg value="resources/" />
   <arg value="--cubin" />

4. Compile and run your Java application

Don't forget, GPUs are connected to the main computer processor by relatively slow connection, the bus. For this reason moving data on and off of the GPU is an expensive task if compared to performing calculation directly on the GPU. This can create critical bottlenecks for instance when one wants to perform a calculation on the CPU using some data, then perform further calculation on the GPU and then use the output of the GPU to use once more the CPU. The overhead introduced by data transfers through the bus can overwhelm the benefits of fast GPU computation.


Anonymous said...

Please rename the used CUDA function to "add" instead of "subtract" or correct the functions main statement

Bruno Simões said...

Thank you for your correction.

Anonymous said...

Have you compared performance with normal C Cuda? Any performance issues due to Java?


Bruno Simões said...

Hi Vince,

I don't have any comparison results about the performance. However I expect it to slightly slow on Java. One of the main reasons is the JNI overhead. Besides that, you can do optimizations in C that are impossible in Java. Neverless, we should not forget that these two programming languages have different purposes.