Integrate CUDA with netbeans + windows 7 x64

Hello everyone. I’m trying to work with CUDA with Java, because I believe that is quicker to develop rather than C code… however, i’m not able to even compile a simple sample like the piece of code bellow:

import java.util.*;

import jcuda.;
import jcuda.jcublas.
;
import jcuda.jcudpp.;
import jcuda.jcufft.
;
import jcuda.runtime.*;

public class Main
{
public static void main(String args[])
{
System.out.println(„Creating input data“);

    // Create some input data
    int complexElements = 100;
    int floatElements = complexElements * 2;
    int memorySize = floatElements * Sizeof.FLOAT;
    float hostX[] = createRandomFloatData(floatElements);
    float hostY[] = createRandomFloatData(floatElements);


    System.out.println("Initializing device data using JCuda");

    // Allocate memory on the device using JCuda
    Pointer deviceX = new Pointer();
    Pointer deviceY = new Pointer();
    JCuda.cudaMalloc(deviceX, memorySize);

with the output:
Creating input data
Error while loading native library with base name „JCudaRuntime“
Initializing device data using JCuda
Operating system name: Windows 7
Architecture : x86
Architecture bit size: 32
Exception in thread „main“ java.lang.UnsatisfiedLinkError: Could not load native library
at jcuda.LibUtils.loadLibrary(LibUtils.java:79)
at jcuda.runtime.JCuda.assertInit(JCuda.java:225)
at jcuda.runtime.JCuda.cudaMalloc(JCuda.java:1513)
at cudatest.Main.main(Main.java:42)
Java Result: 1

i’ve downloaded all .jar libs from windows x64bits to netbeans and all DLL’s which came in the same package to c:\windows\system32… NOTHING changed that exception. Can anyone help, please? :slight_smile:

Hello

In any case, to use CUDA or JCuda, you have to download the Developer Driver and CUDA Toolkit from http://developer.nvidia.com/object/cuda_3_1_downloads.html.

If you want to create own kernels, you also need a C Compiler (for example, Visual Studio). CUDA files are compiled with the “NVCC”, the NVIDIA CUDA C Compiler, which requires a C Compiler to be installed in the background.
But if you only want to use the runtime libraries (JCublas, JCufft and JCudpp) it should be enough to install the Driver and Toolkit.

BTW: You could put the JCuda DLLs into the root directory of your project (Or at least, into a directory that is visible via the PATH variable) - they do not have to be in windows/system32…

bye

Hi,

I’ve installed the dev driver for NVIDIA, CUDA toolkit 3.1, CUDA SDK 3.1… the environment variables are set because I call cmd line window and write „nvcc -V“ and it shows all ok. I also have visual studio 2010 and I’m able to compile and run into it a few sdk examples… even writing down all lib dependencies / includes in VS properties, there’s a lot of problems in most of cases. But I still can use it sometimes :stuck_out_tongue:

The idea was try and See if i was more successful with netbeans + jcuda… I’ll try the same exercise in Linux right now, because at least with cuda c I am successful.

sorry, it was me who posted… just didn’t register :stuck_out_tongue:

Hi

I just noticed that it says
Architecture bit size: 32
so you are probably using a 32 bit VM. You might want to have a look at which JVM/JDK version is currently used in your Netbeans IDE: If it refers to something like
C:\Program Files**(x86)**\Java…
you’re using the wrong JVM. If you want to use the 64bit JCuda DLLs, you also have to use a 64bit JVM. This might already be installed on your system, in this case, you only have to point the path in your Netbeans IDE to the respective
C:\Program Files\Java…
directory.

bye

thanks thanks… it may be it.

As I said, i tried the same in Linux and at first i encountered the same exception and the dll’s equivalent (*.so) in the root folder of the project didn’t solve it… actually It was only after I moved those libs to /usr/lib (which contains all other shared libs) that I was able to compile and run successfully the first example in the samples section “JCudaRuntimeSample”:

run:
Creating input data
Initializing device data using JCuda
Performing FFT using JCufft
Performing caxpy using JCublas
Performing scan using JCudpp
Result: 196.08002
BUILD SUCCESSFUL (total time: 2 seconds)

I will try your tip in windows.

OK then, under Windows it should also work with the correct JVM and the DLLs located in the project directory or in a directory visible via the PATH variable.

BTW: In a different thread, the problem that it did not find the library could be solved by adding the following lines to the file ~/.bash_profile from the home directory:


LD_LIBRARY_PATH=$LD_LIBRARY_PATH:**thePathContainingTheJCudaLibraries**
export LD_LIBRARY_PATH

Hope that helps.

Ok, It worked in windows too. :slight_smile:
So, the next step was… trying to run some examples of jcuda.org and tried to run „KernelLauncherSample“. It’s not that it went wrong but… i think it didn’t went well. why? because the only output that i got was:
run:
Preparing the KernelLauncher…

and than nothing… the program keeps running and running like it was awayting some feedback from the cl.exe compiler (which i gave path to cl.exe inside vs 2010)… after a while i’ve stopped the program and…

„BUILD STOPPED (total time: 16 minutes 51 seconds)“

awkward, isn’t it?

other example that I can’t run without exceptions is „JCudaDriverGLSample“. Some of the exceptions (tons of…):
Exception in thread „AWT-EventQueue-0“ java.lang.UnsatisfiedLinkError: C:\Windows\System32\jogl.dll: Can’t load IA 32-bit .dll on a AMD 64-bit platform

java.lang.reflect.InvocationTargetException

Caused by: java.lang.NoClassDefFoundError: Could not initialize class com.sun.opengl.impl.windows.WindowsGLDrawableFactory

Exception in thread „AWT-EventQueue-0“ java.lang.NoClassDefFoundError: Could not initialize class com.sun.opengl.impl.windows.WindowsGLDrawableFactory

so on, so on… BTW, i’ve put the jcuda dll’s both in sistem32 and SysWOW64, just in case…

Hi

The KernelLauncher internally launches the NVCC on a specified .CU file. If something goes wrong there, it should actually print all error messages that have been created by the NVCC, but in this case, it seems as if the NVCC got stuck somehow…
BTW: „BUILD STOPPED (total time: 16 minutes 51 seconds)“ - You’re impressingly patient :wink:
I’ve never encountered this message. It seems to be coming from the NVCC, but I cannot imagine what might prevent the NVCC from completing the compilation.
Did you try to create a KernelLauncher from an existing CUBIN file? The CUBIN may be created with
nvcc -cubin InputFile.CU -o OutputFile.cubin
It should then be possible to load the module using the KernelLauncher#load method.

Concerning the error message
Exception in thread „AWT-EventQueue-0“ java.lang.UnsatisfiedLinkError: C:\Windows\System32\jogl.dll: Can’t load IA 32-bit .dll on a AMD 64-bit platform
It seems as if you also downloaded JOGL for a 32bit system. You’ll need JOGL 2.0, for a 64bit system - as far as I know, the latest builds are those in this archive. JOGL has moved several times recently, and should actually be located at Jogamp, but there don’t seem to be any binaries available…

bye

“Did you try to create a KernelLauncher from an existing CUBIN file? The CUBIN may be created with
nvcc -cubin InputFile.CU -o OutputFile.cubin
It should then be possible to load the module using the KernelLauncher#load method.”

I didn’t catch the idea but… another popped up: is it possible to write a kernel function in C, save it in .cu file like the attached below, import and use it in java code? It would be a lot easier instead of putting the source code in a String form.

Other thing: in Jcuda.runtime, how can i create a pointer to a matrix of int? int[][]… for example, to use in cudaMalloc

#ifndef FUNCTIONS_KERNEL_H
#define FUNCTIONS_KERNEL_H

#define TILE_WIDTH 8

global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
shared float Mds[TILE_WIDTH][TILE_WIDTH];
shared float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
Mds[tx][ty] = Md[(mTILE_WIDTH + tx)Width+Row];
Nds[tx][ty] = Nd[Col
Width+(m
TILE_WIDTH + ty)];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[tx][k] * Nds[k][ty];
__syncthreads();
}
Pd[Row*Width+Col] = Pvalue;

}

#endif

This is not directly possible, but using the KernelLauncher class, it IS possible (if the NVCC does not hang up, as it did in your test).
BTW: I should mention that the KernelLauncher is only a small sample. I’m currently updating it, and integrating it into a utility package, but in any case it should not be considered as an „official part“ of JCuda.
However, the current version of the KernelLauncher class offers 3 static methods:

KernelLauncher#compile(sourceCode, functionName, nvccArguments):
This method is used in the sample. It takes the source code as a String, stores it in a temporary .CU file, compiles this .CU file using the NVCC, and loads the resulting CUBIN file.

KernelLauncher#create(cuFileName, functionName, nvccArguments):
This is the method you may be looking for: It takes the name of a .CU file, compiles this .CU file using the NVCC, and loads the resulting CUBIN file. So you might call something like


KernelLauncher k = KernelLauncher.create("MatrixMult.cu", "MatrixMulKernel");

to create the launcher for the MatrixMulKernel - hopefully, the NVCC does not hang up again…
Note: In order to access the function with the given name, it has to be declared as extern „C“ in the source file!


**extern "C"**
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{ 
...

The last method of the KernelLauncher is the one that I referred to in my previous post:
KernelLauncher#load(cubinFileName, functionName):
This function takes the name of an existing CUBIN file and loads it. So if you already have compiled your Matrix Multiplication kernel into a CUBIN file (and if you have declared the function as extern „C“) then you may call


KernelLauncher k = KernelLauncher.create("MatrixMult.**CUBIN**", "MatrixMulKernel");

to create a KernelLauncher for this function.

Other thing: in Jcuda.runtime, how can i create a pointer to a matrix of int? int… for example, to use in cudaMalloc

I already wrote a little about this topic in this thread, maybe it will answer your question.

I think it’s time to set up an Installation Guide and a general FAQ :o …

thanks for the tips. Of course that for someone knowing CUDA C and trying to do equivalent things with JCUDA not knowing how… it become a little harsh, specially when that someone likes pointers :stuck_out_tongue: An installation guide is always welcome to newcomers… but the real help would be providing us the capability to write JCUDA programs + kernel algorithms in CUDA C style (kernels in separate .cu files or even in java files) to ease the learning process, since the big thing in here is to optimize functions / algorithms in existent programs (Java programs in this case)

Maybe I don’t understand, but … that’s already possible. When you have an exsiting program, and want to optimize one part using CUDA, then you can write the CUDA kernel into a separate .CU file (the same way as you could do when using C), then compile this Kernel into a CUBIN file, and load and execute it using the driver API. Loading and executing it is (intentionally) very similar in CUDA C and in JCuda. But if you meant that it may be tedious to use the Driver API to launch own kernels: That’s true. In C, you can use the runtime API, and call kernels like


someKernel<<<g,b,m,s>>>(arg0, arg1);

whereas in the Driver API you have to manually set up the arguments and grid size and all that, which might involve 20 or 40 lines of code. But the intention behind the KernelLauncher was to simplify exactly that: The kernel call mentioned above could be written as


KernelLauncher k = KernelLauncher.load(cubinFile, "someKernel");
k.call(g,b,m,s, arg0, arg1);

which is very similar to the runtime API call.

[QUOTE=Marco13]
But if you meant that it may be tedious to use the Driver API to launch own kernels: That’s true. In C, you can use the runtime API, and call kernels like

someKernel<<<g,b,m,s>>>(arg0, arg1);

whereas in the Driver API you have to manually set up the arguments and grid size and all that, which might involve 20 or 40 lines of code. [/QUOTE]
That is right. Between Driver and runtime API i rather choose runtime… not because I’m not capable of working with driver API but because I think that I don’t need to reinvent the wheel, despite of all liberty with driver api, don’t you think? And somehow I don’t think it’s easier to write my future CUDA algorithms in JCuda if I have to 1) compile .cu first and 2) trying to work with runtime api but having to switch sometimes to driver api, emphasising the catch up the JCUDA have to do comparing to CUDA C.

Despite of all this, I am grateful to all contributors of JCUDA… it’s an excellent first step to other wrappers.

Of course, if you have the choice between C and Java for a new CUDA project, or if you only intend to write some kernels to get in touch with CUDA in general, using C has some advantages. Some language-specific limitations of Java, and the necessity to have a C compiler running in the background to support the nvcc does not make it easier.

However, I never really understood why NVIDIA did this separation of the runtime- and driver API. Both APIs have about ~100 functions which have exactly the same functionality in both APIs. The only difference seems to be that the driver api has some additional ones for device management, and the runtime API allows the simplified kernel<<<…>>> calls. I think it should have been possible to combine this in one API. But I’m not a compiler expert or so - they probably had their reasons for that … -_-

The reason lies in here:
API Runtime was built on top of Driver API… probably to ease the job of the NVIDIA engineers: those who develop hardware or the CUDA architecture itself probably uses Driver API, while all other (which includes us, outsiders) may use Runtime API… Runtime API should abstract us of the hardware details or at least it is my opinion:)

EDIT: Oh and… no, I don’t wanna „mess around“ with CUDA… I really wanna learn it because I’m building my thesis on it. I believe that it will be a little messy in the beginning (as you can see in the multiplication of matrices example) but I’m already in the process of seeking the „best“ tool to develop CUDA algorithms to future apps evolving image processing and others: if in C/C++ or in Java.

That’s why i’m thanking you in advance for all the patience and knowledge that you’ve been spending in here ^.^

Yes, I know this image, and maybe this is the intention behind the runtime API. But… to explain what I mean, and just to pick out a small example:

Allocating device memory and filling it with host data using the Driver API:

CUdeviceptr deviceData = NULL;
cuMemAlloc(&deviceData , n * sizeof(float));
cuMemcpyHtoD(deviceData , hostData, n * sizeof(float));

Allocating device memory and filling it with host data using the Runtime API:

float *deviceData = NULL;
cudaMalloc(&deviceData , n * sizeof(float));
cudaMemcpy(deviceData, hostData, n * sizeof(float), cudaMemcpyHostToDevice);

I could not say that either of them was “simpler”. And although this is only a small example, it goes through the whole API. So it’s not only the functions that are nearly equal in both APIs, it’s also the structures
cudaDeviceProp <-> CUdevprop
cudaArray <-> CUarray
cudaGraphicsResource <-> CUgraphicsResource

And even the combinations thereof:
cudaMemcpy3D(cudaMemcpy3DParms p) <-> cuMemcpy3D(CUDA_MEMCPY3D pCopy)
and the “cudaMemcpy3DParms” and “CUDA_MEMCPY3D” are huge structures, which essentially contain the same information in both APIs, just with minor naming differences.

I could understand this if the driver API had, say, 250 cryptic low-level-functions with confusing, complicated arguments, and the runtime API only had 50 easy-to-use functions exposing the “most commonly used” functionality. But the largest parts of both APIs are essentially equal. I don’t see the advantage of having two separated APIs that mainly differ in function names, but not in functionality.

But as I mentioned, I don’t criticise this, and if in doubt, I assume that there have been reasons for that which I simply don’t understand…

EDIT: You mentioned Image Processing: Then you probably know NPP? It supports many general image processing functions …

Yes, indeed you are right! really… but once again, it’s best having 2 different functions doing almost the same rather than having none :smiley: