Another JCuda library ?

Hi ! I found an example program in JCuda and I wonder if this program refer to another JCuda library.
I have problem with run this code, for example CUDA class doesn’t exist.

http://www.think-techie.com/2009/09/gpu-computing-using-jcuda.html

Hello,

jCUDA is a different library than JCuda, and they are completely unrelated. There is also some library called JCUDA… Fortunately, Java is case sensitive :wink:

The library that is used in the link seems to be no longer maintained…? This forum is for JCuda from http://jcuda.org/. I’m currently updating for CUDA 4.0, and hopefully the update can be uploaded soon - there have been quite a lot of changes…

bye

I have looked at CUDA 4.0, and it looks interesting, particularly the way you can stream multiple BLAS calls. The documentation talks about better ways of calling lots of gemm’s on lots of small matrices (but the documentation seems to be cut short somehow…)

See http://developer.download.nvidia.com/compute/cuda/4_0_rc2/toolkit/docs/CUBLAS_Library.pdf

I’m excited. I have been doing lots of tests with JCUBLAS multiplying a matrix with its transpose, and matrices have to get quite large before they compete with well written Java multi threaded implementations. Bottleneck of course is getting data in and out.

I can’t see any evidence that any matrix factorisation will be provided, however…although in my experience with small matrices that is not a bottleneck. If we can stream:

matrix multiply in CUBLAS => matrix factorisation in theaded Java => apply factors to multiple right hand sides in CUBLAS that would be fine, in an asynchronous loop doing the above for thousands of small matrices.

Maybe the dream is not so far away.

Indeed, there are some interesting news in CUDA 4.0. Some of them require more refactoring than I expected first. Although the changes are not so obvious in the API, internally they are difficult to accomplish - not so much conceptually, because they are similar to things that I already added in JOCL, but JCuda has already… “grown” for a while now :o Maintaining it could become a full-time job, and I have several other projects running…

The non-blocking operations are obviously becoming increasingly important - especially when people have 2 or 4 GPUs in their PC. At the moment, they are NOT supported by JCuda. I hoped to have the chance to (at least basically) support them in the new version, but it might be the case that the first version of JCuda 0.4 (“beta”) will not yet officially support them, because I first have to get the API changes right.

Even for later versions, non-blocking operations will most likely only be possible with direct buffers. Most JNI-binding-libraries (JOGL and Jogamp-JOCL, LWJGL, JavaCL etc.) have this requirement anyhow. I always wanted to keep the possibility to use simple java arrays as well, because they are far more convenient to use on Java side. But the JNI specification leaves a lot of freedom about how the Virtual Machine may behave. Things like Garbage Collection or relocating Arrays could make it impossible to use arrays for non-blocking operations. In any case, I’ll try to find an appropriate solution, even if it has to be based on direct buffers.

I understand. It always amazes me what some people will do for free in their spare time. Maybe Nvidia can take over maintenance of JCuda some time in the future?

I am in no rush at all. I have done a lot of testing of current JCuda, and realise it does not give obvious benefits at present, so I will put the work on hold and revisit later in the year.

I have a test suite which tests lots of different ways to multiply matrices, including BLAS, and my hand crafted Java threaded version comes out fastest every time, on the size of matrices I deal with. Beats any other Java library I have looked at (e.g. Jama), even with the single threaded version.

Well, there still is the intention to put JCuda into a public SVN, maybe then there will further contributors. At the moment, most contributions consist of pre-built binaries for the different OS’es (I’d be lost without these). But there are way too many tasks to be done at the moment.

So, if the plain Java version is fastest, then the matrices are probably relatively small? (Or you did not tweak your benchmark appropriately :wink: )

I think my code is pretty optimised. I am multiplying transpose(A) x A, and results in C. I am modelling when repeating nIts times, with the same dimensions, so allocation is only done once. Obviously, freeing, initialising, allocating, etc. are done outside the loop. I’ve omitted some obvious things, it is based on your sample. My best Java method uses a threadpool and a CompletionService (of course created outside the Its loop) and has numerous detailed code optimisations from many days of careful testing, it easily beats Jama.

Using Ddot within the Java matrix multiplication was hopeless!

int nRows = 32;
int nIts = 4096;
int nCols = 100;
while (nIts > 0) {

JCublas.cublasInit();
current = System.currentTimeMillis();
Pointer d_A = new Pointer();
Pointer d_C = new Pointer();
JCublas.cublasAlloc(nRows * nCols, Sizeof.DOUBLE, d_A);
JCublas.cublasAlloc(nCols * nCols, Sizeof.DOUBLE, d_C);
for (int j = 0; j < nIts; j++) {
JCublas.cublasSetVector(nRows * nCols, Sizeof.DOUBLE, Pointer.to(h_A), 1, d_A, 1);
JCublas.cublasDsyrk(‘U’,‘T’, nCols, nRows, alpha, d_A, nCols, beta, d_C, nCols);
JCublas.cublasGetVector(nCols * nCols, Sizeof.DOUBLE, d_C, 1, Pointer.to(h_C), 1);
}
JCublas.cublasFree(d_A);
JCublas.cublasFree(d_C);
JCublas.cublasShutdown();

nRows = nRows * 2;
nIts = nIts / 2;
}

Well, this is using double precision. The speed of double precision on GPUs is lacking greatly behind the single precision speed, but it’s probably just a matter of time until the performance of a GPU for this task will (even in double precision) be higher than that of Java.

The matrices are not as small as I expected. But it might also be the case that the frequent setVector/getVector eat up all the speedup that may have been achieved for the multiplication itself…

I am doing some more tests with single precision as well.

First VERY important point - single precision does more checking, and I found a bug in my code. You have to be VERY careful to get the rows and columns the right way round. Single precision gives error messages, double precision does not!!!

Preliminary results does not show much difference between single/double precision, which is interesting. Probably bottleneck is some initialising, not the actual data transfer or the actual caluclations. But I need to check it later this weekend.

Makes me wonder whether there are other single/double differences.

I am using a GPU 330M, on a Toshiba i-7 laptop.

BTW, this is the same person as the unregistered person, forgot to log on.

Hmmmm at first glance single is slower than double. Will have to check carefully, but not the first time strange things happen.

I’ve found an excellent series of articles:

http://www.hpcwire.com/hpcwire/2008-10-08/compilers_and_more_programming_gpus_today.html

http://www.hpcwire.com/hpcwire/2008-10-30/compilers_and_more_optimizing_gpu_kernels.html

http://www.hpcwire.com/hpcwire/2008-09-10/compilers_and_more_gpu_architecture_and_applications.html

[and probably lots of other articles by Wolfe].

Although being pretty “old” (2008!?) they seem to give a good overview. I always wanted to create a collection of tutorials and links concerning GPGPU, maybe one day I’ll find the time, and then I could include these…

Hello there,
I’ve got a GT550M and therefore wished to dive into Cuda 4.0 developement.
Reading this topic I do understand jCuda is not yet complaint with CUDA 4.0.
When can we expect the complaince?
I do understand that yet I hava to clean my system from 4.0 and install nVidia’s 3.2 stuff in order to work with jCuda?
On a Win7_x86_64 I need jCuda x86_64 along with Java x86_64 or I can go for x86_32 version?

Hello

It’s right, the current version of JCuda is for CUDA 3.2. Do you plan to use any CUDA 4.0-specific features? If not, it should be possible to build the library for CUDA 4.0 with the source code that is currently available on the website.
I’m not sure if it’s worth the effort to uninstall CUDA 4.0 and install 3.2 instead (once I tride to downgrade a version on my PC, and it did not work properly - but that does not mean that it would not work for you).

I’m also not so sure when the update will be available. For some of the new functions of CUDA 4.0, some internal refactoring will be necessary. However, these functions are very specific, so I considered creating a „release candidate“ quickly (since CUDA 4.0 is still a „release candiate“ itself!) and perform the full update afterwards. In any case, I have planned to finish at least the first release (but maybe even the final one) by end of the month.

(One aspect that also should be refactored is that it is not possible to use the 3.2 binaries with CUDA 4.0 - there are approaches for making this possible, but these are not yet implemented in JCuda)

**On a Win7_x86_64 I need jCuda x86_64 along with Java x86_64 or I can go for x86_32 version? **

Since you will have to install the 64bit driver and toolkit, you will also need the 64 bit version of JCuda.

bye

Hey,
thank You for the response.
I’ll fetch the sources then and compile it against 4.0rc2.