At work we have a very large Java project. When running simulations it uses up almost 3GB of memory. We run it via Condor on 64bit Linux using JDK 1.6.0.14. The hardware is a very powerful machine (4U form 16 core Nehalem with 24GB of memory)
I set it up to do its FFTs via jcuda JCufft package instead of the slower commercial math product we want to replace. I expected the jcuda to be much faster but it’s not… our 5 hour runs are still the same. We profiled the code with the old commercial product in place and the majority of the JVM time is computing FFTs.
I was sceptical that the CUDA was really slower so I wrote a small one file java program which just makes a 2 million value long sample input array of random numbers which I then run the JCuda
forward fft against for many times and then I did the same for the commercial fft product for the same number of times. The commercial product doesn’t support batching so I’m running the
2meg FFT input in both as just one FFT. The delta time via nanoTime() was such that Jcuda was 7x or more times faster than the commercial library (ie. 14 secs for the commercial and 2 secs for JCuda). Since my use of jcuda in our simulation was done via my own FFT interface class to jcuda, I made sure to access it by importing it from our jar file. This should make sure that the code as far as the FFT in the model is the same as in my test.
So I was wondering if anyone had an idea as to why using jcuda in a large application like this would be slow but if you run it directly like in my basic test code it’s fast?
My ideas:
Garbage collection (in the test, comparatively little is being done by the GC)
too many jni local references used in the model due to other JNI code running along with jcuda
I think jcuda seems to handle releasing local references ok? Another possible issue is the sheer amount of data, we’re doing 32k sample ffts 64 batches at a time. This means the array sent back and forth to JNI is about 34Megs (32k * 64 * 2 * sizeof(double) ) == 34MB.
When our large sim is running, we’d have Jcuda and a few other JNI interfaces running at one time,
maybe some kind of JNI resource is running short?
PS. note this is jcuda 0.2.3 running with CUDA toolkit 2.3, I’ll likely upgrade the 4U host to
CUDA 3.0 and JCuda 0.3 on Monday.
Some other info, I used the “Pointer” style fft call instead of the array based “convenience” version, I looked at the code and decided the convenience version has extra overhead I could avoid with the Pointer style call. Another possible issue might be that my FFT interface only mallocs the device space once for the card and it’s then reused over and over. My FFT class also only creates the plans for each size once and then reuses them via a map as the same size is needed in the future. All of these were done as an attempt to speed up our codes handling of FFTs.
My sim code has an explicit “fftCleanup()” method that the main.exec() calls when the simulation ends, this is where I do the jcuda cudaFree() cudaFreeHost() and all the Plan destroys.