Need advice, jcuda cufft is fast only outside of Project

At work we have a very large Java project. When running simulations it uses up almost 3GB of memory. We run it via Condor on 64bit Linux using JDK 1.6.0.14. The hardware is a very powerful machine (4U form 16 core Nehalem with 24GB of memory)

I set it up to do its FFTs via jcuda JCufft package instead of the slower commercial math product we want to replace. I expected the jcuda to be much faster but it’s not… our 5 hour runs are still the same. We profiled the code with the old commercial product in place and the majority of the JVM time is computing FFTs.

I was sceptical that the CUDA was really slower so I wrote a small one file java program which just makes a 2 million value long sample input array of random numbers which I then run the JCuda
forward fft against for many times and then I did the same for the commercial fft product for the same number of times. The commercial product doesn’t support batching so I’m running the
2meg FFT input in both as just one FFT. The delta time via nanoTime() was such that Jcuda was 7x or more times faster than the commercial library (ie. 14 secs for the commercial and 2 secs for JCuda). Since my use of jcuda in our simulation was done via my own FFT interface class to jcuda, I made sure to access it by importing it from our jar file. This should make sure that the code as far as the FFT in the model is the same as in my test.

So I was wondering if anyone had an idea as to why using jcuda in a large application like this would be slow but if you run it directly like in my basic test code it’s fast?

My ideas:

Garbage collection (in the test, comparatively little is being done by the GC)
too many jni local references used in the model due to other JNI code running along with jcuda

I think jcuda seems to handle releasing local references ok? Another possible issue is the sheer amount of data, we’re doing 32k sample ffts 64 batches at a time. This means the array sent back and forth to JNI is about 34Megs (32k * 64 * 2 * sizeof(double) ) == 34MB.

When our large sim is running, we’d have Jcuda and a few other JNI interfaces running at one time,
maybe some kind of JNI resource is running short?

PS. note this is jcuda 0.2.3 running with CUDA toolkit 2.3, I’ll likely upgrade the 4U host to
CUDA 3.0 and JCuda 0.3 on Monday.

Some other info, I used the “Pointer” style fft call instead of the array based “convenience” version, I looked at the code and decided the convenience version has extra overhead I could avoid with the Pointer style call. Another possible issue might be that my FFT interface only mallocs the device space once for the card and it’s then reused over and over. My FFT class also only creates the plans for each size once and then reuses them via a map as the same size is needed in the future. All of these were done as an attempt to speed up our codes handling of FFTs.
My sim code has an explicit “fftCleanup()” method that the main.exec() calls when the simulation ends, this is where I do the jcuda cudaFree() cudaFreeHost() and all the Plan destroys.

Hello Mark,

Finding an appropriate answer or even the specific reason for this offhand and remotely is difficult. However, some thoughts:

We profiled the code with the old commercial product in place and the majority of the JVM time is computing FFTs. … I wrote a small one file java program … The delta time via nanoTime() was such that Jcuda was 7x or more times faster than the commercial library

The interesting question for me is: When and where does the speedup vanish? If it’s 7 times faster in the test program, where is it wasting all the time in the real application? I can hardly imagine that it’s really the native part that becomes slower. JCufft in fact is not doing much more than passing the calls from Java to the CUFFT library, and this one should be fairly independent of the rest of the application.
Maybe this is toooo simplified, and possibly not applicable depending on your application context, but:

  • In the simple example, running 15 seconds, JCufft is 7 times faster
  • In the real application, running 5 hours, JCufft is not faster at all
    Can you give a task to the real application that should run, maybe 1 minute, and see whether there’s a speedup when using JCufft?
    If this is possible, one could also profile the program when using JCufft. Some Profilers offer the possibility to make “Snapshots” of application runs, which can be compared. But this would probably only make sense when a single run does not take 5 hours, but only a few minutes…

Garbage collection (in the test, comparatively little is being done by the GC)

You mean that JCufft may become slower when more GC has to be done? I cannot imagine how this should affect the library. All the time-consuming work is done by the GPU…

**too many jni local references used in the model due to other JNI code running along with jcuda
I think jcuda seems to handle releasing local references ok?
[…]
When our large sim is running, we’d have Jcuda and a few other JNI interfaces running at one time,
maybe some kind of JNI resource is running short?
**

I have to admit that I’m not sure if there are any resources shared among different JNI libraries that are loaded via the same JVM which might explain this. Although I don’t know which other JNI libraries are used, again I don’t think there should be noticable interdependencies: The actual JCufft calls are embarassingly simple, much simpler than most other JCuda JNI calls. There are no global references used in JCuda or JCufft, and hardly any local references. Local references are, by the way, automatically destroyed when the native function returns, and usually there does not have to be any special treatment for them.

**
Another possible issue is the sheer amount of data, we’re doing 32k sample ffts 64 batches at a time. This means the array sent back and forth to JNI is about 34Megs (32k * 64 * 2 * sizeof(double) ) == 34MB.

Some other info, I used the “Pointer” style fft call instead of the array based “convenience” version,
[…]
Another possible issue might be that my FFT interface only mallocs the device space once for the card and it’s then reused over and over. **

In general, this is one of the most critical aspects: In many applications, the time spent for the data transfer between host and device is the bottleneck. However, in this case, I wonder how there should be a problem only for the real application but not for the simple test case…?
If the other JNI libraries you mentioned also transfer large data sets via the PCI bus, this could exaplain why the overall performance is not as high as expected, since it might all be limited by the PCI transfer rate, but admittedly, my hardware knowledge is too limited to make a definive statement about that…

I think I already mentioned this in an E-mail: To achieve maximum speed, the “Pointer style” calling is indeed favorable compared to the convenience functions, because the latter are really only provided for conveniently doing a single FFT, and do some memory allocations etc. which can be avoided for repeated calls.

However, one interesting point is: How is the data transferred from the host to the device? You are once allocating a space on the device, large enough to hold the largest FFT data. Then you are using cudaMemcpy to copy data from the host to this pre-allocated block. In which form is the data represented on Java side? Since it comes from Java, I assume that initially it is stored in a float[] array, isn’t it? I told the JVM to try its best to avoid creating a copy of the input data before copying it to the device, and instead to copy the data from Java directly, but the final decision is left to the JVM. It might create a copy of the input data, which could cause an overhead. I did some tests, also with large data sets (much larger than 34 MB) and never saw it making a copy, but nobody can look under the hood of the JVM… The creation of a copy of the input data could definitely be avoided by using a direct buffer (via ByteBuffer.allocateDirect), but I don’t know if this may be seamlessly integrated into your workflow, since I assume that other parts of the application might rely on the data being represented as a float[] array…

bye
Marco

However, one interesting point is: How is the data transferred from the host to the device?

The application uses ordinary float and double of Re/Im interleaved data to pass to my FFT interface like this pseudocode:


double[] input = getData();  // gets an array of fft inputs of batch count
FFT myfft = FFT.getInstance(input.length); // gets an FFT instance for this size
double[] result = myfft.forward(input, batch);
.....

My „forward()“ is really a wrapper to the Jcuda forward(), that’s where that input
array is put into the Pointer via the „to()“ function as the cudaMemcpy is done.

So yes, I’m not using ByteBuffer to do the actual passing to the fft.forward but using
either a float or double

Using Buffer could be a good idea but that still doesn’t answer why that isn’t also hurting speed in
the test code which does the exact same thing to pass the data. Now it is true that in the test code I only made the one input array once and then fed it over and over into the forward() in a loop. This is a very artificial situation compared with the real simulation which of course would make a new array each time.
Mark

[QUOTE=markos]The application uses ordinary float and double of Re/Im interleaved data to pass to my FFT interface like this pseudocode:

My „forward()“ is really a wrapper to the Jcuda forward(), that’s where that input
array is put into the Pointer via the „to()“ function as the cudaMemcpy is done.
[/QUOTE]

When an array is passed to JCuda (via a Pointer), then JCuda internally tries to obtain the address of the real Java array with „GetPrimitiveArrayCritical“. In all cases I have tested so far this really returns the array address, and thus does not make an unnecessary copy. Despite that, I have never really experienced the memory management taking a considerable amount of time - except when the JVM runs short on memory, and has to work a lot to move/GC memory blocks in order to provide a contingous block for a large new memory allocation. You mentioned that „in the test, comparatively little is being done by the GC“. I assume this refers only to the simple test case, but not to the application. But even IF the application wastes some time with memory management (which might be the case when there are lots of large allocations/deallocations) : This should not really influence the speed of the native CUFFT library, and should in any case be equal for JCufft AND for the other library.

So yes, I’m not using ByteBuffer to do the actual passing to the fft.forward but using
either a float or double

Using Buffer could be a good idea but that still doesn’t answer why that isn’t also hurting speed in
the test code which does the exact same thing to pass the data. Now it is true that in the test code I only made the one input array once and then fed it over and over into the forward() in a loop. This is a very artificial situation compared with the real simulation which of course would make a new array each time.

The hint about using Direct Buffers only referred to the fact that Direct Buffers are guaranteed to be used directly, while arrays might be copied by the JVM (Thus, they might be copied in your application - nobody knows that…). But as I mentioned: I also did some tests involving larger arrays, and never saw the JVM making a copy of the array when using „GetPrimitiveArrayCritical“. When NO copy is made, there should be no difference in the speed on native side between an array and a direct buffer. But maybe I’ll find some time this week to create an example which repeatedly does FFTs on newly allocated arrays / direct buffers, to find out if this affects the performance in any way…