I was playing around with the sample CUDPP program. I was kind of suprised to see that the CPU (Core I7 3.3+ max clock) smoked the GPU. I can’t remember the exact difference but it was at least 3 times faster on the CPU. Is this just because the Radix Sort just doesn’t speed up that well? I haven’t looked at the sum scan yet. Does anyone have numbers using JCUDA out there?
Hello
In theory, the Radix sort is speeding up better than the standard Java sort: The Radix sort has O(kn) while the standard Java sort ist a modified Quicksort with O(nlogn).
But there are several points to consider:
-
With the first call to any JCudpp- or JCuda method, the native library will be loaded, which may take some time. If you want to do a benchmark, you can force the libraries to be loaded by calling JCuda.initialize()/JCudpp.initialize() before starting the timer. (These calls are usually not necessary, they are just for this sort of “pre-loading”)
-
The ‘k’ in the running time of the Radix sort is the number of relevant bits. So, for example, if you wanted to sort a few million values of which you knew that they are all <256 (8 bits) then RadixSort would be blazingly fast. Even with 32bits it would still be fast, but in both cases, the exact speed, of course, depends on how many values have to be sorted.
-
A “naive” time measuring might include the time for the memory transfer. But JCudpp is not intended as a “drop-in-replacement” for Arrays.sort. JCudpp just offers the functionalities of CUDPP, and will in most cases be used when you want to work on data which is already located on the device.
In a mail that I received about a similar topic a while ago, someone complained that the “scan” Operation of JCudpp was slower than a plain Java implementation. I showed him some benchmarks to point out what I can summarize here briefly:
When you are including the time for memory transfers in your measurements, and compare several CUDPP operations to Java implementations, then… -
A scan of 32 million elements will be slower in CUDPP
-
Sorting 32 million elements will also be slower in CUDPP.
-
Sorting and then immediately scanning these 32 million elements will be faster in CUDPP compared to plain Java.
And of course, when you omit the time for the memory transfers, both operations will be much faster than a Java implementation even for relatively small input sizes.
Some more… systematic benchmarks might be interesting, but it’s not trivial to create an “objective” (micro-) benchmark that simulates a “realistic” use case…
bye
Thanks. I have been considering those things. I admit I only did the “naive” testing, but for my POV it’s the “right” test. I’m going to play with it more tonight. I’m going to play with multiple ops. A little research I did today suggests, as you did too, that the more complicated the job the better the speed up. In general it looks like you have to pay for the two mem copy ops with complexity.
I’m really interested in the CUDA 4.0 new memory transfer method. I’m not too sure how it works but you get to skip the mem copy call, or maybe it was the malloc() call. I’m looking forward to your port.
I did just do it w/o the malloc/free in the timing test. I guess you really need to do these ops once if you manage the memory correctly. This looks promising.