Hi all,
I got a problem if the value of n is larger than 63,961,152. Is it a upper limit of my display card??
JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT, cudaMemcpyKind.cudaMemcpyHostToDevice);
The error is
Exception in thread “main” jcuda.CudaException: cudaErrorLaunchFailure
at jcuda.runtime.JCuda.checkResult(JCuda.java:184)
at jcuda.runtime.JCuda.cudaMemcpy(JCuda.java:1068)
JCuda Version 0.3.0a
Display Card: GT240
OS: linux 64 bit
Lemon
Hello
It should not be, actually. I just did a websearch looking for any limitations of cudaMemcpy or cudaMalloc, but did not find any that apply for your system. The value 63,961,152 would result in ~256MB of memory, so this should be no problem on any card with >512MB
I also ran a test (with
java -Xmx1200m JCudaMemcpyTest
on WinXP 32 / GeForce GTX 280 / 1GB)
import jcuda.*;
import jcuda.runtime.JCuda;
import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaMemcpyKind.*;
public class JCudaMemcpyTest
{
public static void main(String args[])
{
JCuda.setExceptionsEnabled(true);
for (int n=2; n<=(1<<30); n*=2)
{
int memorySize = n * Sizeof.INT;
int host[] = new int[n];
System.out.println("Test "+n+" ("+memorySize+" bytes)");
Pointer devicePointer = new Pointer();
cudaMalloc(devicePointer, memorySize);
cudaMemcpy(devicePointer, Pointer.to(host), memorySize,
cudaMemcpyHostToDevice);
cudaFree(devicePointer);
System.out.println("Test "+n+" ("+memorySize+" bytes) DONE");
}
}
}
It bails out as expected, when trying to allocate a 1GB chunk of memory.
In any case, if this was a limit of the graphics card, you should experience problems already when trying to allocate this memory, and not when trying to copy it. Are you sure that your host pointer is valid, and the data it points to is at least 63,961,152*4 bytes?
What happens if you run the above sample program?
bye
Hi,
Reply the result of the sample program first. I set the memory “-Xmn512m -Xmx2024m”
Test 2 (8 bytes)
Test 2 (8 bytes) DONE
Test 4 (16 bytes)
Test 4 (16 bytes) DONE
Test 8 (32 bytes)
Test 8 (32 bytes) DONE
Test 16 (64 bytes)
Test 16 (64 bytes) DONE
Test 32 (128 bytes)
Test 32 (128 bytes) DONE
Test 64 (256 bytes)
Test 64 (256 bytes) DONE
Test 128 (512 bytes)
Test 128 (512 bytes) DONE
Test 256 (1024 bytes)
Test 256 (1024 bytes) DONE
Test 512 (2048 bytes)
Test 512 (2048 bytes) DONE
Test 1024 (4096 bytes)
Test 1024 (4096 bytes) DONE
Test 2048 (8192 bytes)
Test 2048 (8192 bytes) DONE
Test 4096 (16384 bytes)
Test 4096 (16384 bytes) DONE
Test 8192 (32768 bytes)
Test 8192 (32768 bytes) DONE
Test 16384 (65536 bytes)
Test 16384 (65536 bytes) DONE
Test 32768 (131072 bytes)
Test 32768 (131072 bytes) DONE
Test 65536 (262144 bytes)
Test 65536 (262144 bytes) DONE
Test 131072 (524288 bytes)
Test 131072 (524288 bytes) DONE
Test 262144 (1048576 bytes)
Test 262144 (1048576 bytes) DONE
Test 524288 (2097152 bytes)
Test 524288 (2097152 bytes) DONE
Test 1048576 (4194304 bytes)
Test 1048576 (4194304 bytes) DONE
Test 2097152 (8388608 bytes)
Test 2097152 (8388608 bytes) DONE
Test 4194304 (16777216 bytes)
Test 4194304 (16777216 bytes) DONE
Test 8388608 (33554432 bytes)
Test 8388608 (33554432 bytes) DONE
Test 16777216 (67108864 bytes)
Test 16777216 (67108864 bytes) DONE
Test 33554432 (134217728 bytes)
Test 33554432 (134217728 bytes) DONE
Test 67108864 (268435456 bytes)
Test 67108864 (268435456 bytes) DONE
Test 134217728 (536870912 bytes)
Test 134217728 (536870912 bytes) DONE
Test 268435456 (1073741824 bytes)
Exception in thread “main” jcuda.CudaException: cudaErrorMemoryAllocation
at jcuda.runtime.JCuda.checkResult(JCuda.java:184)
at jcuda.runtime.JCuda.cudaMalloc(JCuda.java:827)
at JCudaMemcpyTest.main(JCudaMemcpyTest.java:21)
Hi Marco,
In fact, I have two cudaMemcpy in my program. Does it mean, the maximum memory size is divided by 2 for two cudaMemcpy ?
private static void sort(int array[],int h_values[]){
int n = array.length;
JCuda.setExceptionsEnabled(true);
JCudpp.setExceptionsEnabled(true);
Pointer d_keys = new Pointer();
Pointer d_values = new Pointer();
JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
JCuda.cudaMalloc(d_values, n * Sizeof.INT);
JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT, cudaMemcpyKind.cudaMemcpyHostToDevice);
JCuda.cudaMemcpy(d_values, Pointer.to(h_values), n * Sizeof.INT, cudaMemcpyKind.cudaMemcpyHostToDevice);
Lemon
Hi Marco,
I tuned the program. The radix sort function is shown as below.
private static void sort(int array[],int h_values[])
{
int n = array.length;
JCuda.setExceptionsEnabled(true);
JCudpp.setExceptionsEnabled(true);
Pointer d_keys = new Pointer();
Pointer d_values = new Pointer();
JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
JCuda.cudaMalloc(d_values, n * Sizeof.INT);
JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT,
cudaMemcpyKind.cudaMemcpyHostToDevice);
JCuda.cudaMemcpy(d_values, Pointer.to(h_values), n * Sizeof.INT,
cudaMemcpyKind.cudaMemcpyHostToDevice);
CUDPPConfiguration config = new CUDPPConfiguration();
config.algorithm = CUDPPAlgorithm.CUDPP_SORT_RADIX;
config.datatype = CUDPPDatatype.CUDPP_UINT;
config.op = CUDPPOperator.CUDPP_ADD;
config.options = CUDPPOption.CUDPP_OPTION_KEY_VALUE_PAIRS;
CUDPPHandle handle = new CUDPPHandle();
JCudpp.cudppPlan(handle, config, n, 1, 0);
JCudpp.cudppSort(handle, d_keys, d_values, 32, n);
Arrays.fill(array, 0);
JCuda.cudaMemcpy(Pointer.to(array), d_keys, n * Sizeof.INT,
cudaMemcpyKind.cudaMemcpyDeviceToHost);
JCuda.cudaMemcpy(Pointer.to(h_values), d_values, n * Sizeof.INT,
cudaMemcpyKind.cudaMemcpyDeviceToHost);
JCudpp.cudppDestroyPlan(handle);
JCuda.cudaFree(d_keys);
JCuda.cudaFree(d_values);
}
However, the exception is pointed to the following code at this time. The value of n is 63,961,153 .
JCuda.cudaMemcpy(Pointer.to(array), d_keys, n * Sizeof.INT,
cudaMemcpyKind.cudaMemcpyDeviceToHost);
Sorry to interrupt you so many times.
Lemon
Hello
With the example you posted, I can reproduce the problem. However, it does not seem to be related to cudaMempcy: When inserting the line
JCuda.cudaThreadSynchronize();
after the call to cudppSort, it throws the exception at this line. This is obviously a case of an error code that is returned due to an asynchronous launch (as stated in the documentation of all CUDA API functions: “Note that this function may also return error codes from previous, asynchronous launches.”.
In this case, the error code seems to come from cudppSort.
I just tested the “radixsort” example from the NVIDA (which is, as far as I know, nearly the same as the CUDPP version) and for me it prints
Sorting 1048576 32-bit unsigned int keys and values
. estradixsort.cpp(277) : cudaSafeCall() Runtime API error : unspecified launch failure.
This happens even for 1048576 values, so there is something completely wrong, but I’m not sure in how far this also applies to CUDPP. A websearch on “unspecified launch failure” radixsort yields some results, which are also related to CUDPP, but none of them being helpful…
Maybe there is an (undocumented) limit of the size for a radixsort in CUDPP? (They are already working on documentation of size limits…)
I’ll probably have take a closer look at this and do some more tests of the CUDPP sort function to find out what might be the reason for this.
BTW: Does the radixsort example from NVIDIA work for you? It’s really strange that it fails for me even for such a small array…
bye
Hi Marco,
I can successfully run the radixsort of the NVIDIA SDK (sdk_3.0_win_32) by visual studio 2008.
c:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA G
PU Computing SDK\C\bin\win32\Debug\radixsort.exe Starting...
Using CUDA device [0]: GeForce GT 240
Sorting 1048576 32-bit unsigned int keys and values
radixSort, Throughput = 30.3455 MElements/s, Time = 0.03455 s, Size = 1048576 el
ements, NumDevsUsed = 1, Workgroup = 256
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------
It fails to run if i set the number = 63,000,000 (-n=63000000) .
Using CUDA device [0]: GeForce GT 240
First-chance exception at 0x7c812a6b in radixsort.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0012fc04..
The thread 'Win32 Thread' (0xc70) has exited with code 1 (0x1).
The program '[2908] radixsort.exe: Native' has exited with code 1 (0x1).
Lemon
I ran another test of the radixsort example, and now it seems to work (I had compiled it with a wrong setup).
However, it was working for 50 Million elements, whereas for 65 Million elements it crashes. In release mode, I got the same error message that you posted, but after compiling and running it in debug mode it showed
Sorting 65000000 32-bit unsigned int keys and values
CUDA Error RadixSort::initialize() : out of memory
I had a look at the specified function. The documentation states that the keys and values are sorted „in-place“. I assumed that this means that there are no additional temporary arrays created for the keys&values, but in fact, during the initialization, some temporary arrays are created: One for the keys, one for the values, and 3 additional (smaller) arrays. This simply is too much, because then you have 465Millionsizeof(int)+some smaller arrays which is >1GB. So it is obviously not possible to sort more than ~64 Million Key-Value pairs with CUDPP on a 1GB card…
Depending on the application case, it could(!) be possible to sort two halves of the arrays, and merge them afterwards (on the CPU, if necessary), but of course this would have a considerable performance impact…