I got some problems in Window and Linux.
- GT240 on Window XP , 32 bit
I can run the following device test.
import jcuda.runtime.*;
import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaError.*;
import static jcuda.runtime.cudaComputeMode.*;
public class JCudaDeviceQueryTest
public static void main(String args[])
int deviceCountArray[] = new int[1];
if (cudaGetDeviceCount(deviceCountArray) != cudaSuccess)
System.out.printf("cudaGetDeviceCount failed! CUDA Driver and Runtime version may be mismatched.
int deviceCount = deviceCountArray[0];
// This function call returns 0 if there are no CUDA capable devices.
if (deviceCount == 0)
System.out.println("There is no device supporting CUDA");
int dev;
int driverVersionArray[] = new int[1];
int runtimeVersionArray[] = new int[1];
for (dev = 0; dev < deviceCount; ++dev)
cudaDeviceProp deviceProp = new cudaDeviceProp();
cudaGetDeviceProperties(deviceProp, dev);
if (dev == 0)
// This function call returns 9999 for both major & minor fields, if no CUDA capable devices are present
if (deviceProp.major == 9999 && deviceProp.minor == 9999)
System.out.printf("There is no device supporting CUDA.
else if (deviceCount == 1)
System.out.printf("There is 1 device supporting CUDA
System.out.printf("There are %d devices supporting CUDA
", deviceCount);
String name = new String(deviceProp.name);
name = name.substring(0, name.indexOf(0));
Device %d: \"%s\"
", dev, name);
int driverVersion = driverVersionArray[0];
System.out.printf(" CUDA Driver Version: %d.%d
", driverVersion / 1000, driverVersion % 100);
int runtimeVersion = runtimeVersionArray[0];
System.out.printf(" CUDA Runtime Version: %d.%d
", runtimeVersion / 1000, runtimeVersion % 100);
System.out.printf(" CUDA Capability Major revision number: %d
", deviceProp.major);
System.out.printf(" CUDA Capability Minor revision number: %d
", deviceProp.minor);
System.out.printf(" Total amount of global memory: %d bytes
", deviceProp.totalGlobalMem);
System.out.printf(" Number of multiprocessors: %d
", deviceProp.multiProcessorCount);
System.out.printf(" Number of cores: %d
", 8 * deviceProp.multiProcessorCount);
System.out.printf(" Total amount of constant memory: %d bytes
", deviceProp.totalConstMem);
System.out.printf(" Total amount of shared memory per block: %d bytes
", deviceProp.sharedMemPerBlock);
System.out.printf(" Total number of registers available per block: %d
", deviceProp.regsPerBlock);
System.out.printf(" Warp size: %d
", deviceProp.warpSize);
System.out.printf(" Maximum number of threads per block: %d
", deviceProp.maxThreadsPerBlock);
System.out.printf(" Maximum sizes of each dimension of a block: %d x %d x %d
System.out.printf(" Maximum sizes of each dimension of a grid: %d x %d x %d
System.out.printf(" Maximum memory pitch: %d bytes
", deviceProp.memPitch);
System.out.printf(" Texture alignment: %d bytes
", deviceProp.textureAlignment);
System.out.printf(" Clock rate: %.2f GHz
", deviceProp.clockRate * 1e-6f);
System.out.printf(" Concurrent copy and execution: %s
", deviceProp.deviceOverlap != 0 ? "Yes" : "No");
System.out.printf(" Run time limit on kernels: %s
", deviceProp.kernelExecTimeoutEnabled != 0 ? "Yes" : "No");
System.out.printf(" Integrated: %s
", deviceProp.integrated != 0 ? "Yes" : "No");
System.out.printf(" Support host page-locked memory mapping: %s
", deviceProp.canMapHostMemory != 0 ? "Yes" : "No");
System.out.printf(" Compute mode: %s
deviceProp.computeMode == cudaComputeModeDefault ? "Default (multiple host threads can use this device simultaneously)" :
deviceProp.computeMode == cudaComputeModeExclusive ? "Exclusive (only one host thread at a time can use this device)" :
deviceProp.computeMode == cudaComputeModeProhibited ? "Prohibited (no host thread can use this device)" : "Unknown");
The result is:
There is 1 device supporting CUDA
Device 0: "GeForce GT 240"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 2
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 12
Number of cores: 96
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 1.34 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
However, It throws exception when I run JCudaRuntimeSample.java which download in the JCuda official website. The 0.3.0a.jar of jcuda is placed in the classpath and the required .dll is placed in jre1.6.0_07/bin. I think that the location is correct.
The result is shown as follow:
Creating input data
Initializing device data using JCuda
Performing FFT using JCufft
Performing caxpy using JCublas
Performing scan using JCudpp
Error while loading native library with base name "JCudpp"
Operating system name: Windows XP
Architecture : x86
Architecture bit size: 32
Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load native library
at jcuda.LibUtils.loadLibrary(LibUtils.java:74)
at jcuda.jcudpp.JCudpp.assertInit(JCudpp.java:175)
at jcuda.jcudpp.JCudpp.cudppPlan(JCudpp.java:214)
at com.wellsynergy.kmatrix.ci.util.JCudaRuntimeSample.main(JCudaRuntimeSample.java:90)
- GTX 470 on Linux 64bit
Also sucessfully run the device test. The result is shown as follow:
There is 1 device supporting CUDA
Device 0: "GeForce GTX 470"
CUDA Driver Version: 3.0
CUDA Runtime Version: 3.0
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1341325312 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.22 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
When I run the JCudppSample.java , the testSort is Passed in sorting <= 32 elements. However, the testSort Failed in sorting more than 32 elements.
I tried to print out the exception, it shows:
Creating input data
Performing sort with Java...
Performing sort with JCudpp...
Exception in thread "main" jcuda.CudaException: cudaErrorLaunchFailure
at jcuda.runtime.JCuda.checkResult(JCuda.java:184)
at jcuda.runtime.JCuda.cudaMemcpy(JCuda.java:1068)
at JCudppSample.sort(JCudppSample.java:89)
at JCudppSample.testSort(JCudppSample.java:46)
at JCudppSample.main(JCudppSample.java:28)
I am so sorry of many questions. Please give me some help~