Problem in Window 32 bit and Linux 64 bit

Hi,

I got some problems in Window and Linux.

  1. GT240 on Window XP , 32 bit

I can run the following device test.

import jcuda.runtime.*;
import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaError.*;
import static jcuda.runtime.cudaComputeMode.*;

public class JCudaDeviceQueryTest
{
    public static void main(String args[])
    {
        JCuda.setExceptionsEnabled(true);

        int deviceCountArray[] = new int[1];
        if (cudaGetDeviceCount(deviceCountArray) != cudaSuccess)
        {
            System.out.printf("cudaGetDeviceCount failed! CUDA Driver and Runtime version may be mismatched.
");
            System.out.printf("
Test FAILED!
");
            System.exit(1);
        }
        int deviceCount = deviceCountArray[0];

        // This function call returns 0 if there are no CUDA capable devices.
        if (deviceCount == 0)
        {
            System.out.println("There is no device supporting CUDA");
        }

        int dev;
        int driverVersionArray[] = new int[1];
        int runtimeVersionArray[] = new int[1];
        for (dev = 0; dev < deviceCount; ++dev)
        {
            cudaDeviceProp deviceProp = new cudaDeviceProp();
            cudaGetDeviceProperties(deviceProp, dev);

            if (dev == 0)
            {
                // This function call returns 9999 for both major & minor fields, if no CUDA capable devices are present
                if (deviceProp.major == 9999 && deviceProp.minor == 9999)
                    System.out.printf("There is no device supporting CUDA.
");
                else if (deviceCount == 1)
                    System.out.printf("There is 1 device supporting CUDA
");
                else
                    System.out.printf("There are %d devices supporting CUDA
", deviceCount);
            }
            
            String name = new String(deviceProp.name);
            name = name.substring(0, name.indexOf(0));
            System.out.printf("
Device %d: \"%s\"
", dev, name);

            cudaDriverGetVersion(driverVersionArray);
            int driverVersion = driverVersionArray[0];
            System.out.printf("  CUDA Driver Version:                           %d.%d
", driverVersion / 1000, driverVersion % 100);

            cudaRuntimeGetVersion(runtimeVersionArray);
            int runtimeVersion = runtimeVersionArray[0];
            System.out.printf("  CUDA Runtime Version:                          %d.%d
", runtimeVersion / 1000, runtimeVersion % 100);

            System.out.printf("  CUDA Capability Major revision number:         %d
", deviceProp.major);
            System.out.printf("  CUDA Capability Minor revision number:         %d
", deviceProp.minor);

            System.out.printf("  Total amount of global memory:                 %d bytes
", deviceProp.totalGlobalMem);
            System.out.printf("  Number of multiprocessors:                     %d
", deviceProp.multiProcessorCount);
            System.out.printf("  Number of cores:                               %d
", 8 * deviceProp.multiProcessorCount);
            System.out.printf("  Total amount of constant memory:               %d bytes
", deviceProp.totalConstMem);
            System.out.printf("  Total amount of shared memory per block:       %d bytes
", deviceProp.sharedMemPerBlock);
            System.out.printf("  Total number of registers available per block: %d
", deviceProp.regsPerBlock);
            System.out.printf("  Warp size:                                     %d
", deviceProp.warpSize);
            System.out.printf("  Maximum number of threads per block:           %d
", deviceProp.maxThreadsPerBlock);
            System.out.printf("  Maximum sizes of each dimension of a block:    %d x %d x %d
", 
                deviceProp.maxThreadsDim[0], 
                deviceProp.maxThreadsDim[1], 
                deviceProp.maxThreadsDim[2]);
            System.out.printf("  Maximum sizes of each dimension of a grid:     %d x %d x %d
", 
                deviceProp.maxGridSize[0], 
                deviceProp.maxGridSize[1], 
                deviceProp.maxGridSize[2]);
            System.out.printf("  Maximum memory pitch:                          %d bytes
", deviceProp.memPitch);
            System.out.printf("  Texture alignment:                             %d bytes
", deviceProp.textureAlignment);
            System.out.printf("  Clock rate:                                    %.2f GHz
", deviceProp.clockRate * 1e-6f);
            System.out.printf("  Concurrent copy and execution:                 %s
", deviceProp.deviceOverlap != 0 ? "Yes" : "No");
            System.out.printf("  Run time limit on kernels:                     %s
", deviceProp.kernelExecTimeoutEnabled != 0 ? "Yes" : "No");
            System.out.printf("  Integrated:                                    %s
", deviceProp.integrated != 0 ? "Yes" : "No");
            System.out.printf("  Support host page-locked memory mapping:       %s
", deviceProp.canMapHostMemory != 0 ? "Yes" : "No");
            System.out.printf("  Compute mode:                                  %s
", 
                deviceProp.computeMode == cudaComputeModeDefault ? "Default (multiple host threads can use this device simultaneously)" : 
                    deviceProp.computeMode == cudaComputeModeExclusive ? "Exclusive (only one host thread at a time can use this device)" : 
                        deviceProp.computeMode == cudaComputeModeProhibited ? "Prohibited (no host thread can use this device)" : "Unknown");
        }
    }
}

The result is:


There is 1 device supporting CUDA

Device 0: "GeForce GT 240"
  CUDA Driver Version:                           3.0
  CUDA Runtime Version:                          3.0
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         2
  Total amount of global memory:                 1073414144 bytes
  Number of multiprocessors:                     12
  Number of cores:                               96
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.34 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)


However, It throws exception when I run JCudaRuntimeSample.java which download in the JCuda official website. The 0.3.0a.jar of jcuda is placed in the classpath and the required .dll is placed in jre1.6.0_07/bin. I think that the location is correct.

The result is shown as follow:


Creating input data
Initializing device data using JCuda
Performing FFT using JCufft
Performing caxpy using JCublas
Performing scan using JCudpp
Error while loading native library with base name "JCudpp"
Operating system name: Windows XP
Architecture         : x86
Architecture bit size: 32
Exception in thread "main" java.lang.UnsatisfiedLinkError: Could not load native library
	at jcuda.LibUtils.loadLibrary(LibUtils.java:74)
	at jcuda.jcudpp.JCudpp.assertInit(JCudpp.java:175)
	at jcuda.jcudpp.JCudpp.cudppPlan(JCudpp.java:214)
	at com.wellsynergy.kmatrix.ci.util.JCudaRuntimeSample.main(JCudaRuntimeSample.java:90)

  1. GTX 470 on Linux 64bit

Also sucessfully run the device test. The result is shown as follow:

There is 1 device supporting CUDA

Device 0: "GeForce GTX 470"
  CUDA Driver Version:                           3.0
  CUDA Runtime Version:                          3.0
  CUDA Capability Major revision number:         2
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 1341325312 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.22 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

When I run the JCudppSample.java , the testSort is Passed in sorting <= 32 elements. However, the testSort Failed in sorting more than 32 elements.

I tried to print out the exception, it shows:


Creating input data
Performing sort with Java...
Performing sort with JCudpp...
Exception in thread "main" jcuda.CudaException: cudaErrorLaunchFailure
	at jcuda.runtime.JCuda.checkResult(JCuda.java:184)
	at jcuda.runtime.JCuda.cudaMemcpy(JCuda.java:1068)
	at JCudppSample.sort(JCudppSample.java:89)
	at JCudppSample.testSort(JCudppSample.java:46)
	at JCudppSample.main(JCudppSample.java:28)

I am so sorry of many questions. Please give me some help~
Thanks.

Regards,
Lemon

Hello Lemon,

Concerning the first error:
Exception in thread „main“ java.lang.UnsatisfiedLinkError: Could not load native library

It might be missing the CUDPP library. As mentioned on the JCudpp site, the native CUDPP library has to be present:

…in order to use JCudpp, you need an installation of CUDPP - namely, the CUDPP library file, like the CUDPP.DLL for Windows, or the libCudpp.so for Linux.

By default, the required libraries are automatically installed when you install the NVIDIA CUDA SDK. The libraries are then contained in the NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin directory of the SDK. For example, the CUDPP DLL for 32 bit Windows may be found in
„NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win32\Release“

You may want to try copying the DLL from the SDK directory into the same directory as the JCudpp DLL (at least for a first test - the DLL is large, and later it might make more sense to put it into a directory that is visible via the PATH variable).

Concerning the second error: How does your function call look like? Note that in the call
JCudpp.cudppSort(handle, d_keys, null, 32, n);
the ‚32‘ does NOT stand for the number of elements to sort, but for the number of bits that the values have. The actual number of elements to sort is given as the last parameter, ‚n‘ in this case. The RadixSort uses the bit-pattern of the numbers for sorting, and CUDPP supports up to 32bits at the moment.

bye

Hi Marco,

For the second error, the 32 means the value of n.

The function called in the main: testSort(32);

Lemon

That’s strange … I never experienced such an error. The message (cudaErrorLaunchFailure) indicates that there is a problem with one of the kernels. At the moment, I’m not sure whether this is a problem in the native JCudpp library for Linux 64 (which I did not compile myself, since I don’t have Linux 64 sytem available) or if it is a general problem of CUDPP on Linux 64.
Is it possible to run the ‘radixSort’ example from the NVIDIA SDK?

It can run the ‘radixSort’ in NVIDIA SDK. But as I mentioned in the previous thread, the GT 240 display card can sort about ~63Million elements. However, It through exception when I sort more than 30 Million elements in the NVIDIA SDK.

The exception:

Cuda error: after radixsort in file 'testradixsort.cpp' in line 287 : invalid configuration argument.

I have no idea on it. >.<

Oh yes, sorry, this was related to the previous thread.

But I’m a little bit confused now, just for clarification, please correct me where I’m wrong:

  • With the GT240 on Window XP 32, you can use JCudpp and the radixSort example for ~63 million elements.
  • With GTX 470 on Linux 64bit you can use the radixSort for ~30 million elements, but it shows the “invalid configuration argument” error for >30 million elements
  • With JCudpp it shows the “cudaErrorLaunchFailure” error even for 32 (not 32 million, but 32!?) elements?

It might be the case that the “invalid configuration argument” error shows up as the “cudaErrorLaunchFailure” error for JCudpp, but that’s only a guess. A websearch on “invalid configuration argument” showed that this is related to an illegal number of threads per block. I would have to look at the CUDPP source to see whether this number is somehow computed based on the “maxThreadsPerBlock” value of the device properties, and whether there might be an error, but as far as I know, there is no way to influence this directly. And in any case, this would not explain why it gives an error with JCudpp even for 32 elements…

Hi Marco,

Concerning the first error, I can solve it after following your instruction to copy the cudpp.dll to library path. Many Thanks.

Concerning the second error, you are right of my situation. I am going to ask the second question “- With GTX 470 on Linux 64bit you can use the radixSort for ~30 million elements, but it shows the “invalid configuration argument” error for >30 million elements” in CUDPP Group. Hope they can help me.

If the problem is occur in “maxThreadsPerBlock” value, is it no solution to solve it ???

Thanks for your help.

Hello,

The “invalid configuration argument” error seems to be related to an illegal number of threads per block. And one of the differences between both cards are the values that are returned as the “maxThreadsPerBlock” (512 x 512 x 64 vs. 1024 x 1024 x 64). It might be the case that CUDPP is internally computing the number of Threads that should be used, and the case that maxThreadsPerBlock gives values >512 is not handled correctly, but I might be wrong. I’ll have to take a look at the CUDPP source, but unfortunately, I’ll not be at my “CUDA PC” very often in the next 1-2 weeks. Let’s see what the CUDPP group says about this, and I’ll also try to have a look at this as soon as possible.

If it really turns out to be a bug in CUDPP, it might be necessary to compile the (fixed) version of the CUDPP library manually (this might be … not so straightforward on Linux 64).

But I still wonder if this was really true:
**- With JCudpp it shows the “cudaErrorLaunchFailure” error even for 32 (not 32 million, but 32!?) elements?
**
This would indicate an error in JCudpp, and at the moment I would not have any idea what might be the reason for this error…

bye

Hi Marco,

I posted the question in CUDPP Groups, would you mind to chat with them.
The thread is: http://groups.google.com/group/cudpp/browse_thread/thread/6baf9674da421b

Thanks

Lemon

Hello,

I’m back on my “real” PC, and hopefully can have a closer look at this now. I’ll post a reply to the Google Group Topic you linked.

bye