I am on Ubuntu, kernel 2.6.35-2 and 64bits.
I have installed the cuda sdk 3.2.16
I use JCuda 0.3.2
Good news : I can compile my cuda samples (.cu → .cubin) with nvcc. And always good I can compile and lauch any JCuda examples without errors im my Eclipse IDE.
So here’s the problem :
When I launch the JCudaRuntimeDriverMixSample, found here : jcuda.org - Samples, all pass good (compilation and execution) but I have the output :
So I repeat : The .cu is compiled in .cubin correctly, all the java program runs well, but the job isn’t done. I have the same vector in output than in input.
I think it could comes to the copy of the value fromDeviceToHost ? Or maybe the kernel function is never called ?
But why because I haven’t modify anything it this example.
Can you try adding
Arrays.fill(vector, 0.0f);
before the last call to cuMemcpyDtoH? This will at least show whether the kernel call or the memcopy fails…
BTW: The sample needs a small update: I don’t know how the reference to JCufft slipped in there, it’s not needed and should be JCublas instead… : And these factors “*2” are also not needed - huh, I must have been overworked when I wrote this…
In any case, I’ll try to test this again tomorrow, maybe I can find out whether there’s something wrong with the example.
With the Arrays.fill(vector, 0.0f); before the last call to cuMemcpyDtoH it gives the same result, so the cuMemcpyDtoH works !
But I tryed to modify the examples like this :
[ul]
[li]cudaMalloc of a CUdeviceptr for a float array BUT no cuMemcpyHtoD[/li][li]call the cuda native function, which fill the array with a static value (like 4.2f)[/li][li]finally, get the result from the GPU into a Java float array (cuMemcpyDtoH), in function of the first cudaMalloc()[/li][/ul]
The result is very strange when I print my array : It gives me this value :
vector [0.5, 0.33333334, 0.25, 0.2, 0.16666667]
And this is exactly the values of one of my last example, when I tried some stuffs. I just filled initaly my Java float array with values as 1/i, for i = 2 to n+2.
So my conclusion : When I do a cuMemcpyDtoH it returns me a values of a previous array reference ! But I don’t understand HOW because I have even done a reboot of my computer !
I did a short test with the original version, and it seemed to work (no surprise, since I already tested it before I uploaded it). But there had been some possible bugs (which I mentioned above), which might cause errors in a different environment - I’m not sure if it could be related to what you decribed, but in any case: I uploaded an updated version of this sample, you might want to test it. If you still encounter problems (which is not unlikely, since it was only a minor change) I can have a look at the example that you posted, probably early next week, and try to reproduce the behavior that you just described.
The nvcc compilation work correctly and the cubin is created.
But, the output of the program is : Test FAILED
When you do, in the Java sample, the test „is the expected value equal to the hostOuptut value“ (after the kernel call, line 144), I print all the hostOuptut value like that :
OK, that has been a while… Could you please add
JCudaDriver.setExceptionsEnabled(true);
as the first line of the ‘main’? - Maybe this already brings a hint what might be wrong there…
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_INVALID_SOURCE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:170)
at jcuda.driver.JCudaDriver.cuModuleLoad(JCudaDriver.java:1400)
at dl.JCudaDriverCubinSample.main(JCudaDriverCubinSample.java:53)
No, there should be nothing wrong, except what is described here: http://forum.byte-welt.de/showthread.php?t=3494
So it means that you might add the “-arch sm_XX” parameter when compiling the CUBIN file, where “XX” stands for the Compute Capability of your card.
Alternatively, you could use an PTX file instead of a CUBIN file, which may be more flexible. (I’m currently updating the samples to prefer PTX files, and hopefully I can upload them this week, together with the new version of JCuda for CUDA 4.0 and a short “Getting started” tutorial, which also covers the CUBIN/PTX issue)
I wrote a little bit about Creating Kernels in the newly linked Tutorial. This also refers to CUBIN- and PTX files, and an updated “JCudaDriverSample” has been added to show how PTX files may be used instead of CUBIN files, maybe it’s worth a look for you.
EDIT: The Link to the VectorAdd CUDA file will be fixed soon