JCudaDriverGLSample3 problem

Hello ! I modified the JCudaDriverGLSample3. I changed the way of positioning objects (points). I don’t have the meshWidth and meshHeight to calculate the position but I use this function (I created also the additional class Point3d):


private void initMesh()
    {
        pBuffer = Buffers.newDirectFloatBuffer(360600 * 4);
       
        for(float x = -6.0f; x < 6.0f; x += 0.02f)
        {
            for(float z = -14.0f; z < -2.0f; z += 0.02f)
            {
                pBuffer.put(new Point3D(x, 0, z, 1).getBuffer());
            }
        }
        
        pBuffer.flip();
    }

runJava function (the sin wave pattern is a little bit simpler):


gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vertexBufferObject);
ByteBuffer byteBuffer = gl.glMapBuffer(GL3.GL_ARRAY_BUFFER, GL3.GL_READ_WRITE);
        
if (byteBuffer == null)
{
     throw new RuntimeException("Unable to map buffer");
}
        
FloatBuffer vertices = byteBuffer.asFloatBuffer();
        
for(int i = 0; i < 360600; i++)
{
     vX = vertices.get(i * 4);
     vY = vertices.get(i * 4 + 1);
     vZ = vertices.get(i * 4 + 2); 
                
     float freq = 1.5f;
     float w = (float) Math.sin(vX * freq + animationState);
                          
     vertices.put(i * 4, vX);
     vertices.put(i * 4 + 1, w); 
     vertices.put(i * 4 + 2, vZ); 
}
        
gl.glUnmapBuffer(GL3.GL_ARRAY_BUFFER);
gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);

I need to write my own cu file but I have no idea how the kernel should looks like in this case.

Hello

I think it should look similar to the original one - as far as I can see, you can still pass the positions in as a float4*… Do you have and approach for the kernel? Otherwise, maybe I can try to adjust the original kernel this week, but I’m not so sure where I should place this task on my todo-list… -_-

BTW: A loop like


for(float z = -14.0f; z < -2.0f; z += 0.02f)

is dangerous: You never know how many steps this loop will take! Note that, due to floating point inaccuracies, a computation like


float a = -14.0f;
a += 0.02f;
...
// (100 times)
...
a += 0.02f;

will most likely NOT yield a=-12 (as expected), but instead might yield something like a=-11.99999993112 or so. You should definitely convert these loops to integers so that they will look like


float minZ = -14.0f;
for(int iz = 0; iz< 600; iz++)
{
    z = minZ + iz * 0.02f;
    ...
}

(then you will also have a mesh size again :wink: )

First of all, I changed the loop as you mentioned:


private void initMesh()
{
     pBuffer = Buffers.newDirectFloatBuffer(width * 500 * 4);
        
     float minX = -6.0f;
     float minZ = -14.0f;
        
     for(int i = 0; i < width; i++)
     {
          for(int j = 0; j < 500; j++)
          {
               pBuffer.put(new Point3D(minX + i * 0.02f, 0, minZ + j * 0.02f, 1).getBuffer());
               
          }
      }
     
     pBuffer.flip();
}

The variable width is declared as private int width = 500. I need width to calculate vertices position in my kernel. I also modified some parameters in the cuda function:


private void cuda()
{
     CUdeviceptr basePointer = new CUdeviceptr();
     cuGraphicsMapResources(1, new CUgraphicsResource[]{vboGraphicsResource}, null);
     cuGraphicsResourceGetMappedPointer(basePointer, new long[1], vboGraphicsResource); 
        
     int blockX = 10;
     int blockY = 10;
     cuFuncSetBlockShape(function, blockX, blockY, 1);

     Pointer dIn = Pointer.to(basePointer);
     Pointer pWidth = Pointer.to(new int[]{width});
     Pointer pAnimationState = Pointer.to(new float[]{animationState});
        
     int offset = 0;
        
     offset = align(offset, Sizeof.POINTER);
     cuParamSetv(function, offset, dIn, Sizeof.POINTER);
     offset += Sizeof.POINTER;
        
     offset = align(offset, Sizeof.INT);
     cuParamSetv(function, offset, pWidth, Sizeof.INT);
     offset += Sizeof.INT;
        
     offset = align(offset, Sizeof.FLOAT);
     cuParamSetv(function, offset, pAnimationState, Sizeof.FLOAT);
     offset += Sizeof.INT;
        
     cuParamSetSize(function, offset);
        
     int gx = 500 / blockX;
     int gy = 500 / blockY;
     cuLaunchGrid(function, gx, gy);
     cuCtxSynchronize();

     cuGraphicsUnmapResources(1, new CUgraphicsResource[]{vboGraphicsResource}, null);
}

This is my kernel:


extern "C"

__global__ void createVertices(float4* positions, int width, float animationState) 

{ 
     int tx = blockIdx.x * blockDim.x + threadIdx.x; 
     int ty = blockIdx.y * blockDim.y + threadIdx.y; 

     float4 temp = positions[tx + ty * width];
     float freq = 1.5f;
     float wave = sin(temp.x * freq + animationState);

     positions[tx + ty * width] = make_float4(temp.x, wave, temp.z, temp.w);
}

Everything works fine but I have a problem when I switch a computation between cuda and cpu. For example first useCuda is true and cuda works. Then I switch to cpu and it also works, but when I switch back to cuda there is no animation. Then I switch back to cpu and it works and so on. Cuda works only if is run first in display (useCuda at the begining = true), cpu works all the time.

Hello

Concerning the remaining problem, for switching between Java and CUDA: Does the same problem also exist in the original unmodified sample? (I just tested it here again, and it seemed to work…).
Does it print any exception message or so?

bye

I can’t run the original sample. This line: cuModuleLoad(module, “simpleGL_kernel.sm_10.cubin”); causes jcuda.CudaException: CUDA_ERROR_INVALID_SOURCE.

A problem with the modified version can be related with initVBO. At the begining of the program I have already created a mesh (initMesh) and data is storing in the pBuffer. Then initVBO looks like this:


private void initVBO(GL3 gl)
{
     int buffer[] = new int[1];
     gl.glGenBuffers(1, IntBuffer.wrap(buffer));
     vertexBufferObject = buffer[0];
     gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vertexBufferObject);
     gl.glBufferData(GL3.GL_ARRAY_BUFFER, Buffers.SIZEOF_FLOAT * pBuffer.capacity(), pBuffer, GL3.GL_DYNAMIC_DRAW);
     gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);
     vboGraphicsResource = new CUgraphicsResource();
     cuGraphicsGLRegisterBuffer(vboGraphicsResource, vertexBufferObject, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD);
}

Oh yes, that’s because the CUBIN has been compiled for older architectures.

You may try compiling the original input file (from the NVIDIA sample). Depending on the Compute Capability of your device, the command line should roughly be
nvcc -cubin -arch sm_21 simpleGL_kernel.cu -o simpleGL_kernel.cubin
(for example, to compile it for a device with Compute Capability 2.1).

Then the line
cuModuleLoad(module, “simpleGL_kernel.cubin”);
should work.

Hopefully the name mangling will not be different then, but it should not be. If you encounter any errors, please let me know. (I should probably try to make this example (or a similar, basic GL example) more easy to use as well…)

bye

Concerning the ‘edit’: I don’t see such a major difference, except that you are trying to initialize the data during the creation - in how far do you think it is related to the problem?

Hmm I don’t know, especially now when I saw what happens with the original version.

I compiled the original simpleGL_kernel.cu according to the compute capability 2.1. Now I can run the original program but it doesn’t work properly (worse than modified version). At the begining both programs have set useCuda = true. My version works ok with cuda, the orginal version produces only same points on the screen. When I switch cuda to cpu, both programs work fine in the cpu mode, but when I switch back to cuda in both cases there is no animation. When I switch back to cpu both versions work and so on.

You should see what happens. I used Fraps to record movies. (I’m running program, then I’m waiting a few seconds and I’m switching the mode then I waiting again a few seconds and so on).

http://www.speedyshare.com/files/28518474/jcuda_jogl.rar

Wow, sorry, but I’ll probably not download ~120 MB right now…

It’s normal that it works with the Java version (there are fewer possible sources for errors). I’m not sure what might go wrong when switching between the CUDA- and Java mode. (Maybe I should also mention that I do neither consider myself as an expert for CUDA nor for OpenGL - maybe I also have made an error when I created this example, although I tested and verified it as far as I could, of course…)

When the original version only shows „random points“, there is still something wrong with the invocation. Also no error messages in this case? :confused: (And … do you really have a card with Compute Capapbility 2.1?)

Sorry for the big size of these files but this is specific for Fraps. I recorded movies again and then compressed them. Now in the original sample you can see more than some points, but it also doesn’t look good. There is no error messages in both cases. I have MSI GeForce GTX 560 Ti Twin Frozr II OC.

Compressed movies:
http://www.speedyshare.com/files/28520062/jcuda.rar

OK, but I still don’t know whether you can see any error messages in any case. In doubt, it might be necessary to consider the “feature” of switching between both modes at runtime does not work in all cases. If someone finds a different solution for this, I’ll integrate it into the sample.

What do you exactly mean by any error messages. I just run the program in Netbeans and there is no error messages generated by IDE. The switching doesn’t work in both cases, but what may cause that strange effect in the original sample when the program starts in cuda mode ?

Hello

Especially when
JCudaDriver.setExceptionsEnabled(true);
is enabled, it should print a stack trace when anything goes wrong (that is, when one of the functions returns a value different to CUDA_SUCCESS).

But with GL-Interoperation, I also had this strange effect occasionally (when I started to create the sample), where the points seemed to be “randomly” disributed. I’m sorry, but at the moment I can not point out a specific reason for that.

Did you have the chance to test the other (Driver API) samples, for example, the JCudaDriverCubinSample or the JCudaDriverTextureTest? This might help to see whether it’s a general problem or specifically related to the GL interoperation…

bye

I tested the orginal sample and modified on GeForce 9600 M GT (compute capability 1.1) and a result was the same as on my device.

I use JOGL2 RC2 Signed Released (jog-2.0-b23-20110303-windows-i586) so I decided to test these samples also with the latest automatic build of JOGL2 RC2 (jogl-2.0-b391-20110517-windows-i586) but nothing’s changed.

The first problem is that strange effect when running the original sample and the second is the switching between cuda and cpu. Have the switching ever worked on your computer ?

Today I tested JCudaDriverCubinSample, JCudaDriverTextureSample and JCudaDriverTextureTest. I compiled kernels for my GeForce GTX 560 Ti with compute capability 2.1 and everything works perfect.

JCuda/JOGL interoperability is very useful feature, so I hope that we will resolve the problem but now I don’t have any ideas what is wrong :frowning:

As a short wrap-up

  • The „simple“ examples (without GL interop) work properly
  • Your modified version of the GL example works with CUDA, until you switch between CUDA and Java
  • The original GL example does not work at all in CUDA mode (just shows „random“ points)

The original sample worked on my computer and other computers, and switching between CUDA and Java also worked well. Until now, I could only test it on windows machines, mainly 32bit, but this obviously matches your setup.

The most interesting part for me at the moment is: What is the difference between the original and the modified sample? (To find out why the modified one works better than the original one).
Concerning the possibility to switch between the computing modes… (maybe I should have omitted this feature in the sample, then nobody would even have considered that this could be possible ;)) … One approach could be to reduce the program, as far as possible, to flesh out the minimal sequence of operations that reproducably causes the problem.

At the moment I hardly see any other way how to find out what is wrong there. The reasons may be related to hardware, drivers, CUDA toolkit, the frameworks (JOGL and JCuda) or, of course, to the sample itself… This sort of debugging may be difficult … :frowning: Sorry for the inconveniences…

Although I don’t have too much hope that this will help, you may try wrapping each GL3 that is obtained from a drawable into a DebugGL3 - this should at least rule out any purely GL-related errors
GL3 gl = drawable.getGL().getGL3();
gl = new DebugGL3(gl);

EDIT: BTW, which is your original setup (i.e. the other one, not the GeForce 9600 M GT…) ?

EDIT2: Wait a moment, the JCudaDriverTextureSample is also working? :eek: It makes even heavier usage of GL functions - so it’s more likely related to the sample itself than to the GL interop in general?!

bye
Marco

  • The original GL example does not work at all in CUDA mode (just shows “random” points) - yes but when I switch to cpu it works, and when I switch back to cuda it doesn’t work but there is no more random points , but just a mesh without an animation. You can see this precisly on the movie.

  • JCudaDriverCubinSample, JCudaDriverTextureSample and JCudaDriverTextureTest works perfect

  • gl = new DebugGL3(gl); didn’t help

My setup: WinXp SP3 32 bit, Java SE Development Kit 6 Update 25, Netbeans 7.0, JOGL2 RC2 (jog-2.0-b23-20110303-windows-i586), JCuda 0.3.2a, CUDA toolkit 3.2, GeForce GTX 560 Ti (ForceWare 266.66).

I also tested the original JCudaDriverGLSample and the modified version on my friend’s laptop (Win7 32 bit 9600 M GT) but a result was the same as on my device.

This is my version of JCudaDriverGLSample: http://www.speedyshare.com/files/28545100/ModifiedSample.rar You can check the difference between the original and the modified sample.

Hello, I have problems with downloading it, could you send it via PN or mail or so? Maybe I can find the time to compare both versions.

I have good news. I installed the newest drivers for my graphics card (ForceWare 270.61) and now the modified version works. However I had the newest drivers (ForceWare 266.66) which were available when Cuda Toolkit 3.2 was released so it looks like there was some bugs.

Unfortunately the original version still doesn’t work.

Try here: http://www.sendspace.com/file/ejstwu
or send me via PM your email address.

That’s confusing. Usually, the hint to update the drivers is the first one when any problem is reported, (anywhere, in any context :wink: ) but I forgot this in this case, though…

But specifically for CUDA, it’s not so obvious which driver version to use: I’m usually using the Developer Driver for my current CUDA toolkit version, which is available from the CUDA download site, together with the toolkit. The most recent one is here, but that’s for CUDA 4.0 - and the driver version is 270.51. The version that was distributed here for CUDA 3.2 has version 263.06.

In both cases, using these drivers would be a „downgrade“ compared to your version 270.61, but admittedly, I’m not sure what the difference is between the „normal“ drivers and the „developer drivers“ - it might be the case that some CUDA-specific features may not work with the „normal“ drivers, but that’s just a guess. A quick websearch brought some results, but none of them with a clear statement -_-

However, I downloaded the modified sample, and will try to find possible differences. (It’s a little bit difficult, since the original sample works for me, and I have installed the CUDA 4.0 RC1, and intend to update to RC2 soon and release JCuda 0.4.0-RC2-beta1, but I think I can allocate some time early next week).

bye
Marco

Hello Ranger,

Today I had the chance to test the sample with CUDA 4.0 RC2 on a Windows Vista 64 bit machine, with … some GeForce GTX 280 (?), Compute capability 1.3: With the original CUBIN file, I also received random points, but after re-compiling the CUBIN it worked well (also switching between the compute modes). Are you sure that you used a proper CUBIN file?
I did not have the chance to take a closer look at the modified sample. But I hope that I can “finish” the update to CUDA 4.0 RC2 today, and then create samples (and KernelLauncher support) for PTX and JIT, maybe this alleviates this problem.

bye