Grain Growth Simulation

Ranger · 3. Juni 2011 um 08:33

I made a simple grain growth simulation using JCuda/JOGL interoperability. The program works, in the cpu mode everything is ok, but when I run cuda growth of grains (shape) isn’t correct.

The program is based on the JCudaDriverGLsample. I use vbo to store positions of the points(first half of the vbo) and colors of the points(second half of the vbo). Positions are initialized once and then I use for a simulation only second half of the vbo and auxiliary objects (buffer, array) to store colors. Each position and color has component x,y,z and w. Each position has w = 1.0f all the time, but when a color have component w = 0.0f, there is no grain. I extreme simplify rules - each grain has the same colors, so I don’t have to do some computation.
At the begining the program randomize some initial grains which will grow. Grain mesh dimension = 1000 x 1000.

The question is: where did I make a mistake when I was creating a kernel based on the cpu version ?

Cuda settings:


cuFuncSetBlockShape(function, 10, 10, 1);
cuLaunchGrid(function, 100, 100);

The cpu version works correct. Grains in some phase of growth: screen


private void cpu(GL gl)
    {
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo);
        byteBuffer = gl.glMapBuffer(GL3.GL_ARRAY_BUFFER, GL3.GL_READ_WRITE);
        
        if (byteBuffer == null)
        {
            throw new RuntimeException("Unable to map buffer");
        }
        
        data = byteBuffer.asFloatBuffer();
        
        for(int i = 0; i < 1000; i++)
        {
            for(int j = 0; j < 1000; j++)
            {
                
                n = 0;
                
                if((i > 0) && (j > 0) && (i < 999) && (j < 999))
                {
                    if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0)
                    {
                        
                        
                        if(data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j - 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                        
                        if(data.get((((j  + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get(((j + (i - 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get((((j + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j + 1) + (i - 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j + 1) + i * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j + 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j + 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get((((j  + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get(((j + (i + 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get((((j + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j - 1) + (i + 1) * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            n++;
                        }
                            
                        if(data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW) != 0)
                        {
                           
                            neighbourColors[n] = new Float4(data.get((((j - 1) + i * grainMeshWidth) * 4) + offsetColor), 
                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetY),
                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), 
                                                            data.get(((((j - 1) + i * grainMeshWidth) * 4) + offsetColor) + offsetW));
                            
                        }
                        
                   }
                    
                   
                   if(neighbourColors[0].getW() != 0)  //here should be some function which returns the most ferquent color, but I simplified this.
                                                       
                   {
                       colorsBuffer.put(((j + i * grainMeshWidth) * 4), neighbourColors[0].getX());
                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetY), neighbourColors[0].getY());
                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetZ), neighbourColors[0].getZ());
                       colorsBuffer.put((((j + i * grainMeshWidth) * 4) + offsetW), neighbourColors[0].getW());
                      
                       
                       
                       for(int n = 0; n < 8; n++)
                       {
                           neighbourColors[n].reset();
                       }
                   }
                }
            }
        }
        
        
        for(int i = 0; i < 1000; i++)
        {
            for(int j = 0; j < 1000; j++)
            {
                if((i > 0) && (j > 0) && (i < 999) && (j < 999))
                {
                    if(colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW) != 0)
                    {
                        data.put((((j + i * grainMeshWidth) * 4) + offsetColor), colorsBuffer.get((j + i * grainMeshWidth) * 4));
                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetY), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetY));
                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetZ), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetZ));
                        data.put(((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW), colorsBuffer.get(((j + i * grainMeshWidth) * 4) + offsetW));
                    }
                    
                }
                
            }
        }
        
        gl.glUnmapBuffer(GL3.GL_ARRAY_BUFFER);
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);
    }

Cuda version in some phase of growth (a shape of grains should be identical as previous): screen


extern "C"

__global__ void graingrowth(float4* vbo, float4* colors, int width) 

{ 
	

	int tx = blockIdx.x * blockDim.x + threadIdx.x; 
	int ty = blockIdx.y * blockDim.y + threadIdx.y; 

	
	float4 tab[8];

	for(int i = 0; i < 8; i++)
	{
		tab** = make_float4(0.0f, 0.0f, 0.0f, 0.0f);
	}

	int x = 0;

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))
	{
	     
             if(vbo[(ty + tx * width) + 1000000].w == 0.0f)
	     {
			
		  if(vbo[((ty - 1) + (tx - 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty - 1) + (tx - 1) * width) + 1000000].x), 
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].y),
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].z),
				            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].w));
	
		       x++;

	          }

		  
	          if(vbo[(ty + (tx - 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[(ty + (tx - 1) * width) + 1000000].x), 
                                            (vbo[(ty + (tx - 1) * width) + 1000000].y),
                                            (vbo[(ty + (tx - 1) * width) + 1000000].z),
				            (vbo[(ty + (tx - 1) * width) + 1000000].w));
			
		       x++;	
	          
		  }

		  
	          if(vbo[((ty + 1) + (tx - 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty + 1) + (tx - 1) * width) + 1000000].x), 
                                            (vbo[((ty + 1) + (tx - 1) * width) + 1000000].y),
                                            (vbo[((ty + 1) + (tx - 1) * width) + 1000000].z),
				            (vbo[((ty + 1) + (tx - 1) * width) + 1000000].w));

		       x++;

	          }

		  if(vbo[((ty + 1) + tx * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty + 1) + tx * width) + 1000000].x), 
                                            (vbo[((ty + 1) + tx * width) + 1000000].y),
                                            (vbo[((ty + 1) + tx * width) + 1000000].z),
				            (vbo[((ty + 1) + tx * width) + 1000000].w));

	               x++;

	          }

		  if(vbo[((ty + 1) + (tx + 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty + 1) + (tx + 1) * width) + 1000000].x), 
                                            (vbo[((ty + 1) + (tx + 1) * width) + 1000000].y),
                                            (vbo[((ty + 1) + (tx + 1) * width) + 1000000].z),
				            (vbo[((ty + 1) + (tx + 1) * width) + 1000000].w));

		       x++;

	          }

		  if(vbo[(ty + (tx + 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[(ty  + (tx + 1) * width) + 1000000].x), 
                                            (vbo[(ty  + (tx + 1) * width) + 1000000].y),
                                            (vbo[(ty  + (tx + 1) * width) + 1000000].z),
				            (vbo[(ty  + (tx + 1) * width) + 1000000].w));

		       x++;

	          }

		  if(vbo[((ty - 1) + (tx + 1) * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty - 1) + (tx + 1) * width) + 1000000].x), 
                                            (vbo[((ty - 1) + (tx + 1) * width) + 1000000].y),
                                            (vbo[((ty - 1) + (tx + 1) * width) + 1000000].z),
				            (vbo[((ty - 1) + (tx + 1) * width) + 1000000].w));
		       x++;

	          }

	          if(vbo[((ty - 1) + tx * width) + 1000000].w != 0.0f)
		  {

	               tab[x] = make_float4((vbo[((ty - 1) + tx * width) + 1000000].x), 
                                            (vbo[((ty - 1) + tx * width) + 1000000].y),
                                            (vbo[((ty - 1) + tx * width) + 1000000].z),
				            (vbo[((ty - 1) + tx * width) + 1000000].w));

	          }

	      }


	      if(tab[0].w != 0)
	      {

	           colors[(ty + tx * width)] = tab[0];
		   
	      }

	   

	}

	

	if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999))
	{
	     
             if(colors[(ty + tx * width)].w != 0.0f)
	     {

	          vbo[(ty + tx * width) + 1000000] = colors[(ty + tx * width)];
	     }

	}

}

Marco13 · 3. Juni 2011 um 13:01

Hello

Again, it’s basically impossible to debug such a huge amount of code by just reading it - and even if it was a complete program, it would not be easy… because … well, I’m not sure, but is the code block


tab[x] = make_float4((vbo[((ty - 1) + (tx - 1) * width) + 1000000].x), 
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].y),
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].z),
				            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].w));

supposed to be something different than


tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

?

But in any case, one thing that is likely to cause problems: You are reading and writing data at the same time. Let’s use „vbo[x][y]“ as a short form for the actual access (you could also use a #define for that). Imagine that Thread (2,2) is going through the first large if-statement. In the beginning, it will check the value of vbo[1][1] and possibly write it into tab[0]. Later, at the end of the main loop, this value will be written into colors[2][2]. In the if-statement at the bottom, colors[2][2] will be written into vbo[2][2].
But remember that this is done by all threads simultaneously! So during this process, Thread (1,1) might modify the value of vbo[1][1], changing the behavior of thread (2,2). It basically means that the program will show „random“ behavior.

Again, this is a fairly complex kernel in its current form. I could not understand what is done there in depth. And maybe there is another (possibly simpler) reason for the behavior you described. But in general, reading and writing the same data with multiple threads simultaneously will cause unpredictable behavior. I’m not sure how to avoid this here, since it’s not completely clear to me what the program is doing - is this some sort of „Game of Life“?

bye

Ranger · 3. Juni 2011 um 15:52

These blocks of code are identical - vbo that’s float4 type (in the kernel) and tab too. make_float4 need to get four parameters of vbo float4 element (x , y , z, w).


tab[x] = make_float4((vbo[((ty - 1) + (tx - 1) * width) + 1000000].x), 
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].y),
                                            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].z),
				            (vbo[((ty - 1) + (tx - 1) * width) + 1000000].w));

I wrote the kernel in this way due to similarity to the cpu version , where vbo is just a FloatBuffer, but in the kernel it can also be written as follows:


tab[x] = vbo[((ty - 1) + (tx - 1) * width) + 1000000];

An idea of grain growth is based on a cellular automaton and the game of life too. The grain growth simulation is using in metallurgy but rules in this version are extreme simple. When you run program you will see what’s going on (to change the mode just comment out cuda or cpu in display and rerun program). In grain growth simulation cells can only born. The main rule is: when a cell is empty (there’s no color: if(data.get((((j + i * grainMeshWidth) * 4) + offsetColor) + offsetW) == 0) it can get color which is most ferquent among 8 surrounding neighbours. I simplified rules - each initial grain has the same color and I don’t have to check which color is most ferquent in the tab, so I can just take the first color. I use if((tx > 0) && (ty > 0) && (tx < 999) && (ty < 999)) to avoid index out of bound exception.

Here is the complete program (based on the JCudaDriverGLSample):
Download

Marco13 · 3. Juni 2011 um 16:35

I’m sometimes using the same approach: Since I have not yet written many kernels, and because errors in CUDA- or OpenCL programs can cause nasty, nasty crashes, I write the kernel in Java as far as possible and test it, and then convert it into a real kernel. Until now, I mainly did this for OpenCL - therefore I actually wrote a class containing some „dummy fields“ and methods that allows writing the code very similar to the „real“ one. It may sound strage that you start loving ArrayIndexOutOfBoundsExceptions, but they are very convenient compared to the painful death of a buggy CUDA kernel…

However, there are aspects that can not be translated directly into Java - namely the fact that there is no predefined order of execution. [EDIT: At least not for one million threads, of course ]. In a sequence like


data[threadIndex+1] = 1;
data[threadIndex  ] = 2;

you never know which value data[0] and data[1] contain - thread 0 will write a ‚1‘ into data[1], and thread 1 will write a 2 into data[1] - possibly at the same time, so that data[1] afterwards contains garbage…

But I’m still not sure whether this is also the main reason here. Maybe I can try to have a look at the code, but it may take a while (I’ll be on a business trip soon).

The topic (cellular automata) is, however, quite interesting, and I know that there already has been done a lot of research for bringing these to CUDA. Maybe there also are some simple kernels available on the web, which may be used as starting points for own tests, and where the synchronization issues already have been addressed…

bye

Ranger · 5. Juni 2011 um 12:15

I found the game of life kernel, but it didn’t give me an answer:


#include<iostream>
using namespace std;
#ifndef _SIMPLEGL_KERNEL_H_
#define _SIMPLEGL_KERNEL_H_


__global__ void kernel(float4* in,float4* out, unsigned int width, unsigned int height)
{
	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
   	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    	// calculate uv coordinates
    	float u = x / (float) width;
    	float v = y / (float) height;
    	u = u -0.5f; 
    	v = v -0.5f;
    	float count = 0.0f;
	
	
	if(x > 0 && x < width -1 && y > 0 && y < height-1)
	{
		
		count = in[y*width+x+1].w + in[y*width+x-1].w + in[(y+1)*width+x].w + in[(y-1)*width+x].w +
		in[(y+1)*width+x+1].w + in[(y+1)*width+x-1].w + in[(y-1)*width+x+1].w + in[(y-1)*width+x-1].w;
	
	}					
	if(count == 3.0f && in[y*width+x].w == 0.0f)
	{
		out[y*width+x] = make_float4(u, v, 0.0f, 1.0f);
	}
	else
	{
		if((count == 2.0f || count == 3.0f) && in[y*width+x].w == 1.0f)
		{
			out[y*width+x] = make_float4(u, v, 0.0f, 1.0f);
		}
		else
		{
			out[y*width+x] = make_float4(u, v, 0.0f, 0.0f)

		}
	}


}

__global__ void initGame(float *tab, float4* pos, unsigned int width, unsigned int height)
{
	unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    	unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

    	// calculate uv coordinates
    	float u = x / (float) width;
    	float v = y / (float) height;
    	u = u -0.5f; 
    	v = v -0.5f;

    
    	if(tab[y*width + x] > 0.5f)
    	{
		pos[y*width+x] = make_float4(u, v, 0.0f, 1.0f);
    	}		
    	else
    	{
		pos[y*width+x] = make_float4(u, v, 0.0f, 0.0f);
    	}
}

extern "C" void launch_initGame(float4* pos, unsigned int mesh_width, unsigned int mesh_height)
{
	
	float* tab = new float[mesh_width*mesh_height];
	
	for(unsigned int i = 0; i<mesh_width; i++)
	{
		for(unsigned int j=0; j<mesh_height; j++)
		{
			tab[j*mesh_width + i]= (float)(rand()/(double)RAND_MAX);//[0,1)			
		}
	}
			
	 // execute the kernel
    	dim3 block(16, 16, 1);
    	dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);
    
    	float* d_tab;
    	int size = mesh_width * mesh_height * sizeof(float);
    
    	cudaMalloc(&d_tab,size);
    	cudaMemcpy(d_tab, tab, size, cudaMemcpyHostToDevice);
    
    	initGame<<< grid, block>>>(d_tab,pos, mesh_width, mesh_height);
    	cudaFree(d_tab);
    	delete[] tab;

}

// Wrapper for the __global__ call that sets up the kernel call
extern "C" void launch_kernel(float4* in,float4* out, unsigned int mesh_width, unsigned int mesh_height)
{
    	// execute the kernel
    	dim3 block(16, 16, 1);
    	dim3 grid(mesh_width / block.x, mesh_height / block.y, 1);
    	kernel<<< grid, block>>>(in, out, mesh_width, mesh_height);
}

#endif // #ifndef _SIMPLEGL_KERNEL_H_

Unfortunately I didn’t found any kernels refer to grain growth simulation excepts this: http://www-e.uni-magdeburg.de/stoeter/ but the download link is broken.

Marco13 · 5. Juni 2011 um 14:11

Well, the game of life at least gives a partial answer: As far as I can see from a short glance, the input and output are completely separated there. I’m not sure whether this is an option for your simulation, but if you duplicate the input data, and the output only depends on the unmodified input, this could easily solve all synchronization issues - as long as it does not mean that you have to copy data forth and back or so. There are some … more ore less “elegant” options. For a cloth simulation in OpenCL, I used the option that only each second step is really rendered, so that I could simply “swap” the buffers after each simulation step… but I’m not sure if this applicable here.
Maybe it’s worth to contact the author for the source code? Depending on what you want to achieve. (I will not be able the have a closer look at this soon, maybe next week, but I’m not sure)

Ranger · 15. Juni 2011 um 07:21

Hello!

I sent an email to the author and I’m waiting for the reply.

I resolved the synchronization problem using separated input and output. The double buffering seems to be the only way to resolve problems like this.

Unfortunately I found some efficiency problem - I have only 30 fps when using cuda (grainMeshWidth = 1000, so a grain mesh = 1000 x 1000). The cpu version reaches about 10 fps, but cuda should run much faster than 30 fps.

I use vbo to store positions of the points(first half of the vbo) and colors of the points(second half of the vbo). Due to the double buffering I have two vbo objects and an extra array. At the begining I initialize buffers and then I use these buffers when creating vbo’s.

initParticles:


grainBuffer1 = Buffers.newDirectFloatBuffer((grainMeshWidth * grainMeshWidth * 4) + (grainMeshWidth * grainMeshWidth * 4));
grainBuffer2 = Buffers.newDirectFloatBuffer((grainMeshWidth * grainMeshWidth * 4) + (grainMeshWidth * grainMeshWidth * 4));
colors = new float[grainMeshWidth * grainMeshWidth * 4]; //an extra array using in cuda

//some initialize work

initVBO:


    private void initVBO(GL3 gl)
    {
        int buffer1[] = new int[1];
        gl.glGenBuffers(1, IntBuffer.wrap(buffer1));
        vbo1 = buffer1[0];
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo1);
        gl.glBufferData(GL3.GL_ARRAY_BUFFER, grainBuffer1.capacity() * Buffers.SIZEOF_FLOAT, grainBuffer1, GL3.GL_DYNAMIC_DRAW);
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);
        
        int buffer2[] = new int[1];
        gl.glGenBuffers(1, IntBuffer.wrap(buffer2));
        vbo2 = buffer2[0];
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo2);
        gl.glBufferData(GL3.GL_ARRAY_BUFFER, grainBuffer2.capacity() * Buffers.SIZEOF_FLOAT, grainBuffer2, GL3.GL_DYNAMIC_DRAW);
        gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, 0);
        
        vboGraphicsResource1 = new CUgraphicsResource();
        cuGraphicsGLRegisterBuffer(vboGraphicsResource1, vbo1, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD);
        vboGraphicsResource2 = new CUgraphicsResource();
        cuGraphicsGLRegisterBuffer(vboGraphicsResource2, vbo2, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD);
    }

initCuda


    private void initCuda()
    {
        
        JCudaDriver.setExceptionsEnabled(true);

        cuInit(0);
        CUdevice dev = new CUdevice();
        cuDeviceGet(dev, 0);
        CUcontext glCtx = new CUcontext();
        cuGLCtxCreate(glCtx, 0, dev);

        CUmodule module = new CUmodule();
        cuModuleLoad(module, "grainkernel.cubin");

        function = new CUfunction();
        cuModuleGetFunction(function, module, "graingrowth");
        
        cuMemAlloc(colorsDevice, size);
        cuMemcpyHtoD(colorsDevice, Pointer.to(colors), size);
        
     }

Due to the double buffering I use a variable flip.

In display it looks like this:


        if(!flip)
        {
            gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo1);
            flip = true;
        }
        else
        {
            gl.glBindBuffer(GL3.GL_ARRAY_BUFFER, vbo2);
            flip = false;
        }

cuda:


    private void cuda()
    {
        CUdeviceptr basePointer1 = new CUdeviceptr();
        CUdeviceptr basePointer2 = new CUdeviceptr();
        cuGraphicsMapResources(1, new CUgraphicsResource[]{vboGraphicsResource1}, null);
        cuGraphicsMapResources(1, new CUgraphicsResource[]{vboGraphicsResource2}, null);
        cuGraphicsResourceGetMappedPointer(basePointer1, new long[1], vboGraphicsResource1); 
        cuGraphicsResourceGetMappedPointer(basePointer2, new long[1], vboGraphicsResource2); 
        cuFuncSetBlockShape(function, 10, 10, 1);

        Pointer dIn1 = Pointer.to(basePointer1);
        Pointer dIn2 = Pointer.to(basePointer2);
        Pointer pColors = Pointer.to(colorsDevice);
        Pointer pGrainMeshWidth = Pointer.to(new int[]{grainMeshWidth});
        
        int offset = 0;
        
        if(!flip)
        {
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, dIn1, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, dIn2, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, pColors, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.INT);
            cuParamSetv(function, offset, pGrainMeshWidth, Sizeof.INT);
            offset += Sizeof.INT;
        
            cuParamSetSize(function, offset);
            cuLaunchGrid(function, 100, 100);
            cuCtxSynchronize();
        }
        else
        {
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, dIn2, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, dIn1, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.POINTER);
            cuParamSetv(function, offset, pColors, Sizeof.POINTER);
            offset += Sizeof.POINTER;
        
            offset = align(offset, Sizeof.INT);
            cuParamSetv(function, offset, pGrainMeshWidth, Sizeof.INT);
            offset += Sizeof.INT;
        
            cuParamSetSize(function, offset);
            cuLaunchGrid(function, 100, 100);
            cuCtxSynchronize();
        }

        cuGraphicsUnmapResources(1, new CUgraphicsResource[]{vboGraphicsResource1}, null);
        cuGraphicsUnmapResources(1, new CUgraphicsResource[]{vboGraphicsResource2}, null);
    }

The problem is not related to the code in the kernel. I can run an “empty” kernel and there is still 30fps ( grainMeshWidth = 1000). What I’m doing wrong ?

Marco13 · 15. Juni 2011 um 08:38

Hello

From a short glance this is again hard to say. Since profiling such an application may be difficult, the most straightforward approach for finding potential bottlenecks would IMHO be to omit specific parts where possible, and see whether there is a speedup. For example, what happens if you do not really render the data, but only execute the ‘cuda()’ method? (Possibly even with an empty kernel?). Or, the other way around, omit parts of the cuda() method and render always the same data, to see whether it’s directly related to CUDA. I could imagine that mapping/unmapping the buffers might take some time, but this is just a guess…

bye
Marco

Ranger · 16. Juni 2011 um 04:54

Ok, I performed a few tests also using an “empty” kernel.

Obviously the empty kernel looks like this:


extern "C"

__global__ void graingrowth(float4* vbo1, float4* vbo2, float4* colors, int width) 
{ 
	
}

Results:

display without the rendering and cuda: ~3600 fps
display + rendering: ~750 fps
display + cuda(empty kernel): ~2300 fps
display + rendering + cuda(empty kernel) : ~50 fps
display + cuda(original kernel): ~330 fps
display + rendering + cuda(original kernel): ~43 fps

What do you think about this ?

Marco13 · 16. Juni 2011 um 05:17

Well, from that there’s not really a clear result, unfortunately - at least it’s not possible to point out a single bottleneck. The empty kernel seems to make not a great difference, but the rendering and the original kernel of course take some time - however, it’s not completely clear why the combination of CUDA+rendering is so much slower than any other task (even if they were combined “arithemetically”). In the end, nobody knows what calls like cuGraphicsMapResources really do internally - it might be possible (not likely, but possible) that the data is only really mapped when it is accessed for rendering.
I’m not sure what the goal of this program is. It might be worthwhile to check whether it’s possible to not render evey frame, but maybe only every 5th frame or so, but maybe this is not applicable here. If you can send me the program in a testable form, I can try to play around and see whether I find possible reasons (although I’m not sure when I find the time for that, there are so many other tasks, e.g. the update for CUDA 4.0 should be published this week but will possibly already be delayed until next week, JCublas, JNpp, and many other things waiting to be done … maybe I should hire some staff … -_- )

Ranger · 28. Juli 2011 um 08:22

Hi Marco again!

I translated to JCuda an another simple example which uses JCuda/JOGL interoperability (a particles engine). The example discussed above is a little bit too muddled, so the next one should be better to explain some performance issues.

Points move chaotically within a cube. Arrays which stores positions and velocities of the particles are allocated in a device and using in a kernel. Ofcourse vbo stores current positions of the particles(points).

When I create vbo there are two possibilities:

vbo reserves only memory space with the given data size (in this case positions of the particles are
first time write to vbo within the kernel and later modify based on the velocities):


gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, null, GL3.GL_DYNAMIC_DRAW);

vbo reserves memory space with the given data size and get some initial data (in this case initial positions of the particles, ofcourse these values are later modify in the kernel based on the velocities):


gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, FloatBuffer.wrap(particlesPositions), GL3.GL_DYNAMIC_DRAW);

For example, when n_particles = 2000000 in the first case I have ~180fps, in the second case ~40fps. What can cause the difference in an efficiency ? I can’t understand …

Kernel:


extern "C"

__global__ void particles_kernel(float4 *vbo, float4 *pos, float4 *vel, int n)
{

	const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
	
	float4 tPos = pos[tid];
	float4 tVel = vel[tid];

	tPos.x += tVel.x;
	tPos.y += tVel.y;
	tPos.z += tVel.z;

	if(tPos.x < -2.0f)
	{
		tVel.x = -tVel.x;
	}
	else if(tPos.x > 2.0f)
	{
		tVel.x = -tVel.x;
	}


	if(tPos.y < -2.0f)
	{
		tVel.y = -tVel.y;
	}
	else if(tPos.y > 2.0f)
	{
		tVel.y = -tVel.y;
	}


	if(tPos.z < -2.0f)
	{
		tVel.z = -tVel.z;
	}
	else if(tPos.z > 2.0f)
	{
		tVel.z = -tVel.z;
	}

	pos[tid] = tPos;
	vel[tid] = tVel;

	vbo[tid] = make_float4(tPos.x, tPos.y, tPos.z, tPos.w);
}

Marco13 · 28. Juli 2011 um 09:35

Hello

Admittedly, I don’t see a significant difference between both lines. To my understanding, the main difference between the lines you posted should be that the actual VBO contents in the first case is undefined, whereas in the second case it is filled with the ‘positions’.

Is this really the only difference that causes the huge performance drop? Are you calling these lines (or additional operations in the second case) somewhere in the rendering loop?

In doubt, I could try to reproduce the example. Although it does not seem to be related to JCuda specifically, from waht you said until now. But it would be interesting to know what might be wrong there.

bye
Marco

Ranger · 28. Juli 2011 um 10:42

This is very strange…

I noticed the same behaviour when I modified the original JCudaDriverGLSample.

Please check this on your PC. You need add these lines:


    private float[] vertices = new float[meshWidth * meshHeight * 4];


    private void initVertices()
    {
        for (int x = 0; x < meshWidth; x++)
        {
            for (int y = 0; y < meshHeight; y++)
            {
               
                float u = x / (float) meshWidth;
                float w = 0.0f;
                float v = y / (float) meshHeight;
                
                u = u * 2.0f - 1.0f;
                v = v * 2.0f - 1.0f;
                
                int index = 4 * (y * meshWidth + x);
                vertices[index + 0] = u;
                vertices[index + 1] = w;
                vertices[index + 2] = v;
                vertices[index + 3] = 1.0f;
            }
        }
    }

In init add initVertices() before initVBO(gl). Then in initVBO you can write:


    gl.glBufferData(GL.GL_ARRAY_BUFFER, size, FloatBuffer.wrap(vertices),
            GL.GL_DYNAMIC_DRAW);

instead of:


    gl.glBufferData(GL.GL_ARRAY_BUFFER, size, (Buffer) null,
            GL.GL_DYNAMIC_DRAW);

Marco13 · 28. Juli 2011 um 12:52

Hello

Indeed, I could reproduce this behavior. Changing this single line reduces the FPS from 530 to 215 on my PC.

I tried to compare this to the „simpleGL“ example from the NVIDIA SDK examples. They are using a different timing scheme. In the JCuda example, I’m counting the number of times that the ‚display‘ function is called per second. Well, the Frames Per Second, what else ;). In the NVIDIA example, they are measuring the time that is required for one execution of the ‚display‘ function, and try to compute the FPS from this value. I’m not sure, maybe I’m wrong, but it seems to me that this timing and the FPS computation do not make any sense at all: The ~16000 Frames Per Second it reports may be considered as a „theoretical“ value, based on the computation time per frame, but I also printed the value returned by
cutGetAverageTimerValue(timer)
every 100 frames: For a grid of 256256, it is about ~0.06 milliseconds. For a grid of 40964096 points (i.e. 256 times as many!) it is… ~0.07 milliseconds. Hard to believe.
However, coming to the point: When I pass in a ‚float‘ array as the initial data for the ‚glBufferData‘ function in the ‚createVBO‘ of the original example, the time that is reported also increases - from 0.07 to 0.1 milliseconds. Regardless of whether this value is correct or not: It’s becoming significantly slower for larger grid sizes when initial data is provided for the glBufferData function.

But this should be taken with a grain of salt: In general, the timing and measurements are difficult. And maybe I already mentioned it: Especially for Java it’s nearly impossible to find out where exactly time is consumed, on the long path which starts at Java, forks into JCuda and JOGL, and finally reaches the „Black Box“ GPU which eventually merges the commands from both libraries in order to paint millions of moving red dots on the screen…

On the one hand, it’s good to know that is not an inherent problem of JCuda (or JOGL) itself, although the drop in performance is annoyingly large in JCuda -_- (At least for „small“ grid sizes like 512512 - in fact, for 40964096 it showed 15 FPS regardless of whether the Buffer was used or not!). But the original sample from NVIDIA also shows different timing results depending on the way how glBufferData is used.
On the other hand, not knowing what the real reason is leaves an uncomfortable feeling: Someone might use glBufferData with a Buffer, and wonder why it is „so horribly slow“ - or at least, he might not know that avoiding it might bring a much higher performance.

Since I don’t see a specific or deterministic way of investigating this further, I’ll probably have to accept it for now, unless someone has further ideas about that. (Maybe someone in the NVIDIA forum could help here as well?)

However, thank you for pointing this out. When someone complains about poor GL performance, I now can ask him whether he’s initializing the data in the call to glBufferData, and possibly help him to achieve performance gain

bye
Marco

Ranger · 3. August 2011 um 05:01

Hi!

I would like to test one more thing.
The runJava function from JCudaDriverGLSample doesn’t take advantage of the multicore cpu. As far as I know we can’t be sure how JVM will manage threads. I don’t have experience with a concurrency in Java and I’m not sure how to modify the runJava function. I have available 2 cores (C2D CPU).

Here some idea but it doesn’t work properly…

Additional class:


    public class VerticesRunnable implements Runnable {
    
    private int meshWidthStart;
    private int meshWidthEnd;
    private int meshHeightStart;
    private int meshHeightEnd;
    private FloatBuffer vertices;
    private float animationState;
    
    public VerticesRunnable(int meshWidthStart, int meshWidthEnd, int meshHeightStart, int meshHeightEnd, FloatBuffer vertices, float animationState)
    {
        this.meshWidthStart = meshWidthStart;
        this.meshWidthEnd = meshWidthEnd;
        this.meshHeightStart = meshHeightStart;
        this.meshHeightEnd = meshHeightEnd;
        this.vertices = vertices;
        this.animationState = animationState;
    }

    @Override
    public void run() 
    {
        for(int x = meshWidthStart; x < meshWidthEnd; x++)
        {
            for(int y = meshHeightStart; y < meshHeightEnd; y++)
            {
                 
                float u = x / (float) meshWidthStart;
                float v = y / (float) meshHeightStart;

                u = u * 2.0f - 1.0f;
                v = v * 2.0f - 1.0f;

                float freq = 4.0f;
                float w = (float) Math.sin(u * freq + animationState) *
                          (float) Math.cos(v * freq + animationState) * 0.5f;

                int index = 4 * (y * meshWidthStart + x);
                vertices.put(index + 0, u);
                vertices.put(index + 1, w);
                vertices.put(index + 2, v);
                vertices.put(index + 3, 1);
            }
        }
        
    }
}

runJava:


    private void runJava(GL gl) throws InterruptedException
    {
        gl.glBindBuffer(GL.GL_ARRAY_BUFFER, vertexBufferObject);
        ByteBuffer byteBuffer = 
            gl.glMapBuffer(GL.GL_ARRAY_BUFFER, GL2.GL_READ_WRITE);
        if (byteBuffer == null)
        {
            throw new RuntimeException("Unable to map buffer");
        }
        FloatBuffer vertices = 
            byteBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
        
        Runnable r1 = new VerticesRunnable(0, meshWidth / 2, 0, meshHeight / 2, vertices, animationState);
        Runnable r2 = new VerticesRunnable(meshWidth / 2, meshWidth, meshHeight / 2, meshHeight, vertices, animationState);
        
        Thread t1 = new Thread(r1);
        Thread t2 = new Thread(r2);
        t1.start();
        t2.start();
        t1.join();
        t2.join();

        gl.glUnmapBuffer(GL.GL_ARRAY_BUFFER);
        gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
    }

Marco13 · 3. August 2011 um 11:56

Hi

Indeed, that’s very interesting.

I know that there are many benchmarks and comparisons of GPU- and CPU codes, and in many cases, the CPU code is single threaded - although most modern CPUs have more than one core. This makes these comparisons somewhat biased and not very meaningful. (Maybe I should emphasize that this sample was not inteded as a real, objective “benchmark” - just an example for JCuda/JOGL interop - but of course, it may serve as a benchmark at least to some extent…)

Concerning the original code you posted: The u- and v-values are the position of the point inside the unit square. So these still have to be computed as

float u = x / (float) meshWidth;
float v = y / (float) meshHeight;

regardless of the fact that each thread will not process the full ‘meshWidth’ and ‘meshHeight’. These values are just used to compute the location of the point.

Similarly, the index has to be computed as

int index = 4 * (y * meshWidth + x);

because only then the entries will end up a the right position in the ‘vertices’ buffer (which still contains the full mesh, i.e. meshWidth*meshHeight points).

Apart from that, the creation of the runnables was not entirely right, because they only covered the upper left and lower right quarter of the mesh


   0,0      w/2        w
    -------------------
    |        |        |
    |        |        |
    |  OK    | missing|
    |        |        |
h/2 -------------------
    |        |        |
    | missing|        |
    |        |  OK    |
    |        |        |
  h -------------------

I have slightly modified and extended the test:

    private void runJava(GL gl)
    {
        //runJavaSingle(gl);
        runJavaMulti(gl);
    }
    
    public static class VerticesRunnable implements Runnable 
    {
        private int meshWidthStart;
        private int meshWidthEnd;
        private FloatBuffer vertices;
        private float animationState;
        
        public VerticesRunnable(
            int meshWidthStart, int meshWidthEnd, 
            FloatBuffer vertices, float animationState)
        {
            this.meshWidthStart = meshWidthStart;
            this.meshWidthEnd = meshWidthEnd;
            this.vertices = vertices;
            this.animationState = animationState;
        }

        @Override
        public void run() 
        {
            for(int x = meshWidthStart; x < meshWidthEnd; x++)
            {
                for(int y = 0; y < meshHeight; y++)
                {
                     
                    float u = x / (float) meshWidth;
                    float v = y / (float) meshHeight;

                    u = u * 2.0f - 1.0f;
                    v = v * 2.0f - 1.0f;

                    float freq = 4.0f;
                    float w = (float) Math.sin(u * freq + animationState) *
                              (float) Math.cos(v * freq + animationState) * 0.5f;

                    int index = 4 * (y * meshWidth + x);
                    vertices.put(index + 0, u);
                    vertices.put(index + 1, w);
                    vertices.put(index + 2, v);
                    vertices.put(index + 3, 1);
                }
            }
            
        }
    }    
    
    private boolean printInfos = true;
    
    private void runJavaMulti(GL gl) 
    {
        gl.glBindBuffer(GL.GL_ARRAY_BUFFER, vertexBufferObject);
        ByteBuffer byteBuffer = 
            gl.glMapBuffer(GL.GL_ARRAY_BUFFER, GL2.GL_READ_WRITE);
        if (byteBuffer == null)
        {
            throw new RuntimeException("Unable to map buffer");
        }
        FloatBuffer vertices = 
            byteBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();

        int processors = Runtime.getRuntime().availableProcessors();
        if (printInfos)
        {
            System.out.println("Number of processors: "+processors);
        }
        int chunks = processors;
        
        Thread threads[] = new Thread[chunks];
        int stepSize = meshWidth / chunks; 
        for (int i=0; i<chunks; i++)
        {
            int start = i * stepSize;
            int end = start + stepSize;
            end = Math.min(meshWidth, end);
            
            if (printInfos)
            {
                System.out.println("Chunk "+i+" with elements "+start+" to "+end+" of "+meshWidth);
            }
            
            Runnable runnable = new VerticesRunnable(start, end, vertices, animationState);
            threads** = new Thread(runnable);
            threads**.start();
        }
        printInfos = false;
        
        try
        {
            for (int i=0; i<chunks; i++)
            {
                threads**.join();
            }
        }
        catch (InterruptedException e)
        {
            Thread.currentThread().interrupt();
        }

        gl.glUnmapBuffer(GL.GL_ARRAY_BUFFER);
        gl.glBindBuffer(GL.GL_ARRAY_BUFFER, 0);
    }

(all just one snippet from the main class - the VerticesRunnable is a static inner class here)

This way, it determines the number of processors, and divides the mesh into an appropriate number of “chunks” (which basically are small slices of the mesh, with the full meshHeight, but a width of only meshWidth/numProcessors). These are then processed by individual threads.

For a mesh of 512x512, this increases the FPS on my Quad Core PC from ~18 in the single threaded version to ~45 in the version with 4 threads, and the system monitor shows that all cores are quite busy. More tests with different mesh sizes, and different numbers of “chunks” (e.g. “numProcessors/2” or “numProcessors*2”) might also be interesting…

Maybe I’ll clean it up a little bit more, and update the example on the website with this multithreaded version.

bye