Using JCudaDriver with JCufft

Hello,

I have finished implementing a test case that utilizes JCufft.

Here is the basic procedure:

Execute forward cuFFT of two images → copy data to host → compute element-wise multiplication of the two forward ffts → copy back to device-> execute inverse fft → copy data back to host…

This procedure works and gets the correct answer.

Now I have a kernel that computes the element-wise multiplication. I converted all the device pointers from Pointer type to CUdeviceptr to work with my kernel, and used those to pass to cuFFT routines.

The answer I am getting now is 0.

Any help with operating between my own kernel and JCufft would be greatly appreciated. Also thank you for providing JCuda it is an excellent tool for getting functionality of CUDA in Java. :slight_smile:

Hi

It’s hard to guess what might be wrong there. Did you run, for example, an isolated test that verifies that the kernel for the element-wise multiplication works properly? Or do you have some example source code where the error can be reproduced? Did you use
JCudaDriver.setExceptionsEnabled(true);
to be informed about errors during the setup?

There is a “JCudaRuntimeDriverMixSample.java” on the Samples page, which shows how to mix Runtime- and Driver API (althogh I just noticed that this should be updated - it still uses a CUBIN file for the kernel…)

Otherwise, I can try to run another test with JCufft and an own kernel.

BTW: Recently I added a “JCuda Vector utilities” library at http://jcuda.org/utilities/utilities.html : The intention of this library is to offer vector operations, like element-wise multiplication. So with this library, you might not need to write an own kernel, but could possibly just do a call like
VecFloat.mul(size, result, x, y);
But of course, for non-trivial operations, own kernels will still be necessary.

bye
Marco

I will post some code in a bit, I will try enabling exceptions to see what happens. In the meantime, how might I register for these forums?

CUDA code: compiled with: “nvcc -cubin -m64 -arch sm_21 file.cu -o file.cubin”

One thing to note. I first compiled this code in linux and copied the cubin file from linux into windows and loaded it from windows. Do I need to generate the cubin file in windows or should nvcc be cross compatible between windows and linux.

extern "C"
__global__ void
elt_prod_conj(cufftDoubleComplex *fc, cufftDoubleComplex * c1, 
              cufftDoubleComplex * c2, int size)
{
  __shared__ cufftDoubleComplex sfc[THREADS_PER_BLOCK];
  __shared__ cufftDoubleComplex sc1[THREADS_PER_BLOCK];
  __shared__ cufftDoubleComplex sc2[THREADS_PER_BLOCK];

  int idx = threadIdx.x + blockIdx.x * THREADS_PER_BLOCK;

  if (idx >= size)
    return;

  sc1[threadIdx.x] = c1[idx];
  sc2[threadIdx.x] = c2[idx];

  __syncthreads();

  sfc[threadIdx.x] = cuCmul(sc1[threadIdx.x], cuConj(sc2[threadIdx.x]));

  double mag = cuCabs(sfc[threadIdx.x]);

  fc[idx] = make_cuDoubleComplex(cuCreal(sfc[threadIdx.x]) / mag,
				 cuCimag(sfc[threadIdx.x]) / mag);
}

Here is the java code for doing the FFTs and launching the above kernel:

First initialization:

	public static CUfunction elt_prod_function;
	public static int fftSize;
	public static cufftHandle plan_fwd;
	public static cufftHandle plan_bwd;	

...


initPlans(int width, int height)
{
		checkError(JCudaDriver.cuInit(0));
		CUdevice device = new CUdevice();
		checkError(JCudaDriver.cuDeviceGet(device, 0));
		CUcontext context = new CUcontext();
		checkError(JCudaDriver.cuCtxCreate(context, 0,  device));
		
		CUmodule module = new CUmodule();
		checkError(JCudaDriver.cuModuleLoad(module, "util-cuda-bin.cubin"));

		elt_prod_function = new CUfunction();
		checkError(JCudaDriver.cuModuleGetFunction(elt_prod_function, module, "elt_prod_conj"));
		
		fftSize = (width/2+1)*height;
		plan_fwd = new cufftHandle();
		plan_bwd = new cufftHandle();
		JCufft.cufftPlan2d(plan_fwd, height, width, cufftType.CUFFT_D2Z);
		JCufft.cufftPlan2d(plan_bwd, height, width, cufftType.CUFFT_Z2D);	
}

// ...	Compute the fft
	public void computeFFT()
	{
		if (!isTileRead())
			readTile();

		if (hasFFT())
			return;			
		
		double tempJ[] = new double[super.getWidth()*super.getHeight()];
		
		for (int i = 0; i < super.getPixels().length; i++)
		{
			tempJ** = super.getPixels()**;
		}
		
		fft = new CUdeviceptr();
		CUdeviceptr ptr = new CUdeviceptr();
		
		JCUDAImageTile.checkError(JCudaDriver.cuMemAlloc(ptr,  super.getWidth()*super.getHeight()*Sizeof.DOUBLE));						
		JCUDAImageTile.checkError(JCudaDriver.cuMemAlloc(fft,  fftSize*Sizeof.DOUBLE*2));
		
		JCUDAImageTile.checkError(JCudaDriver.cuMemcpyHtoD(ptr, Pointer.to(tempJ), super.getWidth()*super.getHeight()*Sizeof.DOUBLE));
		
		JCUDAImageTile.checkError(JCufft.cufftExecD2Z(plan_fwd, ptr, fft));
				
		JCudaDriver.cuMemFree(ptr);
		
		JCudaDriver.cuCtxSynchronize();

	}
... checkError:
public static void checkError(int val)
	{
		if (val != 0)
		{
			Log.msg(Log.LOG_LEVEL_MANDATORY, "Error: " + val);
		}
	}

... and finally using the kernel using two ffts.
       
func(ImageTile t1, ImageTile t2)
{
       if (!t1.hasFFT())
			t1.computeFFT();		

		if (!t2.hasFFT())
			t2.computeFFT();		
				
		int numThreads = 256;
		int numBlocks = (int)Math.ceil((double)JCUDAImageTile.fftSize / (double)numThreads);
		System.out.println(JCUDAImageTile.fftSize + " blocks : " + numBlocks);
		CUdeviceptr ptr = new CUdeviceptr();
		CUdeviceptr ptrOut = new CUdeviceptr(); 				
		
		JCUDAImageTile.checkError(JCudaDriver.cuMemAlloc(ptr, JCUDAImageTile.fftSize*Sizeof.DOUBLE*2));
		JCUDAImageTile.checkError(JCudaDriver.cuMemAlloc(ptrOut, t1.getWidth()*t1.getHeight()*Sizeof.DOUBLE));
		
		jcuda.Pointer kernelParams = jcuda.Pointer.to(
				jcuda.Pointer.to(t1.getFFT()),
				jcuda.Pointer.to(t2.getFFT()),
				jcuda.Pointer.to(ptr),
				jcuda.Pointer.to(new int[]{JCUDAImageTile.fftSize}));
				
		JCUDAImageTile.checkError(JCudaDriver.cuLaunchKernel(JCUDAImageTile.elt_prod_function,
				numBlocks, 1, 1, 
				numThreads, 1, 1, 
				16384, null, 
				kernelParams, null));
		
		
						
		JCUDAImageTile.checkError(JCufft.cufftExecZ2D(JCUDAImageTile.plan_bwd, ptr, ptrOut));
		
		JCUDAImageTile.checkError(JCudaDriver.cuCtxSynchronize());

		
		double [] valsOut = new double[t1.getWidth()*t1.getHeight()];
		
		JCUDAImageTile.checkError(JCudaDriver.cuMemcpyDtoH(jcuda.Pointer.to(valsOut), ptrOut, t1.getWidth()*t1.getHeight()*Sizeof.DOUBLE));
		
		JCUDAImageTile.checkError(JCudaDriver.cuMemFree(ptr));		
		JCUDAImageTile.checkError(JCudaDriver.cuMemFree(ptrOut));
}

Just a note: I implemented the algorithm using FFTW and some native C bindings and I am able to get the correct answer. Also, I am now trying to take some code I had written in C++ that uses CUDA and porting it into Java.

I ran my tests with exceptions turned on and logging level of debug and no messages were printed to the console.

Another quick update. I decided to use nvcc in windows and created the cubin file and am now getting not 0 answers (hooray). Still not correct, but its at least a start.

Let me know if anything sticks out for you in the code and thank you for your response.

Found the registration button (:wut:)

Fixed my bug. I passed the parameters wrong to the CUDA function :wut: :wut::twisted:

edit:

now that I have this problem fixed. I am moving onto my next problem. I am doing a parallel reduction for finding the max element’s index. I have this implemented, but my kernel requires passing a template. Is it possible to do this?

Here is example code

template <unsigned int blockSize>
__global__ void
reduce_max_main(cufftDoubleComplex *g_idata, double *g_odata, 
		int * max_idx, unsigned int n)
{
...
}

Good to hear that the other problem is resolved. It’s not really possible to programmatically verify the kernel parameters in any way (not only in JCuda, but also in CUDA itself, when using the Driver API).

Concerning the templates in your reduction kernel: Templates are difficult. In general, they are supported, but the attempt to use them has a chain of implications that may be considered as undesirable.

First of all, the .CU file is not part of a larger, plain C++ program. So the templates have to be instantiated explicitly, and can not be deduced from a calling context.

But one of the main problems is Name Mangling in C++. When you create a dummy kernel like

// Dummy kernel definition
template <unsigned int blockSize>
__global__ void dummyKernel(int x)
{
    x = blockSize;
}

// Template instantiation
template __global__ void dummyKernel<1234>(int x);
template __global__ void dummyKernel<5678>(int x);

and compile it into a PTX file, then the name ‘dummyKernel’ will be mangled. Of course, it has to be, because otherwise there would be TWO functions called ‘dummyKernel’ - and they somehow have to be identified uniquely when fetching them with cuModuleGetFunction. So in the PTX file, there will be TWO entries, with names like _Z11dummyKernelILj1234EEvi and __Z11dummyKernelILj5678EEvi. Although it is possible to obtain the proper kernel with
cuModuleGetFunction(f, m, "Z11dummyKernelILj1234EEvi");
I’m not sure whether you really want to do this…

(BTW: This name mangling is usually avoided by the extern "C" declaration - but templates are no C code, so this is not applicable here)

There are several more or less practical approaches to deal with this, ranging from “accepting it” (and using the mangled names) over #define’s or automatic code generation… but I think as long as the templating is not absolutely crucial for the program (for example, as long as it does not need a template parameter for a type that allows kernels to operate on arbitrary primitive types), maybe the best strategy is to try to avoid templates.

Particularly, at the Samples page, there is a “JCudaReduction.java”+“reduction.cu” sample that I once derived from the NVIDIA ‘Reduction’ sample (which also heavily uses templates), which could be applicable here.

Hope that helps

bye
Marco

Do I need to generate the cubin file in windows or should nvcc be cross compatible between windows and linux.

Admittedly, I’m not sure about this. But in general, I’d recommend using PTX files anyhow, because they are less specific for the target system. Some of the samples are still using CUBINs, I’ll have to go though all of these and update them accordingly (it’s easy to change in general, but I have not yet taken the time :o )