JCUDA Test program help, code included

First, THANK YOU for your work! :slight_smile:

Ok, I’m putting together a simple test program used to eventually teach others CUDA/Java interaction. I’m having problems with expected results.

My work is based on this C program that calculates the sum of two vectors:


#include <stdio.h>
#include <sys/time.h>


// The number of threads per blocks in the kernel
// (if we define it here, then we can use its value in the kernel,
//  for example to statically declare an array in shared memory)
const int threads_per_block = 256;


// Forward function declarations
float *GPU_add_vectors(float *A, float *B, int N);
float *CPU_add_vectors(float *A, float *B, int N);
float *get_random_vector(int N);
long long start_timer();
long long stop_timer(long long start_time, char *name);
void die(char *message);


int main(int argc, char **argv) {
	// Seed the random generator (use a constant here for repeatable results)
	srand(10);

	// Determine the vector length
	int N = 100000;  // default value
	if (argc > 1) N = atoi(argv[1]); // user-specified value

	// Generate two random vectors
	long long vector_start_time = start_timer();
	float *A = get_random_vector(N);
	float *B = get_random_vector(N);
	stop_timer(vector_start_time, "Vector generation");
	
	// Compute their sum on the GPU
	long long GPU_start_time = start_timer();
	float *C_GPU = GPU_add_vectors(A, B, N);
	long long GPU_time = stop_timer(GPU_start_time, "	            Total");
	
	// Compute their sum on the CPU
	long long CPU_start_time = start_timer();
	float *C_CPU = CPU_add_vectors(A, B, N);
	long long CPU_time = stop_timer(CPU_start_time, "
CPU");
	
	// Compute the speedup or slowdown
	if (GPU_time > CPU_time) printf("
CPU outperformed GPU by %.2fx
", (float) GPU_time / (float) CPU_time);
	else                     printf("
GPU outperformed CPU by %.2fx
", (float) CPU_time / (float) GPU_time);
	
	// Check the correctness of the GPU results
	int num_wrong = 0;
	for (int i = 0; i < N; i++) {
		if (fabs(C_CPU** - C_GPU**) > 0.000001) num_wrong++;
	}
	
	// Report the correctness results
	if (num_wrong) printf("
%d / %d values incorrect
", num_wrong, N);
	else           printf("
All values correct
");

}


// A GPU kernel that computes the vector sum A + B
// (each thread computes a single value of the result)
__global__ void add_vectors_kernel(float *A, float *B, float *C, int N) {
	// Determine which element this thread is computing
	int block_id = blockIdx.x + gridDim.x * blockIdx.y;
	int thread_id = blockDim.x * block_id + threadIdx.x;
	
	// Compute a single element of the result vector (if the element is valid)
	if (thread_id < N) C[thread_id] = A[thread_id] + B[thread_id];
}


// Returns the vector sum A + B (computed on the GPU)
float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) {
	
	long long memory_start_time = start_timer();

	// Allocate GPU memory for the inputs and the result
	int vector_size = N * sizeof(float);
	float *A_GPU, *B_GPU, *C_GPU;
	if (cudaMalloc((void **) &A_GPU, vector_size) != cudaSuccess) die("Error allocating GPU memory");
	if (cudaMalloc((void **) &B_GPU, vector_size) != cudaSuccess) die("Error allocating GPU memory");
	if (cudaMalloc((void **) &C_GPU, vector_size) != cudaSuccess) die("Error allocating GPU memory");
	
	// Transfer the input vectors to GPU memory
	cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice);
	cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice);
	
	stop_timer(memory_start_time, "
GPU:	  Transfer to GPU");
	
	// Determine the number of thread blocks in the x- and y-dimension
	int num_blocks = (int) ((float) (N + threads_per_block - 1) / (float) threads_per_block);
	int max_blocks_per_dimension = 65535;
	int num_blocks_y = (int) ((float) (num_blocks + max_blocks_per_dimension - 1) / (float) max_blocks_per_dimension);
	int num_blocks_x = (int) ((float) (num_blocks + num_blocks_y - 1) / (float) num_blocks_y);
	dim3 grid_size(num_blocks_x, num_blocks_y, 1);
	
	// Execute the kernel to compute the vector sum on the GPU
	long long kernel_start_time = start_timer();
	add_vectors_kernel <<< grid_size , threads_per_block >>> (A_GPU, B_GPU, C_GPU, N);
	cudaThreadSynchronize();  // this is only needed for timing purposes
	stop_timer(kernel_start_time, "	 Kernel execution");
	
	// Check for kernel errors
	cudaError_t error = cudaGetLastError();
	if (error) {
		char message[256];
		sprintf(message, "CUDA error: %s", cudaGetErrorString(error));
		die(message);
	}
	
	// Allocate CPU memory for the result
	float *C_CPU = (float *) malloc(vector_size);
	if (C_CPU == NULL) die("Error allocating CPU memory");
	
	// Transfer the result from the GPU to the CPU
	memory_start_time = start_timer();
	cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost);
	stop_timer(memory_start_time, "	Transfer from GPU");
	
	// Free the GPU memory
	cudaFree(A_GPU);
	cudaFree(B_GPU);
	cudaFree(C_GPU);
	
	return C_CPU;
}


// Returns the vector sum A + B
float *CPU_add_vectors(float *A, float *B, int N) {	
	// Allocate memory for the result
	float *C = (float *) malloc(N * sizeof(float));
	if (C == NULL) die("Error allocating CPU memory");

	// Compute the sum;
	for (int i = 0; i < N; i++) C** = A** + B**;
	
	// Return the result
	return C;
}


// Returns a randomized vector containing N elements
float *get_random_vector(int N) {
	if (N < 1) die("Number of elements must be greater than zero");
	
	// Allocate memory for the vector
	float *V = (float *) malloc(N * sizeof(float));
	if (V == NULL) die("Error allocating CPU memory");
	
	// Populate the vector with random numbers
	for (int i = 0; i < N; i++) V** = (float) rand() / (float) rand();
	
	// Return the randomized vector
	return V;
}


// Returns the current time in microseconds
long long start_timer() {
	struct timeval tv;
	gettimeofday(&tv, NULL);
	return tv.tv_sec * 1000000 + tv.tv_usec;
}


// Prints the time elapsed since the specified time
long long stop_timer(long long start_time, char *name) {
	struct timeval tv;
	gettimeofday(&tv, NULL);
	long long end_time = tv.tv_sec * 1000000 + tv.tv_usec;
	printf("%s: %.5f sec
", name, ((float) (end_time - start_time)) / (1000 * 1000));
	return end_time - start_time;
}


// Prints the specified message and quits
void die(char *message) {
	printf("%s
", message);
	exit(1);
}

My Java test class:


import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;

public class TestCUBINCall {
	
	// Run attributes, not really important but define vector element size. 
	// This is from another larger test program I was working on.
	private int maxParticleCount = 4;
	private int dimensions = 3;
	
	// Calculate vector needs
    private int vectorElementCount = maxParticleCount*dimensions;
    private int vectorMemSize = vectorElementCount*Sizeof.FLOAT;
    
    // All input and result arrays are the same size
	private float particleCoordinates[] = new float[vectorElementCount];
	private float particlePositionModifiers[] = new float[vectorElementCount];
	private float results[] = new float[vectorElementCount];
	
	private int threads_per_block = 256;

	/**
	 * Simple test method to calculate the sum of 2 vectors
	 * 
	 * [1,1,1,1,1,1,1,1,1...] + [2,2,2,2,2,2,2,2,2...] = [3,3,3,3,3,3,3,3,3...]
	 * 
	 */
    public void testCUBIN(){
    	
    	// Initialize the driver and create a context for the first device.
        JCudaDriver.cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        JCudaDriver.cuDeviceGet(dev, 0);
        JCudaDriver.cuCtxCreate(pctx, 0, dev);


        // Load the CUBIN file.
        CUmodule module = new CUmodule();
        JCudaDriver.cuModuleLoad(module, "vector_add.sm_10.cubin");


        // Obtain a function pointer to the "sampleKernel" function.
        CUfunction function = new CUfunction();
        JCudaDriver.cuModuleGetFunction(function, module, "add_vectors_kernel");

        
        // Prepare host test data
        for(int i = 0; i < this.vectorElementCount; i++){
        	this.particleCoordinates** = 1;
        	this.particlePositionModifiers** = 2;
        	this.results** = 0;
        }

        // Define pointers to input vectors
        CUdeviceptr positionDevicePointer = new CUdeviceptr();
        CUdeviceptr modificationDevicePointer = new CUdeviceptr();
        CUdeviceptr outputDevicePointer = new CUdeviceptr();
        
        // Allocate memory space on the GPU
        JCudaDriver.cuMemAlloc(positionDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(modificationDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(outputDevicePointer, this.vectorMemSize);

        // Copy data from host to device
        JCudaDriver.cuMemcpyHtoD(positionDevicePointer, Pointer.to(this.particleCoordinates), this.vectorMemSize);
        JCudaDriver.cuMemcpyHtoD(modificationDevicePointer, Pointer.to(this.particlePositionModifiers), this.vectorMemSize);


        // Set up the execution parameters.
        int num_blocks = (int) ((float) (this.vectorElementCount + threads_per_block - 1) / (float) threads_per_block);
    	int max_blocks_per_dimension = 65535;
    	int num_blocks_y = (int) ((float) (num_blocks + max_blocks_per_dimension - 1) / (float) max_blocks_per_dimension);
    	int num_blocks_x = (int) ((float) (num_blocks + num_blocks_y - 1) / (float) num_blocks_y);
        JCudaDriver.cuFuncSetBlockShape(function, num_blocks_x, num_blocks_y, 1);


        // Set up the parameters for the function call
        Pointer dInPositions = Pointer.to(positionDevicePointer);
        Pointer dInModifiers = Pointer.to(modificationDevicePointer);
        Pointer dOut = Pointer.to(outputDevicePointer);
        Pointer vectorSize = Pointer.to(new int[]{this.vectorElementCount});
        
        // Accumulate offset used for function call
        int offset = 0;
        
        // Position
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInPositions, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Modifier
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInModifiers, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Results
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dOut, Sizeof.POINTER);
        offset += Sizeof.POINTER;

        // Vector Size
        offset = JCudaDriver.align(offset, Sizeof.INT);
        JCudaDriver.cuParamSetv(function, offset, vectorSize, Sizeof.INT);
        offset += Sizeof.INT;
        
        JCudaDriver.cuParamSetSize(function, offset);

        // Call the function.
        JCudaDriver.cuLaunch(function);
        JCudaDriver.cuCtxSynchronize();


        // Copy the device output to the host.
        JCudaDriver.cuMemcpyDtoH(Pointer.to(this.results), outputDevicePointer, this.vectorMemSize);


        // Verify the result via simple output
        // All values should be 3, ie 3,3,3,3,3,3,3,3,3,3........
        for(int i = 0; i < this.vectorElementCount; i++){
        	System.out.print(this.results**+",");
        }

        // Clean up.
        JCudaDriver.cuMemFree(positionDevicePointer);
        JCudaDriver.cuMemFree(modificationDevicePointer);
        JCudaDriver.cuMemFree(outputDevicePointer);
    }
    
    /**
     * Kick off the test
     * @param args
     */
	public static void main(String[] args){
		TestCUBINCall test = new TestCUBINCall();
		test.testCUBIN();
	}
}


My output: :twisted:


0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,

Can you please help me pinpoint my bug? I’m very new to all this and could use a more experienced persons input.

danke!

I must have an incorrect pointer as my results after a reboot:

8.171241E-14,-2.6685475E-5,-1.9992675,-1.4982358E23,-4.8428295E-32,-2.74810749E11,NaN,-1.4505928E26,NaN,-1.7335166,-1.4374386,-1.6866435,

Well, I enabled logging. I think I might have a bad function reference?

(I’ll be posting my progress so others can learn what not to do)


Exception in thread "main" jcuda.CudaException: CUDA_ERROR_NOT_FOUND
	at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:139)
	at jcuda.driver.JCudaDriver.cuModuleGetFunction(JCudaDriver.java:919)
	at com.ecm.cs652.finalProject.explosions.TestCUBINCall.testCUBIN(TestCUBINCall.java:59)
	at com.ecm.cs652.finalProject.explosions.TestCUBINCall.main(TestCUBINCall.java:152)
Executing cuInit
Executing cuDeviceGet for device 0
Executing cuCtxCreate
Executing cuModuleLoad
Executing cuModuleGetFunction


package com.ecm.cs652.finalProject.explosions;

import jcuda.LogLevel;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;

public class TestCUBINCall {
	
	// Run attributes, not really important but define vector element size. 
	// This is from another larger test program I was working on.
	private int maxParticleCount = 4;
	private int dimensions = 3;
	
	// Calculate vector needs
    private int vectorElementCount = maxParticleCount*dimensions;
    private int vectorMemSize = vectorElementCount*Sizeof.FLOAT;
    
    // All input and result arrays are the same size
	private float particleCoordinates[] = new float[vectorElementCount];
	private float particlePositionModifiers[] = new float[vectorElementCount];
	private float results[] = new float[vectorElementCount];
	
	private int threads_per_block = 256;

	/**
	 * Simple test method to calculate the sum of 2 vectors
	 * 
	 * [1,1,1,1,1,1,1,1,1...] + [2,2,2,2,2,2,2,2,2...] = [3,3,3,3,3,3,3,3,3...]
	 * 
	 */
    public void testCUBIN(){
    	
    	// Enable logging
        JCudaDriver.setLogLevel(LogLevel.LOG_DEBUGTRACE);
        JCudaDriver.setExceptionsEnabled(true);
        
        
    	// Initialize the driver and create a context for the first device.
        JCudaDriver.cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        JCudaDriver.cuDeviceGet(dev, 0);
        JCudaDriver.cuCtxCreate(pctx, 0, dev);


        // Load the CUBIN file.
        CUmodule module = new CUmodule();
        JCudaDriver.cuModuleLoad(module, "vector_add.sm_10.cubin");


        // Obtain a function pointer to the "sampleKernel" function.
        CUfunction function = new CUfunction();
        JCudaDriver.cuModuleGetFunction(function, module, "add_vectors_kernel");

        
        // Prepare host test data
        for(int i = 0; i < this.vectorElementCount; i++){
        	this.particleCoordinates** = 1.0f;
        	this.particlePositionModifiers** = 2.0f;
        	this.results** = 0.0f;
        }

        // Define pointers to input vectors
        CUdeviceptr positionDevicePointer = new CUdeviceptr();
        CUdeviceptr modificationDevicePointer = new CUdeviceptr();
        CUdeviceptr outputDevicePointer = new CUdeviceptr();
        
        // Allocate memory space on the GPU
        JCudaDriver.cuMemAlloc(positionDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(modificationDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(outputDevicePointer, this.vectorMemSize);

        // Copy data from host to device
        JCudaDriver.cuMemcpyHtoD(positionDevicePointer, Pointer.to(this.particleCoordinates), this.vectorMemSize);
        JCudaDriver.cuMemcpyHtoD(modificationDevicePointer, Pointer.to(this.particlePositionModifiers), this.vectorMemSize);


        // Set up the execution parameters.
        int num_blocks = (int) ((float) (this.vectorElementCount + threads_per_block - 1) / (float) threads_per_block);
    	int max_blocks_per_dimension = 65535;
    	int num_blocks_y = (int) ((float) (num_blocks + max_blocks_per_dimension - 1) / (float) max_blocks_per_dimension);
    	int num_blocks_x = (int) ((float) (num_blocks + num_blocks_y - 1) / (float) num_blocks_y);
        JCudaDriver.cuFuncSetBlockShape(function, num_blocks_x, num_blocks_y, 1);
        

        // Set up the parameters for the function call
        Pointer dInPositions = Pointer.to(positionDevicePointer);
        Pointer dInModifiers = Pointer.to(modificationDevicePointer);
        Pointer dOut = Pointer.to(outputDevicePointer);
        Pointer vectorSize = Pointer.to(new int[]{this.vectorElementCount});
        
        // Accumulate offset used for function call
        int offset = 0;
        
        // Position
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInPositions, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Modifier
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInModifiers, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Results
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dOut, Sizeof.POINTER);
        offset += Sizeof.POINTER;

        // Vector Size
        offset = JCudaDriver.align(offset, Sizeof.INT);
        JCudaDriver.cuParamSetv(function, offset, vectorSize, Sizeof.INT);
        offset += Sizeof.INT;
        
        JCudaDriver.cuParamSetSize(function, offset);

        // Call the function.
        JCudaDriver.cuLaunch(function);
        JCudaDriver.cuCtxSynchronize();


        // Copy the device output to the host.
        JCudaDriver.cuMemcpyDtoH(Pointer.to(this.results), outputDevicePointer, this.vectorMemSize);


        // Verify the result via simple output
        // All values should be 3, ie 3,3,3,3,3,3,3,3,3,3........
        for(int i = 0; i < this.vectorElementCount; i++){
        	System.out.print(this.results**+",");
        }

 
        
        // Clean up.
        JCudaDriver.cuMemFree(positionDevicePointer);
        JCudaDriver.cuMemFree(modificationDevicePointer);
        JCudaDriver.cuMemFree(outputDevicePointer);
    }
    
    /**
     * Kick off the test
     * @param args
     */
	public static void main(String[] args){
		TestCUBINCall test = new TestCUBINCall();
		test.testCUBIN();
	}
}


I’ll keep going here :wink:

I forgot the extern „C“ to make the function visible to external modules.

My kernel function only contains this code.


extern "C"
__global__ void add_vectors_kernel(float *A, float *B, float *C, int N) {
	// Determine which element this thread is computing
	int block_id = blockIdx.x + gridDim.x * blockIdx.y;
	int thread_id = blockDim.x * block_id + threadIdx.x;
	
	// Compute a single element of the result vector (if the element is valid)
	if (thread_id < N) C[thread_id] = A[thread_id] + B[thread_id];
}

Ouput, no more error, but not expected values:



3.0,-2.6685475E-5,-1.9992675,-1.4982358E23,-4.8428295E-32,-2.74810749E11,NaN,-1.4505928E26,NaN,-1.7335166,-1.4374386,-1.6866435,Executing cuInit
Executing cuDeviceGet for device 0
Executing cuCtxCreate
Executing cuModuleLoad
Executing cuModuleGetFunction
Executing cuMemAlloc of 48 bytes
Executing cuMemAlloc of 48 bytes
Executing cuMemAlloc of 48 bytes
Executing cuMemcpyHtoD of 48 bytes
Initializing pointer data for Java Pointer object 0x7f807ed439d0
Obtaining native pointer 0x1000500
Initializing pointer data for Java Pointer object 0x7f807ed439c8
Obtaining host memory from array in java buffer
Releasing host memory from array in java buffer
Executing cuMemcpyHtoD of 48 bytes
Initializing pointer data for Java Pointer object 0x7f807ed439d0
Obtaining native pointer 0x1000600
Initializing pointer data for Java Pointer object 0x7f807ed439c8
Obtaining host memory from array in java buffer
Releasing host memory from array in java buffer
Executing cuFuncSetBlockShape (1,1,1)
Executing cuParamSetv
Initializing pointer data for Java Pointer object 0x7f807ed439b8
Obtaining pointers in host memory
Releasing host memory of pointers
Executing cuParamSetv
Initializing pointer data for Java Pointer object 0x7f807ed439b8
Obtaining pointers in host memory
Releasing host memory of pointers
Executing cuParamSetv
Initializing pointer data for Java Pointer object 0x7f807ed439b8
Obtaining pointers in host memory
Releasing host memory of pointers
Executing cuParamSetv
Initializing pointer data for Java Pointer object 0x7f807ed439b8
Obtaining host memory from array in java buffer
Releasing host memory from array in java buffer
Executing cuParamSetSize
Executing cuLaunch
Executing cuCtxSynchronize
Executing cuMemcpyDtoH of 48 bytes
Initializing pointer data for Java Pointer object 0x7f807ed439d0
Obtaining host memory from array in java buffer
Initializing pointer data for Java Pointer object 0x7f807ed439c8
Obtaining native pointer 0x1000700
Releasing host memory from array in java buffer
Executing cuMemFree
Initializing pointer data for Java Pointer object 0x7f807ed439e0
Obtaining native pointer 0x1000500
Executing cuMemFree
Initializing pointer data for Java Pointer object 0x7f807ed439e0
Obtaining native pointer 0x1000600
Executing cuMemFree
Initializing pointer data for Java Pointer object 0x7f807ed439e0
Obtaining native pointer 0x1000700



Because of the first and only correct „3“, I’m guessing I don’t have the required number of pointers to the executing GPU threads. Time to hack more, much closer!

I fixed it! My error lied in how my thread/block were being setup. I changed the code to include 65536 possible max threads in only the x-grid dimension. A limitation yes, but thats ok!

You will see me in this forum more often!

Feel free to use my code in a tutorial for others to learn from if you wish :slight_smile:

-Eric-


import jcuda.LogLevel;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;

public class TestCUBINCall {
	
	// Run attributes, not really important but define vector element size. 
	// This is from another larger test program I was working on.
	private int maxParticleCount = 4;
	private int dimensions = 3;
	
	// Calculate vector needs
    private int vectorElementCount = maxParticleCount*dimensions;
    private int vectorMemSize = vectorElementCount*Sizeof.FLOAT;
    
    // All input and result arrays are the same size
	private float particleCoordinates[] = new float[vectorElementCount];
	private float particlePositionModifiers[] = new float[vectorElementCount];
	private float results[] = new float[vectorElementCount];
	
	private int threads_per_block = 256;

	/**
	 * Simple test method to calculate the sum of 2 vectors
	 * 
	 * [1,1,1,1,1,1,1,1,1...] + [2,2,2,2,2,2,2,2,2...] = [3,3,3,3,3,3,3,3,3...]
	 * 
	 */
    public void testCUBIN(){
    	
    	// Enable logging
        JCudaDriver.setLogLevel(LogLevel.LOG_DEBUGTRACE);
        JCudaDriver.setExceptionsEnabled(true);
        
        
    	// Initialize the driver and create a context for the first device.
        JCudaDriver.cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        JCudaDriver.cuDeviceGet(dev, 0);
        JCudaDriver.cuCtxCreate(pctx, 0, dev);


        // Load the CUBIN file.
        CUmodule module = new CUmodule();
        JCudaDriver.cuModuleLoad(module, "vector_add_kernel_only.sm_10.cubin");


        // Obtain a function pointer to the "sampleKernel" function.
        CUfunction function = new CUfunction();
        JCudaDriver.cuModuleGetFunction(function, module, "add_vectors_kernel");

        
        // Prepare host test data
        for(int i = 0; i < this.vectorElementCount; i++){
        	this.particleCoordinates** = 1.0f;
        	this.particlePositionModifiers** = 2.0f;
        	this.results** = 0.0f;
        }

        // Define pointers to input vectors
        CUdeviceptr positionDevicePointer = new CUdeviceptr();
        CUdeviceptr modificationDevicePointer = new CUdeviceptr();
        CUdeviceptr outputDevicePointer = new CUdeviceptr();
        
        // Allocate memory space on the GPU
        JCudaDriver.cuMemAlloc(positionDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(modificationDevicePointer, this.vectorMemSize);
        JCudaDriver.cuMemAlloc(outputDevicePointer, this.vectorMemSize);

        // Copy data from host to device
        JCudaDriver.cuMemcpyHtoD(positionDevicePointer, Pointer.to(this.particleCoordinates), this.vectorMemSize);
        JCudaDriver.cuMemcpyHtoD(modificationDevicePointer, Pointer.to(this.particlePositionModifiers), this.vectorMemSize);

        
        // Limit of 65535 elements now!!!
        // 1 thread per vector element
        JCudaDriver.cuFuncSetBlockShape(function, this.vectorElementCount, 1, 1);
        
        
        // Set up the parameters for the function call
        Pointer dInPositions = Pointer.to(positionDevicePointer);
        Pointer dInModifiers = Pointer.to(modificationDevicePointer);
        Pointer dOut = Pointer.to(outputDevicePointer);
        Pointer vectorSize = Pointer.to(new int[]{this.vectorElementCount});
        
        // Accumulate offset used for function call
        int offset = 0;
        
        // Position
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInPositions, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Modifier
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dInModifiers, Sizeof.POINTER);
        offset += Sizeof.POINTER;
        
        // Results
        offset = JCudaDriver.align(offset, Sizeof.POINTER);
        JCudaDriver.cuParamSetv(function, offset, dOut, Sizeof.POINTER);
        offset += Sizeof.POINTER;

        // Vector Size
        offset = JCudaDriver.align(offset, Sizeof.INT);
        JCudaDriver.cuParamSetv(function, offset, vectorSize, Sizeof.INT);
        offset += Sizeof.INT;
        
        JCudaDriver.cuParamSetSize(function, offset);

        // Call the function.
        JCudaDriver.cuLaunch(function);
        JCudaDriver.cuCtxSynchronize();


        // Copy the device output to the host.
        JCudaDriver.cuMemcpyDtoH(Pointer.to(this.results), outputDevicePointer, this.vectorMemSize);


        // Verify the result via simple output
        // All values should be 3, ie 3,3,3,3,3,3,3,3,3,3........
        for(int i = 0; i < this.vectorElementCount; i++){
        	System.out.print(this.results**+",");
        }

 
        // Clean up.
        JCudaDriver.cuMemFree(positionDevicePointer);
        JCudaDriver.cuMemFree(modificationDevicePointer);
        JCudaDriver.cuMemFree(outputDevicePointer);
    }
    
    /**
     * Kick off the test
     * @param args
     */
	public static void main(String[] args){
		TestCUBINCall test = new TestCUBINCall();
		test.testCUBIN();
	}
}


With JCUDA how do I set the threads/block as one does in C.

add_vectors_kernel <<< grid_size , threads_per_block >>> (A_GPU, B_GPU, C_GPU, N);

Hello

Thank you for your precise progress description :slight_smile:

I should possibly assemble some for these issues in a small HowTo/Tutorial, covering the installation, the setup of the „boilerplate code“ for the initialization, and a step-by-step description of the creation of an own kernel.
Especially the necessity to declare the function as
extern „C“
is something that is not obvious in the beginning. Actually, I had the same problem during my first tests - then I had a look at the intermediate PTX code that is generated by NVCC: When the „extern C“ is left out, the function name becomes „mangled“, and simple function names like „kernel“ are renamed, taking into account the function signature, and become something like „_Z6kernelP6float4jjf“ :twisted: Using these names, the functions can also be called, but of course, using the „extern C“ is much simpler…

Concerning your question about the thread/block/grid setup:

When calling a kernel in C (with the runtime API) the call usually is
kernel<<< gridDim, blockDim>>>(…)
where
blockDim is a dim3 that specifies the number of threads per block and
gridDim is a dim3 which specifies the number of blocks per grid (only the first 2 elements are used)
(Additionally, one may specify shared memory or streams, but these parameters may be omitted when they are not required)

Using the driver API, these parameters are set up using function calls. The call to set the number of threads per block:
cuFuncSetBlockShape(function, threadsPerBlockX, threadsPerBlockY, threadsPerBlockZ);

And when launching the function, the number of blocks per grid can be specified by calling cuLaunchGrid:
cuLaunchGrid(function, blocksPerGridX, blocksPerGridY);

Another thing that should be in the Tutorial :wink: I think I’ll start this as soon as possible.

My pleasure. Thanks for the information. I’ll post up my finalized code in the near future for others to utilize.

„extern „C““, yes, I had to remember back loooong ago when I was an undergrad writing x86 assembly for Turbo C++. :wink:

-E-

[QUOTE=Marco13]Hello

Thank you for your precise progress description :slight_smile:

I should possibly assemble some for these issues in a small HowTo/Tutorial, covering the installation, the setup of the „boilerplate code“ for the initialization, and a step-by-step description of the creation of an own kernel.
Especially the necessity to declare the function as
extern „C“
is something that is not obvious in the beginning. Actually, I had the same problem during my first tests - then I had a look at the intermediate PTX code that is generated by NVCC: When the „extern C“ is left out, the function name becomes „mangled“, and simple function names like „kernel“ are renamed, taking into account the function signature, and become something like „_Z6kernelP6float4jjf“ :twisted: Using these names, the functions can also be called, but of course, using the „extern C“ is much simpler…

Concerning your question about the thread/block/grid setup:

When calling a kernel in C (with the runtime API) the call usually is
kernel<<< gridDim, blockDim>>>(…)
where
blockDim is a dim3 that specifies the number of threads per block and
gridDim is a dim3 which specifies the number of blocks per grid (only the first 2 elements are used)
(Additionally, one may specify shared memory or streams, but these parameters may be omitted when they are not required)

Using the driver API, these parameters are set up using function calls. The call to set the number of threads per block:
cuFuncSetBlockShape(function, threadsPerBlockX, threadsPerBlockY, threadsPerBlockZ);

And when launching the function, the number of blocks per grid can be specified by calling cuLaunchGrid:
cuLaunchGrid(function, blocksPerGridX, blocksPerGridY);

Another thing that should be in the Tutorial :wink: I think I’ll start this as soon as possible.[/QUOTE]