Matrix Row Sum in JCUDA

Hi Here is my program that i’m trying to run for getting the matrix row sum. but at the end result in the sum is 0. i have tried the Matrix Row Sum in visual c and the program in c is working fine.

Code in java is

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;
import jcuda.runtime.JCuda;

/**
 *
 *
 */
public class MtrixRowSum {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {

        int M = 4, N = 4,P=16;

        float scores_h[][] = new float[M][N];
        float[] a = new float[] {(float)1.35};
        int first[][] = new int[M][N];

        int sum[] = new int[M*N*4];

        int i, j;
        //input in host array
        for (i = 0; i<M; i++)
        {
            for (j = 0; j<N; j++)
            {
                scores_h**[j] = 1;

            }
        }
        //load the function
        cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        cuDeviceGet(dev, 0);
        cuCtxCreate(pctx, 0, dev);
        //load the module
        CUmodule module = new CUmodule();
        cuModuleLoad(module, "matrixRowSum.ptx");
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "rowSum");
        CUdeviceptr a_dev1 = new CUdeviceptr();


        // memory allocation
        CUdeviceptr a_dev[] = new CUdeviceptr[P];
        for(i=0;i<P;i++){
            a_dev**=new CUdeviceptr();
            // memory allocation
            cuMemAlloc(a_dev**, Sizeof.INT*4*4);
        }
        for(i=0;i<M;i++){
            // copy the content from host to GPU
            cuMemcpyHtoD(a_dev**, Pointer.to(scores_h**), Sizeof.FLOAT*4*4);
        }





        CUdeviceptr b_dev[] = new CUdeviceptr[M];
        for(i=0;i<M;i++){
            b_dev**=new CUdeviceptr();
            // memory allocation
            cuMemAlloc(b_dev**, Sizeof.INT*4*4);
        }



        //Pointer object that will hold all the parameters
        Pointer kernelParameters = Pointer.to(
            Pointer.to(a_dev),
            Pointer.to(b_dev)
            );
        cuLaunchKernel(function, 1, 1, 1, P, 1, 1, 0, null, kernelParameters, null);
        //copy back the result from the GPU to host
        for(i=0;i<M;i++){
            // copy the content from host to GPU
            cuMemcpyDtoH(Pointer.to(sum),b_dev**, Sizeof.FLOAT*4*4);

        }
        for(i=0;i<M;i++)
        {
            // print the result
            System.out.println("sum: "+sum**);
        }
        //free the memory...
        for(i = 0; i < P; i++)
        {
            cuMemFree(a_dev**);
        }
        for(i = 0; i < M; i++)
        {
            cuMemFree(b_dev**);
        }
    }
}

the matrixrowsum.ptx is the c program which is the function in visual c 2013 which is working fine and the code is


extern "C"
__global__ void RowSum(float* B, float* Sum, int N, int M)
{
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;

if (rowIdx < N) {
float sum = 0;
for (int k = 0; k < M; k++)
sum += B[rowIdx*M + k];
Sum[rowIdx] = sum;
}
}

there is no error just the result sum is–
run:
sum: 0
sum: 0
sum: 0
sum: 0
BUILD SUCCESSFUL (total time: 0 seconds)

please guide me what changes i should make in the program…

[edit SlaterB: Blogeintrag von @richa ins Forum übertragen]

Just a short note: During development and tests, I’d recommend to have
JCudaDriver.setExceptionsEnabled(true);
as the first line in your main method. Otherwise, you’d have to do the error checks manually.

I’ll try out the code and write more details later today.

The line
cuModuleGetFunction(function, module, "rowSum");
caused an error CUDA_ERROR_NOT_FOUND, because the name is case sensitive - it should be
cuModuleGetFunction(function, module, "RowSum"); (capital “R”)

The memory allocations have been odd:
cuMemAlloc(a_dev**, Sizeof.INT * 4 * 4); should likely be
cuMemAlloc(a_dev**, Sizeof.FLOAT * N); // FLOAT and N (or M, see below)
(similarly, for the memory copies)

The M and N parameters had been missing in the kernel parameters

The roles of M and N are not entirely clear for me. Additionally, you seem to be summing pointers in the kernel (thus, basically garbage data).


Let’s take a step back:

You have a matrix, with M rows and N columns, right?
Do you want to represent this as a 2D array on the device? (Note that representing it as a 1D array is far easier and likely more efficient)
Then you want to compute the sum of each of the M rows, and write the results into an array of size M, right?

yes exactly i want to do the sum of matrix row and get the result into another array… I tried to make changes as per your suggestions but its not working.
as the parameters was missing so i have added it into the kernel parameters

    Pointer.to(a_dev),
    Pointer.to(b_dev),
    Pointer.to(a_dev1),
    Pointer.to(a_dev2)
);

where a_dev1 and a_dev2 is

     Float[] b = new Float[] {(Float)4.f};
     Float[] d = new Float[] {(Float)4.f};
     // memory allocation
     cuMemAlloc(a_dev1, Sizeof.FLOAT);
     cuMemcpyHtoD(a_dev1, Pointer.to(b), Sizeof.FLOAT);
     
     CUdeviceptr a_dev2 = new CUdeviceptr();
     
     // memory allocation
     cuMemAlloc(a_dev2, Sizeof.FLOAT);
     cuMemcpyHtoD(a_dev2, Pointer.to(d), Sizeof.FLOAT);

still the results are same i.e.

run:
sum: 0
sum: 0
sum: 0
sum: 0
BUILD SUCCESSFUL (total time: 0 seconds)

still not working after making all the changes…

    Pointer.to(a_dev),
    Pointer.to(b_dev),
    Pointer.to(a_dev1),
    Pointer.to(a_dev2)
);

     CUdeviceptr a_dev1 = new CUdeviceptr();
     
     // memory allocation
     cuMemAlloc(a_dev1, Sizeof.FLOAT);
     cuMemcpyHtoD(a_dev1, Pointer.to(b), Sizeof.FLOAT);
     
     CUdeviceptr a_dev2 = new CUdeviceptr();
     
     // memory allocation
     cuMemAlloc(a_dev2, Sizeof.FLOAT);
     cuMemcpyHtoD(a_dev2, Pointer.to(d), Sizeof.FLOAT);

so after making these changes i’m not able to make the matrix row sum. and the answer is still same:

run:
sum: 0
sum: 0
sum: 0
sum: 0
BUILD SUCCESSFUL (total time: 0 seconds)

The reason of why I suggested to „take a step back“ was that there are several (many) issues with the code, and I wanted to be sure that I (and you ;-)) know what you want to achieve.

Again, in the kernel, you are summing up
sum += B[rowIdx*M + k];
But „B“ here is an array that was filled from „a_dev“ in the host code - and this is an array of pointers. You are adding pointers, and not the values that the pointers are pointing to.

This is related to the (still unanswered) question of whether you want to represent the matrix as a 2D array or a 1D array on the device. But I assume that 1D will be OK.

I’ll try to create an appropriate sample later today.

(And I hope that it will be readable and understandable, because I will use variable names like numRows instead of M or N, and matrix instead of a_dev or B:wink: )

Here is an example that should accomplish the desired task:

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxDestroy;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;

import java.util.Locale;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;

/**
 * An example that computes the sums of the rows of a matrix,
 * and writes the results into an array
 */
public class JCudaMatrixRowSum
{
    public static void main(String[] args)
    {
        // Enable exceptions and omit subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);

        // Create a context for the first device
        cuInit(0);
        CUcontext context = new CUcontext();
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        cuCtxCreate(context, 0, device);

        // Load the module and obtain the pointer to the kernel function
        CUmodule module = new CUmodule();
        cuModuleLoad(module, "JCudaMatrixRowSumKernel.ptx");
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "rowSums");
        
        // Create the input matrix in host memory
        int rows = 8;
        int cols = 6;
        float hostMatrix[] = createExampleMatrix(rows, cols);
        
        System.out.println("Input matrix:");
        System.out.println(createString2D(hostMatrix, rows, cols));
        
        // Copy the host data to the device
        CUdeviceptr deviceMatrix = new CUdeviceptr();
        cuMemAlloc(deviceMatrix, rows * cols * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceMatrix, Pointer.to(hostMatrix), 
            rows * cols * Sizeof.FLOAT);

        // Allocate memory for the result on the device
        CUdeviceptr deviceSums = new CUdeviceptr();
        cuMemAlloc(deviceSums, rows * Sizeof.FLOAT);
        
        // Set up and launch the kernel
        Pointer kernelParameters = Pointer.to(
            Pointer.to(deviceMatrix),
            Pointer.to(deviceSums),
            Pointer.to(new int[]{ rows }),
            Pointer.to(new int[]{ cols })
        );
        int blockSizeX = 256;
        int gridSizeX = (rows * cols + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function, 
            gridSizeX, 1, 1, 
            blockSizeX, 1, 1, 
            0, null, kernelParameters, null);
        
        // Copy the result from the device to the host
        float hostSums[] = new float[rows];
        cuMemcpyDtoH(Pointer.to(hostSums), deviceSums, rows * Sizeof.FLOAT);

        System.out.println("Row sums:");
        System.out.println(createString2D(hostSums, rows, 1));
        
        // Clean up
        cuMemFree(deviceSums);
        cuMemFree(deviceMatrix);
        cuCtxDestroy(context);
    }

    private static float[] createExampleMatrix(int rows, int cols)
    {
        float matrix[] = new float[rows*cols];
        int counter = 0;
        for (int r = 0; r < rows; r++)
        {
            for (int c = 0; c < cols; c++)
            {
                matrix[r * cols + c] = counter;
                counter++;
            }
        }
        return matrix;
    }
    
    private static String createString2D(
        float matrix[], int rows, int cols)
    {
        String format = "%7.2f";
        StringBuffer sb = new StringBuffer();
        for (int r = 0; r < rows; r++)
        {
            for (int c = 0; c < cols; c++)
            {
                float value = matrix[r * cols + c];
                String s = String.format(Locale.ENGLISH, format, value);
                sb.append(s).append(" ");
            }
            sb.append("
");
        }
        return sb.toString();
    }
}

The JCudaMatrixRowSumKernel.cu, to be compiled to JCudaMatrixRowSumKernel.ptx:


// A kernel that computes the sums of the rows of a matrix, which is stored 
// in row-major order, and places the sums in a result array. 
extern "C"
__global__ void rowSums(float* matrix, float* sums, int rows, int cols)
{
    int row = threadIdx.x + blockIdx.x * blockDim.x;
    if (row < rows) 
    {
        float sum = 0;
        for (int col = 0; col < cols; col++)
        {
			sum += matrix[row * cols + col];
	    }
	    sums[row] = sum;
    }
}

The example prints the following:


Input matrix:
   0.00    1.00    2.00    3.00    4.00    5.00 
   6.00    7.00    8.00    9.00   10.00   11.00 
  12.00   13.00   14.00   15.00   16.00   17.00 
  18.00   19.00   20.00   21.00   22.00   23.00 
  24.00   25.00   26.00   27.00   28.00   29.00 
  30.00   31.00   32.00   33.00   34.00   35.00 
  36.00   37.00   38.00   39.00   40.00   41.00 
  42.00   43.00   44.00   45.00   46.00   47.00 

Row sums:
  15.00 
  51.00 
  87.00 
 123.00 
 159.00 
 195.00 
 231.00 
 267.00 

thanx…i understand where i was wrong…and from now i’ll not use some meaningful variables rather M or N…thank you… :slight_smile:

hello Marco,

I did the the above program on my system, but it is giving Exception in thread “main” jcuda.CudaException: CUDA_ERROR_FILE_NOT_FOUND
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:288)
at jcuda.driver.JCudaDriver.cuModuleLoad(JCudaDriver.java:1906)
at TestSampleCuda.main(TestSampleCuda.java:238) though I give the full path upto .ptx file

because at work location I create JCudaMatrixRowSumKernel.cu file but in cuModuleLoad function we are accessing JCudaMatrixRowSumKernel.ptx file (which is not generating at that place)

If I replace .ptx file in function cuModuleLoad with .cu file like cuModuleLoad(module, “C:\Users\590943\workspace\Assignments\JCudaMatrixRowSumKernel.cu”);

showing Exception in thread “main” jcuda.CudaException: CUDA_ERROR_INVALID_IMAGE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:288)
at jcuda.driver.JCudaDriver.cuModuleLoad(JCudaDriver.java:1906)
at TestSampleCuda.main(TestSampleCuda.java:238)
please guide me

If you are the same person as the one who wrote the other posts:
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/21752-basic-jcuda-setup-windows.html#post140361
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/19607-matrix-row-sum-jcuda.html#post140379
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/20617-jcuda-cudaexception-cuda_error_invalid_image.html#post140362
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/21939-cl-exe-found-path.html#post140360
then please try to focus on one topic, otherwise I don’t know what I should write where, and I don’t know what your actual question is, and what exactly works or does not work.


The example above assumed that the JCudaMatrixRowSumKernel.ptx is compiled manually, at the command line, using the NVCC. You cannot load a “.CU” file with cuModuleLoad