Matrix operation in jcuda

Hi…trying to cover the matrix operations like multiplication,addition,subtraction and matrix row sum average…successfully done with the other three except matrix row sum average…working with visual c++ and my code is

__global__ void rowSums(float* InputMatrix, float* RowSumAverage, int rows, int cols)
{
	int row = threadIdx.x + blockIdx.x * blockDim.x;//to set the start index in kernel
	if (row < rows)
	{
		float sum = 0, counter = 0;
		for (int col = 0; col < cols; col++)
		{
			if (matrix[row * cols + col] == 0)
			{
				sum += matrix[row * cols + col];
			}
			else
			{
				sum += matrix[row * cols + col];
				counter++;
			}
		}
		sum = sum / counter;
		sums[row] = sum;
	}
}

this also working fine…but how parallelization exists here…because looping exist there… when matrix addition is done then we don’t need to raise the index again and again…
code is

__global__ void MatAdd(int A[][N], int B[][N], int C[][N]){
	int i = threadIdx.x;
	int j = threadIdx.y;

	C**[j] = A**[j] + B**[j];
}

i also tried this code to do matrix row sum but its not working

__global__ void RowSum(int B[][N], int Sum[N], int *row, int *col)
{
	int i = blockDim.x * blockIdx.x + threadIdx.x;    
	int j = blockDim.y * blockIdx.y + threadIdx.y;
	if (i < *row && j < *col)
		Sum[j] += B**[j];
}

is there any clue to do this work…

why are you posting this not in the JCuda-area but (german) Java-area? :wink:
moved here


are you the user @richa , which has posted similar questions before, has the same IP,
which has posted the question

i want to ask about the concept of thread, id and block. i have visited many sites and slides on the internet but not able to get that in case we have to think about to divide into grid, block and thread. somewhere i found
i = (blockIdx.x * blockDim.x) + threadIdx.x;
j = (blockIdx.y * blockDim.y) + threadIdx.y;

how we will be able to get how its representing what dimensions.

as a blog-entry yesterday, after the topic before

was also posted as blog-entry first,

i had manually created the forum-thread for the first blog-entry,
there were answers, you even answered (short) my forum-pn, everything fine

the second blog-entry i just deleted + pn,
you could have created a forum-entry, nothing bad happend,

but no answer to my pn, now new user? strange…

sorry my mistake…

*** Edit ***

yeah that’s my friend @richa … she introduced me with this forum…i feel this forum useful for the cuda discussion…and i am trying to make it cuda from jcuda…can we switch to the topic discussion now …if you hv any idea then help me

do you have read the topics before, at least the ones of richa? :wink:
@Marco13 is the only one answering here, no worry, will come soon for sure,

There are some syntactical details. And open question that I already asked @richa for. Most importantly:

Should the matrix be represented as a 1D array or as a 2D array???

I never got an answer, though.

So just a short answer here:


__global__ void MatAdd(int A[][N], int B[][N], int C[][N]){
	int i = threadIdx.x;
	int j = threadIdx.y;

	C**[j] = A**[j] + B**[j];
}

Here you are using threadIdx.x and threadIdx.y, which are the 2D coordinates of the thread in the thread block. (NOTE: This will work only for “few” rows and columns. Usually, it should be blockDim.x * blockIdx.x + threadIdx.x - read about this in the CUDA Programming Guide and related resources).

This basically means that you have, for example, 128x128 threads. Each thread is doing a single addition. The thread at (x,y) will add the matrix elements at (x,y).

In the row example, you have


__global__ void rowSums(float* InputMatrix, float* RowSumAverage, int rows, int cols)
{
	int row = threadIdx.x + blockIdx.x * blockDim.x;//to set the start index in kernel
	if (row < rows)
        ...
}

Which means that you are only using a 1D coordinates. This makes sense, as you want to perform the operations for all rows in parallel. So when you have 128 threads, then each thread will compute the sum of one row of the matrix. Thread (x) will compute the sum of row (x) (by iterating over all columns - that’s what the loop is for.

Note that you could also do a reduction of each row, to better exploit the parallelism when you have “few” rows. Such a reduction and its optimization are described in this PDF file: https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf However, this is far more difficult, unless you have some of the building blocks available already. The samples at jcuda.org - Samples contains a reduction example, but it will have to be adjusted accordingly.

:slight_smile: yeah … i have go through all the post by @richa …and i argued that the method they have chosen is not correct if we go through parallel programing…

Again, it depends on many factors, and there may not be a “right” or “wrong” (but maybe “better/worse for certain application cases”).

For a Matrix with 10000 rows and 10 columns, the proposed method of computing the row sums is certainly fine.
For a 1000x100 matrix it will also be OK.
For a matrix with a size of 100x1000, you will only use 100 threads, which will not exploit the available cores.
For a matrix with 10 rows and 10000 columns, the proposed method would not make any sense at all, because it would only use 10 threads. In this case, the parallelization should be done over the columns, requiring that you would do a parallel reduction of the rows.