Matrix Multiplication with KernelLauncher

wepawetmose · 7. September 2010 um 08:54

hi. In this piece of code, can you tell me what’s wrong?! please? This is just a part of a program that invokes the reduction kernel given in SDK… and at this point, „vector“ is full of „1“ values in all size dimention:

float* vector = NULL, *d_vector, sum = NULL;
int size = 16384;/// 16384;
dim3 dimThreads( 16 , 1 , 1 );
dim3 dimBlocks( 1, 1 , 1 );
float d_output;

sum = new float(-1);

if ( dimBlocks.x == 1)
{
cout << " ok, bitch! 1" << endl;
cudaMalloc((void**)&d_vector, size * sizeof(float));
cudaMemcpy (d_vector, vector, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_output, sizeof(float) * dimBlocks.x);
cudaMemset(d_output,255,sizeof(float) * dimBlocks.x);

reductionVector<<< dimBlocks, dimThreads >>>(d_vector, d_output, size);
cout << "ITERATION DONE!" << endl;
cudaMemcpy (sum, d_output, sizeof(float) * dimBlocks.x, cudaMemcpyDeviceToHost);

}

cout << "Result = " << *sum << endl;

But the result of „sum“ is „nan“ the question is WHY? of course that dimBlocks and dimThreads are configured like this just to enter this if statement.

Marco13 · 7. September 2010 um 14:01

Hello

There are many possible reasons for this. You did not say which of the kernels you are using. Note that some of them have special requirements, maybe on the block- and grid size, but definitely on the shared memory. Apart from that, I’m pretty sure that cudaMemset with ‘255’ will result in NaNs. Although this forum is about JCuda, maybe I could help you if you posted some compileable example project in a ZIP. But I can not promise anything…

bye
Marco

wepawetmose · 7. September 2010 um 16:10

ok, changing cudaMemset to „cudaMemset(d_output,0,sizeof(float) * dimBlocks.x)“ alows the result to be = 0.

The Idea was trying to do the reduction with sequencial accesses and more than one call to the kernel function. gladly, I’ll provide here the hole program. please take a look

http://ftp.ua.pt/incoming/cuda/

wepawetmose · 7. September 2010 um 17:11

ok, changing cudaMemset to „cudaMemset(d_output,0,sizeof(float) * dimBlocks.x)“ alows the result to be = 0.

The Idea was trying to do the reduction with sequencial accesses and more than one call to the kernel function. gladly, I’ll provide here the hole program. please take a look

http://ftp.ua.pt/incoming/cuda/

BTW, the output is shown as:

==========================================
========= MENU =========

1 - Reduce a random Vector
0 - Exit
—>1
ok! 2
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ITERATION DONE!
ok! 3
ok! 5
136
ITERATION DONE!
ok! 6
Result = 136

ok, those 16’s and zeros are the amount of blocks and should all be 256 since each block have 256 threads to perform the reduction. obviously without this right, the result will never be 20 000 as it should

Marco13 · 8. September 2010 um 10:26

Hello

I did not completely understand your code … So I reduced ( ) it, with some glimpses at the NVIDIA sample, to make a simple test.

One of the most important points is (what I already mentioned above) : You did not assign any shared memory to the kernel! It has to be given as the third argument inside the <<< backets >>> :
theKernel<<<blocks, threads, sharedMemorySize>>>(…);

In general, I’ll probably not have the time to dig through this sort of code to find possible errors for you in the future. However, here’s what I did (all put into a single file) :

//  ======================  c++ include files ==========================
#include <iostream>
#include <stdlib.h>

using namespace std;
//  ====================== cuda include files ==========================
#include <cuda_runtime.h>
//#include "async_kernel.cu"

/// Kernel function
extern "C" __global__ void reductionVector(float* d_iVector, float* d_oVector, int size)
{
	// each thread loads one element from global to shared mem
	unsigned int t_ID = threadIdx.x;
	unsigned int index = blockIdx.x*blockDim.x+threadIdx.x;
	extern __shared__ int partialSum[];

	partialSum[t_ID] = (index < size)? d_iVector[index] : 0;
	__syncthreads();


	// do reduction in shared mem
	for (unsigned int s=blockDim.x / 2; s>0; s>>=1) 
	{
		if (t_ID < s) {
		  partialSum[t_ID] += partialSum[t_ID + s];
		}
		__syncthreads();
	}

	// write result for this block to global mem
	if (t_ID == 0) d_oVector[blockIdx.x] = partialSum[0];
}


/// ====================================================================

/// =======================    Functions    ============================
void 			generateValues(float *vector, int size);
int 			roundBlocks(int value1, int value2);
//extern"C" float launchReductionVector( dim3 dimBlocks , dim3 dimThreads, float* d_vector, int size);
extern "C" __global__  void reductionVector(float* d_iVector, float* d_oVector, int size);
int  			menu( void );
/// ====================================================================

int maxThreads = 256;  // number of threads per block
int maxBlocks = 64;

unsigned int nextPow2( unsigned int x ) {
    --x;
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return ++x;
}

void debugPrint(float *dVector, int size)
{
	float *hVector = new float[size];
	cudaMemcpy (hVector, dVector, size * sizeof(float), cudaMemcpyDeviceToHost);
	for (int i=0; i<size; i++)
	{
		cout << "Element " << i << ": " << hVector** << endl;
	}
	delete[] hVector;
}

int main ( void )
{
	float* vector = NULL, *d_vector, *sum = NULL, cpuSum = 0;
	int size = 20000;/// 16384;
	int threads = 256;
	int blocks = roundBlocks(size, threads);
	float* d_output;
	
	cudaMallocHost((void**)&vector, size * sizeof(float));
	cudaMemset (vector, 0, size * sizeof(float));
					
	generateValues(vector, size);
					
	for (int i = 0; i < size; i++)
		cpuSum += vector**;
					
	/*cout << "Generated values  =  " << endl;
	for (int j = 0; j < size; j++)
		cout << vector[j] << "   ";
	cout << "DONE" << endl;*/
					
	sum = new float(-1);
					
	cudaMalloc((void**)&d_vector, size * sizeof(float));
	cudaMemcpy (d_vector, vector, size * sizeof(float), cudaMemcpyHostToDevice);
	cudaMalloc((void**)&d_output, sizeof(float) * blocks);
	cudaMemset(d_output,0,sizeof(float) * blocks);
					
	int sharedMemorySize = threads * sizeof(float);

	cout << "For " << size << " elements using " << blocks << " blocks of " << threads << " threads, total: " << (blocks*threads) << " threads" << endl;

	reductionVector<<< blocks, threads, sharedMemorySize >>>(d_vector, d_output, size);

    int s=blocks;
    while(s > 1) 
    {
        threads = (s < maxThreads) ? nextPow2(s) : maxThreads;
        blocks = (s + threads - 1) / threads;

		cout << "For " << s << " elements using " << blocks << " blocks of " << threads << " threads, total: " << (blocks*threads) << " threads" << endl;
		//debugPrint(d_output, s);

		reductionVector<<<blocks, threads, sharedMemorySize>>>(d_output, d_output, s);
		s = (s + threads - 1) / threads;
	}
	cudaMemcpy (sum, d_output, sizeof(float) * blocks, cudaMemcpyDeviceToHost);

	cout << "GPU Result =   " << *sum << endl;
	cout << "CPU Result =   " << cpuSum << endl;
					
	cudaFreeHost(vector);
	cudaFree(d_vector);
	cudaFree(d_output);

	cout << "Done" << endl;
    return 0;
}


/// ====================================================================
/// ====================  Function Implementations  ====================
/// ====================================================================

int menu( void )
{
	int op = -1;
	
	while ((op < 0) || (op > 1) )
	{
		cout << "==========================================" << endl;
		cout << "=========           MENU         =========" << endl;
		cout << "==========================================" << endl << endl;
		cout << " 1 - Reduce a random Vector" << endl;
		cout << " 0 - Exit" << endl;
		cout << "--->";
		cin >> op;
	}
	return op;
}

void generateValues(float *vector, int size)
{
	for (int i = 0; i < size; i++)
		vector** = rand() % 10;
}

int	roundBlocks(int value1, int value2)
{	
	int ret = ceil( (float)value1 / (float)value2 );
	
	if ( ret <= 1 )
		return 1;
	
	return ret;
		
}

bye
Marco

wepawetmose · 8. September 2010 um 11:01

Hi Marco. First of all I must thank you for all your efford, really. Still, the problem is not solved… and what I intend to do is realise (with your help) the mistake… because CPU and GPU sums MUST be equal with this function like

void generateValues(float vector, int size)
{
for (int i = 0; i < size; i++)
vector* = 1;//rand() % 10;
}
isn’t that right? if you have a vector with 20 000 elements, and if all the elements are “1”, the outcome should be :
For 20000 elements using 79 blocks of 256 threads, total: 20224 threads
For 79 elements using 1 blocks of 128 threads, total: 128 threads
GPU Result = 20000
CPU Result = 20000
Done

and not as it is right now:
GPU Result = 512
CPU Result = 20000
or with 90 000 :
For 90000 elements using 352 blocks of 256 threads, total: 90112 threads
For 352 elements using 2 blocks of 256 threads, total: 512 threads
For 2 elements using 1 blocks of 2 threads, total: 2 threads
GPU Result = -4.29487e+09
CPU Result = 90000
Done

in matter of fact, there is still no combination that make the output right.

wepawetmose · 8. September 2010 um 11:12

Sorry for the double post… but I just realised a major issue! I’ve pasted your main code on top mine and i’ve got the above results. ok? than… i’ve saved the entire code of yours and runned it. the output is as it should!!! what makes me think that one of the REAL issues was the kernel function being placed in another file… or placed in a way that it shouldn’t!!!

Is there a correct way to do it or… (unfortunately) may I preserve kernel functions in the same file?

Marco13 · 8. September 2010 um 11:14

Hello

I’ve tested it here, and it worked. Did you notice that the kernel also had been modified? Maybe you are still using the old one…

bye
Marco

EDIT: OK, that was close

I have put the kernel function into the same file. Admittedly, I did not have a look at your makefile, since I’m developing on Windows. It should be possible to compile the „combined“ file (containing the kernel and the main) with the NVCC at once, but you may also keep the files separated - this might even be better when you have multiple kernels or more complex programs.

wepawetmose · 8. September 2010 um 11:27

sorry, can you explain to me the utility of
unsigned int nextPow2( unsigned int x )
{
–x;
//Bitwise OR assignment
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return ++x;
}

to reduce the number of threads with threads = (s < maxThreads) ? nextPow2(s) : maxThreads; ?

I just dont understand the way that it works. I know that |= operator is a OR bitwise… but the function do all those bitwise operations?! dont know how it works… sorry

Marco13 · 8. September 2010 um 11:31

Hello

This function is copied from the NVIDIA sample. As the name suggests, it returns the next highest power of 2 for a given value. You could also write it as something like

int nextPow2(int value)
{
    int result = 1;
    while (result < value) result = result * 2; 
    return result;
}

That this function works is … some sort of „magic“ There are some functions which use this sort of „bit twiddling“ to achieve specific results. In fact, the function is also described on this website.

bye
Marco

wepawetmose · 8. September 2010 um 11:45

Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks.

One thing is funny, though: the litle shared memory part that you’ve gladly mentioned is not mentioned as a MUST HAVE in our code in any CUDA papers, videos and books that i’ve already read!!! and clearly it was the differene of working properly and not! This should be a major concern of developers! what if in the

int sharedMemorySize = threads * sizeof(float);

part, there is no sufficient shared memory to alloc? should we put less threads per block?

Marco13 · 8. September 2010 um 12:57

Well, some general information about shared memory can, of course, be found in the Programming Guide. There is always a tradeoff between the shared memory size and the number of threads that can be used. Shared memory is very limited, and should be distributed wisely among the threads. I think it is literally an art to find the optimal way for exploiting the available shared memory (and registers, by the way). Unfortunately, I don’t have real experience with this either - I have still a lot to learn…

Matrix Multiplication with KernelLauncher

========================================== ========= MENU =========

==========================================
========= MENU =========