SDK reduce + jCuda

Flix · 2. September 2011 um 01:50

Hi,

I’m using jCuda to solve sparse linear equation systems. I allready have implemented the conjugated gradient method and the bi-conjugated gradient method. I use JCublas und jCuSparse to do all needed blas operations.
My findings so far are:

Depending on the size of the equation system, the speedup is somewhere between 2x and 5x.
I have done some profiling of my code and have found out, that most of the runtime in Cuda is used for the dot function.

The brings me to the idea to use the reduce example which can be found in the cuda sdk.
To be more specific: My idea is to replace to dot function by two seperated kernels.
The first one will to a simple vector multiplication and the second one will be the reduce kernel.
This should be much faster.

That brings me to the reason of this thread.
How can I define the amount of shared memory per block using jCuda?

Thanks for your help in advance.

Greetings Felix

Marco13 · 2. September 2011 um 02:44

Hello

Do you mean the cublasSdot function? It is not clear how this function is implemented. We can assume that the NVIDIA engineers are aware of the fact that this function is used in many algorithms, and tried their best to provide a highly optimized implementation of this function, or possibly even different implementations for the different compute capabilites and GPU versions. However, since the signature of this function has been dictated by the fortran BLAS conventions (where the increments, incx and incy can be specified), and assuming that these will always be 1 in your case, it might at least be worth a try to implement a specialized ‘dot’ function.

Shared memory can be used the same way in JCuda as in CUDA: Inside of a kernel, you may declare a shared memory array like


__global__ void someKernel(...)
{
    __shared__ float someMemory[BLOCKSIZE];
...

The ‘BLOCKSIZE’ can either be written as a
#define BLOCKSIZE 64
at the beginning of the file, or by passing it as a parameter
-D BLOCKSIZE=64
to the NVCC compiler when creating the PTX file.

bye
Marco

Flix · 2. September 2011 um 05:12

Thanks for the fast answer.

I had a closer look into the disired Kernel.
It contains some templates.
So the new question is: Is it possible to access template Kernels via jcuda?

Here is the Kernel code:

/*
#ifndef _REDUCE_KERNEL_H_
#define _REDUCE_KERNEL_H_

#include <stdio.h>

// Utility class used to avoid linker errors with extern
// unsized shared memory arrays with templated type
template<class T>
struct SharedMemory
{
    __device__ inline operator       T*()
    {
        extern __shared__ int __smem[];
        return (T*)__smem;
    }

    __device__ inline operator const T*() const
    {
        extern __shared__ int __smem[];
        return (T*)__smem;
    }
};
template<>
struct SharedMemory<double>
{
    __device__ inline operator       double*()
    {
        extern __shared__ double __smem_d[];
        return (double*)__smem_d;
    }

    __device__ inline operator const double*() const
    {
        extern __shared__ double __smem_d[];
        return (double*)__smem_d;
    }
};
template <class T>
__global__ void
reduce0(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // load shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    
    sdata[tid] = (i < n) ? g_idata** : 0;
    
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2) {
        // modulo arithmetic is slow!
        if ((tid % (2*s)) == 0) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T>
__global__ void
reduce1(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // load shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    
    sdata[tid] = (i < n) ? g_idata** : 0;
    
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2)
    {
        int index = 2 * s * tid;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T>
__global__ void
reduce2(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // load shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    
    sdata[tid] = (i < n) ? g_idata** : 0;
    
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=blockDim.x/2; s>0; s>>=1)
    {
        if (tid < s)
        {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T>
__global__ void
reduce3(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

    T mySum = (i < n) ? g_idata** : 0;
    if (i + blockDim.x < n)
        mySum += g_idata[i+blockDim.x];  

    sdata[tid] = mySum;
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=blockDim.x/2; s>0; s>>=1)
    {
        if (tid < s)
        {
            sdata[tid] = mySum = mySum + sdata[tid + s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T, unsigned int blockSize>
__global__ void
reduce4(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;

    T mySum = (i < n) ? g_idata** : 0;
    if (i + blockSize < n)
        mySum += g_idata[i+blockSize];  

    sdata[tid] = mySum;
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s)
        {
            sdata[tid] = mySum = mySum + sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid < 32)
    {
        volatile T *smem = sdata;
        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; }
        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; }
        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; }
        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; }
        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; }
        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; }
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T, unsigned int blockSize>
__global__ void
reduce5(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;

    T mySum = (i < n) ? g_idata** : 0;
    if (i + blockSize < n)
        mySum += g_idata[i+blockSize];  

    sdata[tid] = mySum;
    __syncthreads();

    // do reduction in shared mem
    if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }
    
    if (tid < 32)
    {
        volatile T* smem = sdata;
        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; }
        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; }
        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; }
        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; }
        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; }
        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; }
    }
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
template <class T, unsigned int blockSize, bool nIsPow2>
__global__ void
reduce6(T *g_idata, T *g_odata, unsigned int n)
{
    T *sdata = SharedMemory<T>();

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
    unsigned int gridSize = blockSize*2*gridDim.x;
    
    T mySum = 0;
    while (i < n)
    {        
        mySum += g_idata**;
        // ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
        if (nIsPow2 || i + blockSize < n)
            mySum += g_idata[i+blockSize];  
        i += gridSize;
    }
    sdata[tid] = mySum;
    __syncthreads();


    // do reduction in shared mem
    if (blockSize >= 512) { if (tid < 256) { sdata[tid] = mySum = mySum + sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] = mySum = mySum + sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid <  64) { sdata[tid] = mySum = mySum + sdata[tid +  64]; } __syncthreads(); }
    
    if (tid < 32)
    {
        volatile T* smem = sdata;
        if (blockSize >=  64) { smem[tid] = mySum = mySum + smem[tid + 32]; }
        if (blockSize >=  32) { smem[tid] = mySum = mySum + smem[tid + 16]; }
        if (blockSize >=  16) { smem[tid] = mySum = mySum + smem[tid +  8]; }
        if (blockSize >=   8) { smem[tid] = mySum = mySum + smem[tid +  4]; }
        if (blockSize >=   4) { smem[tid] = mySum = mySum + smem[tid +  2]; }
        if (blockSize >=   2) { smem[tid] = mySum = mySum + smem[tid +  1]; }
    }
    if (tid == 0)
        g_odata[blockIdx.x] = sdata[0];
}


extern "C"
bool isPow2(unsigned int x);


////////////////////////////////////////////////////////////////////////////////
// Wrapper function for kernel launch
////////////////////////////////////////////////////////////////////////////////
template <class T>
void
reduce(int size, int threads, int blocks,
       int whichKernel, T *d_idata, T *d_odata)
{
    dim3 dimBlock(threads, 1, 1);
    dim3 dimGrid(blocks, 1, 1);

    // when there is only one warp per block, we need to allocate two warps
    // worth of shared memory so that we don't index shared memory out of bounds
    int smemSize = (threads <= 32) ? 2 * threads * sizeof(T) : threads * sizeof(T);

    // choose which of the optimized versions of reduction to launch
    switch (whichKernel)
    {
    case 0:
        reduce0<T><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size);
        break;
    case 1:
        reduce1<T><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size);
        break;
    case 2:
        reduce2<T><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size);
        break;
    case 3:
        reduce3<T><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size);
        break;
    case 4:
        switch (threads)
        {
        case 512:
            reduce4<T, 512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 256:
            reduce4<T, 256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 128:
            reduce4<T, 128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 64:
            reduce4<T,  64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 32:
            reduce4<T,  32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 16:
            reduce4<T,  16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  8:
            reduce4<T,   8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  4:
            reduce4<T,   4><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  2:
            reduce4<T,   2><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  1:
            reduce4<T,   1><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        }
        break;
    case 5:
        switch (threads)
        {
        case 512:
            reduce5<T, 512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 256:
            reduce5<T, 256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 128:
            reduce5<T, 128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 64:
            reduce5<T,  64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 32:
            reduce5<T,  32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case 16:
            reduce5<T,  16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  8:
            reduce5<T,   8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  4:
            reduce5<T,   4><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  2:
            reduce5<T,   2><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        case  1:
            reduce5<T,   1><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
        }
        break;      
    case 6:
    default:
        if (isPow2(size))
        {
            switch (threads)
            {
            case 512:
                reduce6<T, 512, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 256:
                reduce6<T, 256, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 128:
                reduce6<T, 128, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 64:
                reduce6<T,  64, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 32:
                reduce6<T,  32, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 16:
                reduce6<T,  16, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  8:
                reduce6<T,   8, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  4:
                reduce6<T,   4, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  2:
                reduce6<T,   2, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  1:
                reduce6<T,   1, true><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            }
        }
        else
        {
            switch (threads)
            {
            case 512:
                reduce6<T, 512, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 256:
                reduce6<T, 256, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 128:
                reduce6<T, 128, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 64:
                reduce6<T,  64, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 32:
                reduce6<T,  32, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case 16:
                reduce6<T,  16, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  8:
                reduce6<T,   8, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  4:
                reduce6<T,   4, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  2:
                reduce6<T,   2, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            case  1:
                reduce6<T,   1, false><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata, size); break;
            }
        }
        break;      
    }
}

// Instantiate the reduction function for 3 types
template void
reduce<int>(int size, int threads, int blocks,
            int whichKernel, int *d_idata, int *d_odata);

template void
reduce<float>(int size, int threads, int blocks,
              int whichKernel, float *d_idata, float *d_odata);
              
template void
reduce<double>(int size, int threads, int blocks,
               int whichKernel, double *d_idata, double *d_odata);


#endif // #ifndef _REDUCE_KERNEL_H_

And here the Java Code:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package jcudavecreduce;

import jcuda.jcublas.JCublas;
import static jcuda.driver.JCudaDriver.*;

import java.io.*;

import jcuda.*;
import jcuda.driver.*;
import static jcuda.runtime.cudaMemcpyKind.*;
import static jcuda.runtime.JCuda.*;

public class JCudaVecReduce {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException{
                JCublas.initialize();
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);

        // Create the PTX file by calling the NVCC
        String ptxFileName = preparePtxFile("reduction_kernel.cu");

        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        // Load the ptx file.
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);

        // Obtain a function pointer to the "add" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "reduce");
        
        

        int numElements = 10000000;
        
        int[] input = new int[numElements];
        
        for(int i=0;i<numElements;i++){
            input**=1;
        }
        
        float[] output = new float[numElements];
        
        CUdeviceptr dInput = new CUdeviceptr();
        cuMemAlloc(dInput, numElements * Sizeof.INT);
        cudaMemcpy(dInput,Pointer.to(input),Sizeof.INT,cudaMemcpyHostToDevice);
        
        CUdeviceptr dOutput = new CUdeviceptr();
        cuMemAlloc(dOutput, numElements * Sizeof.INT);
        cudaMemcpy(dOutput,Pointer.to(output),Sizeof.INT,cudaMemcpyHostToDevice);
        
        int threadPerBlock = 512;
        int GridDimX = (numElements+threadPerBlock-1)/threadPerBlock;
        
            Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{numElements}),
            Pointer.to(new int[] {threadPerBlock}),
            Pointer.to(new int[] {GridDimX}),
            Pointer.to(new int[] {0}),
            Pointer.to(dInput),
            Pointer.to(dOutput)
        );
        
        //CUstream stream = new CUstream();
        
        // Call the kernel function.
        //int blockSizeX = 256;
        //int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
        long ownStart=System.currentTimeMillis();
        cuLaunchKernel(function,
            GridDimX,  1, 1,      // Grid dimension
            threadPerBlock, 1, 1,      // Block dimension
            0, null,               // Shared memory size and stream
            kernelParameters, null // Kernel- and extra parameters
        );
        //cuCtxSynchronize();
        long ownEnd=System.currentTimeMillis();
        
        
        cudaMemcpy(Pointer.to(output),dOutput,numElements*Sizeof.INT, cudaMemcpyDeviceToHost);
        
        System.out.println(output[0]);
    }
}

Greetings Felix

Marco13 · 2. September 2011 um 05:42

Hello

It should be possible, yes, but I’ll have to take a closer look at this. In any case, the kernel names will be “mangled”, which means that it will not be possible to obtain the function via
cuModuleGetFunction(function, module, “reduce”);
(BTW: ‘reduce’ is not a kernel function, anyhow)
but instead will have to be something like
cuModuleGetFunction(function, module, “Z3reduce4fPfS_S”);
or so (this name will include the information about the exact signature of the kernel, namely whether it is the ‘float’ or the ‘int’ version).

It might, in fact, be far easier to simply replace the ‘T’ parameter manually with ‘float’, and put the resulting kernels into an own file, declaring the kernel functions as ‘extern “C”’. But at the moment, I don’t know how the ‘reduce’ function is used (and how the ‘whichKernel’ parameter is determined). I’ll have to look this up in the main function of the SDK example.

I’ll try to allocate some time for this during the weekend, and see how to cope with the templates or alternatively try to second approach, and eventually this might end up with a ‘reduction’ sample on the website, but I can not give an exact timeline for that.

BTW: Did you manage to compile the ‘.cu’ input file into a PTX? I think it might be necessary to add an include path where it can find the ‘stdio.h’.

bye
Marco

Flix · 3. September 2011 um 06:59

Hi again,

I have written a own Kernel which is slower than the previous example but the Kernel itself works and seems to be faster than cublasSdot.

Here the code:

__global__ void ownSdot(int n, float *a, float *b, float *res)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;

        __shared__ float blockSum[1];
        if(threadIdx.x==0){
        blockSum[0]=0;
        }
        __syncthreads();
        float local=0;
        if(i<n){
            local=a***b**;
        }

        atomicAdd(&blockSum[0], local);

        __syncthreads();

	if(threadIdx.x==0){
		atomicAdd(&res[0], blockSum[0]);
	}	  
}```

But copying the result back to the host takes alot of times which makes it even slower than cublasSdot.
Is it possible to write the result directly to the Host memory?
Do you have any other ideas?

Marco13 · 3. September 2011 um 12:20

Hello

Did you also try using one of the kernels from the ‘reduce’ example (by replacing ‘T’ with ‘float’) ? I think that the documentation of this example contains a detailed description of the optimizations done in each version. I’m just wondering, because the ‘atomicAdd’ will most likely be expensive, since atomic operations usually destroy a lot of parallelism.

Apart from that, I’m wondering how you measured the time. Making accurate benchmarks is difficult. You mentioned that your kernel was faster than cublasSdot, but when copying the memory back, it was slower. You should either include the time for copying in both versions, or omit it in both versions, otherwise the benchmark will not tell you anything.

However, I assume that the ‘dot’ computation is only one step in a longer sequence of operations that are all performed on the device. So it should not be necessary to copy the data between the host and device, except for the initial input data and the final result, of course. (It is possible to write data to host memory that was allocated with cudaMallocHost and mapped to the device with cudaHostGetDevicePointer, but this is most likely not what you meant…)

Hopefully, I find the time to to some more own tests in the next days.

bye
Marco

djmj · 3. September 2011 um 18:11

As Marco13 said, your Kernel is of cause faster then cublasSdot, because this function takes the time the memory copy operation from the device to host into account.

Your Kernel is a very slow Kernel, because of all those Atomic Operations.

There is no need for a Template in this case, you can modify your Kernel for your desired input type.

Look at this Slide and the nvidia slide referenced in it:

Cuda-Dot-Product-Parallel-Reduction

With the Nvidia Slides I created my max and normalize (needs dot product) kernel for arbitrary input sizes and float input data by myself, which was a good learning experience.

Just go step by step with the nvidia slides to create your kernel.

If you need help just ask.

Marco13 · 4. September 2011 um 15:16

Hello

I had examined the ‘reduce’ example quite a while ago, and read the documentation, which is an excellent overview overview over different general optimization strategies. But did not remember the exact strcuture of the source code. Now I read it again, and … it’s fairly complex. It allows selecting one of 7 (or actually, one of 44 (!!!) kernels) and sophisticated benchmarking and stuff. While it would be interesting to have it ported to JCuda with as few changes as possible, it would not be very convenient to read and to understand, because there are so many things messed together in this example, and specifically, it would not help you anything for your task of writing an optimized reduction.

So there are several options: I could try to do a 1:1 port of the SDK example, but will probably not do this, since som parts are not applicable in Java, and I don’t like its structure anyhow. Alternatively, I could try an “improved” port, i.e try to simplify it and make it easier to cope with, while maintaining the benchmarking possibilities. But again, this would not help you so much. The last option currently seems the most suitable: I could extract the ‘kernel6’, which is the most optimized one (and by the way, the only one that allows using input sizes that are not a power of 2), and try to create a small, general reduction example from that. I think I can allocate some time for this, and upload it this week.

bye

Flix · 4. September 2011 um 15:45

hi,

thanks for your help.

I should be a bit more specific about my problem.
the algorithim I’m using has some parts where only the dot product of a
vector is used to proceed.
Furthermore one of the dot products is the stop criteria and
needs to be transfered to the host.
At the moment I’m simply checking the overall speed via profiling
in netbeans.
So far the cuda code is 1,7 to 3,6 times faster with a hugh matrix n=1e7!
I will post the code of my algorithm tomorrow.

bye

Marco13 · 5. September 2011 um 12:23

Hello

I have uploaded an example showing a reduction on http://jcuda.org/samples/samples.html (JCudaReduce.java + reduction.cu).

The kernel is basically the “reduce6” kernel from the NVIDIA example, simplified to use only ‘float’ values and no templates. The sample itself contains some methods that are also adapted from the example, namely some computations of block and grid sizes.

The ‘reduce’ method is offered in different versions, specifically one version accepting a float[] array, and one version directly taking the device pointer as the input. The latter might be what you need when you want to call this as one step in a larger computation, namely after having filled the device memory with the ‘vector multiplication’ result.

The sample also performs a small “benchmark”, but only using a rough timing and a predefined setup. More precise timings, for comparing the duration of vectorMul+reduction to that of cublasSdot might be interesting as well.

bye
Marco

Flix · 5. September 2011 um 12:57

As mentioned before.
Here is my code for the conjugate gradient method in single precision.
I also have the same for double precision.

        
        boolean useSparse=true;
        boolean justAlloc=true;
        
        long start = System.currentTimeMillis();
        
        int incx=1;
        int incy=1;
        int n = rowPtr.length-1;
        float[] r = new float[n];
        
        Pointer dX = cuHandler.copyVec(x); //0
        Pointer dRowPtr = cuHandler.copyVec(rowPtr); //1
        Pointer dColInd = cuHandler.copyVec(colInd); //2
        Pointer dValues = cuHandler.copyVec(values); //3
        Pointer dB = cuHandler.copyVec(b); //4
        
        Pointer dR=null;
        Pointer dP=null;
        Pointer dX2=null;
        Pointer dR2=null;
        Pointer dV=null;
        
        if(justAlloc){
            dR = cuHandler.allocFloatVec(n);//5
            dP = cuHandler.allocFloatVec(n);//6
            dX2 = cuHandler.allocFloatVec(n);//7
            dR2 = cuHandler.allocFloatVec(n);//8
            dV = cuHandler.allocFloatVec(n);//9
        }
        else{
            dR = cuHandler.copyVec(r);//5
            dP = cuHandler.copyVec(r);//6
            dX2 = cuHandler.copyVec(r);//7
            dR2 = cuHandler.copyVec(r);//8
            dV = cuHandler.copyVec(r);//9
        }
        
        long alloc = System.currentTimeMillis();
        
        //Pointer dZeros = cuHandler.copyVec(r);
        //float[] x2=null;
        //float[] xH=null;
        //float[] rH=null;
        //float[] vH = null;
        //int status=0;
        
        
        cusparseScsrmv(cuHandler.getCuSparseHandle(),1,n,n,1,cuHandler.getCuSparseDescra(),dValues,dRowPtr,dColInd,dX,0,dR);
        //rH=cuHandler.getFloatVector(5, n);
        JCublas.cublasSaxpy(n, -1, dB, incx, dR, incy);
        //status=JCublas.cublasGetError();
        //rH=cuHandler.getFloatVector(5, n);
        JCublas.cublasScopy(n, dR, incx, dP, incy);
        float normr=mC.sqrt(JCublas.cublasSdot(n, dR, incx, dR, incy));
        //float normr=mC.sqrt(cuHandler.ownSdot(n, dR, dR));
        float alpha=normr*normr;
        int iter=0;
        float lambda=0;
        float alpha2=0;
        float vr=0;
        boolean conv=false;
        
        for(int i=0;i<maxIter;i++){
            iter=i;
            if(normr<=tol){
                System.out.println("CudaCG Iteration: "+i+" Residual: "+normr+" Duration: "+(System.currentTimeMillis()-start)+" CypTime: "+(alloc-start));
                conv=true;
                break;
            }
            cusparseScsrmv(cuHandler.getCuSparseHandle(),1,n,n,1,cuHandler.getCuSparseDescra(),dValues,dRowPtr,dColInd,dP,0,dV);
            //vH=cuHandler.getFloatVector(9, n);
            /*if(JCublas.cublasSdot(n, dV, incx, dV, incy)==0){
                System.out.println("Matrix is singular");
                break;
            }*/
            vr=JCublas.cublasSdot(n, dV, incx, dR, incy);
            //vr=cuHandler.ownSdot(n, dV, dR);
            if(vr<=0){
                System.out.println("Matrix is not positiv definit");
                break;
            }
            lambda=alpha/vr;
            //xH = cuHandler.getFloatVector(0, n);
            //x2 = cuHandler.getFloatVector(7, n);
            //cudaMemcpy(dX2,dX,n*Sizeof.FLOAT,cudaMemcpyDeviceToDevice);
            JCublas.cublasScopy(n, dX, incx, dX2, incy);
            //xH = cuHandler.getFloatVector(0, n);
            //x2 = cuHandler.getFloatVector(7, n);
            JCublas.cublasSscal(n, lambda, dX2, incx);
            //x2 = cuHandler.getFloatVector(7, n);
            JCublas.cublasSaxpy(n, 1, dX2, incx, dX, incy);
            //xH = cuHandler.getFloatVector(0, n);
            
            JCublas.cublasScopy(n, dV, incx, dR2, incy);
            JCublas.cublasSscal(n, lambda, dR2, incx);
            JCublas.cublasSaxpy(n, -1, dR2, incx, dR, incy);
            
            alpha2=alpha;
            normr=mC.sqrt(JCublas.cublasSdot(n, dR, incx, dR, incy));
            //normr=mC.sqrt(cuHandler.ownSdot(n, dR, dR));
            alpha=normr*normr;
            JCublas.cublasSscal(n, alpha/alpha2, dP, incx);
            JCublas.cublasSaxpy(n, 1, dR, incx, dP, incy);
        }
        
        if(!conv){
            System.out.println("CudaCG Iteration: "+iter+" Residual: "+normr+" Duration: "+(System.currentTimeMillis()-start)+" CypTime: "+(alloc-start));
        }
        x=cuHandler.getFloatVector(0, n);
        cuHandler.freeMemory();
        return x;
    }```

Marco13 · 5. September 2011 um 15:32

A while ago, someone sent me a small example of a similar algorithm, including preconditioning. He asked be not to publish it in its current form, so it’s lying on my Desktop for quite a while now. I wanted to clean it up and, if he agreed then, upload it as a sample with credits to him, but did not yet find the time. I also did not yet find the time for the OO-wrapper for JCuda that I once intended to write: I see that you added a “Handle” class, wich obviously covers some of the most tedious tasks like memory allocation and copies… this could all be much simpler with a nicer API…

However, since the ‘incx’ and ‘incy’ values are constant (1) in this case, it should be possible to replace the
float d = JCublas.cublasSdot(n, dR, incx, dR, incy)
calls with something like
myKernels.vecMul(dR, dR, dR, n);
float d = myKernels.reduce(dR, n);
You have to make sure that the actual kernel can be called using the Driver API. I think the easiest would be to initialize the context using the Driver API, because IIRC the runtime API (i.e. CUBLAS) will automatically grab any active and current context, if available. Maybe I can also try to do some benchmarks comparing the cublasSdot and the 2-kernels-approach, to see whether it’s worthwhile to modify your program.

bye
Marco

Flix · 6. September 2011 um 01:18

Hi again,

I allready started to work on a solution with self written Kernels. My idea is, to only copy the result of the dot product only once.
To do that, I’m comparing the dot product with the float tol on the GPU.
My idea is that, when the dot product is smaller than the tol value, the api will produce a non successful error which can simply be captured by the api.
I hope that this works as I plan it.
Another idea was, to use printf to post the value of the dot product to java.out and read it back from there.
So far that had not been a good solution because the value was only posted after the next invocation of Memcpy.

I hope that my ideas make somehow sense.

bye

Marco13 · 6. September 2011 um 02:37

Umm… using Errors/Exceptions for the Control flow or using the Standard Output as a “Messaging Channel” does not only sound like a weird hack, it is almost certainly far less efficient than copying a single float value back from the device…
Or am I wrong: I assumed that you ONLY need the final result of the reduction, which is only a single float value - and the input data which is about to be reduced is already located on the device anyhow…?!

Flix · 6. September 2011 um 03:05

Hi,

that was also my first impression.
So I tried to get the single float value from the device after I did the reduction.
That approach was even slower than using Sdot.
I checked again with profiling which method call took the most time and again it was cudaMemcpy.
That leads to the idea to do everything on the GPU and only do a copy HostToDevice in the beginning and another one DeviceToHost in the end.
The only issue is, how to check if normr is smaller/bigger than tol on the GPU and report a true or false to the CPU?

The clean way whould be to write a complete Kernel(bunch of Kernels) which can do the whole algorithm on the device and java is only starting this Kernel.
In my case that whould mean to reimplement things like sparse matrix times dense vector and so on.

I hope I’m not totally mislead.

Marco13 · 6. September 2011 um 05:16

No, of course it’s not practically possible to rewrite the whole algorithm in CUDA.

Your original intention should be feasible: Letting the CPU call the Kernels/CUBLAS functions until the termination criterion is met. I wonder why the memory copy should create such a large overhead. Of course, there may be a overhead involved just for calling the memcopy function and all that, but since it is only a single float value I thought it should be negligible.

Are you sure that you only copied a single value, as in the sample

float result[] = {0.0f};
cuMemcpyDtoH(Pointer.to(result), deviceBuffer, Sizeof.FLOAT);     
return result[0];

(it’s only one Sizeof.FLOAT, and not the whole array)

In general, copying between device- and host memory is the only way of communication between the device and the host, so there’s hardly a way to avoid that (except for pinned and mapped host memory, which might be worth a try, but I would not bet that it will be faster…)

Maybe I can also try some benchmarks later this week.

bye

Flix · 6. September 2011 um 10:51

Hi,

thank you very much for your help.

I have finished some testing with the reduction example.
And my findings are, that Sdot was still a little faster than any other approach.
At the moment the allready posted code seems to be a reasonable approach.

The only small improvment was to use Snrm2 instead of Sdot to calculate normr.

Let me know Iif you are interrested in my final report on that topic.

Greetings Felix

Marco13 · 6. September 2011 um 13:06

Hi

Which kernel did you finally use? An own one, or the one that was taken from the NVIDIA example? And did you make a dedicated comparison of only the two apporaches for the dot product, or did you measure the overall time for the algorithm?

bye
Marco

Flix · 6. September 2011 um 15:04

Hi,

here some results for a sparse Matrix with 10 million rows.
Created as in the Cuda SDK example (konjugate gradient).

Single Precission: Standard Cuda functions:

Testing for N=10000000…
Solving with CudaCG
CudaCG Iteration: 18 Residual: 4.6281044E-16 Duration: 2157 CypTime: 100

Solving with jCG
jCG Iteration: 18 Residual: 6.0043487E-16 Duration: 4133

Single Precission: Own kernels(incl. reduce) instead of Sdot oder Snrm2 with CuCtxSynchronize():

Testing for N=10000000…
Solving with CudaCG
CudaCG Iteration: 18 Residual: 4.9802456E-16 Duration: 2362 CypTime: 101

Solving with jCG
jCG Iteration: 18 Residual: 6.1587856E-16 Duration: 4128

Apart from messuring the time via System.currentTimeMillis() I also used the Profiling function of NetBeans(as mentioned before).

Here are two screenshots of the results:

Marco13 · 7. September 2011 um 01:24

The time difference does not seem to be sooo large. One could also consider an own „dot“ function (not split into multiplication and reduction) but I assume that there is not so much potential left for improvements. It’s hard to compete with CUBLAS …