Cudnn version 5

I know Jcudnn is currently a port of cudnn version 4. I am wondering whether there is plan to upgrade to version 5 anytime soon.

The reason I ask is that I tested Jcudnn with Mnist data set and its performance is worse than Caffe and I don’t know why.

Running on K40c GPU, I got 2.6 seconds per 100 iterations while Caffe only needs 1.6 seconds. I don’t know whether there is any potential bottleneck that I overlooked.

I measured the detailed timing of calls to various forward/backward methods and they account for the majority of the 2.6 seconds. Now I run out of ideas. Any suggestions?

Hello,

Of course, I’ll try to update JCudnn as well and as soon as possible. The mail from NVIDIA about the update to 5.1 and DIGITS 4 is still in my inbox, and it mentions a “2.7x faster training”…

Right now, I’m mainly working on one of my other projects. Regarding JCuda, I wanted to do an update for an easier deployment, with the goal to bring JCuda into Maven Central. This will require a review of the native library loading and some refactoring of the POMs. The CUDA 8 RC has been out for a while, and JCuda should be updated for this as well (because most likely, they’ll bring out CUDA 8 ‘final’ soon). So it’s hard to give definite date of when JCudnn will support CUDNN 5.1. But I can try to interweave this into the other tasks, because it seems like they did not change much in the API, so it might mainly be an issue of re-compiling it with the latest CUDNN version.

Sorry for the delays,
Marco

Thanks Marco for the response.

I would really appreciate it that Jcudnn update is on your agenda. I think Jcudnn is quite valuable since installing Caffe and Theano on Windows is a pain while I can run Jcudnn on Windows with just dependencies on jcuda and jcudnn.

Right now, my pressing issue is to find out why there is such a performance discrepancy between Jcudnn and Caffe. I believe the Caffe compilation I made was on cudnn version 4 so that it is apple-to-apple comparison.
There shouldn’t be any performance difference since over 75% of total time I used is to call cudnn’s forward/backward methods, which is already way over the total time for Caffe.

Since I have no clue about the way porting is done, I am curious if you are aware of any potential issue that might cause performance problems. Thanks.

The point to have a look at (some of) the libraries from https://developer.nvidia.com/deep-learning-frameworks is also on my TODO list for quite a while now. I have to admit that I did not yet go into the details of cuDNN. I built the bindings, mainly due to a request here in the forum, but initially expressed my doubts that people would really want to use and program against such a spookily complex API - and this complexity was also the reason of why I only ported the MNIST sample without really trying to understand what it is doing internally :o

Like the other JCu* libraries, the JCudnn library is a very thin wrapper around cuDNN. Of course, there always is an overhead when going the Java->JNI->C path, and for certain application patterns, this overhead could even be prohibitively large, but this should not be the case for cuDNN: A considerable amount of time should be spent in the cuDNN functions, and the launch overhead should become negligible then - although I can’t be absolutely sure here, due to my lack of knowledge of the inner workings of cuDNN.

So regarding a performance comparison between Caffe and “plain cuDNN/JCudnn”, I might need some pointers: How did you actually compare the performance? Please correct me if I’m wrong, but I think Caffe uses only some high-level description of the training parameters, and it’s nearly impossible to derive the actual implementation. So for example, when configuring Caffe to do this MNIST classification, this does not necessarly have anything in common with the MNIST sample, or does it?

Thanks for the reply.

I also agree that the launch overhead is very small compared with the actual runtime of cudnn calls.

You are right that cudnn interface is very complex and I spent considerable amount of time to decipher it. I think I have a good understanding of how it works and even wrote some wrapper classes for them so that it is more palatable.

As to performance comparison, I implemented Lenet to run on Mnist data set. The Lenet I implemented is actually Java code compiled from a little DSL I wrote in Scala. I am sure it is the same configuration as to the Lenet implementation of Theano and it has slight difference from the Lenet version in Caffe though the difference shouldn’t impact the performance. Both my code and Caffe’s run on 500 sample batch except that Caffe is substantially faster.

I think I made the same number of forward/backward calls to Jcudnn as Caffe does to cudnn. The thin wrapper of Jcudnn shouldn’t account for the performance difference as the data is always GPU bound and never copied out.

Lenet - another library to look at (recently, Home | Horn Project went on my “Have a closer look at this”-List, but it might be unrelated to the low-level libraries related to cuDNN)

So I’m not sure how to proceed here systematically. There are SO many parameters and degrees of freedom that might influence the learning process… The obvious/naive question would be whether you can run this in some sort of profiler, to see whether the time is really lost IN the cuDNN functions (and maybe also better see whether they are called the same number of times). Unfortunately, the newer versions of the CUDA Visual Profiler does not work any more with JCuda (the older ones, up to 4.x, worked perfectly, and I heard rumours that it should even be possible to use the newer ones with JCuda, but I did not succeed with this until now).

Thanks Marco for the suggestion.

I didn’t know the existence of the profiler and I will try it out soon.

I am pretty sure that all CuDNN functions are called the same number of times by inspecting the source code of all libraries.

I did try Theano recently and its performance is 2.3 seconds per 100 iterations. For comparison, Caffe took 1.6 seconds and my implementation with JCuda took 2.7 seconds. The strange thing is that just by convolution calls along, my time is almost 1.4 seconds. That leaves not much for everything else.

There may be some secret sauce in Caffe but from the source code I can’t figure out.

If this means that you did roughly (!) something like

for (int i=0; i<100; i++)
{
    cudnnConvolutionForward(...);
}

and measured the execution time, then I assume that this is indeed caused by a newer cuDNN version (and I assume that the result with the native cuDNN example, without Java/JCuda, would be the same).

However, I’ll try to do the updates as soon as possible, maybe the update for CUDA 8 and cuDNN 5.1 can be done in one run.

Well, my time included convolution forward, backward data, backward filter, backward bias. That accounted for 1.6 seconds in 100 iterations in fact.

Another possible cause is that I tried to minimize memory use and alloc/dealloc as fast as possible. This probably causes some overhead. I always use JCublas.cublasAlloc/dealloc for this purpose – not sure if this is the right way to do it.
I did a preliminary timing of the memory alloc/dealloc and it seems to account for at least 0.6 seconds.

Yes, there has been a short interlude about the alloc/free performance around this post: https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/17980-jcublas-dsyrk-dgemm-benchmark-3.html#post127931 .

And yes: You should avoid “unnecessary” alloc/free calls when possible. My response to the post above mentions some approaches for this (and contains a small ““benchmark”” as well). I’m not sure in how far this can be applied to cuDNN, because I still did not try to get a deeper understanding of what the MNIST example actually does - it seems to do some alloc/free calls, but I don’t know whether they can easily be avoided.

Beyond that, there are often are certain elements in the API that serve as “handles” - the MNIST example contains a dedicated “createHandles” method for all these. They may require some alloc/free internally, and have an additional initialization overhead, but at least in the MNIST example, they are only created once.

The other functions that are called “often” (namely, setTensorDesc (=cudnnSetTensor4dDescriptor)) should have a low overhead.

I hope that this can be analyzed more systematically after the update to cuDNN 5.1, when it is clear whether it’s indeed only related to cuDNN, or whether there is an overhead implied by JCudnn (although I don’t expect this to be the case). There are still some other tasks in the queue, but I see that the interest in JCudnn seems to be far higher than I initially expected.

@typecheck May I ask which OS/CUDA version you are using?
I’m currently finishing the update for CUDA 8.0RC and cuDNN 5.1 (I did some refactoring of the code generator, so it took a bit longer than aticipated). I’d be curious to know whether the performance difference is gone with the new version, and could try to assemble a “preview version” for Win64 before finalizing the other libraries.

[QUOTE=Marco13]@typecheck May I ask which OS/CUDA version you are using?
I’m currently finishing the update for CUDA 8.0RC and cuDNN 5.1 (I did some refactoring of the code generator, so it took a bit longer than aticipated). I’d be curious to know whether the performance difference is gone with the new version, and could try to assemble a “preview version” for Win64 before finalizing the other libraries.[/QUOTE]

Sorry for the late reply.

I am using cuDNN 4 and CUDA 8 (or whatever is the latest and it works with the current version of JCuda).

I am using both Windows and Linux. The performance delta is measured on Linux.

I am sure that the Theano and the Caffe libraries are also compiled against cuDNN4 in the Linux Server that I experimented with.

The current timing test on Lenet of 100 iterations is that Caffe: 1.6 seconds, My tool that uses JCuda: 1.9 seconds, Theano: 2.4 seconds. I am evaluating other libraries as well. Deeplearn4J is way behind. Tensorflow is also quite slow but I forgot the numbers. I will look into Torch7 and CNTK soon.

I didn’t run these libraries on Windows because the difficulty with installation.

So if you will do a preview version, perhaps a linux version is preferred. I can still run on my windows laptop if you have a version but I can only compare with previous results. FYI, the same code runs about 15 seconds on my Dell Precision M4700 laptop.

@typecheck

I just pushed an update of JCudnn for cuDNN 5.1 and CUDA 8.0.27 (RC), at https://github.com/jcuda/jcudnn/commit/37bbc972fb1d9dab499c89b25f3a15195e2188e5

It should also be considered only as a “release candidate”. I have updated the MNIST sample, which I’ll just dump at the end of this post. But in general, the test coverage and procedure urgently has to be improved. (I also added https://github.com/jcuda/jcuda-main/issues/9 - hopefully, with the new native library handling, it will finally be feasible to add proper JUnit tests).

Still, you mentioned that the other libraries that you compared with (likely) used cuDNN 4.x, so comparing the performance now would no longer be an apple-to-apple comparison.

I’ll try to have a closer look and maybe try out some of the libraries that you mentioned (and if you have a recommendation of one that is particularly easy to get up and running for a simple performance comparison, I’d be glad to hear it). Maybe it’s possible to figure out if/where performance is lost. (But right now, I have some other tasks in the queue, with strict deadlines in the next few days, so I’m not sure when I can allocate the time for that).

package jcuda.jcudnn.samples;

import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasSgemv;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_T;
import static jcuda.jcudnn.JCudnn.CUDNN_VERSION;
import static jcuda.jcudnn.JCudnn.cudnnActivationForward;
import static jcuda.jcudnn.JCudnn.cudnnAddTensor;
import static jcuda.jcudnn.JCudnn.cudnnConvolutionForward;
import static jcuda.jcudnn.JCudnn.cudnnCreate;
import static jcuda.jcudnn.JCudnn.cudnnCreateActivationDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnCreateConvolutionDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnCreateFilterDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnCreateLRNDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnCreatePoolingDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnCreateTensorDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroy;
import static jcuda.jcudnn.JCudnn.cudnnDestroyActivationDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroyConvolutionDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroyFilterDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroyLRNDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroyPoolingDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnDestroyTensorDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnFindConvolutionForwardAlgorithm;
import static jcuda.jcudnn.JCudnn.cudnnGetConvolutionForwardAlgorithm;
import static jcuda.jcudnn.JCudnn.cudnnGetConvolutionForwardWorkspaceSize;
import static jcuda.jcudnn.JCudnn.cudnnGetConvolutionNdForwardOutputDim;
import static jcuda.jcudnn.JCudnn.cudnnGetErrorString;
import static jcuda.jcudnn.JCudnn.cudnnGetPoolingNdForwardOutputDim;
import static jcuda.jcudnn.JCudnn.cudnnGetVersion;
import static jcuda.jcudnn.JCudnn.cudnnLRNCrossChannelForward;
import static jcuda.jcudnn.JCudnn.cudnnPoolingForward;
import static jcuda.jcudnn.JCudnn.cudnnSetActivationDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetConvolutionNdDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetFilterNdDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetLRNDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetPoolingNdDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetTensor4dDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSoftmaxForward;
import static jcuda.jcudnn.cudnnActivationMode.CUDNN_ACTIVATION_RELU;
import static jcuda.jcudnn.cudnnConvolutionFwdAlgo.CUDNN_CONVOLUTION_FWD_ALGO_FFT;
import static jcuda.jcudnn.cudnnConvolutionFwdPreference.CUDNN_CONVOLUTION_FWD_PREFER_FASTEST;
import static jcuda.jcudnn.cudnnConvolutionMode.CUDNN_CROSS_CORRELATION;
import static jcuda.jcudnn.cudnnDataType.CUDNN_DATA_FLOAT;
import static jcuda.jcudnn.cudnnLRNMode.CUDNN_LRN_CROSS_CHANNEL_DIM1;
import static jcuda.jcudnn.cudnnNanPropagation.CUDNN_PROPAGATE_NAN;
import static jcuda.jcudnn.cudnnPoolingMode.CUDNN_POOLING_MAX;
import static jcuda.jcudnn.cudnnSoftmaxAlgorithm.CUDNN_SOFTMAX_ACCURATE;
import static jcuda.jcudnn.cudnnSoftmaxMode.CUDNN_SOFTMAX_MODE_CHANNEL;
import static jcuda.jcudnn.cudnnTensorFormat.CUDNN_TENSOR_NCHW;
import static jcuda.runtime.JCuda.cudaDeviceReset;
import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToDevice;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.jcudnn.JCudnn;
import jcuda.jcudnn.cudnnActivationDescriptor;
import jcuda.jcudnn.cudnnConvolutionDescriptor;
import jcuda.jcudnn.cudnnConvolutionFwdAlgo;
import jcuda.jcudnn.cudnnConvolutionFwdAlgoPerf;
import jcuda.jcudnn.cudnnFilterDescriptor;
import jcuda.jcudnn.cudnnHandle;
import jcuda.jcudnn.cudnnLRNDescriptor;
import jcuda.jcudnn.cudnnPoolingDescriptor;
import jcuda.jcudnn.cudnnTensorDescriptor;
import jcuda.runtime.JCuda;

/**
 * A port of the "mnistCUDNN" sample.<br> 
 * <br>
 * This sample expects the data files that are part of the 
 * mnistCUDNN sample to be present in a "data/" subdirectory.
 */
public class MnistJCudnn
{
    private static final int IMAGE_H = 28;
    private static final int IMAGE_W = 28;

    private static final String first_image = "one_28x28.pgm";
    private static final String second_image = "three_28x28.pgm";
    private static final String third_image = "five_28x28.pgm";
    private static final String dataDirectory = "data/";

    private static final String conv1_bin = "conv1.bin";
    private static final String conv1_bias_bin = "conv1.bias.bin";
    private static final String conv2_bin = "conv2.bin";
    private static final String conv2_bias_bin = "conv2.bias.bin";
    private static final String ip1_bin = "ip1.bin";
    private static final String ip1_bias_bin = "ip1.bias.bin";
    private static final String ip2_bin = "ip2.bin";
    private static final String ip2_bias_bin = "ip2.bias.bin";

    public static void main(String args[])
    {
        JCuda.setExceptionsEnabled(true);
        JCudnn.setExceptionsEnabled(true);
        JCublas2.setExceptionsEnabled(true);

        int version = (int) cudnnGetVersion();
        System.out.printf("cudnnGetVersion() : %d , " + 
            "CUDNN_VERSION from cudnn.h : %d
",
            version, CUDNN_VERSION);

        System.out.println("Creating network and layers...");
        Network mnist = new Network();
        
        System.out.println("Classifying...");
        int i1 = mnist.classifyExample(dataDirectory + first_image);
        int i2 = mnist.classifyExample(dataDirectory + second_image);

        mnist.setConvolutionAlgorithm(CUDNN_CONVOLUTION_FWD_ALGO_FFT);
        int i3 = mnist.classifyExample(dataDirectory + third_image);
        
        System.out.println(
            "
Result of classification: " + i1 + " " + i2 + " " + i3);
        if (i1 != 1 || i2 != 3 || i3 != 5)
        {
            System.out.println("
Test failed!
");
        }
        else
        {
            System.out.println("
Test passed!
");
        }
        mnist.destroy();
    }

    
    // The CUDNN_TENSOR_NCHW tensor format specifies that the 
    // data is laid out in the following order: 
    // image, features map, rows, columns.
    private static class TensorLayout
    {
        int n;
        int c;
        int h;
        int w;
    }
    
    private static class Layer
    {
        int inputs;
        int outputs;
        int kernel_dim;
        Pointer data_d;
        Pointer bias_d;

        Layer(int inputs, int outputs, int kernelDim, 
            String weightsFileName, String biasFileName)
        {
            this.inputs = inputs;
            this.outputs = outputs;
            this.kernel_dim = kernelDim;

            String weightsPath = dataDirectory + weightsFileName;
            String biasPath = dataDirectory + biasFileName;

            float weights[] = readBinaryFileUnchecked(weightsPath);
            data_d = createDevicePointer(weights);

            float bias[] = readBinaryFileUnchecked(biasPath);
            bias_d = createDevicePointer(bias);
        }

        void destroy()
        {
            cudaFree(data_d);
            cudaFree(bias_d);
        }
    };
    
    private static class Network
    {
        private int convAlgorithm;
        private cudnnHandle cudnnHandle;
        private cudnnTensorDescriptor srcTensorDesc;
        private cudnnTensorDescriptor dstTensorDesc;
        private cudnnTensorDescriptor biasTensorDesc;
        private cudnnFilterDescriptor filterDesc;
        private cudnnConvolutionDescriptor convDesc;
        private cudnnPoolingDescriptor poolingDesc;
        private cudnnActivationDescriptor activDesc;
        private cudnnLRNDescriptor normDesc;
        private cublasHandle cublasHandle;
        
        private final Layer conv1;
        private final Layer conv2;
        private final Layer ip1;
        private final Layer ip2;

        Network()
        {
            convAlgorithm = -1;
            createHandles();
            
            conv1 = new Layer(1, 20, 5, conv1_bin, conv1_bias_bin);
            conv2 = new Layer(20, 50, 5, conv2_bin, conv2_bias_bin);
            ip1 = new Layer(800, 500, 1, ip1_bin, ip1_bias_bin);
            ip2 = new Layer(500, 10, 1, ip2_bin, ip2_bias_bin);
        }

        void createHandles()
        {
            cudnnHandle = new cudnnHandle();
            srcTensorDesc = new cudnnTensorDescriptor();
            dstTensorDesc = new cudnnTensorDescriptor();
            biasTensorDesc = new cudnnTensorDescriptor();
            filterDesc = new cudnnFilterDescriptor();
            convDesc = new cudnnConvolutionDescriptor();
            poolingDesc = new cudnnPoolingDescriptor();
            activDesc = new cudnnActivationDescriptor();
            normDesc = new cudnnLRNDescriptor();

            cudnnCreate(cudnnHandle);
            cudnnCreateTensorDescriptor(srcTensorDesc);
            cudnnCreateTensorDescriptor(dstTensorDesc);
            cudnnCreateTensorDescriptor(biasTensorDesc);
            cudnnCreateFilterDescriptor(filterDesc);
            cudnnCreateConvolutionDescriptor(convDesc);
            cudnnCreatePoolingDescriptor(poolingDesc);
            cudnnCreateActivationDescriptor(activDesc);
            cudnnCreateLRNDescriptor(normDesc);

            cublasHandle = new cublasHandle();
            cublasCreate(cublasHandle);
        }

        void destroy()
        {
            cudnnDestroyLRNDescriptor(normDesc);
            cudnnDestroyPoolingDescriptor(poolingDesc);
            cudnnDestroyActivationDescriptor(activDesc);
            cudnnDestroyConvolutionDescriptor(convDesc);
            cudnnDestroyFilterDescriptor(filterDesc);
            cudnnDestroyTensorDescriptor(srcTensorDesc);
            cudnnDestroyTensorDescriptor(dstTensorDesc);
            cudnnDestroyTensorDescriptor(biasTensorDesc);
            cudnnDestroy(cudnnHandle);

            cublasDestroy(cublasHandle);
            
            conv1.destroy();
            conv2.destroy();
            ip1.destroy();
            ip2.destroy();
        }


        void setConvolutionAlgorithm(int algo)
        {
            convAlgorithm = algo;
        }

        void addBias(cudnnTensorDescriptor dstTensorDesc, 
            Layer layer, int c, Pointer data)
        {
            cudnnSetTensor4dDescriptor(biasTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                1, c, 1, 1);
            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(1.0f);
            cudnnAddTensor(cudnnHandle, alpha,
                biasTensorDesc, layer.bias_d, beta, dstTensorDesc, data);
        }

        void fullyConnectedForward(Layer ip, TensorLayout t, 
            Pointer srcData, Pointer dstData)
        {
            if (t.n != 1)
            {
                System.out.println("Not Implemented");
                return;
            }
            int dim_x = t.c * t.h * t.w;
            int dim_y = ip.outputs;
            resize(dim_y, dstData);

            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(1.0f);

            // place bias into dstData
            cudaMemcpy(dstData, ip.bias_d, dim_y * Sizeof.FLOAT,
                cudaMemcpyDeviceToDevice);

            cublasSgemv(cublasHandle, CUBLAS_OP_T, 
                dim_x, dim_y, alpha, ip.data_d, 
                dim_x, srcData, 1, beta, dstData, 1);

            t.h = 1;
            t.w = 1;
            t.c = dim_y;
        }

        void convoluteForward(Layer conv, TensorLayout t,
            Pointer srcData, Pointer dstData)
        {
            int algo = 0; // cudnnConvolutionFwdAlgo_t

            cudnnSetTensor4dDescriptor(srcTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            int tensorDims = 4;
            int tensorOuputDimA[] = { t.n, t.c, t.h, t.w };
            int filterDimA[] = { 
                conv.outputs, conv.inputs, 
                conv.kernel_dim, conv.kernel_dim };
            
            cudnnSetFilterNdDescriptor(filterDesc, 
                CUDNN_DATA_FLOAT, CUDNN_TENSOR_NCHW, tensorDims, filterDimA);

            int convDims = 2;
            int padA[] = { 0, 0 };
            int filterStrideA[] = { 1, 1 };
            int upscaleA[] = { 1, 1 };
            cudnnSetConvolutionNdDescriptor(convDesc, convDims, padA,
                filterStrideA, upscaleA, CUDNN_CROSS_CORRELATION, CUDNN_DATA_FLOAT);

            // find dimension of convolution output
            cudnnGetConvolutionNdForwardOutputDim(convDesc, 
                srcTensorDesc, filterDesc, 
                tensorDims, tensorOuputDimA);
            t.n = tensorOuputDimA[0];
            t.c = tensorOuputDimA[1];
            t.h = tensorOuputDimA[2];
            t.w = tensorOuputDimA[3];

            cudnnSetTensor4dDescriptor(dstTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            if (convAlgorithm < 0)
            {
                int algoArray[] = { -1 };
                
                // Choose the best according to the preference
                System.out.println(
                    "Testing cudnnGetConvolutionForwardAlgorithm ...");
                cudnnGetConvolutionForwardAlgorithm(cudnnHandle, srcTensorDesc,
                    filterDesc, convDesc, dstTensorDesc,
                    CUDNN_CONVOLUTION_FWD_PREFER_FASTEST, 0, algoArray);
                algo = algoArray[0];

                System.out.println("Fastest algorithm is Algo " + algo);
                convAlgorithm = algo;

                // New way of finding the fastest config
                // Setup for findFastest call
                System.out.println(
                    "Testing cudnnFindConvolutionForwardAlgorithm ...");
                int requestedAlgoCount = 5;
                int returnedAlgoCount[] = new int[1];
                cudnnConvolutionFwdAlgoPerf results[] = 
                    new cudnnConvolutionFwdAlgoPerf[requestedAlgoCount];
                cudnnFindConvolutionForwardAlgorithm(cudnnHandle,
                    srcTensorDesc, filterDesc, convDesc, dstTensorDesc,
                    requestedAlgoCount, returnedAlgoCount, results);
                for (int algoIndex = 0; algoIndex < returnedAlgoCount[0]; ++algoIndex)
                {
                    System.out.printf(
                        "    %s for Algo %d (%s): %f time requiring %d memory
",
                        cudnnGetErrorString(results[algoIndex].status),
                        results[algoIndex].algo, 
                        cudnnConvolutionFwdAlgo.stringFor(results[algoIndex].algo),
                        results[algoIndex].time, results[algoIndex].memory);
                }
            }
            else
            {
                algo = convAlgorithm;
                if (algo == CUDNN_CONVOLUTION_FWD_ALGO_FFT)
                {
                    System.out.println("Using FFT for convolution");
                }
            }

            resize(t.n * t.c * t.h * t.w, dstData);
            long sizeInBytesArray[] = { 0 };
            Pointer workSpace = new Pointer();
            cudnnGetConvolutionForwardWorkspaceSize(cudnnHandle, 
                srcTensorDesc, filterDesc, convDesc, dstTensorDesc, 
                algo, sizeInBytesArray);
            long sizeInBytes = sizeInBytesArray[0];
            if (sizeInBytes != 0)
            {
                cudaMalloc(workSpace, sizeInBytes);
            }
            
            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(0.0f);
            cudnnConvolutionForward(cudnnHandle, alpha, srcTensorDesc, 
                srcData, filterDesc, conv.data_d, convDesc, algo, 
                workSpace, sizeInBytes, beta, dstTensorDesc, dstData);
            addBias(dstTensorDesc, conv, t.c, dstData);
            if (sizeInBytes != 0)
            {
                cudaFree(workSpace);
            }
        }

        void poolForward(TensorLayout t, Pointer srcData,
            Pointer dstData)
        {
            int poolDims = 2;
            int windowDimA[] = { 2, 2 };
            int paddingA[] = { 0, 0 };
            int strideA[] = { 2, 2 };
            cudnnSetPoolingNdDescriptor(poolingDesc, 
                CUDNN_POOLING_MAX, CUDNN_PROPAGATE_NAN, poolDims, windowDimA, 
                paddingA, strideA);

            cudnnSetTensor4dDescriptor(srcTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            int tensorDims = 4;
            int tensorOuputDimA[] = { t.n, t.c, t.h, t.w };
            cudnnGetPoolingNdForwardOutputDim(
                poolingDesc, srcTensorDesc,
                tensorDims, tensorOuputDimA);
            t.n = tensorOuputDimA[0];
            t.c = tensorOuputDimA[1];
            t.h = tensorOuputDimA[2];
            t.w = tensorOuputDimA[3];

            cudnnSetTensor4dDescriptor(dstTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            resize(t.n * t.c * t.h * t.w, dstData);
            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(0.0f);
            cudnnPoolingForward(cudnnHandle, poolingDesc, 
                alpha, srcTensorDesc, srcData, beta, 
                dstTensorDesc, dstData);
        }

        void softmaxForward(TensorLayout t,
            Pointer srcData, Pointer dstData)
        {
            resize(t.n * t.c * t.h * t.w, dstData);

            cudnnSetTensor4dDescriptor(srcTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);
            cudnnSetTensor4dDescriptor(dstTensorDesc,
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(0.0f);
            cudnnSoftmaxForward(cudnnHandle, 
                CUDNN_SOFTMAX_ACCURATE, CUDNN_SOFTMAX_MODE_CHANNEL, 
                alpha, srcTensorDesc, srcData,
                beta, dstTensorDesc, dstData);
        }

        void lrnForward(TensorLayout t, 
            Pointer srcData, Pointer dstData)
        {
            int lrnN = 5;
            double lrnAlpha, lrnBeta, lrnK;
            lrnAlpha = 0.0001;
            lrnBeta = 0.75;
            lrnK = 1.0;
            cudnnSetLRNDescriptor(normDesc, lrnN, lrnAlpha, lrnBeta, lrnK);

            resize(t.n * t.c * t.h * t.w, dstData);

            cudnnSetTensor4dDescriptor(srcTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);
            cudnnSetTensor4dDescriptor(dstTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(0.0f);
            cudnnLRNCrossChannelForward(cudnnHandle, normDesc,
                CUDNN_LRN_CROSS_CHANNEL_DIM1, 
                alpha, srcTensorDesc, srcData,
                beta, dstTensorDesc, dstData);
        }

        
        void activationForward(TensorLayout t,
            Pointer srcData, Pointer dstData)
        {
            cudnnSetActivationDescriptor(activDesc, 
                CUDNN_ACTIVATION_RELU, CUDNN_PROPAGATE_NAN, 0.0);
            
            resize(t.n * t.c * t.h * t.w, dstData);

            cudnnSetTensor4dDescriptor(srcTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);
            cudnnSetTensor4dDescriptor(dstTensorDesc, 
                CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT,
                t.n, t.c, t.h, t.w);

            Pointer alpha = pointerTo(1.0f);
            Pointer beta = pointerTo(0.0f);
            cudnnActivationForward(cudnnHandle, activDesc, 
                alpha, srcTensorDesc, srcData, 
                beta, dstTensorDesc, dstData);
        }

        
        int classifyExample(String imageFileName)
        {
            TensorLayout t = new TensorLayout();
            Pointer srcData = new Pointer();
            Pointer dstData = new Pointer();

            float imgData_h[] = readImageDataUnchecked(imageFileName);

            System.out.println("Performing forward propagation ...");

            cudaMalloc(srcData, IMAGE_H * IMAGE_W * Sizeof.FLOAT);
            cudaMemcpy(srcData, Pointer.to(imgData_h), IMAGE_H * IMAGE_W
                * Sizeof.FLOAT, cudaMemcpyHostToDevice);

            t.n = 1;
            t.c = 1;
            t.h = IMAGE_H;
            t.w = IMAGE_W;
            convoluteForward(conv1, t, srcData, dstData);
            poolForward(t, dstData, srcData);

            convoluteForward(conv2, t, srcData, dstData);
            poolForward(t, dstData, srcData);

            fullyConnectedForward(ip1, t, srcData, dstData);
            activationForward(t, dstData, srcData);
            lrnForward(t, srcData, dstData);

            fullyConnectedForward(ip2, t, dstData, srcData);
            softmaxForward(t, srcData, dstData);

            int max_digits = 10;
            float result[] = new float[max_digits];
            cudaMemcpy(Pointer.to(result), dstData, 
                max_digits * Sizeof.FLOAT,
                cudaMemcpyDeviceToHost);
            int id = 0;
            for (int i = 1; i < max_digits; i++)
            {
                if (result[id] < result**)
                    id = i;
            }

            System.out.println("Resulting weights from Softmax:");
            printDeviceVector(t.n * t.c * t.h * t.w, dstData);

            cudaFree(srcData);
            cudaFree(dstData);
            return id;
        }
    }


    
    //========================================================================
    // I/O utility methods
    
    private static float[] readBinaryFile(String fileName) throws IOException
    {
        FileInputStream fis = new FileInputStream(new File(fileName));
        byte data[] = readFully(fis);
        ByteBuffer bb = ByteBuffer.wrap(data);
        bb.order(ByteOrder.nativeOrder());
        FloatBuffer fb = bb.asFloatBuffer();
        float result[] = new float[fb.capacity()];
        fb.get(result);
        return result;
    }

    private static float[] readBinaryFileUnchecked(String fileName)
    {
        try
        {
            return readBinaryFile(fileName);
        }
        catch (IOException e)
        {
            cudaDeviceReset();
            e.printStackTrace();
            System.exit(-1);
            return null;
        }
    }

    private static byte[] readFully(InputStream inputStream) throws IOException
    {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte buffer[] = new byte[1024];
        while (true)
        {
            int n = inputStream.read(buffer);
            if (n < 0)
            {
                break;
            }
            baos.write(buffer, 0, n);
        }
        byte data[] = baos.toByteArray();
        return data;
    }

    @SuppressWarnings("deprecation")
    private static byte[] readBinaryPortableGraymap8bitData(
        InputStream inputStream) throws IOException
    {
        DataInputStream dis = new DataInputStream(inputStream);
        String line = null;
        boolean firstLine = true;
        Integer width = null;
        Integer maxBrightness = null;
        while (true)
        {
            // The DataInputStream#readLine is deprecated,
            // but for ASCII input, it is safe to use it
            line = dis.readLine();
            if (line == null)
            {
                break;
            }
            line = line.trim();
            if (line.startsWith("#"))
            {
                continue;
            }
            if (firstLine)
            {
                firstLine = false;
                if (!line.equals("P5"))
                {
                    throw new IOException(
                        "Data is not a binary portable " + 
                        "graymap (P5), but " + line);
                }
                else
                {
                    continue;
                }
            }
            if (width == null)
            {
                String tokens[] = line.split(" ");
                if (tokens.length < 2)
                {
                    throw new IOException(
                        "Expected dimensions, found " + line);
                }
                width = parseInt(tokens[0]);
            }
            else if (maxBrightness == null)
            {
                maxBrightness = parseInt(line);
                if (maxBrightness > 255)
                {
                    throw new IOException(
                        "Only 8 bit values supported. " + 
                        "Maximum value is " + maxBrightness);
                }
                break;
            }
        }
        byte data[] = readFully(inputStream);
        return data;
    }

    private static Integer parseInt(String s) throws IOException
    {
        try
        {
            return Integer.parseInt(s);
        }
        catch (NumberFormatException e)
        {
            throw new IOException(e);
        }
    }

    private static float[] readImageData(String fileName) throws IOException
    {
        InputStream is = new FileInputStream(new File(fileName));
        byte data[] = readBinaryPortableGraymap8bitData(is);
        float imageData[] = new float[data.length];
        for (int i = 0; i < data.length; i++)
        {
            imageData** = (((int) data**) & 0xff) / 255.0f;
        }
        return imageData;
    }

    private static float[] readImageDataUnchecked(String fileName)
    {
        try
        {
            return readImageData(fileName);
        }
        catch (IOException e)
        {
            cudaDeviceReset();
            e.printStackTrace();
            System.exit(-1);
            return null;
        }
    }
    
    //========================================================================
    // utility methods
    
    private static Pointer createDevicePointer(float data[])
    {
        int size = data.length * Sizeof.FLOAT;
        Pointer deviceData = new Pointer();
        cudaMalloc(deviceData, size);
        cudaMemcpy(deviceData, Pointer.to(data), size, cudaMemcpyHostToDevice);
        return deviceData;
    }

    private static void resize(int numberOfFloatElements, Pointer data)
    {
        cudaFree(data);
        cudaMalloc(data, numberOfFloatElements * Sizeof.FLOAT);
    }
    
    private static Pointer pointerTo(float value)
    {
        return Pointer.to(new float[] { value });
    }

    
    
    //========================================================================
    // debugging utility methods
    
    private static void printDeviceVector(int size, Pointer d)
    {
        float h[] = new float[size];
        cudaDeviceSynchronize();
        cudaMemcpy(Pointer.to(h), d, size * Sizeof.FLOAT,
            cudaMemcpyDeviceToHost);
        for (int i = 0; i < size; i++)
        {
            System.out.print(h** + " ");
        }
        System.out.println();
    }
    
    
}

Thanks Marco.

I don’t have experience with compiling RMI. Do you have a built Jar for Windows or Linux?

RMI? JNI, I guess.

Unfortunately, I can’t provide pre-built binaries for Linux.

(EDIT> I’ll upload the Windows ones ASAP, and post a message here. The remaining part of this post referred to the Linux case <EDIT)

I could provide the JARs, of course, but not the native libraries. I hope that the usual contibutors will soon provide the binaries for 8.0.27, but I can’t promise anything here (in doubt, I can “ping” them, but can’t give a specific date or so).

The build process should be simple (at least, I spent a considerable amount of time to make it “simple”). If you want to try it, the instructions at https://github.com/jcuda/jcuda-main should cover the creation of the natives and the JARs pretty well. (I have asked the contributor of the “Shortbuilding script” whether it needs to be updated (except for the version number)).

Sorry, I know that having to build the binaries is a nuisance, and even if it is intended to be simple, there are always unsexpected border cases. So if you run into problems, then you can post them here, and if you don’t want to try it, I’ll notify you as soon as the updated Linux binaries are available.

Thanks Marco. Yes, I meant to say JNI. I will give the Linux compilation a try. I will wait for your windows build though.

@typecheck The windows binaries have been uploaded, and are available at jcuda.org - Downloads

Hope that the deployment with the natives-in-JARs works as expected…

Thanks. I tried the windows version but I encountered a dreaded problem:

JCudnn-0.8.0RC-windows-x86_64.dll: Can’t find dependent libraries

I have cudnn64_5.dll in the same bin directory as the rest of the cuda dll. Path environment includes the bin directory.

Any suggestions? Thanks.

What a pity. Which version of cuDNN did you download exactly? I downloaded the one from
https://developer.nvidia.com/rdp/cudnn-download -> “Download cuDNN v5.1 (August 10, 2016), for CUDA 8.0 RC” -> “cuDNN v5.1 Library for Windows 7”

(The protected link is https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod/8.0/cudnn-8.0-windows7-x64-v5.1-zip - the file name is still cudnn64_5.dll, and not cudnn64_5.1.dll or so…)

I.e. in the header file, it should say


#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 5

I could try to create a dedicated version of JCudnn for 5.0 (I think only 2 functions changed), but such a “downgrade” does not seem desirable.

If you have the same version, then the difference might still be between the Windows7 and the Windows10 version (I have Windows 8.1 so picked the Windows7 one…)

(Maybe CUDNN will settle soon, with longer release cycles, fewer different versions, and fewer changes between the versions…)

Sorry for the inconvenience.

@marco ,

Thanks but it is due to a silly mistake on my part. I didn’t reboot my eclipse. Apparently, eclipse cached the system environment variables and it kept looking at the CUDA 7 folder.

I only found out the problem after eclipse crashed, which it does very often nowadays.

I did run a test case and the performance improve is minor on my laptop: 15.8 --> 15.4 seconds per 100 iterations of Lenet. Maybe if I reboot my machine later, some magic many happen.

I will run the test on Linux server and report the result later.