Batch normalization problem

typecheck · 18. Oktober 2016 um 14:19

Hi Marco,

I am trying to use batch normalization of JCudnn but I couldn’t pass tests on backward propagation.

It kept on returning error code 3, which indicates bad parameters.

According to cudnn manual (page 112), there is a list of potential reasons for the failure. However, I check everything at the input and none would have caused the failure.

I looked the native code and it appears to be correct as well.

Now I am really stuck. Any suggestions?

thanks

Marco13 · 18. Oktober 2016 um 16:14

Sorry, there was not much progress on my side with cuDNN. After the update, other tasks have been more pressing.

Do you have some sort of minimal example (maybe just the basic setup and the single relevant function call) so that I can have a look? At least, this could help to verify that there is really nothing wrong in the JNI part. (And if possible, I could then try to do the same in native cuDNN, just to have a “ground truth”)

typecheck · 18. Oktober 2016 um 16:52

[QUOTE=Marco13]Sorry, there was not much progress on my side with cuDNN. After the update, other tasks have been more pressing.

Do you have some sort of minimal example (maybe just the basic setup and the single relevant function call) so that I can have a look? At least, this could help to verify that there is really nothing wrong in the JNI part. (And if possible, I could then try to do the same in native cuDNN, just to have a “ground truth”)[/QUOTE]

If you don’t mind, please try the following code. The forward training and forward inference methods both work really well. Just the backward doesn’t work.
When I run it, I got error code “3” and the result “dx” is all zero.

	
	@Test
	public void test() {
		 cudnnHandle cudnnHandle = new cudnnHandle();
		 int mode = 1; // spatial mode
		 Pointer one = Pointer.to(new float[]{1.0f});
		 Pointer zero = Pointer.to(new float[]{0.0f});
		 int[] dims = {2, 3, 1, 1};
		 cudnnTensorDescriptor descriptor = new cudnnTensorDescriptor();
		 JCudnn.cudnnCreateTensorDescriptor(descriptor);
		 JCudnn.cudnnSetTensor4dDescriptor(descriptor, 0, 0, dims[0], dims[1], dims[2], dims[3]);
		 cudnnTensorDescriptor norm_dptr = new cudnnTensorDescriptor();
		 JCudnn.cudnnCreateTensorDescriptor(norm_dptr);
		 JCudnn.cudnnSetTensor4dDescriptor(norm_dptr, 0, 0, 1, dims[1], 1, 1);
		 double epsilon = 1;
		 
		 Pointer x = new Pointer(), dy = new Pointer(), dx = new Pointer();
		 int size = dims[0]*dims[1]*dims[2]*dims[3];
		 JCublas.cublasAlloc(size, Sizeof.FLOAT, x); 
		 JCublas.cublasAlloc(size, Sizeof.FLOAT, dy);
		 JCublas.cublasAlloc(size, Sizeof.FLOAT, dx);
		 JCublas2.cublasSetVector(size, Sizeof.FLOAT, Pointer.to(new float[] {1,2,3,4,5,6}), 1, x, 1);
		 int norm_size = dims[1];
		 Pointer scale = new Pointer(), d_scale = new Pointer(), d_bias = new Pointer(), saved_mean = new Pointer(), saved_variance = new Pointer();
		 JCublas.cublasAlloc(norm_size, Sizeof.FLOAT, scale); 
		 JCublas.cublasAlloc(norm_size, Sizeof.FLOAT, d_scale);
		 JCublas.cublasAlloc(norm_size, Sizeof.FLOAT, d_bias);
		 JCublas.cublasAlloc(norm_size, Sizeof.FLOAT, saved_mean);
		 JCublas.cublasAlloc(norm_size, Sizeof.FLOAT, saved_variance);
		 JCublas2.cublasSetVector(norm_size, Sizeof.FLOAT, Pointer.to(new float[] {10,100,1000}), 1, scale, 1);
		 
		 int ret = cudnnBatchNormalizationBackward(cudnnHandle, mode, one, zero, one, zero,
					descriptor, x, descriptor, dy, descriptor, dx,
					norm_dptr, scale, d_scale, d_bias, 
					epsilon, saved_mean, saved_variance);
			
			
		 System.out.println(ret); 
		 float[] result = new float[size];
		 JCublas2.cublasGetVector(size, Sizeof.FLOAT, dx, 1, Pointer.to(result), 1);
		 System.out.println(Arrays.toString(result));
	}

Marco13 · 18. Oktober 2016 um 17:20

Thanks for pointing this out. There is, indeed, an error in the JNI part. Some of the pointers have to be in host memory (and are properly given in host memory in your test), but are assumed to be in device memory in the JNI part. (This assumption is actually intended as a simplication/optimization for the common case of device memory pointers, but occasionally, one slips through - particularly when the memory space is only mentioned in the documentation (and even more when it changes between different versions)).

However, I’ll update it ASAP (it’s already 2:00 am here, so likely „tomorrow“), and then drop a note here.

Sorry for the inconvenience.

*** Edit ***

EDIT: By the way: If you have more (JUnit) tests, and would like to contribute them, I’d be glad about this. Creating proper unit tests for the cuDNN functions is tremendously difficult, considering their dozens of parameters that are described with

Tensor descriptors and pointers in device memory for the layer’s x data, backpropagated differential dy (inputs) and resulting differential with respect to x, dx
…
Pointers in device memory for the resulting scale and bias differentials computed by this routine…

-_- that’s a horribly large state space, and I can hardly imagine who could be „sure to use the function properly“…

typecheck · 18. Oktober 2016 um 18:08

Thanks for fixing this. I will publish some code in the near future, which exercises a number of jcudnn calls through wrappers.

Marco13 · 19. Oktober 2016 um 07:25

@typecheck

Thanks again for pointing this out. I should probably have added some sort of … typecheck … for the pointers. In any case, I reviewed the types now: Some parameters of cudnnBatchNormalizationBackward (and cudnnSpatialTfSamplerBackward) had to be in host memory. Additionally, the last parameters of cudnnBatchNormalizationBackward are optional (i.e. they may be null). The update is in https://github.com/jcuda/jcudnn/commit/76d3c83498ee2e0d0c086b5793de31de8c6c0aba

I also updated and ran the test. Some remarks:

Important: You did not use cudnnCreate to create the cudnnHandle.
Is there a particular reason to use the JCublas functions for allocating and copying memory? Strictly speaking and technically, it does not really matter, because (since CUDA 3.0), the CUBLAS functions like cublasAlloc are basically just wrappers for cudaMalloc etc. But it seems more “natural” to use the CUDA functions, because cuDNN is otherwise unrelated to CUBLAS.
Although in Java (due to the lack of typedef) the type information for types that map to primitive types is lost. But I’d strongly recommend to use the type constants instead of plain int values. E.g. instead of

int mode = 1; // spatial mode
...
cudnnSetTensor4dDescriptor(descriptor, 
    0, 0, 
    dims[0], dims[1], dims[2], dims[3]);

one should probably use

int mode = CUDNN_BATCHNORM_SPATIAL;
...
cudnnSetTensor4dDescriptor(descriptor, 
    CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 
    dims[0], dims[1], dims[2], dims[3]);

(with the appropriate static imports)

Here is the updated test code (as an application) :

package jcuda.jcudnn.test;

import static jcuda.jcudnn.JCudnn.cudnnBatchNormalizationBackward;
import static jcuda.jcudnn.JCudnn.cudnnCreate;
import static jcuda.jcudnn.JCudnn.cudnnCreateTensorDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetTensor4dDescriptor;
import static jcuda.jcudnn.cudnnBatchNormMode.CUDNN_BATCHNORM_SPATIAL;
import static jcuda.jcudnn.cudnnDataType.CUDNN_DATA_FLOAT;
import static jcuda.jcudnn.cudnnTensorFormat.CUDNN_TENSOR_NCHW;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcudnn.JCudnn;
import jcuda.jcudnn.cudnnHandle;
import jcuda.jcudnn.cudnnTensorDescriptor;
import jcuda.runtime.JCuda;

public class JCudnnBatchNormalizationTest
{
    public static void main(String[] args)
    {
        JCudnn.setExceptionsEnabled(true);
        JCuda.setExceptionsEnabled(true);
        
        cudnnHandle cudnnHandle = new cudnnHandle();
        cudnnCreate(cudnnHandle);

        int[] dims = {2, 3, 1, 1};
        
        cudnnTensorDescriptor descriptor = new cudnnTensorDescriptor();
        cudnnCreateTensorDescriptor(descriptor);
        cudnnSetTensor4dDescriptor(descriptor, 
            CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 
            dims[0], dims[1], dims[2], dims[3]);
        
        cudnnTensorDescriptor norm_dptr = new cudnnTensorDescriptor();
        cudnnCreateTensorDescriptor(norm_dptr);
        cudnnSetTensor4dDescriptor(norm_dptr, 
            CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, dims[1], 1, 1);
        
        Pointer x = new Pointer();
        Pointer dy = new Pointer();
        Pointer dx = new Pointer();
        int size = dims[0]*dims[1]*dims[2]*dims[3];
        cudaMalloc(x, size * Sizeof.FLOAT);
        cudaMalloc(dy, size * Sizeof.FLOAT);
        cudaMalloc(dx, size * Sizeof.FLOAT);
        cudaMemcpy(x, Pointer.to(new float[] {1,2,3,4,5,6}), 
            size * Sizeof.FLOAT, cudaMemcpyHostToDevice);

        int norm_size = dims[1];
        Pointer scale = new Pointer();
        Pointer d_scale = new Pointer();
        Pointer d_bias = new Pointer();
        Pointer saved_mean = new Pointer();
        Pointer saved_variance = new Pointer();
        cudaMalloc(scale, norm_size * Sizeof.FLOAT);
        cudaMalloc(d_scale, norm_size * Sizeof.FLOAT);
        cudaMalloc(d_bias, norm_size * Sizeof.FLOAT);
        cudaMalloc(saved_mean, norm_size * Sizeof.FLOAT);
        cudaMalloc(saved_variance, norm_size * Sizeof.FLOAT);

        cudaMemcpy(scale, Pointer.to(new float[] {10,100,1000}), 
            norm_size * Sizeof.FLOAT, cudaMemcpyHostToDevice);

        int mode = CUDNN_BATCHNORM_SPATIAL;
        Pointer one = Pointer.to(new float[]{1.0f});
        Pointer zero = Pointer.to(new float[]{0.0f});
        double epsilon = 1;
        
        int ret = cudnnBatchNormalizationBackward(
            cudnnHandle, mode, one, zero, one, zero,
            descriptor, x, descriptor, dy, descriptor, dx,
            norm_dptr, scale, d_scale, d_bias, 
            epsilon, saved_mean, saved_variance);
           
        System.out.println(ret); 
        float[] result = new float[size];
        
        cudaMemcpy(Pointer.to(result), dx, 
            size * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
        
        System.out.println(Arrays.toString(result));    
    }
}

Regarding the update: Which OS are you using? Or specifically: If you want to, I can provide the updated Win64 binary, if you don’t want to compile it yourself. (I’ll likely not create a new “release” for this, because the update to CUDA 8.0 (final) is already pending anyhow - although I cannot say an exact date, it should be available “soon”).

*** Edit ***

EDIT: BTW, the output that is printed for “dx” is sill a 0-vector. But I’ll have to wrap my head around what the function is actually supposed to do before I can tell whether this is correct or not. (There are many input parameters, and some of them are 0-vectors as well, so I’m not sure)

*** Edit ***

Another EDIT: I have checked this against the corresponding C-implementation


#include <cuda.h>
#include <cudnn.h>
#include <cstdio>

int main(int argc, char* argv[])
{
    cudnnHandle_t cudnnHandle;
    cudnnCreate(&cudnnHandle);
    cudnnBatchNormMode_t mode = CUDNN_BATCHNORM_SPATIAL;
    int dims[] = { 2, 3, 1, 1 };
    cudnnTensorDescriptor_t descriptor;
    cudnnCreateTensorDescriptor(&descriptor);
    cudnnSetTensor4dDescriptor(descriptor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, dims[0], dims[1], dims[2], dims[3]);
    cudnnTensorDescriptor_t norm_dptr;
    cudnnCreateTensorDescriptor(&norm_dptr);
    cudnnSetTensor4dDescriptor(norm_dptr, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, dims[1], 1, 1);
    double epsilon = 1;

    float one = 1.0f;
    float zero = 0.0f;

    float *x;
    float *dy;
    float *dx;
    int size = dims[0] * dims[1] * dims[2] * dims[3];
    cudaMalloc(&x, size * sizeof(float));
    cudaMalloc(&dy, size * sizeof(float));
    cudaMalloc(&dx, size * sizeof(float));

    float h_x[] = { 1, 2, 3, 4, 5, 6 };
    cudaMemcpy(x, h_x, size * sizeof(float), cudaMemcpyHostToDevice);
    int norm_size = dims[1];
    float *scale;
    float *d_scale;
    float *d_bias;
    float *saved_mean;
    float *saved_variance;
    
    cudaMalloc(&scale, norm_size * sizeof(float));
    cudaMalloc(&d_scale, norm_size * sizeof(float));
    cudaMalloc(&d_bias, norm_size * sizeof(float));
    cudaMalloc(&saved_mean, norm_size * sizeof(float));
    cudaMalloc(&saved_variance, norm_size * sizeof(float));

    float h_scale[] = { 10, 100, 1000 };
    cudaMemcpy(scale, h_scale, norm_size * sizeof(float), cudaMemcpyHostToDevice);

    int ret = cudnnBatchNormalizationBackward(cudnnHandle, mode, &one, &zero, &one, &zero,
        descriptor, x, descriptor, dy, descriptor, dx,
        norm_dptr, scale, d_scale, d_bias,
        epsilon, saved_mean, saved_variance);


    float *result = new float[size];
    cudaMemcpy(result, dx, size * sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < size; i++)
    {
        printf("At %d have %f
", i, result**);
    }
    printf("Done");
    return 0;
}

And this also prints the 0-vector, so it is at least reasonable to say that this is “correct”…

typecheck · 19. Oktober 2016 um 12:28

Thank you very much for prompt response. This is very impressive.

I run Windows on my laptop for testing. So a Window build would be nice for me as I don’t have the right tools to compile on Windows. I will compile Linux built separately.

[QUOTE=Marco13]@typecheck

Thanks again for pointing this out. I should probably have added some sort of … typecheck … for the pointers. In any case, I reviewed the types now: Some parameters of cudnnBatchNormalizationBackward (and cudnnSpatialTfSamplerBackward) had to be in host memory. Additionally, the last parameters of cudnnBatchNormalizationBackward are optional (i.e. they may be null). The update is in Fixed memory types and optional parameters. · jcuda/jcudnn@76d3c83 · GitHub

I also updated and ran the test. Some remarks:

Important: You did not use cudnnCreate to create the cudnnHandle.

[/QUOTE]

Yes, I realize that. My original code does use properly calls to create handles. The test code I gave you was written in 10 minutes to reproduce the error.

Is there a particular reason to use the JCublas functions for allocating and copying memory? Strictly speaking and technically, it does not really matter, because (since CUDA 3.0), the CUBLAS functions like cublasAlloc are basically just wrappers for cudaMalloc etc. But it seems more „natural“ to use the CUDA functions, because cuDNN is otherwise unrelated to CUBLAS.

=========
No reason for this other than that I was careless in picking the alloc functions. There are multiple ways of doing the same thing and I kept forgetting which one I had used before. I will try to use Cuda alloc more often now.

Also, I use CudaBlas for matrix operations such as implementing fully connected layers. I also use your Vec library, which is wonderful.

Although in Java (due to the lack of typedef) the type information for types that map to primitive types is lost. But I’d strongly recommend to use the type constants instead of plain int values. E.g. instead of
int mode = 1; // spatial mode
...
cudnnSetTensor4dDescriptor(descriptor, 
    0, 0, 
    dims[0], dims[1], dims[2], dims[3]);
one should probably use
int mode = CUDNN_BATCHNORM_SPATIAL;
...
cudnnSetTensor4dDescriptor(descriptor, 
    CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 
    dims[0], dims[1], dims[2], dims[3]);
(with the appropriate static imports)

=============

Yes, I do use constant names in my production code. I put together the test code for you so that it has no dependency on my original code. It was done in a big hurry.

=============

Here is the updated test code (as an application) :

package jcuda.jcudnn.test;

import static jcuda.jcudnn.JCudnn.cudnnBatchNormalizationBackward;
import static jcuda.jcudnn.JCudnn.cudnnCreate;
import static jcuda.jcudnn.JCudnn.cudnnCreateTensorDescriptor;
import static jcuda.jcudnn.JCudnn.cudnnSetTensor4dDescriptor;
import static jcuda.jcudnn.cudnnBatchNormMode.CUDNN_BATCHNORM_SPATIAL;
import static jcuda.jcudnn.cudnnDataType.CUDNN_DATA_FLOAT;
import static jcuda.jcudnn.cudnnTensorFormat.CUDNN_TENSOR_NCHW;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcudnn.JCudnn;
import jcuda.jcudnn.cudnnHandle;
import jcuda.jcudnn.cudnnTensorDescriptor;
import jcuda.runtime.JCuda;

public class JCudnnBatchNormalizationTest
{
    public static void main(String[] args)
    {
        JCudnn.setExceptionsEnabled(true);
        JCuda.setExceptionsEnabled(true);
        
        cudnnHandle cudnnHandle = new cudnnHandle();
        cudnnCreate(cudnnHandle);

        int[] dims = {2, 3, 1, 1};
        
        cudnnTensorDescriptor descriptor = new cudnnTensorDescriptor();
        cudnnCreateTensorDescriptor(descriptor);
        cudnnSetTensor4dDescriptor(descriptor, 
            CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 
            dims[0], dims[1], dims[2], dims[3]);
        
        cudnnTensorDescriptor norm_dptr = new cudnnTensorDescriptor();
        cudnnCreateTensorDescriptor(norm_dptr);
        cudnnSetTensor4dDescriptor(norm_dptr, 
            CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, dims[1], 1, 1);
        
        Pointer x = new Pointer();
        Pointer dy = new Pointer();
        Pointer dx = new Pointer();
        int size = dims[0]*dims[1]*dims[2]*dims[3];
        cudaMalloc(x, size * Sizeof.FLOAT);
        cudaMalloc(dy, size * Sizeof.FLOAT);
        cudaMalloc(dx, size * Sizeof.FLOAT);
        cudaMemcpy(x, Pointer.to(new float[] {1,2,3,4,5,6}), 
            size * Sizeof.FLOAT, cudaMemcpyHostToDevice);

        int norm_size = dims[1];
        Pointer scale = new Pointer();
        Pointer d_scale = new Pointer();
        Pointer d_bias = new Pointer();
        Pointer saved_mean = new Pointer();
        Pointer saved_variance = new Pointer();
        cudaMalloc(scale, norm_size * Sizeof.FLOAT);
        cudaMalloc(d_scale, norm_size * Sizeof.FLOAT);
        cudaMalloc(d_bias, norm_size * Sizeof.FLOAT);
        cudaMalloc(saved_mean, norm_size * Sizeof.FLOAT);
        cudaMalloc(saved_variance, norm_size * Sizeof.FLOAT);

        cudaMemcpy(scale, Pointer.to(new float[] {10,100,1000}), 
            norm_size * Sizeof.FLOAT, cudaMemcpyHostToDevice);

        int mode = CUDNN_BATCHNORM_SPATIAL;
        Pointer one = Pointer.to(new float[]{1.0f});
        Pointer zero = Pointer.to(new float[]{0.0f});
        double epsilon = 1;
        
        int ret = cudnnBatchNormalizationBackward(
            cudnnHandle, mode, one, zero, one, zero,
            descriptor, x, descriptor, dy, descriptor, dx,
            norm_dptr, scale, d_scale, d_bias, 
            epsilon, saved_mean, saved_variance);
           
        System.out.println(ret); 
        float[] result = new float[size];
        
        cudaMemcpy(Pointer.to(result), dx, 
            size * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
        
        System.out.println(Arrays.toString(result));    
    }
}

Regarding the update: Which OS are you using? Or specifically: If you want to, I can provide the updated Win64 binary, if you don’t want to compile it yourself. (I’ll likely not create a new „release“ for this, because the update to CUDA 8.0 (final) is already pending anyhow - although I cannot say an exact date, it should be available „soon“).

*** Edit ***

EDIT: BTW, the output that is printed for „dx“ is sill a 0-vector. But I’ll have to wrap my head around what the function is actually supposed to do before I can tell whether this is correct or not. (There are many input parameters, and some of them are 0-vectors as well, so I’m not sure)

*** Edit ***

Another EDIT: I have checked this against the corresponding C-implementation


#include <cuda.h>
#include <cudnn.h>
#include <cstdio>

int main(int argc, char* argv[])
{
    cudnnHandle_t cudnnHandle;
    cudnnCreate(&cudnnHandle);
    cudnnBatchNormMode_t mode = CUDNN_BATCHNORM_SPATIAL;
    int dims[] = { 2, 3, 1, 1 };
    cudnnTensorDescriptor_t descriptor;
    cudnnCreateTensorDescriptor(&descriptor);
    cudnnSetTensor4dDescriptor(descriptor, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, dims[0], dims[1], dims[2], dims[3]);
    cudnnTensorDescriptor_t norm_dptr;
    cudnnCreateTensorDescriptor(&norm_dptr);
    cudnnSetTensor4dDescriptor(norm_dptr, CUDNN_TENSOR_NCHW, CUDNN_DATA_FLOAT, 1, dims[1], 1, 1);
    double epsilon = 1;

    float one = 1.0f;
    float zero = 0.0f;

    float *x;
    float *dy;
    float *dx;
    int size = dims[0] * dims[1] * dims[2] * dims[3];
    cudaMalloc(&x, size * sizeof(float));
    cudaMalloc(&dy, size * sizeof(float));
    cudaMalloc(&dx, size * sizeof(float));

    float h_x[] = { 1, 2, 3, 4, 5, 6 };
    cudaMemcpy(x, h_x, size * sizeof(float), cudaMemcpyHostToDevice);
    int norm_size = dims[1];
    float *scale;
    float *d_scale;
    float *d_bias;
    float *saved_mean;
    float *saved_variance;
    
    cudaMalloc(&scale, norm_size * sizeof(float));
    cudaMalloc(&d_scale, norm_size * sizeof(float));
    cudaMalloc(&d_bias, norm_size * sizeof(float));
    cudaMalloc(&saved_mean, norm_size * sizeof(float));
    cudaMalloc(&saved_variance, norm_size * sizeof(float));

    float h_scale[] = { 10, 100, 1000 };
    cudaMemcpy(scale, h_scale, norm_size * sizeof(float), cudaMemcpyHostToDevice);

    int ret = cudnnBatchNormalizationBackward(cudnnHandle, mode, &one, &zero, &one, &zero,
        descriptor, x, descriptor, dy, descriptor, dx,
        norm_dptr, scale, d_scale, d_bias,
        epsilon, saved_mean, saved_variance);


    float *result = new float[size];
    cudaMemcpy(result, dx, size * sizeof(float), cudaMemcpyDeviceToHost);

    for (int i = 0; i < size; i++)
    {
        printf("At %d have %f
", i, result**);
    }
    printf("Done");
    return 0;
}



And this also prints the 0-vector, so it is at least reasonable to say that this is "correct"....

=====================

It is correct result since I never initialized „dy“ that was passed in. It has to be have non-zero for „dx“ to be non-zero.

As long as the flag returned by the call is „0“, I am sure it worked !

Thank you very much and looking forward to the windows build.

Marco13 · 19. Oktober 2016 um 13:48

@typecheck

Sure, I just wanted to point these things out (and the example that you provided really helped to quickly test this - none of the samples I know (not even that from NVIDIA) use the „cudnnBatchNormalizationBackward“ function at all…)

[QUOTE=typecheck;140983]
I run Windows on my laptop for testing. So a Window build would be nice for me as I don’t have the right tools to compile on Windows.
…
Thank you very much and looking forward to the windows build.[/QUOTE]

Now… I’m afraid that this is the point where it becomes somewhat … „unprofessional“ or „fiddly“. I have uploaded an updated JAR with the new native library at
http://jcuda.org/jcudnn-natives-0.8.0RC-HOTFIX01-windows-x86_64.jar

But since the name (version number) of the library is still the same, you will have to delete the old version of the library manually. So when you delete
C:\Users*YourUserName*\AppData\Local\Temp\JCudnn-0.8.0RC-windows-x86_64.dll
if will unpack the new version the next time that you use JCudnn.

(I considered alternatives for this, like checking the file date, or adding some „-forceLibraryUpdate“ flag, but until now, creating such a „hotfix“ was hardly ever necessary…)

EDIT, BTW:

It would be great if you could drop me a note then, I’d really be interested to see this.

typecheck · 19. Oktober 2016 um 16:44

Yep. It works !

I will alert you when I publish my code that calls jcudnn.

Just you know, batch normalization was more recent addition to cudnn and it is used in the recent best – residual CNN

Marco13 · 19. Oktober 2016 um 16:56

I’m curious to see an application (beyond the MNIST sample…). And I’ll definitely have to read some papers and maybe play around with some “toy datasets” to catch up with the most recent developments.

typecheck · 15. November 2016 um 10:58

Marco,

Here is my website GitHub - deepdsl/deepdsl: DeepDSL is a domain specific language (DSL) embedded in Scala for writing deep learning network applications., where it contains some simple tests for JCudnn functions. deepdsl/deepdsl-java/src/test/java/deepdsl/cudnn at master · deepdsl/deepdsl · GitHub

There are multiple generated Java programs for running deep neural networks at deepdsl/deepdsl-java/src/main/java/deepdsl/gen at master · deepdsl/deepdsl · GitHub, which calls JCudnn/JCuda through my Java wrappers.

Marco13 · 15. November 2016 um 15:52

Thanks @typecheck . From quickly skimming over this, it looks impressive (also the benchmarks). Something like https://github.com/deepdsl/deepdsl/blob/master/deepdsl-java/src/main/java/deepdsl/gen/Alexnet.java , on the other hand, looks “intimidating”, but I will try to allocate some time to have a closer look.

typecheck · 19. November 2016 um 19:00

@Marco13 ,

Thanks for checking it out.

These Java programs are auto-generated, which is why they are long and monolithic.

I do have another potential issue with BatchNorm. In the below test code (that calls the JCudnn library in DeepDSL),
I would expect the result of “norm.forward(x, scale, bias)” and “norm.forward_inference(x, scale, bias)” to be the same.

However, while “forward” is correct, “forward_inference” is not.

“forward” method calls batch_norm_forward_training in JCudnn while “forward_inference” calls batch_norm_forward_inference in JCudnn.
The two versions should be the same except the former does a little more work to calculate running mean/variance and to speed up backward gradient.

When I test Deep Residual Network, while the training loss goes down nicely, the accuracy was very low. This also seems to suggest that the forward_inference has a problem.

Thanks for looking into this.

	@Test
	public void forward() {
		float[] a = {2,0, 0,-1, -2,1}, a1 = {1, 1}, a2 = {0f, 0f};
		int[] dims = {3, 2, 1, 1}, norm_dims = {1, 2, 1, 1};
		JTensorFloat t = new JTensorFloat(a, dims), 
				tscale = new JTensorFloat(a1, norm_dims), 
				tbias = new JTensorFloat(a2, norm_dims);
		JCudaTensor x = t.asJCudaTensor(),
				scale = tscale.asJCudaTensor(),
				bias = tbias.asJCudaTensor();
		
		JCudnnBatchNorm norm = new JCudnnBatchNorm("bn", dims);
		
		JCudaTensor y = norm.forward(x, scale, bias);
		System.out.println(Arrays.toString(y.asArray()));
		System.out.println(Arrays.toString(norm.running_mean.asArray()));
		System.out.println(Arrays.toString(norm.running_variance.asArray()));
		System.out.println(Arrays.toString(norm.saved_mean.asArray()));
		System.out.println(Arrays.toString(norm.saved_inv_variance.asArray())); 
		y = norm.forward_inference(x, scale, bias);
		System.out.println(Arrays.toString(y.asArray()));
      }

Marco13 · 20. November 2016 um 06:46

Again, I’ll have to allocate some time for this (i.e. check out your project and get this part up and running, or try to set up a minimal, standalone test case).

Until then: In how far are the results “different” or “wrong”? Are they only imprecise (in some epsilon-range), plainly different (looking like reasonable but random values), or obviously utterly wrong (involving 1.234e45, +/-Infinity or even NaN)?

typecheck · 20. November 2016 um 10:30

Sorry I should have posted the test run result.

[1.2247425, 0.0, 0.0, -1.2247357, -1.2247425, 1.2247357] output of forward_training
[0.0, 0.0] running mean
[4.0, 1.0] running variance
[0.0, 0.0] saved mean
[0.61237127, 1.2247357] saved inverted standard deviation

[0.9999987, 0.0, 0.0, -0.99999493, -0.9999987, 0.99999493] output of forward_inference

A little explanation of the input. It is 3 batch with 2 channels.

Batch norm will try the following:

Take input x {2, 0, 0,- 1, -2, 1}, separate them into two groups based on channel:

2, 0, -2 and 0, -1, 1

Find mean and variance for both, which are

[0, 0] for mean and
[8/3, 2/3] for variance respectively.

But the reported running variance is [4, 1].

Note that the input dimension is 3 x 2 x 1 x 1, which means 3 batches and 2 channels.

So I would expect variance to be (4 + 0 + 4)/3 and (0 + 1 + 1)/3 respectively. But it is not.

It seems that cudnn used sample variance formula: Sum of error squares / (N-1)

On the other hand, saved mean and saved inverted standard deviation are as expected.

They are in fact, [0, 0 ] for saved mean and [ 1 / sqrt(8/3), 1 / sqrt(2/3) ] for inverted std

=====================================

It seems that forward_training used population variance while forward_inference used sample variance.

I am not sure why there is the difference and whether it is causing my problem. It may not be a bug.

However, if I used “forward_training” for inference purpose, then it all seems to work.

Marco13 · 20. November 2016 um 11:04

Frankly: I have no idea what you are talking about But I will try to extract the relevant parts from the test case and set up an example based on your numbers, and

see whether I observe the same results
try it out with plain cuDNN (this is usually my first step in these cases, to make sure that the problem is really in the JCu* library, and not already in the native library)
I’ll likely be able to do this in the next few days.

typecheck · 20. November 2016 um 11:31

@Marco13 ,

sorry I just edited my post after your reply.

I thought running variance wasn’t updated in batch norm but it was. Now I am not sure what could be the problem except that forward_training uses 1/N to scale variance while the variance it saved is scaled with 1/(N-1), where N is the batch size.

Edit: This is actually correct behavior according to the original paper on batch norm. So this is not a bug. Sorry about that.

Marco13 · 20. November 2016 um 12:55

[QUOTE=typecheck]
Edit: This is actually correct behavior according to the original paper on batch norm. So this is not a bug. Sorry about that.[/QUOTE]

So is the overall issue solved now?

(If it is not solved, this will be the first thing that I try out after setting up the project locally. Otherwise, I could probably start with looking for (simple) examples)

typecheck · 20. November 2016 um 14:26

[QUOTE=Marco13]So is the overall issue solved now?

(If it is not solved, this will be the first thing that I try out after setting up the project locally. Otherwise, I could probably start with looking for (simple) examples)[/QUOTE]

Batch Norm is correctly implemented as far as I can tell. I tried batch norm on Lenet by adding a BN layer after the first convolution layer. It accelerated convergence as expected. Everything works.

My problem is to do with low precision of Residual Network when using batch_norm_forward_inference but it could be a bug on my side or I just didn’t have the patience to train the network long enough with my puny laptop.

typecheck · 20. November 2016 um 16:43

Looks like I am not the only one with batch normalization issues: https://github.com/NVIDIA/DIGITS/issues/629

Some suggested variance clipping as a solution: https://github.com/BVLC/caffe/pull/3919#issuecomment-215174898
since global variance may gets corrupted by some very large/small training variances.

They mentioned a similar issue that I observed: high training accuracy but low testing accuracy.

Batch normalization problem

If you don’t mind, please try the following code. The forward training and forward inference methods both work really well. Just the backward doesn’t work. When I run it, I got error code “3” and the result “dx” is all zero.

[/QUOTE]

Yes, I realize that. My original code does use properly calls to create handles. The test code I gave you was written in 10 minutes to reproduce the error.

Also, I use CudaBlas for matrix operations such as implementing fully connected layers. I also use your Vec library, which is wonderful.

If you don’t mind, please try the following code. The forward training and forward inference methods both work really well. Just the backward doesn’t work.
When I run it, I got error code “3” and the result “dx” is all zero.