Access violation in Cublas

NigelEssence · 26. Oktober 2015 um 08:03

I am using JCublas and calling cublasDsyrk

It normally works fine, but after about an hour (I am doing a lot of processing) I get a EXCEPTION_ACCESS_VIOLATION

It seems to be only when I have big matrices - about 500 x 150,000. This would mean about 75m doubles, which is 300 MB

Is there an obvious memory limit or something i should be aware of? I am using a Geoforce GT 640, which I think is fairly low spec.

My GPU specs are:

name=GeForce GT 640
totalGlobalMem=4294967296
sharedMemPerBlock=49152
regsPerBlock=65536
warpSize=32
memPitch=2147483647
maxThreadsPerBlock=1024
maxThreadsDim=[1024, 1024, 64]
maxGridSize=[2147483647, 65535, 65535]
clockRate=797000
totalConstMem=65536
major=3
minor=0
textureAlignment=512
texturePitchAlignment=32
deviceOverlap=1
multiProcessorCount=2
kernelExecTimeoutEnabled=1
integrated=0
canMapHostMemory=1
computeMode=cudaComputeModeDefault
maxTexture1D=65536
maxTexture1DMipmap=16384
maxTexture1DLinear=134217728
maxTexture2D=[65536, 65536]
maxTexture2DMipmap=[16384, 16384]
maxTexture2DLinear=[65000, 65000, 1048544]
maxTexture2DGather=[16384, 16384]
maxTexture3D=[4096, 4096, 4096]
maxTexture3DAlt=[2048, 2048, 16384]
maxTextureCubemap=16384
maxTexture1DLayered=[16384, 2048]
maxTexture2DLayered=[16384, 16384, 2048]
maxTextureCubemapLayered=[16384, 2046]
maxSurface1D=65536
maxSurface2D=[65536, 32768]
maxSurface3D=[65536, 32768, 2048]
maxSurface1DLayered=[65536, 2048]
maxSurface2DLayered=[65536, 32768, 2048]
maxSurfaceCubemap=32768
maxSurfaceCubemapLayered=[32768, 2046]
surfaceAlignment=512
concurrentKernels=1
ECCEnabled=0
pciBusID=1
pciDeviceID=0
pciDomainID=0
tccDriver=0
asyncEngineCount=1
unifiedAddressing=1
memoryClockRate=891000
memoryBusWidth=128
l2CacheSize=262144
maxThreadsPerMultiProcessor=2048
streamPrioritiesSupported=0
globalL1CacheSupported=0
localL1CacheSupported=1
sharedMemPerMultiprocessor=49152
regsPerMultiprocessor=65536
managedMemory=1
isMultiGpuBoard=0
multiGpuBoardGroupID=0

*** Edit ***


            cublasHandle handle = new cublasHandle();
            cublasCreate(handle);

            Pointer d_A = new Pointer();
            Pointer d_C = new Pointer();
            cudaMalloc(d_A, nRuns * nX * Sizeof.DOUBLE);
            cudaMalloc(d_C, nX * nX * Sizeof.DOUBLE);
                double dh_A[] = new double[nRuns * nX];
                double dh_C[] = new double[nX * nX];

// create dh_A array

           cublasSetVector(nRuns * nX, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
            cublasSetVector(nX * nX, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
            Pointer pAlpha = Pointer.to(new double[]{1.0d});
            Pointer pBeta = Pointer.to(new double[]{0.0d});
            cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, nX, nRuns, pAlpha, d_A, nRuns, pBeta, d_C, nX);
            cublasGetVector(nX * nX, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);

            cudaFree(d_A);
            cudaFree(d_C);
            cublasDestroy(handle);

Marco13 · 26. Oktober 2015 um 10:16

There should not be any particular limit relevant here. Of course, there are limits for the maximum allocation, but if it works once, it should work infinitely often (assuming that there is no issue with memory fragmentation).

From the description, it sounds like some sort of memory leak, but in the given code snippet, everything seems to be cleaned up properly.

(It might not be necessary to create and destroy the cublasHandle each time - it should be fine to create it once at program startup, and destroy it at the end, unless you need different handles for other purposes).

Should the given code snippet is sufficient to reproduce the error, when run in a while(true)? If so, what does the size “500 x 150,000” refer to? Particularly: What are “nRuns” and “nX” in your case?

Did you set
JCublas2.setExceptionsEnabled(true);
JCuda.setExceptionsEnabled(true);
to catch any other errors that might happen here?

Additionally, when the VM crashes, it should create a “hs_err_XXXX.txt” file that may contain useful information.

NigelEssence · 26. Oktober 2015 um 10:46

nruns is 150,000, nX is 500.

I am using JCUBLAS in this example, i will try again with JCUBLAS2.

I have set the exceptions to true, but it doesn;t seem I haven’t thrown them properly - it doesn;t go to the catch section.

So it seems there is nothing obvious…I’ll get some more diagnostics. I am calling this bit of code tens of thousands of time, and nX varies.

NigelEssence · 26. Oktober 2015 um 11:19

OK, I have got some better information. I am now trapping exceptions properly. For nRows more than about 140,000 I get

jcuda.CudaException: CUBLAS_STATUS_MAPPING_ERROR

Plus a window pops up with some message about the nVidia kernel driver not responding.

I can work around this, I can handle the exception, the number nX should generally be much less than this so having large number is an ‘exceptional’ case.

*** Edit ***

ps. 75 M doubles is of course 600MB, not 300MB.

Marco13 · 26. Oktober 2015 um 13:14

So I just started a test run, with this code snippet…

package jcuda.jcublas.test;

import static jcuda.jcublas.JCublas2.cublasCreate;
import static jcuda.jcublas.JCublas2.cublasDestroy;
import static jcuda.jcublas.JCublas2.cublasDsyrk;
import static jcuda.jcublas.JCublas2.cublasGetVector;
import static jcuda.jcublas.JCublas2.cublasSetVector;
import static jcuda.jcublas.cublasFillMode.CUBLAS_FILL_MODE_UPPER;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_T;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcublas.JCublas2;
import jcuda.jcublas.cublasHandle;
import jcuda.runtime.JCuda;

public class JCublasAccessViolationTest
{
    public static void main(String[] args)
    {
        JCublas2.setExceptionsEnabled(true);
        JCuda.setExceptionsEnabled(true);
        int nRuns = 150000;
        int nX = 500;
        int i = 0;
        while (true)
        {
            System.out.println("Run "+i);
            cublasHandle handle = new cublasHandle();
            cublasCreate(handle);

            Pointer d_A = new Pointer();
            Pointer d_C = new Pointer();
            cudaMalloc(d_A, nRuns * nX * Sizeof.DOUBLE);
            cudaMalloc(d_C, nX * nX * Sizeof.DOUBLE);
            double dh_A[] = new double[nRuns * nX];
            double dh_C[] = new double[nX * nX];

            cublasSetVector(nRuns * nX, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
            cublasSetVector(nX * nX, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
            Pointer pAlpha = Pointer.to(new double[]{1.0d});
            Pointer pBeta = Pointer.to(new double[]{0.0d});
            cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, nX, nRuns, pAlpha, d_A, nRuns, pBeta, d_C, nX);
            cublasGetVector(nX * nX, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);

            cudaFree(d_A);
            cudaFree(d_C);
            cublasDestroy(handle);
            
            long free[] = { 0 };
            long total[] = { 0 };
            JCuda.cudaMemGetInfo(free, total);
            
            System.out.println("Run "+i+" DONE, free "+free[0]);
            i++;
        }
    }
}

But based on the progress, it will take many hours (maybe a few days) until it comes close to 140000. I’m not sure how to reproduce this.

The message about the non-responding driver usually means that a kernel is simply running for too long: https://devtalk.nvidia.com/default/topic/459869/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

Could it be that for some inputs, the cublasDsyrk call simply takes more than 2 seconds on your machine? Something like this could explain arbitrary errors, I guess.

Is the “free” memory that is reported in the above program decreasing for you? It should remain constant…

NigelEssence · 26. Oktober 2015 um 18:47

You need a lower spec GPU!

With slight modification of your code


        JCublas2.setExceptionsEnabled(true);
        JCuda.setExceptionsEnabled(true);
        int nRuns = 10000;
        int nX = 500;
        int i = 0;
        while (true) {
            nRuns += 10000;
            cublasHandle handle = new cublasHandle();
            cublasCreate(handle);
            Pointer d_A = new Pointer();
            Pointer d_C = new Pointer();
            cudaMalloc(d_A, nRuns * nX * Sizeof.DOUBLE);
            cudaMalloc(d_C, nX * nX * Sizeof.DOUBLE);
            double dh_A[] = new double[nRuns * nX];
            double dh_C[] = new double[nX * nX];

            cublasSetVector(nRuns * nX, Sizeof.DOUBLE, Pointer.to(dh_A), 1, d_A, 1);
            cublasSetVector(nX * nX, Sizeof.DOUBLE, Pointer.to(dh_C), 1, d_C, 1);
            Pointer pAlpha = Pointer.to(new double[]{1.0d});
            Pointer pBeta = Pointer.to(new double[]{0.0d});
            cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_T, nX, nRuns, pAlpha, d_A, nRuns, pBeta, d_C, nX);
            cublasGetVector(nX * nX, Sizeof.DOUBLE, d_C, 1, Pointer.to(dh_C), 1);

            cudaFree(d_A);
            cudaFree(d_C);
            cublasDestroy(handle);

            long free[] = {0};
            long total[] = {0};
            JCuda.cudaMemGetInfo(free, total);
            System.out.println("Run " + i + " DONE, free " + free[0] + " nRuns " + nRuns);
            i++;
        }

I get


Run 0 DONE, free 3564519424 nRuns 20000
Run 1 DONE, free 3564519424 nRuns 30000
Run 2 DONE, free 3564519424 nRuns 40000
Run 3 DONE, free 3564519424 nRuns 50000
Run 4 DONE, free 3564519424 nRuns 60000
Run 5 DONE, free 3564519424 nRuns 70000
Run 6 DONE, free 3564519424 nRuns 80000
Run 7 DONE, free 3564519424 nRuns 90000
Run 8 DONE, free 3564519424 nRuns 100000
Run 9 DONE, free 3564519424 nRuns 110000
[GC (Allocation Failure)  2561922K->2540244K(2834944K), 0.0052055 secs]
[GC (Allocation Failure)  2540244K->2540292K(2834944K), 0.0048958 secs]
[Full GC (Allocation Failure)  2540292K->1073K(138240K), 0.3011126 secs]
Run 10 DONE, free 3564519424 nRuns 120000
Exception in thread "main" jcuda.CudaException: cudaErrorLaunchTimeout
	at jcuda.runtime.JCuda.checkResult(JCuda.java:434)
	at jcuda.runtime.JCuda.cudaFree(JCuda.java:4211)
	at datamining.Cublastest.main(Cublastest.java:50)
Java Result: 1

NigelEssence · 26. Oktober 2015 um 19:07

I think the main thing I am worrying about is not how to avoid the exception, but how to recover from it. All attempts so far have failed. Even if I trap the exception, it seems to fail at the next cuda line. Is there some kind of reset? I tried JCuda.cudaDeviceReset() after an exception but it still has an exception at the next Cuda statement. Same problem with JCuda.initialize();

NigelEssence · 26. Oktober 2015 um 19:12

cublasDsyrk can easily take more than 2 seconds on my machine. But my impression is that it is pretty consistent how large nRuns is before it fails. Let me check…

I do seem to get launch timeouts if I start too large, but if it creeps up it is OK…until about 125000 nRuns.

Marco13 · 27. Oktober 2015 um 04:03

In general regarding the exception handling:

I think it is not possible to recover from the exception at all. Note that the exception handling in JCuda is only a convenience feature that throws an exception whenever one of the underlying CUDA functions returned an error. Additionally, as indicated by the documentation of all CUDA functions, these error return codes may stem from previous, asynchronous launches. (This is obviously the case here: It looks like the cudaErrorLaunchTimeout was caused by cudaFree, but of course, it can basically only come from cublasDsyrk.

And according to the Documentation of cudaErrorLaunchTimeout :

cudaErrorLaunchTimeout = 6

This indicates that the device kernel took too long to execute. This can only occur if timeouts are enabled - see the device property kernelExecTimeoutEnabled for more information. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.

So this error indicates that things are irrecoverably messed up.

Regarding the actual error, there are several possible issues:

Kernel timeouts
Memory limits

Regarding the first possible reason, Kernel Timeouts:

You should definitely try disabling the Windows Watchdog, as described in the NVIDIA forum thread that I already mentioned above: "Display driver stopped responding and has recovered" WDDM Timeout Detection and Recovery - CUDA Programming and Performance - NVIDIA Developer Forums

(I can not definitely say that this will solve the error, but from the description so far, it should be worth a try…)

Regarding the first possible reason, Memory limits:

Originally I said that this should not be an issue, because from your description, it sounded like the allocated memory size remained the same all the time. But if you are constantly increasing the memory size, then you might indeed hit a wall sooner or later. There is a limit for the maximum memory allocation. Older versions of the CUDA toolkit made some vague statements. Quoting from a very old one:

The maximum size of a single allocation created by cudaMalloc or cuMemAlloc is limited to:

MIN ( ( System Memory Size in MB - 512 MB ) / 2, PAGING_BUFFER_SEGMENT_SIZE )

For Vista, PAGING_BUFFER_SEGMENT_SIZE is approximately 2GB.

I’m not sure whether this is still true for the newer versions (the newest release notes do not contain such a statement, but who knows…). So an allocation of 120000 * 500 * 8 = 480000000 bytes might already be too large.

Again, this is also only a guess.

Further tests could be:

Does it crash when you start with nRuns = 120000 or so?
Does an equivalent plain C CUDA program work?
(The latter is what I’m usually doing to make sure that the reason is in CUDA and not in JCuda, but for this, I first need a case where it reproducably crashes with JCuda (on my machine))

NigelEssence · 27. Oktober 2015 um 04:54

Many thanks, I will try some of these ideas, once I get my machine back (it is processing and will take another day at least before it finishes). Meanwhile I simply limit nX (it is the number of terms in the logistic model, and if too large you are simply modelling noise and predictions will be crap). This process is all done within a genetic algorithm. I call cublasDsyrk maybe 4 million times.