[cusolverDnSSgels] segmentation fault when calling cusolverDnSSgels

atutorme · 4. Januar 2022 um 06:30

First of all, thanks for this awesome library!

My question:
I’ve been trying to call cusolverDnSSgels using JCusolver, but am getting a segmentation fault.

I am following a CUDA example from here: StackOverflow: Trying to run a CusolverSSgels testcase, however it is not working

The java version of this example is below at the end of my post.

The segmentation fault that occurs when running the Java code is as follows:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f98b9fe8b6d, pid=19450, tid=19455
#
# JRE version: OpenJDK Runtime Environment (17.0+35) (build 17+35-2724)
# Java VM: OpenJDK 64-Bit Server VM (17+35-2724, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libcusolver.so.11+0x20fb6d]  cusolverDnIRSInfosGetNiters+0xd
...

Any help on this would be greatly appreciated.

(NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5)
Jcuda version from Maven Central: 11.4.1

SgelsTest.java

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcusolver.cusolverDnHandle;
import jcuda.jcusolver.cusolverStatus;

import java.util.Arrays;

import static jcuda.jcusolver.JCusolverDn.*;
import static jcuda.runtime.JCuda.*;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

public class SgelsTest {
  public static void main(String[] args) {
    // This test follows the C implementation here:
    // https://stackoverflow.com/questions/67569389/trying-to-run-a-cusolverssgels-testcase-however-it-is-not-working
    float[] A = {6f, 7f, 6f, 5f, 5f, 5f};
    float[] y = {9f, 3f, 10f};

//    A =
//
//    6   5
//    7   5
//    6   5
//
//
//    y =
//
//    9
//    3
//    10


    //params
    final int C = 3;
    final int M = 2;
    final int lda = C;

    final cusolverDnHandle handle = new cusolverDnHandle();
    int status = cusolverDnCreate(handle);
    System.out.println("cusolver initialisation status = " + cusolverStatus.stringFor(status));

    Pointer dA = new Pointer();
    Pointer dy = new Pointer();
    Pointer dx = new Pointer();

    status = cudaMalloc(dA, (long) A.length * Sizeof.FLOAT);
    System.out.println("malloc A status = " + status + " " + cudaGetErrorName(status) + " " + cudaGetErrorString(status));
    status = cudaMalloc(dy, (long) y.length * Sizeof.FLOAT);
    System.out.println("malloc y status = " + status + " " + cudaGetErrorName(status) + " " + cudaGetErrorString(status));

    float[] x = new float[M];
    status = cudaMalloc(dx, (long) x.length * Sizeof.FLOAT);
    System.out.println("malloc x status = " + status + " " + cudaGetErrorName(status) + " " + cudaGetErrorString(status));

    status = cudaMemcpy(dA, Pointer.to(A), A.length * Sizeof.FLOAT, cudaMemcpyHostToDevice);
    System.out.println("memcpy A status = " + status + " " + cudaGetErrorName(status) + " " + cudaGetErrorString(status));
    status = cudaMemcpy(dy, Pointer.to(y), y.length * Sizeof.FLOAT, cudaMemcpyHostToDevice);
    System.out.println("memcpy y status = " + status + " " + cudaGetErrorName(status) + " " + cudaGetErrorString(status));

    long[] bufferSize =  { 0L };

    Pointer buffer = new Pointer();

    cudaMalloc(buffer, Sizeof.FLOAT);

    status = cusolverDnSSgels_bufferSize(handle, C, M, 1, dA, lda, dy, C, dx, M, buffer, bufferSize);
    System.out.println("status of buffer size = " + status + " " + cusolverStatus.stringFor(status));
    System.out.println("buffer size = " + bufferSize[0]);

    Pointer dWork = new Pointer();
    cudaMalloc(dWork, Sizeof.FLOAT * bufferSize[0]);

    int[] niter = { 0 };
    int[] dinfo = { 0 };
    status = cusolverDnSSgels(handle, C, M, 1, dA, lda, dy, C, dx, M, dWork, bufferSize[0], niter, dinfo);
    System.out.println("status of sgels = " + status + " " + cusolverStatus.stringFor(status));

    status = cudaMemcpy(Pointer.to(x), dx, x.length * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
    System.out.println("memcpy x status = " + status + " " + cudaGetErrorName(status));

    System.out.println(Arrays.toString(x));

    cudaFree(dA);
    cudaFree(dy);
    cudaFree(dx);
    cudaFree(buffer);
    cudaFree(dWork);
  }
}

atutorme · 4. Januar 2022 um 06:32

PS: The output before the seg fault is here:

cusolver initialisation status = CUSOLVER_STATUS_SUCCESS
malloc A status = 0 cudaSuccess no error
malloc y status = 0 cudaSuccess no error
malloc x status = 0 cudaSuccess no error
memcpy A status = 0 cudaSuccess no error
memcpy y status = 0 cudaSuccess no error
status of buffer size = 0 CUSOLVER_STATUS_SUCCESS
buffer size = 882688
...

Marco13 · 4. Januar 2022 um 16:49

There have been two bugs in JCusolver that caused this:

The iter parameter had been an int[] array, but was treated as a pointer internally
The dinfo parameter had been an int[] array, but should have been a Pointer (because it is stored in device memory - this is similar to the problem that caused the linked stack overflow question)

^{(The configuration of all this is a bit difficult. If I had to re-write JCuda/JCusolver from scratch, I'd probably let **all** parameters be pointers. Whether it is a host- or device pointer could then be sorted out by the user. Right now, there is no pattern or rule for actually "knowing" the type of any pointer. The documentation on the NVIDIA website is extensive, but pretty inconsistent - and even if it _was_ consistent, there's no way to sensibly feed that information into a code generator - so "configuring" this properly is a pretty manual and clumsy process that I'm not particularly proud of...)}

The types are fixed with this commit:

And when using a Pointer (to device memory) for dinfo, based on the updated state, the following program seems to work:

package jcuda.jcusolver.test;
import static jcuda.jcusolver.JCusolverDn.cusolverDnCreate;
import static jcuda.jcusolver.JCusolverDn.cusolverDnSSgels;
import static jcuda.jcusolver.JCusolverDn.cusolverDnSSgels_bufferSize;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;

import java.util.Arrays;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.jcusolver.JCusolver;
import jcuda.jcusolver.cusolverDnHandle;
import jcuda.runtime.JCuda;

public class SgelsTest 
{
    public static void main(String[] args) 
    {
        // Enable exceptions and omit further error checks
        JCuda.setExceptionsEnabled(true);
        JCusolver.setExceptionsEnabled(true);
        
        // This test follows the C implementation here:
        // https://stackoverflow.com/questions/67569389
        float[] A = {6f, 7f, 6f, 5f, 5f, 5f};
        float[] y = {9f, 3f, 10f};

        //    A =
        //
        //    6   5
        //    7   5
        //    6   5
        //
        //
        //    y =
        //
        //    9
        //    3
        //    10


        //params
        final int C = 3;
        final int M = 2;
        final int lda = C;

        final cusolverDnHandle handle = new cusolverDnHandle();
        cusolverDnCreate(handle);

        Pointer dA = new Pointer();
        Pointer dy = new Pointer();
        Pointer dx = new Pointer();

        cudaMalloc(dA, (long) A.length * Sizeof.FLOAT);
        cudaMalloc(dy, (long) y.length * Sizeof.FLOAT);

        float[] x = new float[M];
        cudaMalloc(dx, (long) x.length * Sizeof.FLOAT);

        cudaMemcpy(dA, Pointer.to(A), A.length * Sizeof.FLOAT, cudaMemcpyHostToDevice);
        cudaMemcpy(dy, Pointer.to(y), y.length * Sizeof.FLOAT, cudaMemcpyHostToDevice);

        long[] bufferSize =  { 0L };
        Pointer buffer = new Pointer();
        cudaMalloc(buffer, Sizeof.FLOAT);
        cusolverDnSSgels_bufferSize(handle, C, M, 1, dA, lda, dy, C, dx, M, buffer, bufferSize);
        System.out.println("buffer size = " + bufferSize[0]);

        Pointer dWork = new Pointer();
        cudaMalloc(dWork, Sizeof.FLOAT * bufferSize[0]);
        int[] niter = { 0 };
        Pointer dinfo = new Pointer();
        cudaMalloc(dinfo, Sizeof.INT * 1);
        cusolverDnSSgels(handle, C, M, 1, dA, lda, dy, C, dx, M, dWork, bufferSize[0], niter, dinfo);

        int hostDinfo[] = { -1 };
        cudaMemcpy(Pointer.to(hostDinfo), dinfo, 1 * Sizeof.INT, cudaMemcpyDeviceToHost);
        System.out.println("hostDinfo "+hostDinfo[0]);

        cudaMemcpy(Pointer.to(x), dx, x.length * Sizeof.FLOAT, cudaMemcpyDeviceToHost);
        System.out.println(Arrays.toString(x));

        cudaFree(dA);
        cudaFree(dy);
        cudaFree(dx);
        cudaFree(buffer);
        cudaFree(dWork);
    }
}

printing

buffer size = 882688
hostDinfo 0
[-6.500004, 9.700005]

which seems to be the right result.

Unfortunately, there is no simple workaround for that, and I’ll probably not be able to schedule a quick bugfix release. (I could try, but would rather try to do this together with the release for CUDA 11.5, which is already pending as of CUDA 11.5 · Issue #43 · jcuda/jcuda-main · GitHub , but I cannot give a timeline here either).

If this is urgent, you could try to compile the native library on your own, or (if you’re on Windows, and really need a quick solution) I could send you the updated JAR+DLL, but that would really only be for quick, first tests, and should be replaced with the proper 11.5 release as soon as it is available.

atutorme · 5. Januar 2022 um 06:54

Thanks, Marco. It’s not super-urgent, so will wait for the 11.5 release.
Kudos for jcuda - it’s an amazing effort.