Seeking example of double precision matrix matrix (sparse) multiply (no transpose)

Would very much like to see code which uses JCCusparse.cusparseDcsrgemm including setting up the call from cuu arrays and displaying results.

Many thanks in advance
Jack

Do you have a particular question regarding that, or just looking for “any” example?
And with “cuu arrays”, do you refer to the “coo” format?
In any case, I can try to assemble an example with cusparseDcsrgemm, but can’t give an exact date.

Thank you!
I should have been more explicit.
I am using JCuda.cusparseDcsrgemm and starting from int[] arrays and a double[] which I strongly believe are CSR format already, so CUU is not an issue at the moment. I find myself very confused especially when setting up the output Pointer © and its row and column Pointers. Whatever the case, while the routine returns Success, I am not seeing the values when doing a memCpy back. I would very much like to see a simply end-to-end example given predefined CSR inputs.

Perhaps is appropriate that I show you my code, and ask for criticisms rather than ask for example code which may or may not exist.
I created a cleaned-up and simplified version of what I am attempting. The demo is that of a pair of 3x3 sparse matrices for which I know what the results should be, and I coded all that up in a Java class which is in this gist:

https://gist.github.com/KnowledgeGarden/ab7ba44582e3dfb381c9

The workhorse is in the method matrixMult().

Within that, I made a comment field which says:
// PRIMARY AMBIGUITY BEGINS HERE

I am trying to document this well so that it might help others, particularly those new to CUDA as am I, in the future.

I speak of “ambiguity” since documentation has proven hard to find (for me) on appropriate configurations when setting up a primary JCuda call.
An example: given a pair of sparse matrices, it is clear we can know nnzA and nnzB, but allocating space for the C matrix seems heuristic, and possibly wrong. I simply cannot make sense of that.

In any case, the class runs and returns Success, but also, returns no results.

That could be because I am not setting up the call properly, or perhaps I am not doing the proper calls to retrieve the results.

I will very much appreciate pointers to what is missing or incorrect in my code.

While waiting for my previous post to be approved, I should fill in some gaps: the environment I am using.
Work is performed on a RedHat 7.2 Enterprise I7 64-bit box with Nvidia GTX 960. The development environment is Eclipse; in that environment, I am able to run all the related JCuda sample programs.

*** Edit ***

Just in case, I posted a gist of all my test code here
https://gist.github.com/KnowledgeGarden/ab7ba44582e3dfb381c9
testing with two 3x3 sparse matrices.
I indicated areas in the code where I am unable to make full sense of the documentation.

I look forward to critical comments which point out what is missing or is incorrect.

Thanks!

I have not yet looked at the code in detail, but will try it out later today or tomorrow.

First, a short note:

I strongly recommend to call
JCuda.setExceptionsEnabled(true);
JCusparse.setExceptionsEnabled(true);
JCublas.setExceptionsEnabled(true);
as the first calls in your main method, at least during the development phase. When exceptions are enabled, every return code is checked automatically, and an exception with a (somewhat) helpful message will be thrown in case of errors.

(I’ve seen you added an showReturnValue method, but manually checking all return values is tedious…)

An example: given a pair of sparse matrices, it is clear we can know nnzA and nnzB, but allocating space for the C matrix seems heuristic, and possibly wrong. I simply cannot make sense of that.

That’s the key point here, I guess. I also stumbled over this, but the corresponding documentation shows in a code snippet how to compute the number of non-zero elements for the result matrix (using cusparseXcsrgemmNnz). I started working on an example, and it seems to work well, BUT: Handling the Matrices and the CSR conversion stuff is a bit odd. Quite a while ago, I started creating some utility classes for this, and I indend to publish them on GitHub eventually. (Most of them are already part of the CG sample at jcuda.org - Samples , but I’m extending and refactoring them a bit - the current version does not support „double“ data at all).

Thank you @Marco13 . Adding error code immediately suggested an error: cudaErrorIllegalAddress on the first line where I call cudaMemcpy to fetch results. Studying that issue now. Many thanks!

<side comment: would be nice if this forum did not present a quick reply unless you are logged in. Retyping after logging in takes time!>

I have a question which reveals the ravages of not programming in c/c++ for a zillion years. In some sense, I am looking for a cookbook which shows how to map CUDA c examples to JCuda code. Here is the current issue. Please consider the snippet of code pointed to for cusparseXcsrgemmNnz It’s a 2-phase operation, the first phase used to compute nnzC in a sparse-sparse matrix multiply; that value to be used in the second phase. The code is this:

int baseC, nnzC;
// nnzTotalDevHostPtr points to host memory
int nnzTotalDevHostPtr = &nnzC;
cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST);
cudaMalloc((void
*)&csrRowPtrC, sizeof(int)*(m+1));
cusparseXcsrgemmNnz(handle, transA, transB, m, n, k,
descrA, nnzA, csrRowPtrA, csrColIndA,
descrB, nnzB, csrRowPtrB, csrColIndB,
descrC, csrRowPtrC, nnzTotalDevHostPtr );

In the Java signature, nnzTotalDevHostPtr will be a Pointer.

If we imagine a transliteration of that code to Java, it would start with:

int nnzC;
Pointer nnzTotalDevHostPtr = new Pointer();

I would like to know how to map
int *nnzTotalDevHostPtr = &nnzC;

into that Pointer object.
Many thanks in advance.

Again just a short hint (I’ll look at your gist today afternoon, I promise ;-)) :

Unfortunately, there is a pattern in C-APIs for this: Instead of returning a value (because the return value is usually “occupied” with some error code, due to the lack of exceptions), a pointer to the desired value is passed to the function, and then the function “fills” this pointer. Emulating this in Java is a bit clumsy. Particularly, local variables may only be written to via assignments - and nothing else. So the only option is to (either define some “Holder/Wrapper” class, or) let the function write into an array. For the case of CUDA, it’s even trickier, because these pointers that are passed to these methods may be pointers to host or to device memory.

However, in this case (after setting CUSPARSE_POINTER_MODE_HOST mode), you can do

int nnzC[] = { 0 };
cusparseXcsrgemmNnz(..., Pointer.to(nnzC));

and afterwards just use nnzC[0] (or store this in a new local variable, if you want to omit the array index).

So, I had a look at the gist that you provided, and as already pointed out: When exceptions are enabled, one can see that there is a problem when trying to copy the 2*(nnzA+1) values from the device to the host - at least because the nnzA should be nnzC, as described above.

I did not try to fix this in the given example code. You already mentioned that you did the first steps by using cusparseXcsrgemmNnz to compute the number of nonzero elements of the result matrix.

However, I just pushed a first version of the matrix utility classes that I mentioned to GitHub - jcuda/jcuda-matrix-utils: Utility classes for dense and sparse matrices in JCuda

These classes should not be considered to be a stable, reliable, general matrix library. The interfaces/classes are basically just thin wrappers around „everything that is required to represent a matrix in CSR format“ (namely: Number of rows and columns, row pointers, column indices, values and number of nonzero elements), together with some utility methods to convert between sparse and dense host and device matrices.

Maybe the most important „„feature““ for testing and debugging is to convert the matrices into a (formatted) String :wink:

But again: These are just utility classes, originally intended only for some samples. There are basically zero error checks and test cases. Maybe I’ll extend and improve them over time, but until now, the version number „0.0.1-SNAPSHOT“ should be taken serious.

However, the repository also contains a sample showing how to do a DGEMM with JCusparse. The most important part for you is probably this one:

Namely the computation of the nonzero elements (and the subsequent call to cusparseDcsrgemm). This part should be fairly independent of the other Matrix classes in this package, and show the basic process, regardless of there the rowPointers/values/columnIndices etc. actually come from.

[ot]
Edit:

It should be possible to post even when you are not logged in - even though the post would then be „anonymous“. However, I brought up the question here (in German) whether one can keep the entered text while logging in
[/ot]

Those hints and contributions have been very helpful. Many thanks for that.
While playing with the sample code, I added some instrumentation to DoubleMatrices to better understand how CSR works.

This is a snippet of the sample code + a trace of the CSR arrays:

hostDenseA.set(0, 0, 1.0);
hostDenseA.set(1, 1, 2.0);
hostDenseA.set(0, 2, 3.0);
hostDenseA.set(2, 3, 4.0);
0 2 3 4
0 2 1 3
1.0 3.0 2.0 4.0

Those are precisely what DoubleMatrices makes for a sparse matrix from a dense one.
Here is where I am confused:

I would have expected the CSR arrays to look like these:

0 1 0 2
0 1 2 3
1.0 2.0 3.0 4.0

Perhaps I am about to learn something I didn’t see before?

Many thanks in advance.

Maybe I should emphasize that I’m not an expert here, so it might be that the conversion methods are still buggy or plainly wrong, but … the output is


   1,000    0,000    3,000    0,000 
   0,000    2,000    0,000    0,000 
   0,000    0,000    0,000    4,000 

row [0, 2, 3, 4]
col [0, 2, 1, 3]
val [1.0, 3.0, 2.0, 4.0]

The description on Compressed Row Storage (CRS) (which, due to the name “Netlib”, suggested some credibility for me) says:

The val vector stores the values of the nonzero elements of the matrix A, as they are traversed in a row-wise fashion.

And a row-wise traversal of the matrix gives the values 1,3,2,4

The col_ind vector stores the column indexes of the elements in the val vector

The values are in the columns 0,2,1,3, respectively.

The row_ptr vector stores the locations in the val vector that start a row

val[0] is 1.0, this is where row 0 starts
val[2] is 2.0, this is where row 1 starts
val[3] is 4.0, this is where row 2 starts

So the row pointers should be [0,2,3,?]

The “?” here stands for the number of non-zeros. The linked documentation defines this as “nnz+1”, but this might just be due to the assumption of 1-based indexing. At least, according to the CUSPARSE documentation: cuSPARSE :: CUDA Toolkit Documentation :

Using this data for a quick test

        int rowsA = 4;
        int colsA = 5;
        double epsilon = 1e-10f;

        double m[][] = new double[][] {
            {1,4,0,0,0},
            {0,2,3,0,0},
            {5,0,0,7,8},
            {0,0,9,0,6},
        };
        DoubleMatrixHostDense hostDenseA = 
            DoubleMatrices.createDoubleMatrixHostDense(rowsA, colsA);
        for (int r=0; r<rowsA; r++)
        {
            for (int c=0; c<colsA; c++)
            {
                hostDenseA.set(r,c,m[r]```);
            }
        }
        DoubleMatrixHostCSR hostCsrA = 
            DoubleMatrices.createDoubleMatrixHostCSR(hostDenseA, epsilon);
        
        System.out.println(hostCsrA);
        System.out.println("row "+Arrays.toString(hostCsrA.getRowPointers()));
        System.out.println("col "+Arrays.toString(hostCsrA.getColumnIndices()));
        System.out.println("val "+Arrays.toString(hostCsrA.getValues()));

prints the result


   1,000    4,000    0,000    0,000    0,000 
   0,000    2,000    3,000    0,000    0,000 
   5,000    0,000    0,000    7,000    8,000 
   0,000    0,000    9,000    0,000    6,000 

row [0, 2, 4, 7, 9]
col [0, 1, 1, 2, 0, 3, 4, 2, 4]
val [1.0, 4.0, 2.0, 3.0, 5.0, 7.0, 8.0, 9.0, 6.0]

which matches the example on the NVIDIA site. Particularly, the last element of the “rowPtr” array is 9, which is the number of non-zero elements. The definition there says

while the last entry contains nnz+csrRowPtrA(0). In general, csrRowPtrA(0) is 0 or 1 for zero- and one-based indexing, respectively.

So for 0-based indexing, this should just be “nnz”.

I’d have to take a closer look at your expected output, so see why you expected it, but can try to do this later. If you think that something is wrong with the current implementation, I’d be happy to know about it.

Actually, with your explanation, I believe i am incorrect. In fact, on a deeper look, what I showed is a coo format prior to csr conversion.
Many thanks!

I have modified my code based on what I just learned and am now able to replicate your results.
Many thanks.
I am about to open another thread regarding a simplified way to transpose a sparse matrix.

[QUOTE=jackpark]
I am about to open another thread regarding a simplified way to transpose a sparse matrix.[/QUOTE]

I’m not aware of any dedicated method for this. But two small hints:

  1. The NVIDIA documentation says
    **Note: The matrix A in CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa). **
    (I didn’t yet wrap my head around this, just wanted to mention it)

  2. You rarely need to transpose. Most BLAS methods accept parameters that indicate whether matrices should be interpreted as being transposed (the “transA” and “transB” in the DGEMM example, for instance). This way, there is no need to shuffle any memory around. The parameters just cause the matrix to be interpreted accordingly.

One of the things I thought was that a matrix in COO format could be transposed by simply swapping the cooRow[] with the cooCol[]. But, I’m working with csr values. What you said about CSR format, for me, doesn’t compute – cannot wrap my head around that.

As it turns out , I do have a use case for multiplying A by its transpose.

Right now, I will be testing a simple 3x3 pair of sparse matrices hand transposed and multiplied – I will be passing that data into the JCuda code to see how it compares.

Incidentally, I added a static method to DoubleMatrices which allows me to do in public what you already do inside that code which is to create a csr host based on numRows, numCols, nnZ, csrVal[], csrRow[] and csrCol[] since I have an external routine which allows me to craft a sparse matrix without the use of a dense matrix, and outputs the csr values.

Again, multiplying A by its transpose should be possible with

cusparseDcsrgemm(handle, 
    CUSPARSE_OPERATION_NON_TRANSPOSE,
    CUSPARSE_OPERATION_TRANSPOSE, // transpose here!
    rowsA, colsA, colsA,
    deviceCsrA...,
    deviceCsrA...)

without doing the explicit transpose at all (if there are any difficulties, I can try to create a sample/test case for this next week).

The utility methods could certainly be extended. For now, the Dense->CSR conversion was the most important one for the samples. The method for creating the CSR matrix directly should already exist


but I’ll likely rename this one to “createDoubleMatrixHostCSR” for consistency.

The first part of a multiply is to run the routine which calculates nnzC. If you wish to ultimately do a transpose on B, do you use CUSPARSE_OPERATION_TRANSPOSE on that routine as well? would CUSPARSE_OPERATION_TRANSPOSE affect nnzC?

*** Edit ***

Oh, and thanks for pointing to create() – it’s true that I went looking for a method named like the others. Many thanks!

In the sample, I wrote

// Compute the number of nonzero elements that the result matrix will 
// have. This is done by calling "cusparseXcsrgemmNnz" with the same 
// parameters that will later be used for the "cusparseDcsrgemm" call

because you have to compute the nnzC with the same parameters. Consider


0 0      1 0     0 0 
1 1  *   1 0  =  2 0

vs


0 0      1 1     0 0 
1 1  *   0 0  =  1 1

(I’ll change the method names ASAP, currently I’m busy with some (many) other things)

I am pleased to respond that I crafted a pair of hand-multiplied 3x3 dense matrices with lots of zeros, then ran them, as sparse, through my JCuda code; multiply with transpose on B performed correctly.

Please note that I am faced with repeating this exercise with complex numbers.

Many thanks!