How to pass an array of multidimensional rows and two columns

So Sorry Marco, the code is java. I sent by mistake. Sorry for disturbance.

As you noticed, the error is caused by trying to use a function src.charAt(j). This is probably from String#charAt in Java. But in C, there are no classes, and therefore, no „member functions“. In C, the src is only a pointer (namely, the pointer to the start of a char array).

In order to access elements on an array, you have to use [brackets]. So the following should be the right solution be compileable:

__device__ void levenshteinDistance(int srcLength,char *patternRemoved, int distinctStart, int distinctEnd,char *pattern,int clStart,int clEnd, char *src,int *srcIndices, int strucStart, int strucEnd,int tokensStrucStart,int tokensStrucEnd,int tokensClStart,int tokensClEnd,int wordsClStart,int wordsClEnd,int wordsStrucStart,int wordsStrucEnd,int wordsTokensClStart,int wordsTokensClEnd, int wordsTokensStrucStart, int wordsTokensStrucEnd,int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{
    for (int i = distinctStart; i < distinctEnd; i++) {
        for (int j = strucStart; j < strucEnd; j++) {
            if (src[j] == ',')
              dX[i * srcLength + j] = 0;
             else
            {
              if (src[j] == patternRemoved[i])
                dX[i * srcLength + j] = srcIndices[j];
              else if (src[j] != patternRemoved[i])
                dX[i * srcLength + j] = dX[i * srcLength + j-1];
            }
        }
         
    }
}

extern "C"
__global__ void ComputationdStructureOnGPUSerial(int srcLength,int numTokenSrc, int srcSructuresNum, int srcWordNum, char *src,int *srcStructuresSFIndices, int *srcIndices,int *srcStructureTokensSFIndices, int *srcStartIndices, int *srcStructureWordsSFIndices,int *srcStructureWordstokensSFIndices,int *srcTokensLengths,int totalLengthDistinct, char *patternRemoved, int *distinctSFIndices,int patternLength,int numPatternToken,int patternClustersNum,int patternWordNum,char *pattern, int *patternClustersSFIndices,int *patternIndices,int *patternTokensSFIndices,int *patternStartIndices,int *PatternWordsSFIndices,int *PatternWordstokensSFIndices,int *patternTokensLengths, int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{

    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int distinctStart,distinctEnd,clStart,clEnd,strucStart,strucEnd,tokensStrucStart,tokensStrucEnd,tokensClStart,tokensClEnd,wordsClStart,wordsClEnd,wordsStrucStart,wordsStrucEnd,wordsTokensClStart,wordsTokensClEnd,wordsTokensStrucStart,wordsTokensStrucEnd;

    if(ix<patternClustersNum){
      distinctStart = distinctSFIndices[ix];
      distinctEnd = distinctSFIndices[ix+1];
      clStart = patternClustersSFIndices[ix];
      clEnd = patternClustersSFIndices[ix+1];
      strucStart = srcStructuresSFIndices[ix];
      strucEnd = srcStructuresSFIndices[ix+1];
      tokensStrucStart = srcStructureTokensSFIndices[ix];
      tokensStrucEnd = srcStructureTokensSFIndices[ix+1];
      tokensClStart = patternTokensSFIndices[ix];
      tokensClEnd = patternTokensSFIndices[ix+1];
      wordsClStart = PatternWordsSFIndices[ix];
      wordsClEnd = PatternWordsSFIndices[ix+1];
      wordsStrucStart = srcStructureWordsSFIndices[ix];
      wordsStrucEnd = srcStructureWordsSFIndices[ix+1];
      wordsTokensClStart = PatternWordstokensSFIndices[ix];
      wordsTokensClEnd = PatternWordstokensSFIndices[ix+1];
      wordsTokensStrucStart = srcStructureWordstokensSFIndices[ix];
      wordsTokensStrucEnd = srcStructureWordstokensSFIndices[ix+1];
        levenshteinDistance(srcLength,patternRemoved, distinctStart, distinctEnd,pattern,clStart,clEnd, src,srcIndices,strucStart, strucEnd,tokensStrucStart,tokensStrucEnd,tokensClStart,tokensClEnd,wordsClStart,wordsClEnd,wordsStrucStart,wordsStrucEnd,wordsTokensClStart,wordsTokensClEnd,wordsTokensStrucStart,wordsTokensStrucEnd,dX,ResultFinal,TokensFinal,WordsTokensFinal1,WordsTokensFinal2,WordsFinal1,WordsFinal2,WordsFinal);
    }
}

Dear Marco

This is the code in CUDA, I need to convert to JCUDA

// distributing jobs among devices

for (int i = 0; i < ngpus; i++)

{

cudaSetDevice(i);

cudaMemcpyAsync(d_A[i], h_A[i], iBytes, cudaMemcpyDefault,stream[i]);

cudaMemcpyAsync(d_B[i], h_B[i], iBytes, cudaMemcpyDefault,stream[i]);

iKernel<<<grid, block,0,stream[i]>>> (d_A[i], d_B[i], d_C[i],iSize);

cudaMemcpyAsync(gpuRef[i], d_C[i], iBytes, cudaMemcpyDefault,stream[i]);

}

I will divide the whole data first into h_A[i] arrays according to the number of streams. Can you convert the above code in JCUDA. I have also, included Pyjama into the Java project to support OpenMP in order to use parallel for. I have sent another detailed email.

                                                Thanks

Most of the code was eventually written with the driver API. So I assume this should happen here as well. An example of how the setup and implementation for this code could look like:

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoHAsync;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoDAsync;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUstream;

public class JCudaDriverSimpleAsyncCopy
{
    public static void main(String[] args)
    {
        int size_x = 100000;
        int size_y = 10000;

        float h_x_array[] = new float[size_x];
        Pointer h_x = Pointer.to(h_x_array);
        CUdeviceptr d_x = new CUdeviceptr();
        cuMemAlloc(d_x, size_x * Sizeof.FLOAT);

        float h_y_array[] = new float[size_y];
        Pointer h_y = Pointer.to(h_y_array);
        CUdeviceptr d_y = new CUdeviceptr();
        cuMemAlloc(d_y, size_y * Sizeof.FLOAT);

        int c = 4;
        int ns = c;
        CUstream stream[] = new CUstream[ns];
        for (int i=0; i<ns; i++)
        {
            stream[i] = new CUstream();
        }
        
        for (int i = 0; i<c; i++){
            int offx = (size_x/c)*i;
            int offy = (size_y/c)*i;
            cuMemcpyHtoDAsync(at(d_x, offx), at(h_x, offx), 
                size_x / c, stream[i % ns]);

            // Execute Kernel here: 
//            cuLaunchKernel(k,
//                gx, gy, gz,
//                bx, by, bz,
//                0, stream[i%ns], // Make sure to pass in the stream here!
//                kernelParameters, null
//            );
            
            cuMemcpyDtoHAsync(at(h_y, offy), at(d_y, offy), 
                size_y / c, stream[i % ns]);
        }        
        
        cuCtxSynchronize();
        System.out.println("Done");
    }
    
    // Assumes FLOAT data!!!
    private static Pointer at(Pointer p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
    private static CUdeviceptr at(CUdeviceptr p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
    
    
    
}

(Untested, and without a real kernel, but should get the idea across)

Thanks a lot Marco. I will try the code with the aid of your answers and send it to you.

Hi Marco,

For your code,

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoHAsync;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoDAsync;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUstream;

public class JCudaDriverSimpleAsyncCopy
{
    public static void main(String[] args)
    {
        int size_x = 100000;
        int size_y = 10000;

        float h_x_array[] = new float[size_x];
        Pointer h_x = Pointer.to(h_x_array);
        CUdeviceptr d_x = new CUdeviceptr();
        cuMemAlloc(d_x, size_x * Sizeof.FLOAT);

        float h_y_array[] = new float[size_y];
        Pointer h_y = Pointer.to(h_y_array);
        CUdeviceptr d_y = new CUdeviceptr();
        cuMemAlloc(d_y, size_y * Sizeof.FLOAT);

        int c = 4;
        int ns = c;
        CUstream stream[] = new CUstream[ns];
        for (int i=0; i<ns; i++)
        {
            stream[i] = new CUstream();
        }
        
        for (int i = 0; i<c; i++){
            int offx = (size_x/c)*i;
            int offy = (size_y/c)*i;
            cuMemcpyHtoDAsync(at(d_x, offx), at(h_x, offx), 
                size_x / c, stream[i % ns]);

            // Execute Kernel here: 
//            cuLaunchKernel(k,
//                gx, gy, gz,
//                bx, by, bz,
//                0, stream[i%ns], // Make sure to pass in the stream here!
//                kernelParameters, null
//            );
            
            cuMemcpyDtoHAsync(at(h_y, offy), at(d_y, offy), 
                size_y / c, stream[i % ns]);
        }        
        
        cuCtxSynchronize();
        System.out.println("Done");
    }
    
    // Assumes FLOAT data!!!
    private static Pointer at(Pointer p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
    private static CUdeviceptr at(CUdeviceptr p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
}

I have make my own code but if I need the time of only the concurrent kernels, How to do this?. Inside the loop you make in the previous code three subsequent operations, transfer a part of data on a certain stream from host to device and then kernel execution and finally transfer data from device to host. Mutiple streams have overlapped operations. How can I get the time of only the overlapped kernels? Will nvvp visual profiler is the solution instead of events? If so, How can I use visual profiler with JCUDA ? I only use it with CUDA applications.

                                            Thanks

Marco,
You answer the above question at jcuda - How to get CUDA event starting and ending time without using nvprof - Stack Overflow. The answer is,

There are two ways to do this:

If you can run your JCuda application via the command-line, you can profile it using the command nvprof --profile-child-processes <command to run your JCuda application>

If you cannot run your application via the command-line, open a terminal and run nvprof using the command nvprof --profile-all-processes. Nvprof will go into daemon mode and keep waiting for CUDA activity to happen. Now launch your application as usual from your IDE, and once CUDA activity happens and the application exits, nvprof will print results in its terminal session.

Can you explain it to me? I need the time of only overlapped kernels.

Another question,

If I split the code :slight_smile:
for loop on streams for transferring data from host to device and synchronize them.
for loop on streams to run kernels.
for loop on streams to transferring data from device to device and synchronize them before getting the final result.
I can use CUDA events before the ovelapped kernels and get the time since their code is isloated.

Is their operation as to make one for loop of streams and within it the sequence of the three operations? Please, help me.

This is the solution to use nvprof with JCUDA. I have used it before in CUDA but with paths of program inside CUDA projects.

If you can run your JCuda application via the command-line, you can profile it using the command nvprof --profile-child-processes <command to run your JCuda application>

Just I need how to run JCuda application via the command-line. Then I am able to run the second command, nvprof --profile-child-processes `

The questions are not entirely clear to me.

I have not extensively used the CUDA profiler. I did use it once, for basic tests. But it did not work well together with Java-based applications (I think starting with CUDA 8), and I haven’t regularly tried it since.

It is also not clear what time you want to measure, exactly. You can use CUDA events for timing, or try to use the profiler (but I don’t know how well that works with the latest CUDA versions). If it just about claiming that your implementation is faster than a CPU-based one, you could also just use the wall-clock time, with System.nanoTime().

Regarding the last question:

In my first, basic tests, I usually used a .BAT file as the " command to run your JCuda application". This BAT file should contain the usual command-line call to start a Java application, namely

java -cp ".;EXAMPLE.jar;OTHER.jar" YourMainClass

where the list of JAR files (EXAMPLE and OTHER here) is the list of all JAR files that you need to run your program.

Marco,

the code you sent.

The kernel has several concurrent launches with data transfer from host to device or device to host. How to get the time starting from the first launch of kernel to the last launch of kernel including interference from data transfer operations. All of them are cuda operations. Each stream conveys part of data to be transfered from host to device and then kernel execution, followed by the vice versa transfer. operations on streams are overlapped.

public class JCudaDriverSimpleAsyncCopy
{
    public static void main(String[] args)
    {
        int size_x = 100000;
        int size_y = 10000;

        float h_x_array[] = new float[size_x];
        Pointer h_x = Pointer.to(h_x_array);
        CUdeviceptr d_x = new CUdeviceptr();
        cuMemAlloc(d_x, size_x * Sizeof.FLOAT);

        float h_y_array[] = new float[size_y];
        Pointer h_y = Pointer.to(h_y_array);
        CUdeviceptr d_y = new CUdeviceptr();
        cuMemAlloc(d_y, size_y * Sizeof.FLOAT);

        int c = 4;
        int ns = c;
        CUstream stream[] = new CUstream[ns];
        for (int i=0; i<ns; i++)
        {
            stream[i] = new CUstream();
        }
        
        for (int i = 0; i<c; i++){
            int offx = (size_x/c)*i;
            int offy = (size_y/c)*i;
            cuMemcpyHtoDAsync(at(d_x, offx), at(h_x, offx), 
                size_x / c, stream[i % ns]);

            // Execute Kernel here: 
//            cuLaunchKernel(k,
//                gx, gy, gz,
//                bx, by, bz,
//                0, stream[i%ns], // Make sure to pass in the stream here!
//                kernelParameters, null
//            );
            
            cuMemcpyDtoHAsync(at(h_y, offy), at(d_y, offy), 
                size_y / c, stream[i % ns]);
        }        
        
        cuCtxSynchronize();
        System.out.println("Done");
    }
    
    // Assumes FLOAT data!!!
    private static Pointer at(Pointer p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
    private static CUdeviceptr at(CUdeviceptr p, int offset)
    {
        return p.withByteOffset(offset * Sizeof.FLOAT);
    }
    
    
    
}

I think that you can use the functions that are described in CUDA Driver API :: CUDA Toolkit Documentation for that.

The CUDA events are used to measure kernel timing only or we can measure between the start and end of the event the data transfer from host to device kernel time and time of transfer from device to host. I mean between the event three cuda operations? Can we do so?

If we can , we can put concurrent kernel executions and transfers inside the event. What is your opinion?

It should be possible to just create events, and record their „start“ and „end“ time, and then compute the difference (regardless of what is done in-between). Roughly like this:

        CUevent start = new CUevent();
        CUevent end = new CUevent();
        cuEventCreate(start, 0);
        cuEventCreate(end, 0);
        cuEventRecord(start, stream);
        cuLaunchKernel(function,
            gridSizeX,  1, 1, blockSizeX, 1, 1,
            0, stream, kernelParameters, null
        );
        cuEventRecord(end, stream);
        cuStreamSynchronize(stream);
        float ms[] = { 0.0f };
        cuEventElapsedTime(ms, start, end);
        System.out.println(ms[0]);

But you should be clearer about the time that you want to measure. When you have 4 streams, and each kernel is associated with one stream, and each kernel takes 2 seconds on its stream (but runs in parallel to the others) then it’s not clear (for me) whether you want „2 seconds“ as the result, or „4*2=8 seconds“…

Marco,
I need to calculate the time of this code, assume you convert it to JCUDA code.

for (int i = 0; i < nStreams; i++) {
int offset = i * bytesPerStream;
cudaMemcpyAsync(&d_a[offset], &a[offset], bytePerStream, streams[i]);
kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

The time of all overlapping, I need to calculate. How to do this? This is why I suggest visual profiler but I can not use it. If you can use cuda events for these, tell me.

I could run nvprov for a JCUDA project from command line and it runs sucessfully
nvprof via the following command
nvprof --profile-child-processes java -cp jar path.jarnamepackagename.mainclassname

but I can not run the visual profiler to get visualization of all processes. Can you help me.

Another question, if this can not operate, I see another solution inside visual profiler, File…import
then request CSV data generated by command line profiler.
You know that nvprof is executed for the same project in command line. How to make this CSV file? Sorry for annoying You.

Can I put,
CUDA EVENT START

for (int i = 0; i < nStreams; i++) {
int offset = i * bytesPerStream;
cudaMemcpyAsync(&d_a[offset], &a[offset], bytePerStream, streams[i]);
kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

CUDA EVENT END

Can this get the time accurately of all streams?
IF cudaMemcpyAsync is a part of a large function and within this function, other calculations, Can I put it between cuda event

cuda event start

for (int i = 0; i < nStreams; i++) {
int offset = i * bytesPerStream;
large function(&d_a[offset], &a[offset], bytePerStream, streams[i]); (includes cudaMemcpyAsync at the end)
kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

cuda event end

I have not really used the CUDA profiler. I once started it, with a simple example, several years ago, and saw that it „worked“. Later, with CUDA 8, I tried it out, and it did no longer work. Maybe it works again with newer CUDA versions, I don’t know. If you got it working in general, you can read at 1. Preparing An Application For Profiling — Profiler 12.3 documentation how to use it.

Imagine you have 4 streams. And the computation takes 2 seconds. What should the result be? 2 seconds or 8 seconds?

If you just want the wall clock time, i.e. the time from the „cuda event start“ to the „cuda event end“ line, you can just use

long before = System.nanoTime();
...
// The code
...
cuCtxSynchronize();
long after = System.nanoTime();
double ms = (after-before)/1e6;

to compute the number of milliseconds that it took.

You mean,
even if the operations inside start time to end time are executed inside the GPU, we can also use normal timer of Java.

long before = System.nanoTime();

for (int i = 0; i < nStreams; i++) {
int offset = i * bytesPerStream;
large function(&d_a[offset], &a[offset], bytePerStream, streams[i]); (includes cudaMemcpyAsync at the end)
kernel<<grid, block, 0, streams[i]>>(&d_a[offset]);
cudaMemcpyAsync(&a[offset], &d_a[offset], bytesPerStream, streams[i]);
}

long after = System.nanoTime();
double ms = (after-before)/1e6;

Can we also use Java timer for a set of operations executed as follows.
cuMemcpyHtoD
kernel execution
cuMemcpyDtoH

Despite me asking several times, you still did not really say what you actually want to measure.

With

long before = System.nanoTime();
runSomeCodeHere();
long after = System.nanoTime();
double ms = (after-before)/1e6;

you can measure how long the function runSomeCodeHere took. It does not matter what the function is doing. It starts. A little bit later it finishes. Whether or not this uses CUDA does not matter.

But… I wrote cuCtxSynchronize() there. This has to be called, to make sure that the computation is acutally finished.

Of course, a profiler or fine-grained CUDA event tracking would provide more details (e.g. how much time was spent for memory copies, and how much for computations). But that’s more work, and … well, you’re likely trying to minimize that.

Thanks Marco. I try to measure the time of concurrent executions inside streams and compare i with the single kernel execution. I measure the time of running the kernel with the whole data and running the kernel with data divided on multiple streams.

That’s a somewhat clear question now.

There are still many degrees of freedom. It’s not obvious how exactly you are „dividing“ the data. (I know that you sent me some code via mail, but I can’t read all that). Usually, you should not have to „divide“ the data explicitly, you should just pass the relevant part of the data to the kernel (as it is done with the at(...) functions in JCudaDriverSimpleAsyncCopy.

But in general, it’s really that simple:

double msSingle = -1.0;
double msMulti = -1.0;
{
    long before = System.nanoTime();
    runWithSingleKernel();
    cuCtxSynchronize();
    long after = System.nanoTime();
    msSingle = (after-before)/1e6;
}
{
    long before = System.nanoTime();
    runWithMultipleStreams();
    cuCtxSynchronize();
    long after = System.nanoTime();
    msMulti = (after-before)/1e6;
}
System.out.println("With one stream, it took "+msSingle);
System.out.println("With multiple streams, it took "+msMulti);

Then, just implement the runWithSingleKernel and runWithMultipleStreams accordingly.