JCUDA asynchronous memory copy for multi threads

Hi,

I am using GPU to accelerate a programme written in JAVA. In the programme, it use „ExecutorService“ to create multi-threads. My idea is that each thread launches an GPU kernel and use streams to make them run concurrently ( all the threads share the same context).
However, there were errors when using asynchronous memory copy. Then, I wrote a simple programme and tried to figure it out. But it did not work.

The following are the codes I used. Some codes are from https://forum.byte-welt.net/archive/index.php/t-4082.html.

import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
import static jcuda.runtime.JCuda.cudaSetDevice;
import static jcuda.runtime.JCuda.cudaFree;
import static jcuda.runtime.JCuda.cudaFreeHost;
import static jcuda.runtime.JCuda.cudaHostAlloc;
import static jcuda.runtime.JCuda.cudaHostAllocWriteCombined;
import static jcuda.runtime.JCuda.cudaMalloc;
import static jcuda.runtime.JCuda.cudaMemcpy;
import static jcuda.runtime.JCuda.cudaMemcpyAsync;
import static jcuda.runtime.JCuda.cudaStreamCreate;
import static jcuda.runtime.JCuda.cudaStreamDestroy;
import static jcuda.runtime.JCuda.cudaStreamSynchronize;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToDevice;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;
import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToHost;
import static jcuda.driver.JCudaDriver.*;
import jcuda.Pointer;
import jcuda.runtime.cudaEvent_t;
import jcuda.runtime.*;
import java.util.Arrays;
import java.io.*;
import java.util.Iterator;
import java.util.concurrent.*;
import jcuda.*;
import jcuda.driver.*;
import java.nio.FloatBuffer;
import java.nio.ByteOrder;
import java.util.Locale;

public class JCudaVectorAdd
{
    static ExecutorService masterExecutor;
    static ExecutorService mapExecutor;
    public static void main(String args[]) throws IOException
    {
        JCudaVectorAdd obj=new JCudaVectorAdd();
        obj.run();
        return ;
    }

    public void run()
    {
        this.masterExecutor= Executors.newSingleThreadExecutor();
        this.mapExecutor=Executors.newFixedThreadPool(1);

        for(int i=0;i<1;i++)
        {
            Masterjob masterjob=new Masterjob();
            Future<Integer>  reduceResult=masterExecutor.submit(masterjob);
            while(true)
            {
                try
                {  
                    if(  reduceResult.isDone()) 
                        break;
                }
                catch( Throwable ex)
                {
                }
            }  
        }
        masterExecutor.shutdownNow();
        mapExecutor.shutdownNow();
    }  

    private class Masterjob implements Callable<Integer>
    {
        private Masterjob()
        {
        }

        @Override
        public Integer call()
        {
            final CountDownLatch runningjobs=new CountDownLatch(1);
            try
            {  
                for(int i=0;i<1;i++)       
                {
                    mapExecutor.submit(new ReadMapReduceJob(runningjobs));
                }  
                runningjobs.await();
                return 0;
            }
            catch(Throwable ex)
            {  
                return 0;
            }
        }
    }
    private class ReadMapReduceJob implements Runnable
    {

        final CountDownLatch runningjobs;
        private ReadMapReduceJob(CountDownLatch runningjobs)
        {
            this.runningjobs=runningjobs;
        }

        @Override
        public void run()
        {
            try{
                JCudaDriver.setExceptionsEnabled(true);
                cuInit(0);
                CUdevice device = new CUdevice();
                cuDeviceGet(device, 0);
                CUcontext context = new CUcontext();
                cuCtxCreate(context, 0, device);

                for(int time=0;time<100;time++)
                {
                    int numElements = 100000;
                    Pointer A=new Pointer();
                    JCudaDriver.cuMemAllocHost(A,numElements*Sizeof.FLOAT);
                    FloatBuffer  aa= A.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
                    aa.position(0);
                    float [] expecteds=new float[numElements];
                    Arrays.fill(expecteds,3.33f);
                    aa.put(expecteds,0,numElements);   

                    cudaStream_t stream_new=new cudaStream_t();
                    cudaStreamCreate(stream_new);
                    Pointer deviceInputA=new Pointer();
                    cudaMalloc(deviceInputA, numElements*Sizeof.FLOAT);
                    cudaMemcpyAsync(deviceInputA,A,numElements*Sizeof.FLOAT, cudaMemcpyHostToDevice,stream_new);
                    //cudaMemcpy(deviceInputA,A,numElements*Sizeof.FLOAT,cudaMemcpyHostToDevice);

                    Pointer hostOutput=new Pointer();    
                    JCudaDriver.cuMemAllocHost(hostOutput,numElements*Sizeof.FLOAT);
                    cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);
                    //cudaMemcpy(hostOutput,deviceInputA,numElements*Sizeof.FLOAT,cudaMemcpyDeviceToHost);
                    FloatBuffer  cc= hostOutput.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
                    float host_output[]=new float[numElements];
                    cc.get(host_output);
                    cc.rewind();

                    boolean equal = Arrays.equals(expecteds, host_output);
                    System.out.println("Equal? "+equal);

                    cudaFreeHost(hostOutput);
                    cudaFreeHost(A);
                    cudaFree(deviceInputA);
                    cudaStreamDestroy(stream_new);
                }      
            }
            catch( Throwable ex)   
            {
            }
            finally
            {
                runningjobs.countDown();
            }

        }
    }
}

I used CUDA 7.5 and Java 8 to compile and run. The results are right and wrong.


Equal? false
Equal? false
Equal? false
Equal? false
Equal? false
Equal? false
Equal? false
Equal? false
Equal? true
.....

If I uncomment cudaMemcpy() and comment cudaMemcpyAsync(). The results are right.

Thanks in advance :slight_smile:

I could only run a short/limited test right now, but can do a more detailed test on Sunday or Monday, if necessary.

But the result that you observe is not really unexpected:

Pointer hostOutput=new Pointer();    
JCudaDriver.cuMemAllocHost(hostOutput,numElements*Sizeof.FLOAT);

// Here you are starting the ASYNCHRONOUS memory copy
cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);

// This code is exected immediately, possibly BEFORE the ASYNCHRONOUS operation is finished
FloatBuffer  cc= hostOutput.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
float host_output[]=new float[numElements];
cc.get(host_output);
cc.rewind();

So you are reading the data from the hostOutput pointer into the Java host_output array while CUDA is still copying the data from the device to the hostOutput pointer.

If you immediately want to read this data, then you should simply use cudaMemcpy and not cudaMemcpyAsync.

(Alternatively, you could add

cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);
cudaStreamSynchronize(stream_new); // <-- This

but this would be strange, because it would have the same effect as simply using cudaMemcpy…)

If this does not answer you question, or you want to use cudaMemcpyAsync for a reason that is not obvious in the given example, just drop me a note.

Thanks very much for your response.
I added CudaStreamSynchronize ( ) in the original codes. It works now!

In my older post, I did not put my idea clearly (my fault). I want to create many threads. Each thread launches an GPU kernel and uses streams to run concurrently (all the threads share the same context).

If I use CudaStreamSynchronize() in my programme, then all the kernels could run concurrently and my problem solve.

Thanks again for your quick reply.

Note that streams and multiple threads will not make your kernels run concurrently. Specifically: When you have one GPU, and want to execute two kernels, and each kernel needs 2 seconds, then doing this (pseudocode)

new Thread(() -> launchKernel(kernelA)).start();
new Thread(() -> launchKernel(kernelB)).start();

will not be faster than just doing this

launchKernel(kernelA);
launchKernel(kernelB);

It will take 4 seconds in both cases. When the GPU is busy with one kernel, then it cannot execute another kernel at the same time.

The main purpose of asynchronous operations is to execute kernels and memory copies at the same time. Modern GPUs have a (limited) support for “concurrent copy and compute”.

For example, consider the following operations (Again: This is VERY simplified pseudocode)

launchKernel(kernelA);
copyFromDeviceToHost(resultOfKernelA); 
launchKernel(kernelB);
copyFromDeviceToHost(resultOfKernelB); 

With synchronous copy operations, the “timeline” will look like this:


[--- execute kernel A---]
                        [--- copy result A to host---]
                                                     [--- execute kernel B---]
                                                                             [--- copy result B to host---]

When you do asynchronous copy operations, then the timeline may look like this


[--- execute kernel A---]
                        [--- copy result A to host---]
                        [--- execute kernel B---]
                                                      [--- copy result B to host---]

saving some execution time.

But these are comparatively sophisticated, high-level optimizations. I’d recommend to only consider them when you have a clear idea about your “data flow”, and can foresee that this will bring a speedup. The synchronization between multiple threads and streams is difficult, and hard to debug.

I’m not an expert at all the details, and do not have much real practical experience with complex CUDA applications, but you should be aware that threads+streams does not make the implementation easier.

And again: You will not be able to execute two kernels at the same time (on a single GPU).

Hi,

To the best of my knowledge, GPUs(cc>=3.5) has a new hardware change-HyperQ. With HyperQ, GPU has multiple work queues instead of one work queue. Then GPU (cc>=3.5) could handle the concurrent kernels and/or memory transfers in separate CUDA streams truly independently

There is a description of HyperQ.
Kepler Tuning Guide :: CUDA Toolkit Documentation

There is a HyperQ programming example including the PDF file:
CUDA Samples :: CUDA Toolkit Documentation

There is a link which shows multi streams run concurrently.

Now I am trying to run multi-streams concurrently in Java. If it works, I will let you know.

Thanks.

Hi,

Thanks for your kind note.

As far as I know, GPU (cc>=3.5) has a hardware change–HyperQ. With HyperQ, GPU has multiple hardware work queues. Then, GPU can handle the concurrent kernels and/or memory transfers in separate CUDA streams truly independently.

This is the link of a description of Hyper-Q
Kepler Tuning Guide :: CUDA Toolkit Documentation

This is a example of streams running concurrently. The results are shown by nvprof. The results show that streams run concurrently.

Now I am trying to run multiple streams concurrently in Java with JCUDA. But the problem is that JCUDA does not support nvprof. I cannot check the result with nvprof. However, I have used multi-process-service to run GPU calls of multiple different processes concurrently with JCUDA. (https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf)

If I find that multiple streams in a process with JCUDA could run concurrently, I will post here. :slight_smile:

The spam filter has done its work again … The last two posts were unlocked by me.

@Shanshan Sorry for this spam filter thingy (certain posts appear as “Moderated” and have to be activated by a moderator before they become visible - usually, this should not take long, and I’m having an eye on that, particularly here in the JCuda section, but in this case, it seems to have taken 1.5 days)

And thanks for the links. I may not be entirely up to date with the news that have been introduced with Kepler. (I knew about the “concurrent copy+compute”, but not that they had real concurrent kernels).

I wonder about some details. When each kernel occupies the whole GPU (i.e. all streaming multiprocessors), then they obviously cannot be run concurrently - but I assume that it is possible for “simpler” kernels that do not occupy all resources. I’ll have a closer look at the articles.

Thanks again, and let me know when you have an example for this in JCuda. (Maybe it could then become one of the https://github.com/jcuda/jcuda-samples …?)