+ Antworten
Ergebnis 1 bis 8 von 8

Thema: JCUDA asynchronous memory copy for multi threads

  1. #1
    New User Bit Themenstarter

    Registriert seit
    17.02.2017
    Fachbeiträge
    4
    Genannt
    1 Post(s)
    Hi,

    I am using GPU to accelerate a programme written in JAVA. In the programme, it use "ExecutorService" to create multi-threads. My idea is that each thread launches an GPU kernel and use streams to make them run concurrently ( all the threads share the same context).
    However, there were errors when using asynchronous memory copy. Then, I wrote a simple programme and tried to figure it out. But it did not work.

    The following are the codes I used. Some codes are from https://forum.byte-welt.net/archive/...hp/t-4082.html.

    Java Code:
    1.  
    2. import static jcuda.runtime.JCuda.cudaDeviceSynchronize;
    3. import static jcuda.runtime.JCuda.cudaSetDevice;
    4. import static jcuda.runtime.JCuda.cudaFree;
    5. import static jcuda.runtime.JCuda.cudaFreeHost;
    6. import static jcuda.runtime.JCuda.cudaHostAlloc;
    7. import static jcuda.runtime.JCuda.cudaHostAllocWriteCombined;
    8. import static jcuda.runtime.JCuda.cudaMalloc;
    9. import static jcuda.runtime.JCuda.cudaMemcpy;
    10. import static jcuda.runtime.JCuda.cudaMemcpyAsync;
    11. import static jcuda.runtime.JCuda.cudaStreamCreate;
    12. import static jcuda.runtime.JCuda.cudaStreamDestroy;
    13. import static jcuda.runtime.JCuda.cudaStreamSynchronize;
    14. import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToDevice;
    15. import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyDeviceToHost;
    16. import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToDevice;
    17. import static jcuda.runtime.cudaMemcpyKind.cudaMemcpyHostToHost;
    18. import static jcuda.driver.JCudaDriver.*;
    19. import jcuda.Pointer;
    20. import jcuda.runtime.cudaEvent_t;
    21. import jcuda.runtime.*;
    22. import java.util.Arrays;
    23. import java.io.*;
    24. import java.util.Iterator;
    25. import java.util.concurrent.*;
    26. import jcuda.*;
    27. import jcuda.driver.*;
    28. import java.nio.FloatBuffer;
    29. import java.nio.ByteOrder;
    30. import java.util.Locale;
    31.  
    32. public class JCudaVectorAdd
    33. {
    34.     static ExecutorService masterExecutor;
    35.     static ExecutorService mapExecutor;
    36.     public static void main(String args[]) throws IOException
    37.     {
    38.         JCudaVectorAdd obj=new JCudaVectorAdd();
    39.         obj.run();
    40.         return ;
    41.     }
    42.  
    43.     public void run()
    44.     {
    45.         this.masterExecutor= Executors.newSingleThreadExecutor();
    46.         this.mapExecutor=Executors.newFixedThreadPool(1);
    47.  
    48.         for(int i=0;i<1;i++)
    49.         {
    50.             Masterjob masterjob=new Masterjob();
    51.             Future<Integer>  reduceResult=masterExecutor.submit(masterjob);
    52.             while(true)
    53.             {
    54.                 try
    55.                 {  
    56.                     if(  reduceResult.isDone())
    57.                         break;
    58.                 }
    59.                 catch( Throwable ex)
    60.                 {
    61.                 }
    62.             }  
    63.         }
    64.         masterExecutor.shutdownNow();
    65.         mapExecutor.shutdownNow();
    66.     }  
    67.  
    68.     private class Masterjob implements Callable<Integer>
    69.     {
    70.         private Masterjob()
    71.         {
    72.         }
    73.  
    74.         @Override
    75.         public Integer call()
    76.         {
    77.             final CountDownLatch runningjobs=new CountDownLatch(1);
    78.             try
    79.             {  
    80.                 for(int i=0;i<1;i++)      
    81.                 {
    82.                     mapExecutor.submit(new ReadMapReduceJob(runningjobs));
    83.                 }  
    84.                 runningjobs.await();
    85.                 return 0;
    86.             }
    87.             catch(Throwable ex)
    88.             {  
    89.                 return 0;
    90.             }
    91.         }
    92.     }
    93.     private class ReadMapReduceJob implements Runnable
    94.     {
    95.  
    96.         final CountDownLatch runningjobs;
    97.         private ReadMapReduceJob(CountDownLatch runningjobs)
    98.         {
    99.             this.runningjobs=runningjobs;
    100.         }
    101.  
    102.         @Override
    103.         public void run()
    104.         {
    105.             try{
    106.                 JCudaDriver.setExceptionsEnabled(true);
    107.                 cuInit(0);
    108.                 CUdevice device = new CUdevice();
    109.                 cuDeviceGet(device, 0);
    110.                 CUcontext context = new CUcontext();
    111.                 cuCtxCreate(context, 0, device);
    112.  
    113.                 for(int time=0;time<100;time++)
    114.                 {
    115.                     int numElements = 100000;
    116.                     Pointer A=new Pointer();
    117.                     JCudaDriver.cuMemAllocHost(A,numElements*Sizeof.FLOAT);
    118.                     FloatBuffer  aa= A.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
    119.                     aa.position(0);
    120.                     float [] expecteds=new float[numElements];
    121.                     Arrays.fill(expecteds,3.33f);
    122.                     aa.put(expecteds,0,numElements);  
    123.  
    124.                     cudaStream_t stream_new=new cudaStream_t();
    125.                     cudaStreamCreate(stream_new);
    126.                     Pointer deviceInputA=new Pointer();
    127.                     cudaMalloc(deviceInputA, numElements*Sizeof.FLOAT);
    128.                     cudaMemcpyAsync(deviceInputA,A,numElements*Sizeof.FLOAT, cudaMemcpyHostToDevice,stream_new);
    129.                     //cudaMemcpy(deviceInputA,A,numElements*Sizeof.FLOAT,cudaMemcpyHostToDevice);
    130.  
    131.                     Pointer hostOutput=new Pointer();    
    132.                     JCudaDriver.cuMemAllocHost(hostOutput,numElements*Sizeof.FLOAT);
    133.                     cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);
    134.                     //cudaMemcpy(hostOutput,deviceInputA,numElements*Sizeof.FLOAT,cudaMemcpyDeviceToHost);
    135.                     FloatBuffer  cc= hostOutput.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
    136.                     float host_output[]=new float[numElements];
    137.                     cc.get(host_output);
    138.                     cc.rewind();
    139.  
    140.                     boolean equal = Arrays.equals(expecteds, host_output);
    141.                     System.out.println("Equal? "+equal);
    142.  
    143.                     cudaFreeHost(hostOutput);
    144.                     cudaFreeHost(A);
    145.                     cudaFree(deviceInputA);
    146.                     cudaStreamDestroy(stream_new);
    147.                 }      
    148.             }
    149.             catch( Throwable ex)  
    150.             {
    151.             }
    152.             finally
    153.             {
    154.                 runningjobs.countDown();
    155.             }
    156.  
    157.         }
    158.     }
    159. }


    I used CUDA 7.5 and Java 8 to compile and run. The results are right and wrong.
    Code:
    Equal? false
    Equal? false
    Equal? false
    Equal? false
    Equal? false
    Equal? false
    Equal? false
    Equal? false
    Equal? true
    .....
    If I uncomment cudaMemcpy() and comment cudaMemcpyAsync(). The results are right.

    Thanks in advance :)
    Geändert von Marco13 (17.02.2017 um 21:50 Uhr) Grund: Formatting

  2. #2
    Global Moderator Viertel Gigabyte
    Registriert seit
    05.08.2008
    Fachbeiträge
    4.967
    Genannt
    325 Post(s)
    I could only run a short/limited test right now, but can do a more detailed test on Sunday or Monday, if necessary.

    But the result that you observe is not really unexpected:
    Java Code:
    1.  
    2. Pointer hostOutput=new Pointer();    
    3. JCudaDriver.cuMemAllocHost(hostOutput,numElements*Sizeof.FLOAT);
    4.  
    5. // Here you are starting the ASYNCHRONOUS memory copy
    6. cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);
    7.  
    8. // This code is exected immediately, possibly BEFORE the ASYNCHRONOUS operation is finished
    9. FloatBuffer  cc= hostOutput.getByteBuffer(0,numElements*Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
    10. float host_output[]=new float[numElements];
    11. cc.get(host_output);
    12. cc.rewind();

    So you are reading the data from the hostOutput pointer into the Java host_output array while CUDA is still copying the data from the device to the hostOutput pointer.

    If you immediately want to read this data, then you should simply use cudaMemcpy and not cudaMemcpyAsync.

    (Alternatively, you could add
    Java Code:
    1.  
    2. cudaMemcpyAsync(hostOutput,deviceInputA,numElements*Sizeof.FLOAT, cudaMemcpyDeviceToHost,stream_new);
    3. cudaStreamSynchronize(stream_new); // <-- This
    but this would be strange, because it would have the same effect as simply using cudaMemcpy...)

    If this does not answer you question, or you want to use cudaMemcpyAsync for a reason that is not obvious in the given example, just drop me a note.

  3. #3
    New User Bit Themenstarter

    Registriert seit
    17.02.2017
    Fachbeiträge
    4
    Genannt
    1 Post(s)
    Thanks very much for your response.
    I added CudaStreamSynchronize ( ) in the original codes. It works now!

    In my older post, I did not put my idea clearly (my fault). I want to create many threads. Each thread launches an GPU kernel and uses streams to run concurrently (all the threads share the same context).

    If I use CudaStreamSynchronize() in my programme, then all the kernels could run concurrently and my problem solve.

    Thanks again for your quick reply.

  4. #4
    Global Moderator Viertel Gigabyte
    Registriert seit
    05.08.2008
    Fachbeiträge
    4.967
    Genannt
    325 Post(s)
    Note that streams and multiple threads will not make your kernels run concurrently. Specifically: When you have one GPU, and want to execute two kernels, and each kernel needs 2 seconds, then doing this (pseudocode)
    Java Code:
    1.  
    2. new Thread(() -> launchKernel(kernelA)).start();
    3. new Thread(() -> launchKernel(kernelB)).start();
    will not be faster than just doing this
    Java Code:
    1.  
    2. launchKernel(kernelA);
    3. launchKernel(kernelB);
    It will take 4 seconds in both cases. When the GPU is busy with one kernel, then it cannot execute another kernel at the same time.

    The main purpose of asynchronous operations is to execute kernels and memory copies at the same time. Modern GPUs have a (limited) support for "concurrent copy and compute".

    For example, consider the following operations (Again: This is VERY simplified pseudocode)
    Java Code:
    1.  
    2. launchKernel(kernelA);
    3. copyFromDeviceToHost(resultOfKernelA);
    4. launchKernel(kernelB);
    5. copyFromDeviceToHost(resultOfKernelB);

    With synchronous copy operations, the "timeline" will look like this:
    Code:
    [--- execute kernel A---]
                            [--- copy result A to host---]
                                                         [--- execute kernel B---]
                                                                                 [--- copy result B to host---]
    When you do asynchronous copy operations, then the timeline may look like this
    Code:
    [--- execute kernel A---]
                            [--- copy result A to host---]
                            [--- execute kernel B---]
                                                          [--- copy result B to host---]
    saving some execution time.

    But these are comparatively sophisticated, high-level optimizations. I'd recommend to only consider them when you have a clear idea about your "data flow", and can foresee that this will bring a speedup. The synchronization between multiple threads and streams is difficult, and hard to debug.

    I'm not an expert at all the details, and do not have much real practical experience with complex CUDA applications, but you should be aware that threads+streams does not make the implementation easier.

    And again: You will not be able to execute two kernels at the same time (on a single GPU).

  5. #5
    New User Bit Themenstarter

    Registriert seit
    17.02.2017
    Fachbeiträge
    4
    Genannt
    1 Post(s)
    Hi,

    To the best of my knowledge, GPUs(cc>=3.5) has a new hardware change-HyperQ. With HyperQ, GPU has multiple work queues instead of one work queue. Then GPU (cc>=3.5) could handle the concurrent kernels and/or memory transfers in separate CUDA streams truly independently

    There is a description of HyperQ.
    Kepler Tuning Guide :: CUDA Toolkit Documentation

    There is a HyperQ programming example including the PDF file:
    CUDA Samples :: CUDA Toolkit Documentation

    There is a link which shows multi streams run concurrently.
    https://devblogs.nvidia.com/parallel...y-concurrency/

    Now I am trying to run multi-streams concurrently in Java. If it works, I will let you know.

    Thanks.

  6. #6
    New User Bit Themenstarter

    Registriert seit
    17.02.2017
    Fachbeiträge
    4
    Genannt
    1 Post(s)
    Hi,

    Thanks for your kind note.

    As far as I know, GPU (cc>=3.5) has a hardware change--HyperQ. With HyperQ, GPU has multiple hardware work queues. Then, GPU can handle the concurrent kernels and/or memory transfers in separate CUDA streams truly independently.

    This is the link of a description of Hyper-Q
    Kepler Tuning Guide :: CUDA Toolkit Documentation

    This is a example of streams running concurrently. The results are shown by nvprof. The results show that streams run concurrently.
    https://devblogs.nvidia.com/parallel...y-concurrency/

    Now I am trying to run multiple streams concurrently in Java with JCUDA. But the problem is that JCUDA does not support nvprof. I cannot check the result with nvprof. However, I have used multi-process-service to run GPU calls of multiple different processes concurrently with JCUDA. (https://docs.nvidia.com/deploy/pdf/C...e_Overview.pdf)

    If I find that multiple streams in a process with JCUDA could run concurrently, I will post here.

  7. #7
    Two Of Three Megabyte Avatar von L-ectron-X
    Registriert seit
    16.07.2006
    Ort
    Drebkau
    Fachbeiträge
    2.278
    Genannt
    103 Post(s)
    The spam filter has done its work again ... The last two posts were unlocked by me.
    Schöne Grüße
    L-ectron-X

    Byte-Welt - Wir sind die Community, in der die Benutzer sagen, wohin wir uns entwickeln.
    Programmieren lernt man nur durch Programmieren.

    "Wenn man die Buchstaben von Bundeskanzlerin umstellt, kommt Bankzinsenluder raus..."

  8. #8
    Global Moderator Viertel Gigabyte
    Registriert seit
    05.08.2008
    Fachbeiträge
    4.967
    Genannt
    325 Post(s)
    @Shanshan Sorry for this spam filter thingy (certain posts appear as "Moderated" and have to be activated by a moderator before they become visible - usually, this should not take long, and I'm having an eye on that, particularly here in the JCuda section, but in this case, it seems to have taken 1.5 days)

    And thanks for the links. I may not be entirely up to date with the news that have been introduced with Kepler. (I knew about the "concurrent copy+compute", but not that they had real concurrent kernels).

    I wonder about some details. When each kernel occupies the whole GPU (i.e. all streaming multiprocessors), then they obviously cannot be run concurrently - but I assume that it is possible for "simpler" kernels that do not occupy all resources. I'll have a closer look at the articles.

    Thanks again, and let me know when you have an example for this in JCuda. (Maybe it could then become one of the https://github.com/jcuda/jcuda-samples ...?)

+ Antworten Thema als "gelöst" markieren

Direkt antworten Direkt antworten

Nenne die Wurzel aus 144 (Zahlwort) !

Aktive Benutzer

Aktive Benutzer

Aktive Benutzer in diesem Thema: 1 (Registrierte Benutzer: 0, Gäste: 1)

Ähnliche Themen

  1. Stack Copy Constructor
    Von Oggel im Forum Java-Grundlagen
    Antworten: 10
    Letzter Beitrag: 15.04.2016, 17:15
  2. Pinned memory in JCuda for a 2D array
    Von José im Forum JCuda
    Antworten: 3
    Letzter Beitrag: 12.10.2013, 01:18
  3. Antworten: 7
    Letzter Beitrag: 28.08.2013, 09:36
  4. JCuda out of memory
    Von kifajard im Forum JCuda
    Antworten: 1
    Letzter Beitrag: 27.12.2011, 05:45
  5. Copy & Paste von Code
    Von Beni im Forum Kritiken & Anregungen
    Antworten: 51
    Letzter Beitrag: 11.06.2007, 21:05

Berechtigungen

  • Neue Themen erstellen: Ja
  • Themen beantworten: Ja
  • Anhänge hochladen: Nein
  • Beiträge bearbeiten: Nein
  •