kernelLauncher and multiple devices

When i use the kernelLauncher i don’t have to select a specific device. How does this class choose the device on which it will execute the kernel when there are several on the computer?
I have 2 devices, and I want to run several CPU threads, each will execute one or several kernels. Do i have to „route“ them in order to use both devices?
Hope i’m clear :stuck_out_tongue:


The KernelLauncher will try to attach the the current context (obtained with cuCtxAttach). If no such context can be obtained, it will create one. The device for the context can be specified: The KernelLauncher from the JCuda Utilities has a method
public static void setDeviceNumber(int number)
that can be used to select the device for which the KernelLauncher should be created.

Actually, this functionality could not really be tested until now, since I hardly ever have the chance to use a PC with two GPUs. If you encounter and problems, please let be know.


Thank you Marco for answering so rapidly!

i created a class to “route” the devices, i’ll try it a bit later (put a low priority on this message i won’t be able to test it this week ! ) and i will tell you if it works. I’m not really familiar with the notion of context, as far as i understood, it is linked a CPU thread, so each thread creates a context that is used for all the cuda calls of the threads.
here is my class. imagine that a CPU thread creates a context pointer, the context will be created and associated with the device by calling the method “getDevice”. Does it seem correct to you?

package Utils;
import java.util.concurrent.atomic.AtomicInteger;
import static jcuda.driver.JCudaDriver.*;
import jcuda.driver.*;

public class CudaDeviceRouter {
    static CudaDeviceRouter instance;
    static int nDevices;
    CUdevice[] devices;
    AtomicInteger curDev;

    private CudaDeviceRouter(){
        int[] nd = new int[1];
        nDevices = nd[0];
        devices=new CUdevice[nDevices];
        curDev=new AtomicInteger(0);
        for (int d = 0; d<nDevices; d++) {
           devices[d] = new CUdevice();
           cuDeviceGet(devices[d], d);

    public static CudaDeviceRouter getInstance(){
        if (instance==null) {
            instance=new CudaDeviceRouter();
        return instance;
    public CUdevice getDevice(CUcontext ctx){
        int dev_nb=0;
        if (nDevices!=1) {
        cuCtxCreate(ctx, 0, devices[dev_nb]);
        return devices[dev_nb];

So far, yes, but I’m not so sure what the purpose of this class would be. There’s still the dependency to the host thread (but again: I don’t have real expecience with multiple GPUs either, so maybe I just don’t understand…)

I’m not sure i understand you…
actually my goal is to make sure the CUDA computation that each CPU thread needs is divided between all the devices.
for instance i have to process several images, each CPU thread will process one image at a time, a use one of the devices when needed…the “routing” class ensure that it is not always the same device that is used.
but maybe the driver does that automatically?

No, the driver will not do this automatically.

I’m not sure at which point (and how) you want to use the devices returned by this class. The multi GPU sample from NVIDIA uses the Runtime API, and I don’t know how to do the same with the Driver API. Section 3.3.1 from the Programming Guide talks about Creating, Pushing and Popping Contexts, but I have not yet used this on my own.

It could be helpful to first try to get a basic example running, before trying to apply this technique to more complex image processing tasks inside the Plugin. It will become complicated enough.

And by the way: I can not guarantee that the KernelLauncher offers enough flexibility to run tasks on multiple GPUs - maybe you’ll have to extend or modify it for this. Of course, when I have the chance to use multiple GPUs, and when it’s necessary, I’ll try to extend it accordingly, but at the moment it is only targeting single-GPU usage.