Multiple threads, single context: how to handle?

Hello,
I was wondering if there was a correct way of handling a scenario like this: you have multiple threads that would allocate pointers like this:

cuMemAlloc(dev_expression, length * Sizeof.BYTE);
cuMemcpyHtoD(dev_expression, Pointer.to(postfixExp), length * Sizeof.BYTE);

But you only call the kernel on one of these threads. Is that possible in JCuda?

Right now, it seems that CUDA destroys my context as I get the “CUDA_ERROR_INVALID_CONTEXT” error.
Is there a proper way of handling this issue? I was thinking about creating a GPUWorker thread but it seems that in CUDA 4+, multiple threads can access a single context…

Thanks in advance

Hi

Although I have not yet extensively used CUDA with multiple host threads: You have to make sure that the respective context is “current” for the calling thread. I assume that you only have ONE context, is this correct? Then you will have to call cuCtxSetCurrent(theContext) in the thread before you do the cuMemAlloc call. Otherwise, the memory will not be allocated for the context that later actually tries to use the memory when the kernel is launched.

Some aspects of context handling have been discussed in http://forum.byte-welt.net/threads/10901-JCudaVectorAdd-for-multiple-GPU - although probably not everything that is discussed there is relevant for you, you might find it interesting

bye
Marco

Hello Marco,
I finally had a chance to play around with the context thingy, however, for the life of me I could not find a proper example of the usage of cuCtxSetCurrent. This maybe a more CUDA question than a JCuda question and for that I apologize.

For multiple threads, are we supposed to create one CUcontext and then pass that context around to other threads? Or each thread has its own CUcontext object and calls cuCtxSetCurrent using its own CUcontext instance? It seems that for multiple GPU’s, each device has its own CUcontext instance.
I implemented my program using the single CUcontext instance and in the OpenGL interop part, strangely enough, I got CUDA out of memory errors whenever I wanted to register the GLBuffer with CUDA! :eek: This makes absolutely no sense :frowning:

As always, any help is appreciated :slight_smile:

The GL interoperation sample may be a comparatively „complex“ one to start with. The additional constraints that are imposed by things like GL buffer sharing (which is somehow bound to the GL rendering thread) make it more complicated than it has to be for the first tests.

For me it is not clear whether you have

  • multiple GPUs
  • multiple host threads
  • or both :smiley:
    For the case of multiple GPUs, I created a sample a while ago (at http://forum.byte-welt.net/threads/10901-JCudaVectorAdd-for-multiple-GPU?p=77026&viewfull=1#post77026 and on the website). However, this uses only one host thread. Multilple host threads will make things more complicated: Since there is one context for each GPU, and one has to be really careful to not mess up the mapping between host threads, contexts and GPUs. But at least the case of multiple host threads and a single GPU should be managable with a small example. This should, as far as I understood it until now (!), basically work how you described it: ONE CUcontext is created. This single CUcontext is passed to all threads. And each thread that wants to do anything with this context has to call cuCtxSetCurrent first. Of course, one has to take care about the synchronization. That is, for a sequence of operations in thread A
cuCtxSetCurrent(theCommonContext);
//*
doSomethingOn(theCommonContext);

one probably has to make sure that there is no thread B that grabs the context in line //* and causes thread A to ‚doSomethingOn‘ a context that he does no longer own. But again: I have not yet extensively used multiple GPUs, multiple host threads, or even a combination thereof…

Hello Marco,

Thank you for your response. My case is single GPU with multiple host threads. It is like a producer/consumer problem: the main thread constantly runs a genetic algorithm on CUDA and after each generation, the best solution is passed to the OpenGL thread so that it can be visualized. What I was doing was exactly what you mentioned. I created one CUcontext, and passed it to other threads. Before making any JCuda calls, I made sure to do this:

	cuCtxSetCurrent(theContext);
	doSomehting();
}```

Therefore at least I know I was on the right track :-)
I will play around with it more today and hopefully I can figure something out :-)
Thanks again :)

The cuCtxSetCurrent(theContext) works fine for me, thanks