Unified Memory

Hi there,

I’m having trouble getting started with “Unified Memory” using JCuda. I instantiate a Pointer p1, and then make the following call: cudaMallocManaged(p1, numBytes, cudaMemAttachGlobal). I’d then like to initialize the memory on the host side via a ByteBuffer. Although the call to cudaMallocManaged returns cudaSuccess, a call to p1.getBuffer() returns null. Thus, it is not possible to access the memory. Note: I’ve been able to verify that my GPU supports unified memory, so that’s not an issue.

Any thoughts?

Thanks,
Steven

Hello

You got me on this one. This indeed is a bug. There is no ByteBuffer allocated for such a pointer.

As a (lame?) excuse, I can only say that I got my first card with Unified Memory Support only a short while ago, and since then, did not have the chance to test all the new features („new“ here means: Features that had not been supported by my GeForce 8800 :rolleyes: and that’s quite a lot…).

Although I had read the unified memory blog entry back when CUDA 6 was published, I probably did not (and still do not) understand all details in all depth. For example, as far as I understood, you’d need cudaMemAttachHost, because cudaMemAttachGlobal makes the memory only available to all devices (and not to the host). Additionally, if the memory was accessible to the host, it has to be made accessible to the device explicitly by calling cudaStreamAttachMemAsync.

I just added a fix, and ran a test program

import static jcuda.jcublas.JCublas2.*;
import static jcuda.runtime.JCuda.*;

import java.nio.*;

import jcuda.*;
import jcuda.jcublas.*;
import jcuda.runtime.JCuda;

public class JCudaUnifiedMemory
{
    public static void main(String[] args)
    {
        JCuda.setExceptionsEnabled(true);
        JCublas.setExceptionsEnabled(true);

        // Allocate managed memory that is accessible to the host
        Pointer p = new Pointer();
        int n = 10;
        long size = n * Sizeof.FLOAT;
        cudaMallocManaged(p, size, cudaMemAttachHost);
        
        // Obtain the byte buffer from the pointer
        ByteBuffer bb = p.getByteBuffer(0, size);
        System.out.println("Buffer on host side: "+bb);

        // Fill the buffer with sample data  
        FloatBuffer fb = bb.order(ByteOrder.nativeOrder()).asFloatBuffer();
        for (int i=0; i<n; i++)
        {
            fb.put(i, i);
        }
        
        // Make the buffer accessible to all devices
        cudaStreamAttachMemAsync(null, p, 0, cudaMemAttachGlobal);
        cudaStreamSynchronize(null);

        // Use the buffer in a device operation 
        // (here, a dot product with JCublas, for example)
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);
        float result[] =  { -1.0f };
        cublasSdot(handle, n, p, 1, p, 1, Pointer.to(result));
        System.out.println(result[0]);
    }
}

and this seems to work, basically.

However, I’m not sure how to handle this. I could simply and quickly add this fix and create a new release. But I’m a bit hesitant here: I did not yet test it thoroughly, and I’d prefer to test some application scenarios of Unified Memory before releasing it, if possible. Additionally, from what I have read so far, this Unified Memory feature opens the door for nasty bugs. For example, I’m not sure what should happen to the ByteBuffer after cudaStreamAttachMemAsync was called. But this is only one aspect that caused me to raise an eyebrow. E.g. the API documentation says things like

Accessing memory on the device from streams that are not associated with it will produce undefined results.

No error checking is performed by the Unified Memory system to ensure that kernels launched into other streams do not access this region.

It is a program’s responsibility to order calls to ::cudaStreamAttachMemAsync via events, synchronization or other means to ensure legal access to memory at all times.

If \p stream is destroyed while data is associated with it, the association is removed and the association reverts to the default visibility of the allocation as specified at ::cudaMallocManaged

and I think that this can have subtle implications (and possibly not-so-subtle consequences - maybe even nasty crashes when accessing the ByteBuffer after attaching the memory to a stream or something like that…).

I’ll try to come to a conclusion here has quickly as reasonably possible. However, I can not give any specific time for when the solution will be available in a release. If you’d like to do some basic tests (and tell me what does not work (which would be nice) and how to fix it (which would be great :D)), then I could either provide you a „patched“ Windows64 binary which I used for the above test, or (if you’re on Linux/Mac) provide you the updated source. (It’s only a few added lines until now). Please let me know which solution you’d prefer here.

bye, and thanks for making me aware of this,
Marco

Version 0.7.0a of JCuda, which is available at jcuda.org - Downloads , supports the basic application pattern that I described in the code snippet above: When a pointer is allocated with cudaMallocManaged(p, size, cudaMemAttachHost);, it is afterwards possible to obtain and access the memory via a ByteBuffer.

However, because of the open questions and potential caveats when using Unified Memory this way (also described in the post above), this should for now be considered as experimental. I really have to do further tests to see what the border cases are…

Thanks for your reply and efforts. I haven’t had a chance to explore this since I originally posted, but I will be looking at it again quite soon.

I should probably start with porting one of the „Unified Memory“ samples from NVIDIA to CUDA, to see some application patterns and where the caveats are, but (besides a lack of time), one prominent application case seems to be structs that are used on the host and device - and structs are not really citizens of the Java World. However, if you have remarks or recommendations regarding this topic, I’d be happy to hear about it :slight_smile: