Windows Max Locked Memory

sefjc · 13. November 2019 um 22:44

Hello,

I am using JCUDA in Windows to download textures from GPU to CPU memory.

It is achieving quite fast transfer.

I use cudaHostAlloc(pointer, numBytes, cudaHostAllocWriteCombined);

However, I can’t seem to be able to allocate more than 10 GiB of locked-memory using the above function. Even though I have more than 20 GiB available on the system.

When I go above that limit at around 10 GiB, I get an allocation error from CUDA.

Does anybody know if there is some sort of configuration in Windows 10 that I can set so I can allocate all the memory available in my system as locked-memory for texture transfer?

Thanks in advance

Marco13 · 13. November 2019 um 23:36

Hello,

You might have higher chances of getting a helpful answer (even if it was a definite „No, that’s not possible“) at the NVIDIA forum or a dedicated Windows (10) forum or Q/A site. But websearches like cudaHostAlloc limit OR maximum bring up some results from the NVIDIA forums (Pinned memory limit, Max amount of host pinned memory available for allocation …), where the discouraging (first) answer from the second question starts with…

Similar questions have been asked many times, and I have never seen a satisfactory answer, for any OS supported by CUDA.

Beyond that, the general sentiment seems to be: „That’s going directly to the OS layer, and we don’t have any influence on that“. (Sorry, I know that wasn’t really helpful…)

Taking a step back:

Depending on the exact use-case, there might be options to have some sort of „chunked“ transfer. But eventually, at some point, the data has to be accessed from Java side. And you might even hit the limitations of the language there: Arrays (and buffers, like a ByteBuffer) are still addressed with int.

How do you currently use this data on the host side?

bye
Marco

sefjc · 14. November 2019 um 08:04

Hello,

Thanks a lot it’s very helpful.

Thanks for the Google keywords, I should be learning how to better use operators such as OR.

Also I realize how unspecific I was. I’m not allocating a huge chunk but several 1.8 gib chunks.

My end goal is to record (on memory and then on disk) max 20 seconds of opengl textures at 60 hertz in 4k 3840 x 2160.

Traditional pixel retrieval in opengl is too slow, with cuda memcpy from the texture using interoperability mapping functions I achieve max available throughput on my egpu thunderbolt 3 link (cuda memcpy runs at 1.5 GiB/s which is the same figure as can be seen in cuda-z benchmark when my opengl app is running)

1.5 GiB/s is below the actual opengl rendering throughput (rgba 4 bytes 4k 60h is 1.8 GiB/s) so I’m ring buffering in 5GiB gpu memory which gives me max 20 seconds of recording before the GPU has to stall the rendering.

I am using 1 alloc per texture so I’m already chunking data and it’s working great for recording 6 seconds.

But 10 GiB is not enough memory for 20 seconds. I need to use all my computer memory.

Hence my issue.

My solution is starting to be complex and by the look of it will increase in complexity. If you have a different approach for my end goal I’d be happy to hear it. Of course I tried every recording software, they are either too slow or don’t export the alpha layer.

This is for a personal art project.

Thanks for any help!

Marco13 · 14. November 2019 um 15:57

So you want to churn 40 Gigabytes in 20 seconds - that’s ambitious in any case. I thought that the issue of allocation failures might be caused by fragmentation: The OS could certainly have a problem of allocating e.g. a 10 GB block at once, because it has to be contiguous. But when you’re already doing multiple, smaller allocations, this is less likely.

One NVIDIA forum thread mentioned RAMMap - Sysinternals | Microsoft Learn - I haven’t looked at this yet, but it might help to gain some insights of why the allocation fails (with some background knowledge and further research, presumably). But even then, one would only know the reason, but not how to solve it.

Sorry, but I have no idea how this could be tackled (beyond breaking the allocation down into smaller blocks). Asking whether you really need 4k and really need 60Hz also would be rhetorical…

I assume that if you could simply break this down on a higher, methodological level (e.g. first capturing the first 5 seconds, then capturing the seconds 5-10, then 10-15…) you’d certainly already have tried that. But when this is really generated output (via OpenGL rendering), I wonder why this isn’t feasible. Is there some „real-time“ (AR/Webcam) input mixed in?

sefjc · 14. November 2019 um 16:23

Thanks for your input.

I’m doing this R&D to see if I can do without like you said, breaking this down on a higher, methodological layer.

Because it would make my project more complex since is has indeed audio and video input that would need to be synced.

I’m going to try streaming raw bytes to disk, my SSD is marketed at being able to write at 2 GiB/s.

Thanks for the help.

If you have any remaining advice on how I could improve the rate of cudaMemCpy, 1.5GiB/s from GPU to CPU is already quite awesome, I’m only missing 350 MiB/s so I could record unlimited videos.

Cheers

Marco13 · 14. November 2019 um 16:43

I don’t have more specific advice for memory copies, sorry. The sample https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMemoryBandwidths.java does a basic test of the bandwidths, and for cudaHostAlloc, it should be the fastest.

There are options of using mapped- or unified memory (as shown in the samples https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMappedMemory.java and https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeUnifiedMemory.java ). But I have to admit that I haven’t looked yet at whether/how it might be possible to map texture memory (that is likely allocated by GL) to the host space. I only ported the NVIDIA CUDA OpenGL sample to JCuda, but did nothing with CUDA+GL beyond that (and particularly, I don’t know anything about the relative performance of possible other approaches)

(BTW: If you’re interacting with GL and textures, you’re likely using the driver API at some point? The basic functionality for mapped/unified memory should be pretty similar to the runtime versions that are shown in the samples, though…)

sefjc · 15. November 2019 um 16:00

I tried mapped memory but without PINNED memory. It seems pinned memory removes a lot of memory virtualization overhead. I don’t know if one can use pinned memory with cuda and opengl memory mapping functions.

I posted on Nvidia forum to ask what is the most efficient way to copy from GPU to CPU : https://devtalk.nvidia.com/default/topic/1066695/cuda-programming-and-performance/fastest-way-to-copy-opengl-texture-to-cpu-memory/

Marco13 · 16. November 2019 um 11:29

The comment by Robert Crovella certainly makes sense (sometimes, taking a step back helps to ask the proper questions): Which Java OpenGL API are you using right now? JOGL or LWJGL?

sefjc · 16. November 2019 um 12:32

Hello, thanks for the help.

I’m indeed gonna try to memory map, but I need a way to allocate pinned memory in java which I did not research yet. Right now the only way I know to allocate pinned memory is with the cuda mem alloc method.

I am using the latest version of JOGL. The fastest result I achieved while only relying on opengl only functions is 600 MiB/s. It was using mapped memory but with a traditional „undirect“ bytebuffer.