Memory alloc overhead

hi!

i tried to use jcublas for one of my applications. unfortunately this application got much slower with jcublas. because of that i compared the following in a test program:

  1. allocating, loading and freeing a double array of size 1000000 in gpu memory (cublasAlloc, cublasSetVector, cublasFree)

  2. allocating, loading and freeing 1000000 double arrays of size 1 in gpu memory one after another

  3. is finished in no time, but 2)… don’t wait for it.
    now i wonder, is this an unavoidable hardware issue, a cuda issua, a cublas issue or a jcublas issue?

Hello cyau

You usually use CUDA or specifically CUBLAS when you have big matrices. “Big” in this case means roughly between 100x100 and 10000x10000 for dense matrices. You would not use CUBLAS to multiply two 4x4 matrices, for example.
And it’s most beneficial to use it when you want to perform many BLAS computations on the same memory. That means, when you are going to perform computations that involve lots of arithmetics compared to the size of the memory that is involved, as for example in cublasSgemm. It’s usually no advantage to perform a simple, memory-bound task with CUBLAS, like finding the index of the minimum element of a vector with cublasIsamin.

(That’s a general problem of many potential applications of CUDA: The most expensive operations are the memory transfers, and not the actual computations. For example, a single access to global memory has an overhead of 600-800 cycles. There are details in the CUDA documentation, in case you are interested)

But specifically regarding your question: It might be necessary to do some more tests to find out what is so time-consuming in your test case. The most interesing point (at least, for me personally) would be to run the same test in C with CUBLAS, to see in how far it differs from the JCublas results. Of course, there is an overhead for calling the native CUBLAS function from JCublas. But it should be negligible compared to the time for the memory transfer and the computations. Setting 1 million single float elements with 1 million calls to cublasSetVector is not really a representative benchmark and far away from any realistic application case. It may (in the best case) give an idea about how large the overhead for a single function call really is.

If you describe your application case and how you are using JCublas at the moment, it might be possible to give some hints how to make it faster.

bye
Marco

thank your for the reply. i think i have to rework my algorithm anyway…