Quick question: I have used CUDA.NET with C# and its main problem is the huge overhead of the p/invoke. An application that in C++ was capable of achieving a 10x, loses all the performance benefit when I use C# and CUDA.NET. The marshalling from managed code to unmanaged code is way to expensive.
What is your experience with jCuda? did you manage to avoid this overhead?
I’m not familiar with C# and the Platform Invocation Mechanism, and especially do not know how these invocations, resolving the function calls, and the marshalling are handled internally.
For Java, there are basically two general approaches for accessing native libraries:
Using JNA (Java Native Access) which might be similar to the C# Platform Invocation: One specifies a DLL, and the method calls may be resolved at runtime.
Using native functions (created manually, or with tools like GlueGen), which are directly implemented to call the library functions
I took the latter approach: The functions of CUDA are made accessible to Java as plain native functions. Hardly any marshalling is required, except for basic primitive and pointer conversions, and special cases like for Pointers to arrays of Pointers. The overhead for calling such a function from Java should thus be rather small. Especially the overhead for the native call itself should be negligible, considering that many time-critical functions, including all the basic mathematical functions like Math.sin, Math.tan etc., are originally provided as native functions.
Of course, method invocations are not for free - there always will be a small overhead. And of course, the greatest benefit may be obtained for single method/kernel calls which are compute intensive and take an amount of time that is “considerable” compared to the overhead for the method invocation. But I wonder for which application case the overhead of some method invocations should eat up all performance benefits from CUDA. Assuming that some routine requires 100ms and can be done with CUDA in 10ms: This would mean that only the invocation of some methods (for setting up the arguments and launching the kernel) should take ~90ms in C#. Can this be true?
I can imagine that, for example, only the first call may be slow, because the native library has to be loaded. At least in Java this might be the case, depending on the strategy that is used for loading the native library. May I ask which way of benchmark/time measuring you used?