I have been working with an simple example to tuning my JOCL software.
Lots of things could be done like use pinned memory, zero copy, using single precision and math relaxing.
Now im generating input data in a seperate CPU thread and doing init for JOCL in the main CPU thread. This takes sometimes 200-800ms to generate the inputdata to the kernel (superlarge input).
What I cant figure out is to tuning the JOCL startup, as of now it takes around 400-600ms to initialize JOCL.
Is there so good way like doing the startup in 2 threads or some other solutions, should be very nice
if this could be cut in half to 200-300 ms