Hello everybody, welcome to myself on this forum,
I intend to evaluate the possibility to use JOCL for accelerating some of my software, and as I have so far no experience with GPU programming, I have started to do a few tests. One of them was to evaluate the number of tasks executed at the same time in order to see if it matches the specifications of the GPU. To do so, I made a stupid, useless and time-consuming task: just increasing and decreasing values a large number of time.
I did not take into account the time for setting up stuff or sending data to the GPU. Also, as I wanted only to evaluate the number of tasks which can be executed in parallel, I did not spent time to try to use GPU-specific instructions. Here’s the code I used:
__kernel void dummyKernel(__global int *c) {
int gid = get_global_id(0);
int n = 0;
for (int i=0; i<500; i++) {
while (n<100) {
n+=1;
}
while (n>0) {
n-=1;
}
}
c[gid] = n;
}
I tried it on two graphics chipsets (Intel HD Graphics 4400 and 4600), and two graphic cards (GT 740 and GTX 760). The results of my test are given in the following graphics:
The curves for the Intel HD Graphics 4[46]00 are almost linear because they seem to be able to execute only 4 tasks at a time. We can see that the GT 740 is able to run 32 tasks and the GTX 760 goes up to 96. However, I when looking at the specifications of the GPUs, I noticed something strange:
[ul]
[li]The Intel HD Graphics 4400 is supposed to have 16 execution units, not 4
[/li][li]The Intel HD Graphics 4600 is supposed to have 20 execution units, not 4
[/li][li]The GT 740 is supposed to have 384 cuda cores, not 32
[/li][li]The GTX 760 is supposed to have 1152 cuda cores, not 96
[/li][/ul]
There is something which I have clearly not understood.
I thought that the number of execution units indicated the number of parallel tasks, but it clearly isn’t. There are also 25% more in the 4600, but no gain compared to the 4400 - I think the difference in the graphics comes only from different clock frequencies.
For the graphic cards, I am not sure what to expect by the term „cuda core“. We can notice that 384/32 == 1152/96 == 12, so there’s clearly something happening. Apparently OpenCL does not run one task per cuda core ; is it because cuda cores aren’t equivalent to „processor cores“, or is it a limitation of OpenCL ? I’ve to admit I’m a bit confused.
I also expected the graphic cards to be faster than the Intel chips, but I guess this should change with tasks for which we can use GPU specific instructions.
By the way, I think I’m using recent drivers:
extra/nvidia 343.22-2 [installé]
extra/opencl-nvidia 343.22-1 [installé]
Thank you very much for any information