Counting cores


Hello everybody, welcome to myself on this forum,

I intend to evaluate the possibility to use JOCL for accelerating some of my software, and as I have so far no experience with GPU programming, I have started to do a few tests. One of them was to evaluate the number of tasks executed at the same time in order to see if it matches the specifications of the GPU. To do so, I made a stupid, useless and time-consuming task: just increasing and decreasing values a large number of time.

I did not take into account the time for setting up stuff or sending data to the GPU. Also, as I wanted only to evaluate the number of tasks which can be executed in parallel, I did not spent time to try to use GPU-specific instructions. Here’s the code I used:

__kernel void dummyKernel(__global int *c) {
    int gid = get_global_id(0);
    int n = 0;
    for (int i=0; i<500; i++) {
        while (n<100) {
        while (n>0) {
    c[gid] = n;

I tried it on two graphics chipsets (Intel HD Graphics 4400 and 4600), and two graphic cards (GT 740 and GTX 760). The results of my test are given in the following graphics:

The curves for the Intel HD Graphics 4[46]00 are almost linear because they seem to be able to execute only 4 tasks at a time. We can see that the GT 740 is able to run 32 tasks and the GTX 760 goes up to 96. However, I when looking at the specifications of the GPUs, I noticed something strange:

[li]The Intel HD Graphics 4400 is supposed to have 16 execution units, not 4
[/li][li]The Intel HD Graphics 4600 is supposed to have 20 execution units, not 4
[/li][li]The GT 740 is supposed to have 384 cuda cores, not 32
[/li][li]The GTX 760 is supposed to have 1152 cuda cores, not 96

There is something which I have clearly not understood.

I thought that the number of execution units indicated the number of parallel tasks, but it clearly isn’t. There are also 25% more in the 4600, but no gain compared to the 4400 - I think the difference in the graphics comes only from different clock frequencies.

For the graphic cards, I am not sure what to expect by the term “cuda core”. We can notice that 384/32 == 1152/96 == 12, so there’s clearly something happening. Apparently OpenCL does not run one task per cuda core ; is it because cuda cores aren’t equivalent to “processor cores”, or is it a limitation of OpenCL ? I’ve to admit I’m a bit confused.

I also expected the graphic cards to be faster than the Intel chips, but I guess this should change with tasks for which we can use GPU specific instructions.

By the way, I think I’m using recent drivers:
extra/nvidia 343.22-2 [installé]
extra/opencl-nvidia 343.22-1 [installé]

Thank you very much for any information :slight_smile:


Stream cores are grouped into small groups which have common control flow, eg in case of nVidia it’s 32. Those groups can be thought as something like AVX/SSE in CPU terms. In SSE you can operate on multiple items at a time, so in general Intel could’ve marketed its cores multiplied by the SIMD vector width. Intel doesn’t do that as SSE/ AVX is somewhat less flexible than stream cores (but OTOH GPGPUs aren’t as flexible as individual CPU cores) and in real life wider SSE doesn’t bring immediate high gains. GPUs OTOH are designed to handle rasterization which is an embarassingly parallel task, so GPUs can utilize all of the cores effectively. You also have to take into account that instruction latencies on GPUs are much higher than on CPUs, but thread switching on GPUs are much faster (effectively free, I think) than on CPU (where task switching is expensive, so it’s not done more frequently than 1000 times a second despite the clocks are in gigahertzs).

So, you need to read documentation about GPUs you want to optimize for and find how the streaming processors are grouped and what the groups are sharing. As I’ve told, at some level of grouping, the streaming processors share instruction pointer and that has serious implications.

Also, because instruction latencies on GPGPUs are big and thread switching essentially free, you need to run much more threads than there are available physical units that can execute them, so you can mask those latencies. CPUs do out-of-order execution and other tricks to extract intra-thread parallelism. GPUs don’t do that, instead they are built to maximally exploit inter-thread parallelism.



An interesting test. But admittedly, this touches details of the underlying hardware for which I can hardly make definitive statements (this is fairly complicated… and in the end, OpenCL is intended to be an abstraction of the hardware :wink: although it not necessarily is an abstraction, particularly when it comes to scheduling, work sizes etc.). There are several terms involved that have interdependencies that are hard to grasp at the first look: cores, threads, compute units, execution units, processors, threads, blocks, grids, warps… and some of them are used with similar-but-not-equal or even completely different meanings in different contexts and by different manufacturers. I can not say anything about the Intel cards. But for the NVIDIA cards, one can refer to the CUDA programming guide, namely the section about the Hardware Implementation:

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs).

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.

[li]For devices of compute capability 1.x, a multiprocessor consists of: 8 CUDA cores for arithmetic operations…[/li]> [li]For devices of compute capability 2.x, a multiprocessor consists of: 32 CUDA cores for arithmetic operations…[/li]> [li]For devices of compute capability 2.1, a multiprocessor consists of: 48 CUDA cores for arithmetic operations…[/li]> [li]For devices of compute capability 3.x, a multiprocessor consists of: 192 CUDA cores for arithmetic operations…[/li]> [li]For devices of compute capability 5.x, a multiprocessor consists of: 128 CUDA cores for arithmetic operations…[/li]> [/ul]

Now you mentioned

We can notice that 384/32 == 1152/96 == 12, so there’s clearly something happening.

Which matches the description: This could simply indicate that both cards have 12 “streaming multiprocessors” (one could probably find this information in some spec)

BTW: I’m not sure whether the given kernel is really the best test case, because the loops mainly consist of conditional branches, and not of arithmetic (and the only arithmetic is done on the int type). When I tried to create an artificial workload, I usually used kernels like

__kernel void dummyKernel(__global float *c) {
    int gid = get_global_id(0);
    float result = 0;
    for (int i=0; i<50; i++) {
        result += cos(sin(cos(sin(cos(sin(cos(c[gid])))))));
    c[gid] = result;

(but when this is used as an attempt to “measure performance”, it could still be questionable, and the results should not be overrated)

It could be interesting to see the chart with input sizes up to a value larger than 1152 (possibly with a smaller workload), to see how the curves behave…