JCudaRuntimeGLSample3, why a GLJPanel?

system · 29. Juni 2010 um 06:01

Hello,
I looked at your JCudaRuntimeGLSample3 example (link). And I don’t understand why did you choose to use a GLJPanel to display in.
Just by changing it to a GLCanvas, I can see fps at ~230 while it was at ~80fps before.

Another question,
I haven’t finished looking at the code. But when I run this same program of interaction between CUDA and opengl, I have ~230fps, whereas by running the example in the CUDA SDK, I can see fps at ~x.000.
I suppose you used in your jcuda example the kernel given in the CUDA SDK. So the only difference is the language used next to CUDA. Do this factor 10 performance difference comes from Java ?!!
That would be crazy, ok Java is generally slower than C, but not 10 times !

Marco13 · 29. Juni 2010 um 11:51

Hello Bertrand,

Wo-hoo - you’re right: For me it’s ~120 FPS with the GLJPanel and ~480 FPS with the GLCanvas. Admittedly, there is no specific reason why I used a GLJPanel. I had some difficulties with a CLCanvas in another project, due to its heavyweight nature, and in my „JOGL stub“ I switched between GLJPanel and GLCanvas several times (hence the general variable name ‚glComponent‘ ). Maybe the GLJPanel accidentally slipped into this example. I have updated the file accordingly (it also increased the FPS for the 3D texture sample).

Thanks for this hint!

Concerning the different FPS values between the JCuda sample and the SDK sample: You may have noticed that the JCuda sample uses a grid size of 512x512, whereas the SDK sample uses 256x256. This could explain a fourfold speedup/slowdown. But apart from that, there must be some more issues, because using the same grid size does not result in the same FPS. First I thought the difference might come from the different ways of measuring the FPS: In the SDK sample, the FPS are computed from the time that is required for a single call to the ‚display‘ method, whereas in the JCuda sample, the FPS are computed from the time between two complete invocations of the ‚display‘ method. Then I changed the way of measuring to that used in the SDK sample, but it did not have such a great effect (changed from 480 to 670 FPS).

It might sound like an excuse, but … To me it seems that the SDK sample is not measuring and computing the FPS properly. (And I do NOT say that my sample does so - I did my best to do a reasonable measuring, and hope it makes sense, but I’m open for suggestions about possible improvements). The reason why I think that the FPS computation in the SDK sample is flawed is simple: When running the example on my GeFroce 8800, I have between 10000 and 20000 FPS. This does not sound tooo realistic, but OK, maybe, the GPU is fast and the geometry is simple… However, when changing the grid size to 4096x4096, it becomes awfully slow, but still prints roughly the same numbers, between 10000 and 20000 FPS. It seems as if the computation of the FPS and the actual time required for the kernel are completely unrelated. I’d have to examine this further to find out what might be the reason for that, but I’d be interested if anyone could confirm this beforehand… (Yes, I admit it: I just don’t want to feed the rumour that Java is slow )

bye
Marco

system · 1. Juli 2010 um 02:47

I tried what you said : reducing the grid size and use the same way to calculate fps as in the SDK.
And I also reduced the frame size to have the same as the sample one.
And I reached almost 1.000 fps !
It is still less thant the C version, but that’s a good improvement I think.

I looked on the Internet to find “more proper” way to calculate fps, but it seems that your way is fine.
It is written in the OpenGL FAQ : link, point 22.020If I find time I’ll try this way in the C sample to see, because yes, how they do is a little strange.

I also found another improvment, disable the “Depth Test” as they do in the C sample, doing so I reach 1.200 fps !

Marco13 · 1. Juli 2010 um 06:03

Hello

The description in the link is
A simple method is to note the system time, render a frame, and note the system time again.

To me it is not clear whether this means

void display()
{
    long before = System.nanoTime();

    // All the GL stuff here
    ..

    long after = System.nanoTime();
    System.out.println("FPS: "+computeFrom(before, after));
}

or

private long previous = 0;

void display()
{
    // All the GL stuff here
    ..

    long current = System.nanoTime();
    System.out.println("FPS: "+computeFrom(previous, current));
    previous = current;
}

This may make a difference, as mentioned above (480 vs. 670 FPS) but the difference will probably be heavily depending on what the program is doing in addition to the “GL stuff”… I think that in many cases, these FPS counts may primarily be used to compare different implementations of the same rendering method, but can hardly be seen as an “absolute benchmark value” - especially over language borders like for C and Java. For example, in JOGL there is usually an “Animator” running in such small samples. This may be roughly equivalent to something like glutIdle and glutPostRedisplay, but in both cases there’s SO much happening “under the hood” that it is hard to compare the results objectively.

For measuring the precise execution time of a CUDA kernel, there are other (sort of more reliable) mechanisms, like CUDA events or System.nanoTimes wrapped around the actual function call, as you also pointed out in the other thread.

bye
Marco