cudaMemCpy(...) cudaThreadSynchronize()

Can it be that cudaMemCpy(…) does not blocks my host in JCuda ?

I read that it should call cudaThreadSynchronize() implicitly.

In my application is a loop with 2-3 kernel invokations, in this loop a cpu data structure (tree) is updated with
the results from the kernel calls. Each Kernel invokation is already followed by an cudaThreadSynchronize()

When my Loop Counter increases I get errors in my data structure.

For small loop counter the results are perfect compared to a cpu version.

JCuda.cudaSetDeviceFlags(JCuda.cudaDeviceBlockingSync);

Did not helped that much, I think the error occurs less often but sometimes it still occurs.

I am programming a Text Mining Hierarchical Clustering framework using the GPU for the math intense
calculations for my bachelor thesis.

One big Problem is, that if I do not output every step (includes memCpy from Device to Host), my
Binary Cluster Tree on CPU side gets calculated wrong (parent has itself as child).

Sometimes it still works, but the fault rate increases as document count increases --> which increases the number of necessary iterations.

So, if i make my code very slow and copy all results from device to host back, my binary cluster tree is created perfectly.

For small document counts < 15 my code works perfect and already extreme fast compared to cpu
Some steps already 300 times faster for just 100 documents.

**So I think I am missing some cudaThreadSynchronize().

Each Kernel invokation is already followed by a call of cudaThreadSynchronize().**

A big term-document matrix (TDM) holding term frequency (TF) values for each document is created on the cpu.

Important Steps are.

[ol]
Tokenisation:
[li]CPU MemCpy: copy TDM to Device //longest operation!
[/li][li]GPU Kernel: calculate inverse document frequency and update TDM with TF-IDF values
[/li][li]GPU Kernel: normalize document column vectors
[/li]
Document Similarities:
[li]CPU Allocate: linearized document-document similarity matrix (DDM_0)
[/li][li]CPU MemCpy: row and column integer index arrays
[/li]//obsolet, could be calculated from linear index
[li]GPU Kernel: calculate pairwise document to document similarity values
[/li][li]CPU Free: TDM
[/li][li]CPU: create cluster objects with pairwise documents and their similarity values
[/li]
Error must appear at the following steps, since the CPU binary tree is here updated
Clustering: Loop (i) until only one cluster Remains
[li]GPU Kernel: get cluster with maximum similarity linear index
[/li][li]CPU MemCpy: copy max index from device to host
[/li] -
[li]CPU Allocate: one iteration smaller linearized DDM_i
[/li][li]CPU Allocate: neighbour reference array (NR-array_i)
[/li][li]GPU Kernel: update neighbours of the maximum cluster, and output neighbour linear indexes
[/li][li]CPU MemCpy: copy DDM_i, NR-array_i, and from device to host
[/li] -
[li]CPU: update CPU cluster binary tree using NR-array
[/li][li]CPU Free: NR-array_i, TDM_(i-1)//free not necessary each iteration --> offset pointer and memSet to 0
[/li]
Cleaning up:
[li]CPU Free: TDM_i, row and column integer index arrays
[/li][/ol]

**

Do I need to Block my Host on memory copy operations ?
**

I ran my programm now for around 15-20 times and the error did not occured anymore when I use two Thread.Sleep(100) in my clustering loop.
Usually it definetly happens 1 out of 5 times.

First one is after receiving the maximum cluster index method, after point 11.
Second one is after the cluster binary tree updating method, after point 17

So I assume, that my Java Thread does not block!,
either on a cudaMemCpy(…) or my cudaThreadSynchronize() calls after a kernel

Hello

This sounds like a quite complex algorithm. It’s hard (for me) to grasp it at the first glance, or figure out what might be wrong there. Additionally, I have not yet so much experience with using CUDA from multiple threads. But in general, specifically the cudaMemCpy calls should all be blocking the host thread (in contrast to the cudaMemCpy*Async calls).

(BTW: cudaThreadSynchronize is deprecated in the latest CUDA version. It should still work, but has been replaced by cudaDeviceSynchronize)

But maybe cuCtxSynchronize is more approriate here: You must obviously be mixing the Runtime and the Driver API. I’m not sure whether this may have non-obvious effects on the synchronization, but specifically for synchronization in the driver API, there is “cuCtxSynchronize”. Admittedly, the precise usage of these functions is somewhat “advanced”. My first, naive approach would be to use only the NOT-‘Async’ operations (without streams, and always from the same thread) and to call cuCtxSynchronize after the kernel launches.

Do you create your own Context, or are you connecting to the Context created by the Runtime Libraries?

bye
Marco

Some more facts:

  1. I only use one Java Thread, the main Thread
  2. I only use the runtime api, no driver api is used
  3. I do not use streams, just simple cernel calls.

The most important part is the loop in the algorithm, where the error occurs.


//list of clusters which are merged into a binary tree
//one cluster holds two children!
ArrayList<Cluster> clusters;

float[] inputClusterArray;

//loop
for(int i = 0; i < this.iterations; i++)
{
	//get index of maximum value in the array
	//calls one or two gpu kernels if necessary
	//copy result back from device to host
	maxKernel.call(inputClusterArray);
	cudaThreadSynchronize();
	int maxIndex = maxKernel.memCpy(deviceToHost);

	//allocate a new smaller array
	clusterKernel.call(inputClusterArray, maxIndex);
	cudaThreadSynchronize();
	float[] newClusterArray = clusterKernel..memCpy(deviceToHost);

	//update references to build our tree
	clusters = updateClusters(clusters, newClusterArray);

	//update inputClusterArray
	inputClusterArray = newClusterArray;
}

//at the end only the root cluster is in my cluster array!

Kernels can only be called using the Driver API, so most likely, both APIs are used. In general, they are interoperable, but maybe there as some specific potential pitfalls when it comes to the details of the differences between things like cudaDeviceSynchronize and cuCtxSynchronize. But so far, I can only guess what might be wrong. I assume that it is hardly possible to extract as small(!) example that reproducably creates wrong results…?

BTW: the code above was pseudo code, very simplified

How can I call a Kernel using only the Runtime ?,
or do you have a kernelLauncher for the Runtime ?

i use:

JCuda.cudaMalloc(...)
JCuda.cudaMemcpy(...)
JCuda.cudaMemSet(...)

jcuda.utils.KernelLauncher;
jcuda.utils.KernelLauncher.setup(...)
jcuda.utils.KernelLauncher.call(...)```

As explained in the Introduction of the Tutorial, Kernels can only be executed with the Driver API in Java.

The sequence of operations should be fine in general. Did you try adding “cuCtxSynchronize” after the kernel call? Just for the case the next iteration will start immediately by reading results that the kernel is otherwise still computing…

ah ok,

I have an idea!

CudaFree is a non blocking operation ?, isn’t it ?

I have a few device pointer arrays outside my loop,
which first get freed, and then get replaced with the new smaller pointer arrays

Maybe the replacement is faster then the freeing! Especially if the array sizes increase.

Because I also get sometimes strangely “invalid device pointer” errors

This is most likely related to the other thread: http://forum.byte-welt.de/showthread.php?p=16102#post16102 - maybe we can combine them to find an answer.

Hmm I maybe found the error.
In my CPU Version i create new cluster / node objects in the loop.

At my GPU Version I just tried to create the tree by updating child references.
With new nodes I think the error will not occur anymore.

Will keep you informed.

And I am very thankfull for your help.

That was the error :), now to my next problem ^^