Jcuda for coocurrence matrix

Hello everyone,
I am new to CUDA programming and wanted help in converting my current multithreaded CPU program to be able to run on GPU.

My task is to parallelize the co-occurence matrix creation that I do in my traditional java program. More specifically I have a huge file, where each line represents a document. Each line in the document contains about 100 words (for ease, the words are converted into word indices). The number of lines/documents is about 300K. For simplicity, let’s say the co-ocuccrence window is one document, i.e., how many times word X came with word Y in a document/line. In other words, each line/document needs to go through a nested loop.

Now because of the magnitude of the resulting co-occurence matrix, my current program follows a model similar to external sort-merge algorithm, where doing the merge phase, I keep summing up the number of times a pair of words have repeated, thus finally yielding a complete co-occurence matrix.

I was wondering, at what steps would parallelizing through GPU help and if they are any jcuda codes that the community can help me get started:
Step-(a) creating all pair (Cartesian product) (output is multiple files, each line containing a pair of words)
Step-(b) sorting each file from step - a
Step-© Merging the output of step-b while summing up the repetitions.

I don’t really know the process that you are describing, but from the description of the individual steps, it sounds like this might not really be a compute-intensive task. Particularly, you mentioned files, and if I understood the last part (describing steps a-c) correctly, then I’d be surprised if the file I/O wasn’t the most expensive part.

Did you do some performance tests, and could you identify a bottleneck where it’s really the CPU that is busy with computations? (Specifically, it would have to be data-parallel and not memory-bound computations in order to have the chance to achieve a good speedup by doing this on the GPU…)

Hi Marco,

Thanks for the reply. As per my benchmarking the two bottle necks are the cartesian cross product and sorting the results of all cartesian cross products.

I apologize that my description didn’t clearly state the problem I am solving.

Let me try to give you an example, I have a text file with lines

A \t B \t C \t D \t E
A \t C \t F \t G \t E

Each line in this file is one document.

My multi threaded program (producer), reads each line in the text file and fills a queue. The consumer thread then dequeues it and performs Cartesian product. The Cartesian product is essentially just a nested for loop (whose outer loop is again parallelized through streams)

So for Doc1 - A \t B \t C \t D \t E

Cartesian product Step 1 - A


Doc 2 A \t C \t F \t G \t E

Cartesian product Step 1 - A


Ideally, if the text file was small, I could have kept a hashmap to store that AA has occurred 2, AC has occurred 2 and so on. Since the text file is huge, and I need a cumulative “pair-count” (similar to word count), what I do is that the output of the Cartesian product is buffered up to X lines in memory and then flushed to hard disk. So I have N such files with each line containing the pairs like AA\n AC\n AF\n, etc. and there can be maximum X lines in a file. Then I simply do an external k-way sort merge to produce one big file where all the lines are lexicographically sorted. Now I can count the occurrence of each pair by scanning the lexicographically sorted file only once.

So from my limited understanding of GPU parallelization, I believe, the step related to Cartesian product and external sort merge could be implemented in jcuda. However, I am not sure as to implementation approach in jcuda and would greatly appreciate any pointers or starter codes to help me.

Hope this makes my problem a little clearer and thanks again for helping me.

Things became a little bit clearer now, at least conceptually.

It’s not (yet) clear for me why you are reading the lines in one thread and processing them in another, but I assume that this is (roughly) an attempt to decouple IO and processing, and hide the IO latencies from the processing core.

On the one hand, disc IO and the GPU are very much unrelated, so this may not be relevant regarding CUDA here right now. But in general, roughly speaking: Multi-threaded disc IO tends to be slower than reading/writing one large chunk sequentially. This also refers to the fact that you mentioned that the outer loop of the cartesian product is already parallelized with streams, and the outputs of the cartesian products are stored in files.

What is also not yet clear for me: Why you are buffering this in files. But I assume that this related to the fact that…

  • you initially do not know the „set of all words“
  • the matrix would be too large to be stored in memory

The latter would be a bit surprising, though: Even a 4000x4000 matrix should easily fit into memory (and larger ones could probably be split in some way…).

How many distinct words are you dealing with?

Regarding the process as a whole, as far as I understood, it boils down to this pseudo code:

String line = readLine();
List<String> words = tokenize(line);
for (String w0 : words) {
    for (String w1 : words) {
        int index0 = matrixIndexFor(w0);
        int index1 = matrixIndexFor(w1);         

where …

  • the readLine is decoupled from the rest via producer-consumer
  • the outer loop is parallelized, probably trivially via Stream.of(words).parallel().forEach(...)
  • the matrixIndexFor(word) does not exist, because you don’t know the set of all words
  • the matrix[r][c]++ cannot be implemented like that, because you don’t know the indices

This may not be the „answer“ that you are looking for (and it may be utterly stupid questions), but…

  • Do you think that you could compute the „set of all words“ initially?
  • Do you think that you could create some lookup that maps words to their (final) matrix indices? (Maybe just a Map<String, Integer>?)
  • Do you think that the final matrix could be kept in memory?

Even iff this was possible, I doubt that CUDA would bring a good speedup here: There is basically no computation involved in all this. You are basically only reading an writing memory.

But iff the answer to the questions above was „Yes“, then one could try to think of a tricky approach here - maybe some sort of reduction, which can be implemented on the GPU. But until now, this is only a vague guess…