Jcuda for coocurrence matrix

Hello everyone,
I am new to CUDA programming and wanted help in converting my current multithreaded CPU program to be able to run on GPU.

My task is to parallelize the co-occurence matrix creation that I do in my traditional java program. More specifically I have a huge file, where each line represents a document. Each line in the document contains about 100 words (for ease, the words are converted into word indices). The number of lines/documents is about 300K. For simplicity, let’s say the co-ocuccrence window is one document, i.e., how many times word X came with word Y in a document/line. In other words, each line/document needs to go through a nested loop.

Now because of the magnitude of the resulting co-occurence matrix, my current program follows a model similar to external sort-merge algorithm, where doing the merge phase, I keep summing up the number of times a pair of words have repeated, thus finally yielding a complete co-occurence matrix.

I was wondering, at what steps would parallelizing through GPU help and if they are any jcuda codes that the community can help me get started:
Step-(a) creating all pair (Cartesian product) (output is multiple files, each line containing a pair of words)
Step-(b) sorting each file from step - a
Step-© Merging the output of step-b while summing up the repetitions.

I don’t really know the process that you are describing, but from the description of the individual steps, it sounds like this might not really be a compute-intensive task. Particularly, you mentioned files, and if I understood the last part (describing steps a-c) correctly, then I’d be surprised if the file I/O wasn’t the most expensive part.

Did you do some performance tests, and could you identify a bottleneck where it’s really the CPU that is busy with computations? (Specifically, it would have to be data-parallel and not memory-bound computations in order to have the chance to achieve a good speedup by doing this on the GPU…)

Hi Marco,

Thanks for the reply. As per my benchmarking the two bottle necks are the cartesian cross product and sorting the results of all cartesian cross products.

I apologize that my description didn’t clearly state the problem I am solving.

Let me try to give you an example, I have a text file with lines

A \t B \t C \t D \t E
A \t C \t F \t G \t E

Each line in this file is one document.

My multi threaded program (producer), reads each line in the text file and fills a queue. The consumer thread then dequeues it and performs Cartesian product. The Cartesian product is essentially just a nested for loop (whose outer loop is again parallelized through streams)

So for Doc1 - A \t B \t C \t D \t E

Cartesian product Step 1 - A


Doc 2 A \t C \t F \t G \t E

Cartesian product Step 1 - A


Ideally, if the text file was small, I could have kept a hashmap to store that AA has occurred 2, AC has occurred 2 and so on. Since the text file is huge, and I need a cumulative “pair-count” (similar to word count), what I do is that the output of the Cartesian product is buffered up to X lines in memory and then flushed to hard disk. So I have N such files with each line containing the pairs like AA\n AC\n AF\n, etc. and there can be maximum X lines in a file. Then I simply do an external k-way sort merge to produce one big file where all the lines are lexicographically sorted. Now I can count the occurrence of each pair by scanning the lexicographically sorted file only once.

So from my limited understanding of GPU parallelization, I believe, the step related to Cartesian product and external sort merge could be implemented in jcuda. However, I am not sure as to implementation approach in jcuda and would greatly appreciate any pointers or starter codes to help me.

Hope this makes my problem a little clearer and thanks again for helping me.

Things became a little bit clearer now, at least conceptually.

It’s not (yet) clear for me why you are reading the lines in one thread and processing them in another, but I assume that this is (roughly) an attempt to decouple IO and processing, and hide the IO latencies from the processing core.

On the one hand, disc IO and the GPU are very much unrelated, so this may not be relevant regarding CUDA here right now. But in general, roughly speaking: Multi-threaded disc IO tends to be slower than reading/writing one large chunk sequentially. This also refers to the fact that you mentioned that the outer loop of the cartesian product is already parallelized with streams, and the outputs of the cartesian products are stored in files.

What is also not yet clear for me: Why you are buffering this in files. But I assume that this related to the fact that…

  • you initially do not know the „set of all words“
  • the matrix would be too large to be stored in memory

The latter would be a bit surprising, though: Even a 4000x4000 matrix should easily fit into memory (and larger ones could probably be split in some way…).

How many distinct words are you dealing with?

Regarding the process as a whole, as far as I understood, it boils down to this pseudo code:

String line = readLine();
List<String> words = tokenize(line);
for (String w0 : words) {
    for (String w1 : words) {
        int index0 = matrixIndexFor(w0);
        int index1 = matrixIndexFor(w1);         

where …

  • the readLine is decoupled from the rest via producer-consumer
  • the outer loop is parallelized, probably trivially via Stream.of(words).parallel().forEach(...)
  • the matrixIndexFor(word) does not exist, because you don’t know the set of all words
  • the matrix[r][c]++ cannot be implemented like that, because you don’t know the indices

This may not be the „answer“ that you are looking for (and it may be utterly stupid questions), but…

  • Do you think that you could compute the „set of all words“ initially?
  • Do you think that you could create some lookup that maps words to their (final) matrix indices? (Maybe just a Map<String, Integer>?)
  • Do you think that the final matrix could be kept in memory?

Even iff this was possible, I doubt that CUDA would bring a good speedup here: There is basically no computation involved in all this. You are basically only reading an writing memory.

But iff the answer to the questions above was „Yes“, then one could try to think of a tricky approach here - maybe some sort of reduction, which can be implemented on the GPU. But until now, this is only a vague guess…

Hi Marco,
Sincere apologies for the delay in getting back.
Regarding your questions:

  1. Do you think that you could compute the „set of all words“ initially?
    Yes. The set of words (vocabulary) can be computed before hand.
  2. Do you think that you could create some lookup that maps words to their (final) matrix indices? (Maybe just a Map<String, Integer> ?)
    It is possible, I do convert the string word into a unique integer index. This could be used as a position in the final matrix. However, I don’t think we can create the matrix due to the answer for (3)
  3. Do you think that the final matrix could be kept in memory?
    No, the final matrix is just too big. That is the reason why I have adopted such a circuitous route for calculating the cooccurrence matrix (i.e., enumerating first the all pair combinations and persisting on the hard drive, followed by external merge sort and aggregation [similar to map reduce] to get the counts of the pairs.).

I was going through the regular CUDA C++ examples over the last couple of weeks and I found a code which does the cartesian product (all pair combination) on integer arrays - https://stackoverflow.com/questions/16196321/generate-cartesian-product-using-cuda-on-gpu. Now I can list the words in their unique index form (integer or long) and I believe in theory this example is applicable. But I do not know how to translate this example in JCUDA.

Indeed, I thought about this problem for a while back then, but now would have to re-read it to refresh my memory.

I think that one could try to use CUDA here at some point. I already mentioned the possibility to maybe compute rows (or rather sets of rows) at one time, and then applying some sort of reduction to obtain the final matrix rows. But again: This is mainly reading/writing memory (and lots of memory), and the only computation that is done is basically the matrix[i][j]++ call. It’s very unlikely that CUDA really brings a speedup here.

(Or to phrase it that way: In order to use CUDA for this task at all, one would have to structure the problem and the overall process in a way where it’s eventually no longer relevant whether CUDA is used or not…)

But again: Take this with a grain of salt. There may be experts for this sort of task, who know tricky approaches that I’m not aware of…

Regarding the cartesian_product function from the answer that you linked to: Yeah, it does what it is supposed to do. But this also is something that is only about shuffling memory around. If someone did a benchmark and compared this to a plain, host-side creation of a std::pair, I’m sure it would turn out to be slower when done on the GPU. If you want to give it a try nevertheless: The most basic example at

shows how to call a simple kernel using the CUDA Driver API. You can insert another kernel there. Then you have a program that generates a laaaarge cartesian product. I’m not sure how this should help you with your task, though…

Thanks Marco. The following loop which is used to generate co-occurrence is identical to cartesian product. Because there is no way I can store the cooccurrence in memory, I am generating the cartesian products first and then sorting and reducing it to count the pairs on the hard-disk. Thanks so much for the help and pointers… I will work on it !

for (String w0 : words) {
    for (String w1 : words) {
      pair = w0+"\t"+w1

Again, in order to provide „constructive, profound“ hints here, I’d need a clearer idea of the current approach. You mentioned processes, threads, and disc buffering, and it’s hard to imagine where CUDA could come into play here.

I could try to sketch the computation of such a matrix in plain Java, „how I would do it“, but this might (in the worst case) be useless for you, or (in the „best“ case) solve your problem without CUDA :wink:

(For example: My solution wouldn’t involve string concatenations like w0+"\t"+w1, and would only compute half of the matrix, because it’s symmetric…)

But if you can share an example of your current approach (as something that can easily be compiled and tested, maybe with some dummy data etc), and a clear comment like

// This part is slow, here, I want to use CUDA

then I could have a look at it. (Although I can, of course, neither promise that I find a sensible solution with CUDA, and even less that it will be noticably faster than a (sensible, efficient) plain Java implementation)