How to pass an array of multidimensional rows and two columns

I am a PhD student. I compare two ontologies O1 and O2 using GPU capability.I need to pass a collection of strings and integers in one array, for example,
O1 have two elements, TechnicalStaff, firstName.
O2 have two elements. AssociateProf, SurName.
I have to identify an id related to each ontology elements in order to know correspondences to be compared. I need to pass from host to device an array as follows. First, I have to divide the whole names into tokens and then pass

arr1 = { (Technical,1).(Staff,1),(first,2),(Name,2)}
arr2 = { (Associate,1).(Prof,1),(Sur,2),(Name,2)}

You did not provide many (technical) details. So the following answer may be a bit vague. The „best“ solution will depend on many factors that you’ll probably have to figure out by reading lots of related work, and by some experiments.

In general, if you want to move Strings from Java to the GPU, this will raise some questions. A String in Java is (basically) a char[] array. But a char in Java is different from a char in C/CUDA. In Java, a char has 16 bits, whereas in C, a char has 8 bits. So if you have to move strings from Java to CUDA, you’d have to convert it into a byte[] array, with String#getBytes.

But there is another difficulty: The strings will most likely have different lengths. So it would be really hard to find a sensible data structure in order to pass this data to CUDA.

Another, very general hint is that for GPU programming, you should usually prefer a „Structure Of Arrays“ instead of an „Array Of Structures“. When you have something like

class Point2D {
    float x;
    float y;
List<Point2D> points = ...;

and want to pass these points to the GPU, then you could try to pass it to the GPU as a float[n][2] array (i.e. an array of 2-element arrays). But for the GPU, it is usually much better (and actually easier) to instead pass it as two arrays, like

float xCoordinates[] = extractAllXCoordinates(points);
float yCoordinates[] = extractAllYCoordinates(points);
doSomethingOnGpu(xCoordinates, yCoordinates);

Some more information (and a similar example) can be found at

Again, the description of the problem lacks some technical details. But it sounds like you might consider replacing the Strings with integers, if this suits your problem. Very roughly speaking, and with many guesses about what you want to achieve:

When you have something like

class Entry {
    String word;
    int count;
List<Entry> arrayA = ...;
List<Entry> arrayB = ...;

where these arrays should be compared, then you could try to bring them into a shape that is well-suited for the GPU. This could, for example, be done like that (Pseudocode!):

// Create a map from integer values to words
Map<Integer, String> wordMap = ...;
for (String word : wordList) {
    wordMap.put(wordMap.size(), word);

// Convert the list of entries from the form
//   { { "Foo", 123 },
//     { "Bar",  45 },
//     { "Zap",  78 } }
// into two arrays
//  words = [ 1, 0, 2 ] // The integers used to look up the words in the wordMap
//  counts = [ 123, 45, 78 ]

int wordsA[] = new int[N];
int countsA[] = new int[N];
int i = 0;
for (Entry entry : arrayA) {
    wordsA[i] = wordMap.get(entry.word);
    countsA[i] = entry.count;

// (The same for the other entries)

// Finally:
compareOnGpu(wordsA, countsA, wordsB, countsB);

Once more, the disclaimer: It’s not clear how your input data is structured, what you want to compute, and how you return the result for the computation back from the GPU to Java. But these are some general hints, maybe they are applicable in a similar form to your problem.

Thanks a lot Mr.Marco.

In order not to mislead you. I have made a kernel that accepts two tokens (two character set of characters) using Levenstein distance based on warp shuffle instructions.

All I need to pass a vector of strings from Jcuda to cuda. Knowing that these strings belong to different clusters and different words. Each Cluster contains a set of words.

For example, we divide O1 into two clusters,
C11 = {Bone Organ, Short} , C12 = {Flat Bone,Nasal Bone,Scapula}
And divide O2 into two clusters,
C21 = {Bone, Skull,Phalanx } , C12 = {Flat Bone,Nasal Bone}

We have used an algorithm to detect similar clusters to be
C11 matches C21
C12 matches both C21 and C22
Our problem to find similar words between clusters via GPU using Levenstein.

Required tasks:

  1. How to pass all of the tokens of the clusters of O1 and O2 as strings from Jcuda to cuda.
  2. In your opinion, how to discriminate words inside GPU. To return from GPU with result that for instance, Bone Organ inside O1 is similar to Bone inside O2.
    My solution is to make a vector for the number of a cluster and another for the number of word inside the cluster.

So, sent from the host. a vector of strings (tokens), a vector of cluster number and a vector of word number inside the cluster. All of these once for O1 and another for O2.

What is your opinion, please?

Again, it’s difficult to give specific recommendations here, without really knowing the tasks and goals in all detail.

But if I understood you correctly, you basically have a kernel roughly like this:

__global__ void levenshteinDistance(
    char *wordA, int wordLengthA, 
    char *wordB, int wordLengthB, float *result) { ... }

that computes the distance between words. (If so: I wonder how this is implemented. This already is fairly non-trivial…)

But it’s not clear what your input data is, for the overall algorithm, on the host side. Some guesses: If you have something like

String C11[][] = new String[][] {
    {"Bone Organ", "Short"} , 
    {"Flat Bone" , "Nasal Bone", "Scapula"} 

then it might already be pretty difficult to convert this into a form that can sensibly used in CUDA. There is no such thing like a String in C/CUDA, but only char*. Additionally, GPUs don’t like memory indirections, and handling 2D arrays in CUDA is cumbersome (so squeezing your data into a single, 1D char* array would be best, but then, it’s very likely that the conversion and data transfer will eat up all performance gains that you might get from CUDA…)

O.K. I’ll do the tokenization on host. So, I need to pass a data structure from host to device as follows
{ Bone, Organ, Short, Flat, Bone, Nasal Bone,Scapula}
from Jcuda to cuda, give me the code. please.
I am so sorry for inconvenience.

There are several options for this:

  1. You could convert the data into a 2D array that is then passed to the GPU - basically a byte[][] on Java side, which is a char** on C side. But 2D arrays are very inconvenient to handle with CUDA, and not very efficient (you’d have to copy each word individually)

  2. You could convert the data into a 1D-representation of a 2D array, with a fixed word length.

    Nasal Bone

    This would basically mean that you store the data in an 1D (!) array that has the size of (numberOfWords)x(lengthOfLongestWord), and additionally, store the length of each word in an int[] array.

  3. You could store the data in a 1D array, and an int[] array storing the length of each word (or the start index of each word, equivalently)

    BoneOrganShortNasal Bone
       4|   5|   5|       10|

The decision might eventually affect the performance (and maybe even significantly). It might be that storing the data in an n*16 or n*32 array (as in version 2.) has some advantage due to the more regular memory access patterns. But I don’t have enough experience here to give a recommendation.

If it is only about the code, I can probably post a short snippet for either of the approaches.

A snapshot of the levenstein distance kernel I made for the project

__global__ void ComputationgldOnGPU(
    char *str, char *patternRemoved, char *pattern, 
    int *dX, int *EDPrevious, int *EDLast, 
    const int nx, const int ny, const int nz) {

I compare between a string and a pattern (They are two tokens). I work on the characters of the two tokens

I need to pass an array of tokens with different lengths from Jcuda to cuda.Since the ontology is long, we will pass the ontology as partitions. A set of clusters has to be sent at one time . How can be the code of passing a set of tokens of different sizes from host to device? Give me the code, please on the host of Jcuda.

No offense, but my current impression is that you have the task to ~„do something on the GPU, because GPUs are fast“, but you don’t have an idea about how to do it, and how to do it in a way so that you even have the chance of seeing a speedup.

However, here is an example that copies an array of strings from Java to the GPU. The kernel trivially copies these to previously allocated target memory, which is then copied back to Java, and the copies are printed:

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

public class JCudaStringsSample
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void copy(int n, char **src, char **dst, int *lengths)" + "\n" +
        "{" + "\n" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "\n" +
        "    if (i<n)" + "\n" +
        "    {" + "\n" +
        "        for (int j=0; j<lengths[i]; j++) dst[i][j] = src[i][j];" + "\n" +
        "    }" + "\n" +
        "}" + "\n";
    public static void main(String[] args)
        // Default initialization
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        nvrtcProgram program = new nvrtcProgram();
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "copy");

        // Define input data and copy it to the host
        List<String> srcStrings = Arrays.asList(
            "This", "will", "be", "horribly", "inefficient");
        int n = srcStrings.size();
        byte srcHost[][] = new byte[n][];
        for (int i=0; i<n; i++)
            srcHost[i] = srcStrings.get(i).getBytes();
        CUdeviceptr srcPointers[] = new CUdeviceptr[n];
        int lengthsHost[] = new int[n];
        for (int i=0; i<n; i++)
            CUdeviceptr srcPointer = new CUdeviceptr();
            int length = srcStrings.get(i).length();
            lengthsHost[i] = length;
            cuMemAlloc(srcPointer, length * Sizeof.BYTE);
            cuMemcpyHtoD(srcPointer,[i]), length * Sizeof.BYTE);
            srcPointers[i] = srcPointer;
        CUdeviceptr srcDevice = new CUdeviceptr();
        cuMemAlloc(srcDevice, n * Sizeof.POINTER);
        cuMemcpyHtoD(srcDevice,, n * Sizeof.POINTER);

        CUdeviceptr lengthsDevice = new CUdeviceptr();
        cuMemAlloc(lengthsDevice, n * Sizeof.INT);
        cuMemcpyHtoD(lengthsDevice,, n * Sizeof.INT);

        // Define output data 
        CUdeviceptr dstPointers[] = new CUdeviceptr[n];
        for (int i=0; i<n; i++)
            CUdeviceptr dstPointer = new CUdeviceptr();
            cuMemAlloc(dstPointer, lengthsHost[i] * Sizeof.BYTE);
            dstPointers[i] = dstPointer;
        CUdeviceptr dstDevice = new CUdeviceptr();
        cuMemAlloc(dstDevice, n * Sizeof.POINTER);
        cuMemcpyHtoD(dstDevice,, n * Sizeof.POINTER);
        // Run the kernel
        Pointer kernelParameters =
        int blockSizeX = 256;
        int gridSizeX = (n + blockSizeX - 1) / blockSizeX;
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, null,
            kernelParameters, null

        // Copy output data to the host
        byte dstHost[][] = new byte[n][];
        for (int i=0; i<n; i++)
            dstHost[i] = new byte[lengthsHost[i]];
        for (int i=0; i<n; i++)
            CUdeviceptr dstPointer = dstPointers[i];
                dstPointer, lengthsHost[i] * Sizeof.BYTE);
        // Print the result
        for (int i=0; i<n; i++)
            System.out.println("Result " + i + " is " + new String(dstHost[i]));
        // Clean up omitted here

The output is

Result 0 is This
Result 1 is will
Result 2 is be
Result 3 is horribly
Result 4 is inefficient

and there is a message hidden in there, maybe you can decipher it.

Appropriate websearches will bring some answers that essentially say what I already told you, an we’re not even going into the details of things like managed memory here…

Thanks a lot Mr.Marco. The code you sent is a starting point for my implementation, but from your point of view, the two ontologies, I use, contains thousands or millions of words that have to be compared. If I transfer two words of the compared ontologies each time from CPU to GPU. This will take a long time. I need to transfer a bulk of words, How to do this programmatically in your opinion?

As I already tried to ask earlier: It is important to know

  • How the words are stored on the CPU/Java side
  • What you are doing with the words on the CPU/Java side
  • What you are going to do on the GPU/CUDA side
  • What the result of your computations are on the GPU/CUDA side

For example, if you wanted to only do your computations in Java, then you’d simply use a List<String> most of the time. Maybe you’d also use a Map<String, Integer> that stores (for each word) how often it appears. You might also have something like a Map<Pair<String>, Integer> that stores co-occurances (i.e. how often each pair of words appears together in some texts or ontologies).

But there is no String in CUDA. And there is no List. And no Map. And no Pair. So you have to bring the data into a form that can be used in CUDA. And how exactly you do that depends on the computations that you want to do.

Some computations would be trivial in Java. For example, if you have a List<String> words, and want to compute a word count, then you can just write

Map<String, Integer> count =
    .collect(Collectors.toMap(w -> w, w -> 1, Integer::sum);

and you’re done. One line of code. Doing that in CUDA would require hundreds of lines of complicated code, and it is very unlikely that the CUDA implementation could be faster than the Java implementation, unless you’re a CUDA expert and find exactly the right way to leverage the GPU for this task.

Broadly speaking, you’d have to

  • convert the List<String> into a char *words, int *wordLengths,
  • copy the whole data from the CPU to the GPU
  • compute the word counts (which is not trivial, because you have thousands of threads, and you’d need some synchronization, maybe with atomic variables and various __syncthreads, possibly eating up all performance gains)
  • copy the results back from the GPU to the CPU

Dealing with structures like ontologies might be even more complicated, depending on what exactly you want to do.

The example that I posted earlier shows how to copy an array/list of words with different lengths from the GPU to the GPU. But that’s inefficient. A more efficient solution would be to convert the data into a „flat“ structure, like a char *words, int *wordLengths (and you have to write the CUDA kernel code accordingly). Even more efficient approaches could be based on Unified Memory (as in, or Mapped Memory (as in - adjusted for the CUDA Driver API). (And for a comparison of the different memory transfer methods, have a look at

All this assumes that you can find a sensible representation of your data that is „sensible“ for the CPU and the GPU, considering the actual computations that you have to do on this data on the GPU.

(And maybe I should emphasize again that I cannot give really profound advice here: I have created the CUDA bindings, but I’m not a CUDA expert - meaning that I have not yet written many „larger, complex“ CUDA applications, and can not always say which approach would be the „best“ for a certain problem).