How to pass an array of multidimensional rows and two columns

I am a PhD student. I compare two ontologies O1 and O2 using GPU capability.I need to pass a collection of strings and integers in one array, for example,
O1 have two elements, TechnicalStaff, firstName.
O2 have two elements. AssociateProf, SurName.
I have to identify an id related to each ontology elements in order to know correspondences to be compared. I need to pass from host to device an array as follows. First, I have to divide the whole names into tokens and then pass

arr1 = { (Technical,1).(Staff,1),(first,2),(Name,2)}
arr2 = { (Associate,1).(Prof,1),(Sur,2),(Name,2)}

You did not provide many (technical) details. So the following answer may be a bit vague. The „best“ solution will depend on many factors that you’ll probably have to figure out by reading lots of related work, and by some experiments.


In general, if you want to move Strings from Java to the GPU, this will raise some questions. A String in Java is (basically) a char[] array. But a char in Java is different from a char in C/CUDA. In Java, a char has 16 bits, whereas in C, a char has 8 bits. So if you have to move strings from Java to CUDA, you’d have to convert it into a byte[] array, with String#getBytes.

But there is another difficulty: The strings will most likely have different lengths. So it would be really hard to find a sensible data structure in order to pass this data to CUDA.


Another, very general hint is that for GPU programming, you should usually prefer a „Structure Of Arrays“ instead of an „Array Of Structures“. When you have something like

class Point2D {
    float x;
    float y;
}
List<Point2D> points = ...;

and want to pass these points to the GPU, then you could try to pass it to the GPU as a float[n][2] array (i.e. an array of 2-element arrays). But for the GPU, it is usually much better (and actually easier) to instead pass it as two arrays, like

float xCoordinates[] = extractAllXCoordinates(points);
float yCoordinates[] = extractAllYCoordinates(points);
doSomethingOnGpu(xCoordinates, yCoordinates);

Some more information (and a similar example) can be found at https://en.wikipedia.org/wiki/AoS_and_SoA


Again, the description of the problem lacks some technical details. But it sounds like you might consider replacing the Strings with integers, if this suits your problem. Very roughly speaking, and with many guesses about what you want to achieve:

When you have something like

class Entry {
    String word;
    int count;
}
List<Entry> arrayA = ...;
List<Entry> arrayB = ...;

where these arrays should be compared, then you could try to bring them into a shape that is well-suited for the GPU. This could, for example, be done like that (Pseudocode!):

// Create a map from integer values to words
Map<Integer, String> wordMap = ...;
for (String word : wordList) {
    wordMap.put(wordMap.size(), word);
}

// Convert the list of entries from the form
//   { { "Foo", 123 },
//     { "Bar",  45 },
//     { "Zap",  78 } }
// into two arrays
//  words = [ 1, 0, 2 ] // The integers used to look up the words in the wordMap
//  counts = [ 123, 45, 78 ]

int wordsA[] = new int[N];
int countsA[] = new int[N];
int i = 0;
for (Entry entry : arrayA) {
    wordsA[i] = wordMap.get(entry.word);
    countsA[i] = entry.count;
    i++;
}

// (The same for the other entries)

// Finally:
compareOnGpu(wordsA, countsA, wordsB, countsB);

Once more, the disclaimer: It’s not clear how your input data is structured, what you want to compute, and how you return the result for the computation back from the GPU to Java. But these are some general hints, maybe they are applicable in a similar form to your problem.

Thanks a lot Mr.Marco.

In order not to mislead you. I have made a kernel that accepts two tokens (two character set of characters) using Levenstein distance based on warp shuffle instructions.

All I need to pass a vector of strings from Jcuda to cuda. Knowing that these strings belong to different clusters and different words. Each Cluster contains a set of words.

For example, we divide O1 into two clusters,
C11 = {Bone Organ, Short} , C12 = {Flat Bone,Nasal Bone,Scapula}
And divide O2 into two clusters,
C21 = {Bone, Skull,Phalanx } , C12 = {Flat Bone,Nasal Bone}

We have used an algorithm to detect similar clusters to be
C11 matches C21
C12 matches both C21 and C22
Our problem to find similar words between clusters via GPU using Levenstein.

Required tasks:

  1. How to pass all of the tokens of the clusters of O1 and O2 as strings from Jcuda to cuda.
  2. In your opinion, how to discriminate words inside GPU. To return from GPU with result that for instance, Bone Organ inside O1 is similar to Bone inside O2.
    My solution is to make a vector for the number of a cluster and another for the number of word inside the cluster.

So, sent from the host. a vector of strings (tokens), a vector of cluster number and a vector of word number inside the cluster. All of these once for O1 and another for O2.

What is your opinion, please?

Again, it’s difficult to give specific recommendations here, without really knowing the tasks and goals in all detail.

But if I understood you correctly, you basically have a kernel roughly like this:

__global__ void levenshteinDistance(
    char *wordA, int wordLengthA, 
    char *wordB, int wordLengthB, float *result) { ... }

that computes the distance between words. (If so: I wonder how this is implemented. This already is fairly non-trivial…)

But it’s not clear what your input data is, for the overall algorithm, on the host side. Some guesses: If you have something like

String C11[][] = new String[][] {
    {"Bone Organ", "Short"} , 
    {"Flat Bone" , "Nasal Bone", "Scapula"} 
};

then it might already be pretty difficult to convert this into a form that can sensibly used in CUDA. There is no such thing like a String in C/CUDA, but only char*. Additionally, GPUs don’t like memory indirections, and handling 2D arrays in CUDA is cumbersome (so squeezing your data into a single, 1D char* array would be best, but then, it’s very likely that the conversion and data transfer will eat up all performance gains that you might get from CUDA…)

O.K. I’ll do the tokenization on host. So, I need to pass a data structure from host to device as follows
{ Bone, Organ, Short, Flat, Bone, Nasal Bone,Scapula}
from Jcuda to cuda, give me the code. please.
I am so sorry for inconvenience.

There are several options for this:

  1. You could convert the data into a 2D array that is then passed to the GPU - basically a byte[][] on Java side, which is a char** on C side. But 2D arrays are very inconvenient to handle with CUDA, and not very efficient (you’d have to copy each word individually)

  2. You could convert the data into a 1D-representation of a 2D array, with a fixed word length.

    Bone......
    Organ.....
    Short.....
    Nasal Bone
    

    This would basically mean that you store the data in an 1D (!) array that has the size of (numberOfWords)x(lengthOfLongestWord), and additionally, store the length of each word in an int[] array.

  3. You could store the data in a 1D array, and an int[] array storing the length of each word (or the start index of each word, equivalently)

    BoneOrganShortNasal Bone
       4|   5|   5|       10|
    

The decision might eventually affect the performance (and maybe even significantly). It might be that storing the data in an n*16 or n*32 array (as in version 2.) has some advantage due to the more regular memory access patterns. But I don’t have enough experience here to give a recommendation.

If it is only about the code, I can probably post a short snippet for either of the approaches.

A snapshot of the levenstein distance kernel I made for the project

__global__ void ComputationgldOnGPU(
    char *str, char *patternRemoved, char *pattern, 
    int *dX, int *EDPrevious, int *EDLast, 
    const int nx, const int ny, const int nz) {
    ...
}

I compare between a string and a pattern (They are two tokens). I work on the characters of the two tokens

I need to pass an array of tokens with different lengths from Jcuda to cuda.Since the ontology is long, we will pass the ontology as partitions. A set of clusters has to be sent at one time . How can be the code of passing a set of tokens of different sizes from host to device? Give me the code, please on the host of Jcuda.

No offense, but my current impression is that you have the task to ~„do something on the GPU, because GPUs are fast“, but you don’t have an idea about how to do it, and how to do it in a way so that you even have the chance of seeing a speedup.

However, here is an example that copies an array of strings from Java to the GPU. The kernel trivially copies these to previously allocated target memory, which is then copied back to Java, and the copies are printed:

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

/**
 */
public class JCudaStringsSample
{
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void copy(int n, char **src, char **dst, int *lengths)" + "\n" +
        "{" + "\n" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "\n" +
        "    if (i<n)" + "\n" +
        "    {" + "\n" +
        "        for (int j=0; j<lengths[i]; j++) dst[i][j] = src[i][j];" + "\n" +
        "    }" + "\n" +
        "}" + "\n";
    
    public static void main(String[] args)
    {
        // Default initialization
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "copy");

        // Define input data and copy it to the host
        List<String> srcStrings = Arrays.asList(
            "This", "will", "be", "horribly", "inefficient");
        int n = srcStrings.size();
        byte srcHost[][] = new byte[n][];
        for (int i=0; i<n; i++)
        {
            srcHost[i] = srcStrings.get(i).getBytes();
        }
        CUdeviceptr srcPointers[] = new CUdeviceptr[n];
        int lengthsHost[] = new int[n];
        for (int i=0; i<n; i++)
        {
            CUdeviceptr srcPointer = new CUdeviceptr();
            int length = srcStrings.get(i).length();
            lengthsHost[i] = length;
            cuMemAlloc(srcPointer, length * Sizeof.BYTE);
            cuMemcpyHtoD(srcPointer, Pointer.to(srcHost[i]), length * Sizeof.BYTE);
            srcPointers[i] = srcPointer;
        }
        CUdeviceptr srcDevice = new CUdeviceptr();
        cuMemAlloc(srcDevice, n * Sizeof.POINTER);
        cuMemcpyHtoD(srcDevice, Pointer.to(srcPointers), n * Sizeof.POINTER);

        CUdeviceptr lengthsDevice = new CUdeviceptr();
        cuMemAlloc(lengthsDevice, n * Sizeof.INT);
        cuMemcpyHtoD(lengthsDevice, Pointer.to(lengthsHost), n * Sizeof.INT);

        // Define output data 
        CUdeviceptr dstPointers[] = new CUdeviceptr[n];
        for (int i=0; i<n; i++)
        {
            CUdeviceptr dstPointer = new CUdeviceptr();
            cuMemAlloc(dstPointer, lengthsHost[i] * Sizeof.BYTE);
            dstPointers[i] = dstPointer;
        }
        CUdeviceptr dstDevice = new CUdeviceptr();
        cuMemAlloc(dstDevice, n * Sizeof.POINTER);
        cuMemcpyHtoD(dstDevice, Pointer.to(dstPointers), n * Sizeof.POINTER);
        
        // Run the kernel
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{n}),
            Pointer.to(srcDevice),
            Pointer.to(dstDevice),
            Pointer.to(lengthsDevice)
        );
        int blockSizeX = 256;
        int gridSizeX = (n + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, null,
            kernelParameters, null
        );
        cuCtxSynchronize();

        // Copy output data to the host
        byte dstHost[][] = new byte[n][];
        for (int i=0; i<n; i++)
        {
            dstHost[i] = new byte[lengthsHost[i]];
        }
        for (int i=0; i<n; i++)
        {
            CUdeviceptr dstPointer = dstPointers[i];
            cuMemcpyDtoH(Pointer.to(dstHost[i]), 
                dstPointer, lengthsHost[i] * Sizeof.BYTE);
        }
        
        // Print the result
        cuCtxSynchronize();
        for (int i=0; i<n; i++)
        {
            System.out.println("Result " + i + " is " + new String(dstHost[i]));
        }
        
        // Clean up omitted here
    }
    
   
}

The output is

Result 0 is This
Result 1 is will
Result 2 is be
Result 3 is horribly
Result 4 is inefficient

and there is a message hidden in there, maybe you can decipher it.

Appropriate websearches will bring some answers that essentially say what I already told you, an we’re not even going into the details of things like managed memory here…

Thanks a lot Mr.Marco. The code you sent is a starting point for my implementation, but from your point of view, the two ontologies, I use, contains thousands or millions of words that have to be compared. If I transfer two words of the compared ontologies each time from CPU to GPU. This will take a long time. I need to transfer a bulk of words, How to do this programmatically in your opinion?

As I already tried to ask earlier: It is important to know

  • How the words are stored on the CPU/Java side
  • What you are doing with the words on the CPU/Java side
  • What you are going to do on the GPU/CUDA side
  • What the result of your computations are on the GPU/CUDA side

For example, if you wanted to only do your computations in Java, then you’d simply use a List<String> most of the time. Maybe you’d also use a Map<String, Integer> that stores (for each word) how often it appears. You might also have something like a Map<Pair<String>, Integer> that stores co-occurances (i.e. how often each pair of words appears together in some texts or ontologies).

But there is no String in CUDA. And there is no List. And no Map. And no Pair. So you have to bring the data into a form that can be used in CUDA. And how exactly you do that depends on the computations that you want to do.

Some computations would be trivial in Java. For example, if you have a List<String> words, and want to compute a word count, then you can just write

Map<String, Integer> count = words.stream()
    .collect(Collectors.toMap(w -> w, w -> 1, Integer::sum);

and you’re done. One line of code. Doing that in CUDA would require hundreds of lines of complicated code, and it is very unlikely that the CUDA implementation could be faster than the Java implementation, unless you’re a CUDA expert and find exactly the right way to leverage the GPU for this task.

Broadly speaking, you’d have to

  • convert the List<String> into a char *words, int *wordLengths,
  • copy the whole data from the CPU to the GPU
  • compute the word counts (which is not trivial, because you have thousands of threads, and you’d need some synchronization, maybe with atomic variables and various __syncthreads, possibly eating up all performance gains)
  • copy the results back from the GPU to the CPU

Dealing with structures like ontologies might be even more complicated, depending on what exactly you want to do.

The example that I posted earlier shows how to copy an array/list of words with different lengths from the GPU to the GPU. But that’s inefficient. A more efficient solution would be to convert the data into a „flat“ structure, like a char *words, int *wordLengths (and you have to write the CUDA kernel code accordingly). Even more efficient approaches could be based on Unified Memory (as in JCudaDriverUnifiedMemory.java), or Mapped Memory (as in JCudaRuntimeMappedMemory.java - adjusted for the CUDA Driver API). (And for a comparison of the different memory transfer methods, have a look at JCudaRuntimeMemoryBandwidths.java).

All this assumes that you can find a sensible representation of your data that is „sensible“ for the CPU and the GPU, considering the actual computations that you have to do on this data on the GPU.

(And maybe I should emphasize again that I cannot give really profound advice here: I have created the CUDA bindings, but I’m not a CUDA expert - meaning that I have not yet written many „larger, complex“ CUDA applications, and can not always say which approach would be the „best“ for a certain problem).

Dear Mr.Macro,

How are you? I have made a certain idea to the ontology matching problem. This is the main code:

 public static void main(String[] args) throws IOException {
  JCudaDriver.setExceptionsEnabled(true);
  // Create the PTX file by calling the NVCC
  String ptxFileName = preparePtxFile("D:\\NetBeanProjects\\JcudaStringsSample\\src\\jcudastringssample\\programSourceCodeKernel.cu");
 //Initialize the driver and create a context for the first device.
 cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
         // Load the ptx file.
        CUmodule module = new CUmodule();
        cuModuleLoad(module, ptxFileName);
         // Obtain a function pointer to the "add" function.
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "copy");
        List<String> srcStrings = Arrays.asList(
            "This", "will", "be", "horribly", "inefficient");
        int n = srcStrings.size();
        String result = String.join(",", srcStrings);
       byte[] byteArr = result.getBytes();
       // print the byte[] elements
       //System.out.println("String to byte array: " + Arrays.toString(byteArr));
        CUdeviceptr srcPointer = new CUdeviceptr();
        int length = result.length();
        //System.out.println("length " + length);
        cuMemAlloc(srcPointer, length * Sizeof.BYTE);
        cuMemcpyHtoD(srcPointer, Pointer.to(byteArr), length * Sizeof.BYTE);
}

In order to pass all tokens of the string once to the GPU, I have made srcStrings = {This,will,be,horribly,inefficient} in the host.

I need to make a kernel that cut the tokens again inside GPU and paste them again to dstStrings. Can you help me using threads?

int idx = blockIdx.x * blockDim.x + threadIdx.x;

If a thread finds , (comma) it cuts all characters before it to be one token and then paste them to dstStrings using

strcpy(dstStrings , "This,");
strcat(dstStrings , "will,");
strcat(dstStrings , "be,");
strcat(dstStrings , "horribly,");
strcat(dstStrings , "inefficient");

the cut process is concurrently and also the concat process is also concurrently using threads. I think deeply of the code but I fail.

Can you help me, please?

It’s really difficult. It would be a pity if you spent your (and my) time for ~„just doing something on the GPU with CUDA, because that’s ‚research‘ and you can get a PhD for that“, without making sure that the result makes any sense whatsoever.

I tried to explain extensively that you really have to think carefully about how you want to structure your data, how to send it to the GPU, and what you are actually going to do with that. I showed implementation options and tried to explain the possible pros and cons, including the crucial question about the string lengths.

Now, to put it that way: Trying to tokenize a string on the GPU by searching for the "," comma does not make sense, at all. You have to be aware of the fact that the kernel is executed by 1000s of threads, in parallel, at the same time. Searching for a "," is nothing that you can sensibly do there. That’s simply not what CUDA is intended for. Besides, you already have this information on Java side, so you can just pass it to the GPU as it is.

Here is another example:

It takes the List<String> as the input, and the kernel writes the last letter of each of these strings into the output memory.

To accomplish that, the program joins the Strings into a single string, which is then passed to the GPU. It also stores the end indices of the individual strings inside the combined string, as an int[] array, and also passes this to the GPU. The „end indices“ are basically the accumulated string lengths. This is *only intended for this example: Yes, this does not make sense in general. No, you should not just copy and paste this and try to solve your task with that. You should think about whether you need…

  1. The string lengths
  2. The string start indices
  3. The string end indices
  4. The accumulated start indices
  5. The accumulated end indices
  6. Some combination of these

and you should think about this carefully, depending on what you want to do with these strings in the kernel, and then try to implement it.

Maybe the example is helpful in any way, nevertheless:

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import java.util.Arrays;
import java.util.List;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

/**
 */
public class JCudaStringsWithLengths
{
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void copy(int numStrings, char *srcStringsJoined, int *srcStringEndIndices, char *lastLetters)" + "\n" +
        "{" + "\n" +
        "    int i = blockIdx.x * blockDim.x + threadIdx.x;" + "\n" +
        "    if (i<numStrings)" + "\n" +
        "    {" + "\n" +
        "        int endIndex = srcStringEndIndices[i];" + "\n" +
        "        char lastLetter = srcStringsJoined[endIndex-1];" + "\n" +
        "        lastLetters[i] = lastLetter;" + "\n" +
        "    }" + "\n" +
        "}" + "\n";
    
    public static void main(String[] args)
    {
        // Default initialization
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "copy");

        // Define input data and copy it to the host:
        
        // The input is a list of strings. It will be concatenated
        // into ONE string:
        List<String> srcStrings = Arrays.asList(
            "This", "will", "be", "horribly", "inefficient");
        int numStrings = srcStrings.size();
        String srcStringsJoined = String.join("", srcStrings);
        byte[] srcStringsJoinedData = srcStringsJoined.getBytes();
        int totalLength = srcStringsJoinedData.length;

        // The endIndex of each string is stored in an array: 
        int srcStringEndIndices[] = new int[srcStrings.size()];
        int endIndex = 0;
        for (int i=0; i<numStrings; i++)
        {
            endIndex += srcStrings.get(i).length();
            srcStringEndIndices[i] = endIndex;
        }
        
        // Copy the string data to the device
        CUdeviceptr srcStringsJoinedPointer = new CUdeviceptr();
        cuMemAlloc(srcStringsJoinedPointer, totalLength * Sizeof.BYTE);
        cuMemcpyHtoD(srcStringsJoinedPointer, 
            Pointer.to(srcStringsJoinedData), totalLength * Sizeof.BYTE);

        // Copy the string end indices to the device
        CUdeviceptr srcStringEndIndicesPointer = new CUdeviceptr();
        cuMemAlloc(srcStringEndIndicesPointer, numStrings * Sizeof.INT);
        cuMemcpyHtoD(srcStringEndIndicesPointer, 
            Pointer.to(srcStringEndIndices), numStrings * Sizeof.INT);
        
        // Allocate output memory
        CUdeviceptr dstDevice = new CUdeviceptr();
        cuMemAlloc(dstDevice, numStrings * Sizeof.BYTE);
        
        // Run the kernel
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{numStrings}),
            Pointer.to(srcStringsJoinedPointer),
            Pointer.to(srcStringEndIndicesPointer),
            Pointer.to(dstDevice)
        );
        int blockSizeX = 256;
        int gridSizeX = (numStrings + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, null,
            kernelParameters, null
        );
        cuCtxSynchronize();

        // Copy output data to the host
        byte lastLetters[] = new byte[numStrings];
        cuMemcpyDtoH(Pointer.to(lastLetters), dstDevice, 
            numStrings * Sizeof.BYTE);
        
        // Print the result
        cuCtxSynchronize();
        System.out.println("Last letters: " + new String(lastLetters));
        
        // Clean up omitted here
    }
}

Thanks a lot for your help. I have already made a code for the levenstein distance on GPU for two tokens. All I need is to pass all tokens as a one or few strings and then split them inside the GPU to be compared using levenstein. I am sorry if I waste your time but I learn from your thought and coding.

Again, it will almost certainly not make sense to try to split the Strings on the GPU. You already have the necessary information (namely the string lengths) on the CPU side. The GPU will not automatically be faster at anything than the CPU, because the GPU is only fast for very specific operations.

I wrote more about this in this Stack Overflow answer.

However, if you have a simple kernel/class/project that should only compute the levenshtein distance, and there is a specific problem, you could try to

  1. post this here, with code that is complete and can be compiled and tested as far as possible
  2. (if you don’t want to share your code publicly) : Send me an e-Mail with a ZIP containing the required files, under jcuda@jcuda.org - maybe I can have a look

(And at the risk of demotivating you: I’m going out on a limb and say that I’m reasonably sure that one could write a program (in a fraction of the time) that does the same, without CUDA, in plain Java, and is faster than the CUDA version - but that’s a guess. There is some learning involved in any case, so performance and simplicity might not be the most important thing about your task right now…)

Thanks for cooperation. I will complete the code with the benefit of your simple examples. I will send you the code on your private email.

Dear Mr.Marco.

I have used the time of kernels using events in c. What is the measurement of kernels time in java resembling this:
cudaEvent_t begin, stop;
cudaEventCreate(&begin);
cudaEventCreate(&stop);
cudaEventRecord(begin);
sumMatrixOnGPU2D << < grid, block >> >(d_MatA, d_MatB, d_MatC, nx, ny);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, begin, stop);
printf(„sumMatrixOnGPU2D <<<(%d,%d), (%d,%d)>>> elapsed %f ms\n“, grid.x,
grid.y, block.x, block.y, milliseconds);

An example of how events can be used can be derived from https://github.com/jcuda/jcuda-samples/blob/2e6e62d0a463a6ebca6ca230bd015f96b955f08e/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMemoryBandwidths.java#L227 (it’s very similar to the C code, except for the usual C/Java-specific differences)

Dear Mr.Marco,

I have sent you an email on jcuda@jcuda.org.Please take a look.

I have received your mails and will try to have a closer look at them during the weekend (I cannot promise anything, but will have a look).

Mr Macro,
I send you a reply on your email. I need only to pass the first stage of split.