How to pass an array of multidimensional rows and two columns

I post a question on email Marco

Hi Marco,

How are you? All examples in cuda kernels start running the threads from index 0 till a certain limit such as following.

__global__ void  ComputationgldOnGPU(char *str, char *patternRemoved,char *pattern, int nx, int ny, int nz, int *dX, int *dFinal)
 {
	int ix = blockIdx.x * blockDim.x + threadIdx.x;
       
        if (ix < ny) {
	       for (int i = 0; i <= nx; i++) {
		if (i == 0)
		     dX[ix * (nx+1) + i] = 0;
                                                         
		else{
		    if (str[i-1] == patternRemoved[ix])
			dX[ix * (nx+1) + i] = i;
		    else if (str[i-1] != patternRemoved[ix])
			dX[ix * (nx+1) + i] = dX[ix * (nx+1) + i - 1];
		}
	   }
           __syncthreads();
        }
} 

ix starts from index 0 to the size of the array patternRemoved.

My question is if I want to specify applying the code to start from thread 5 to any length I select, How to do so? In parallel, I will send a different range of threads to a device function performing this code from the kernel function.
I’d like to say : if (ix between 5 and 10) then code…

Thanks

I’m not entirely sure whether I understood the question correctly, but I’ll try:

In general, the index ix that is computed with

int ix = blockIdx.x * blockDim.x + threadIdx.x;

refers to the thread. It really refers to the hardware, i.e which of the „processing units“ on the GPU is actually performing these computations. (This is somewhat simplfied, but conveys the idea)

When you use this index to access an array, then you often do it like this:

someArray[ix] = someValue;

But course, you don’t have to. The ix is just a value, you can do with that whatever you want (as long as it makes sense - that’s the crucial point…).

So when you say

I want to specify applying the code to start from thread 5 to any length I select

Then I assume that you do not really want to refer to „thread 5“. I assume that you just want to process a different part of the array. The thread index (ix) is always starting at 0. So I assume that you just want to add an offset that is used when accessing the array:

__global__ void  compute(char *someArray, int offset, int arrayLength)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    if (ix + offset < arrayLength) {
        someArray[ix + offset] = someValue;
    }
}

For example, if you have an array with length 1000, and you have 10 threads, then you can call (pseudocode) :

byte someArray[] = new byte[1000];
int offset = 100;
callKernel(someArray, offset, someArray.length);

Inside the kernel the value of ix will always be 0…10 (because there are 10 threads). But because of the offset, the line

someArray[ix + offset] ...

will access the array elements 100, 101, … 109.

Thanks a lot Marco. I have another question to complete the kernel I do. if I have the following code

__device__ void levenshteinDistance(char *str,int strStart,int strLength,char *patternRemoved,int patternRemovedStart,int patternRemovedLength,int *dXIndividual,int *dXFinal)
{
       int iy = blockIdx.x * blockDim.x + threadIdx.x;
       int offset = strStart;
       if (iy+offset < patternRemovedLength) {
	   for (int i = 0; i <= strLength; i++) {
		if (i == 0)
		    dXIndividual[(iy+offset) * (strLength+1) + i] = 0;
                                                         
		else{
		    if (str[i-1] == patternRemoved[iy+offset])
			dXIndividual[(iy+offset) * (strLength+1) + i] = i;
		    else if (str[i-1] != patternRemoved[ix])
			dXIndividual[(iy+offset) * (strLength+1) + i] = dXIndividual[(iy+offset) * (strLength+1) + i - 1];
		}
	   }
           __syncthreads();
        }



}
extern "C"
__global__ void ComputationdXOnGPU(int numStr, char *str, int *strStartIndices, int *strIndividualLengths,int numPatternRemoved, char *patternRemoved, int *patternRemovedStartIndices,int *patternRemovedIndividualLengths, int *dXFinal)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    if (ix<numStr)
    {
        for (int i=0; i<numPatternRemoved; i++) 
        {
            int strStart = strStartIndices[ix];
            int strLength = strIndividualLengths[ix];
            int patternStart = patternRemovedStartIndices[i];
            int patternRemovedLength = patternRemovedIndividualLengths[i];
            int size = (strLength+1) * patternRemovedLength;
            int dXIndividual [size];

            levenshteinDistance(str,strStart, strLength, patternRemoved, patternRemovedStart, patternRemovedLength, dXIndividual, dXFinal);
        }
    }
}
  1. Is this available in cuda? to define an index ix for threads in the global function and another one in the device function. I need parallelization in both functions. parallelization to send the tokens, and parallelization to apply some function on each pair of tokens.

  2. please, review the declarations inside the global function as a syntax of C code only and its pass to the levensteindistance:
    levenshteinDistance(str,strStart, strLength, patternRemoved, patternRemovedStart, patternRemovedLength, dXIndividual, dXFinal);

  3. I will adjust dXFinal to carry the final output after receiving your answers.

I once sent you a complete, compileable example (containing the .java file and the .cu file) that showed how you could compute a basic levenshtein distance on the GPU, and you could just compile and start it.

What do you expect me to do with your kernel snippets? I cannot review this code without any context. I cannot test this code. I don’t even know for sure what the code is supposed to do.

However, once more (and frankly: annoyingly), I have to make guesses and mention stuff that should be clear if you had ever read a single tutorial about CUDA:

Is this available in cuda? to define an index ix for threads in the global function and another one in the device function.

No, this ist not „available“. The result of int ix = blockIdx.x * blockDim.x + threadIdx.x; will always be the same, regardless of where it is computed. When the names are ix and iy, it apparently refers to „2D coordinates“, and of course you can compute int iy = blockIdx.y * blockDim.y + threadIdx.y; (note the .y instead of the .x), but you can do this

if and only if

you set up your kernel launch to be a 2D kernel, and lay out your memory accordingly, and think about what you are actually trying to achieve and how to achieve this. (So do not just change the x to y and ask me why it’s not working or what else you have to change).

More generally:

I need parallelization in both functions. parallelization to send the tokens, and parallelization to apply some function on each pair of tokens.

Maybe you think that it would be nice to write into your thesis „Yeah, I parallelized everything“. But it’s really exhausting to crawl along with your attempts to do so. I mean, I struggle to pull myself together and do things of which I think that they could make sense. But here, I just feel like wasting my time. Try to do something that may be useful and that may make sense, instead of trying to „parallelize stuff for the sake of having parallelized stuff“.

I am sorry Marco for annoying you. I have read the book titled „Professional CUDA C Programming“ and applying a lot of examples inside it for one year. I feel I have a little knowledge of cuda programming even after reading a lot in the book.

I know you sent me your thought of levenstein. I make best benefit of it. However, the papers I apply on levenstein diverge from levenstein implementation on sites. It parallelizes the comparison of two tokens.

Thanks a lot for all your responses. Each learns me a new matter. I do not need to use two dimensional grid.

To simplify the question, If I do the following and use only ix,

__device__ void levenshteinDistance(Parameters)
{
       int offset = strStart;
       if (ix+offset < patternRemovedLength) {
	   code......................................................
           ..............................................................
          ...............................................................
           __syncthreads();
        }



}



extern "C"
__global__ void ComputationdXOnGPU(Parameters)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    if (ix<numStr)
    {
        for (int i=0; i<numPatternRemoved; i++) 
        {
           code.............................
          .......................................
          .......................................
            levenshteinDistance(parameters); (Here, we call a device function)
        }
  }
} 

I want to know if ix in both if statements start with 0? inside global and device functions.
I mean the outer if statement does not have an effect or limits the values of the inner one.

The computation of blockIdx.x * blockDim.x + threadIdx.x; will give the same result in both functions.

Think of blockIdx, blockDim and threadIdx as „global variables“.

__device__ void exampleA(int a) {
    int indexA = blockIdx.x * blockDim.x + threadIdx.x;
}
__device__ void exampleB(int a) {
    int indexB = blockIdx.x * blockDim.x + threadIdx.x;
}
__global__ void ComputationdXOnGPU(Parameters) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    exampleA(123);
    exampleB(234);
}

then index, indexA and indexB will always have the same value. And this value will be „the index of the thread“ that is executing these functions. When you have a „work size“ of 1000, then the kernel (and the function calls that it contains) will be called 1000 times, by different threads, all at the same time. In the first thread, index, indexA and indexB will be 0. In the 347th thread, index, indexA and indexB will be 347.

You have to use this index to access different parts of the data. When you are talking about „words“ and „tokens“, then you have to … (and maybe I already mentioned that once or twice or … more often) think about how to lay out and access your data with all these threads.


More broadly:

It parallelizes the comparison of two tokens.

I’m not sure what that means. If this refers to things like the PDF file for „A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations“ that you sent me, then keep in mind: The people who are writing something like this are (likely) teams of experts who gathered knowledge and a deep understanding, and invested months and maybe years to come up with the proposed solution. They are not working on „Ontology matching implemented with Java“. They are going down to the metal, and talk about their kernels on the level of instructions, warp sizes, memory coalesicing and tricky uses of shared memory.

I just cannot help you with that. And one reason of why I’m annoyed is: You seem to think that I can help you. Of course, I could now dive into this topic (and I’d certainly not do that with JCuda, but with plain CUDA - of course!). I could read books and papers and perform benchmarks and consider implementation options. I could spend a few years with that.

But I have to work, to earn money, to pay my rent, and if I have some spare time, I carefully think about whether I use it for leisure, or to do that friggin update for CUDA 11.2 that has been pending for so long, or for helping you to get a PhD - particularly when you ask questions that are literally answered in the first search result of cuda threadidx blockidx - Google Suche

I send you a question on private.

And I have answered, repeating what I already said a dozen times, but obviously, the message does not come across:

You want a PhD, so you have to find a solution.
I cannot help you with that.

Thanks. I’ll try. Here, I will ask general questions about Jcuda

To be clear: I’d really like to help you. And if you review my answers in this thread, you’ll see that I tried to explain everything that I know and that I could say for sure in great detail.

But when the questions cover the range between basics (like „What is the ‚thread index‘?“) and high-level (and at the same time highly specific) questions (like „What’s the kernel code for [this-and-that paper]?“), then I just cannot help you.

I’ll still try to answer questions that are at least somewhat specific for JCuda, or broader questions that may be of interest for others and that I can answer sensibly.

Thanks a lot Marco.

I am so sorry for disturbing you Marco.I have made part of the code. I send you an email on private related to Jcuda.

When I run the program in Jcuda, it gives

Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:436)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:60)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1

Knowing that I solve all syntax errors in kernel, I already create the ptx file of cu file.
How can I determine the above error in understandable statements in lines of code that I can easily solve. Also, I sent all the code on email if you have time to take a look.

I have fix all problems. It is still the same error. It is on email, the code.
Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:407)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:62)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1
BUILD FAILED (total time: 2 seconds)

I also have created the ptx file. What is the problem? Look, there is a post for you earlier for the same problem.

That linked error is totally unrelated. The error code CUDA_ERROR_LAUNCH_FAILED only means that ~„something is wrong“. It does not tell you what is wrong. You have to figure this out on your own. And this is impossible if you don’t exactly know what you’re doing.

So in this case, I can only say that I always have to say:

Your code is wrong.

Yeah. That’s it. That doesn’t help you much.

However: When you add …

private static KernelData createKernelData( ...) {

    ...

    kernelData.deviceResultFinal = deviceResultFinal; // <--------- THIS LINE
          
    return kernelData;
}

… then it will „„work““. (Of course, it will not really work. It will not do what you want it to do. But ~„the crash will disappear“, and you might think that you’re one step closer to your goal, until you encounter the next error… it’s tedious…)

Thanks a lot. It works. I put this statement in earlier projects and I omit it here. Great love to you Marco.

Ein Beitrag wurde in ein neues Thema verschoben: Restoring deleted CUDA file from PTX

Ein Beitrag wurde in ein neues Thema verschoben: Computing execution time from events

Hi Marco,

How can I learn about JcudaMP: OpenMP/Java on CUDA? Is there any projects or code for it?