How to pass an array of multidimensional rows and two columns

I am sorry Marco for annoying you. I have read the book titled „Professional CUDA C Programming“ and applying a lot of examples inside it for one year. I feel I have a little knowledge of cuda programming even after reading a lot in the book.

I know you sent me your thought of levenstein. I make best benefit of it. However, the papers I apply on levenstein diverge from levenstein implementation on sites. It parallelizes the comparison of two tokens.

Thanks a lot for all your responses. Each learns me a new matter. I do not need to use two dimensional grid.

To simplify the question, If I do the following and use only ix,

__device__ void levenshteinDistance(Parameters)
{
       int offset = strStart;
       if (ix+offset < patternRemovedLength) {
	   code......................................................
           ..............................................................
          ...............................................................
           __syncthreads();
        }



}



extern "C"
__global__ void ComputationdXOnGPU(Parameters)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    if (ix<numStr)
    {
        for (int i=0; i<numPatternRemoved; i++) 
        {
           code.............................
          .......................................
          .......................................
            levenshteinDistance(parameters); (Here, we call a device function)
        }
  }
} 

I want to know if ix in both if statements start with 0? inside global and device functions.
I mean the outer if statement does not have an effect or limits the values of the inner one.

The computation of blockIdx.x * blockDim.x + threadIdx.x; will give the same result in both functions.

Think of blockIdx, blockDim and threadIdx as „global variables“.

__device__ void exampleA(int a) {
    int indexA = blockIdx.x * blockDim.x + threadIdx.x;
}
__device__ void exampleB(int a) {
    int indexB = blockIdx.x * blockDim.x + threadIdx.x;
}
__global__ void ComputationdXOnGPU(Parameters) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    exampleA(123);
    exampleB(234);
}

then index, indexA and indexB will always have the same value. And this value will be „the index of the thread“ that is executing these functions. When you have a „work size“ of 1000, then the kernel (and the function calls that it contains) will be called 1000 times, by different threads, all at the same time. In the first thread, index, indexA and indexB will be 0. In the 347th thread, index, indexA and indexB will be 347.

You have to use this index to access different parts of the data. When you are talking about „words“ and „tokens“, then you have to … (and maybe I already mentioned that once or twice or … more often) think about how to lay out and access your data with all these threads.


More broadly:

It parallelizes the comparison of two tokens.

I’m not sure what that means. If this refers to things like the PDF file for „A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations“ that you sent me, then keep in mind: The people who are writing something like this are (likely) teams of experts who gathered knowledge and a deep understanding, and invested months and maybe years to come up with the proposed solution. They are not working on „Ontology matching implemented with Java“. They are going down to the metal, and talk about their kernels on the level of instructions, warp sizes, memory coalesicing and tricky uses of shared memory.

I just cannot help you with that. And one reason of why I’m annoyed is: You seem to think that I can help you. Of course, I could now dive into this topic (and I’d certainly not do that with JCuda, but with plain CUDA - of course!). I could read books and papers and perform benchmarks and consider implementation options. I could spend a few years with that.

But I have to work, to earn money, to pay my rent, and if I have some spare time, I carefully think about whether I use it for leisure, or to do that friggin update for CUDA 11.2 that has been pending for so long, or for helping you to get a PhD - particularly when you ask questions that are literally answered in the first search result of cuda threadidx blockidx - Google Suche

I send you a question on private.

And I have answered, repeating what I already said a dozen times, but obviously, the message does not come across:

You want a PhD, so you have to find a solution.
I cannot help you with that.

Thanks. I’ll try. Here, I will ask general questions about Jcuda

To be clear: I’d really like to help you. And if you review my answers in this thread, you’ll see that I tried to explain everything that I know and that I could say for sure in great detail.

But when the questions cover the range between basics (like „What is the ‚thread index‘?“) and high-level (and at the same time highly specific) questions (like „What’s the kernel code for [this-and-that paper]?“), then I just cannot help you.

I’ll still try to answer questions that are at least somewhat specific for JCuda, or broader questions that may be of interest for others and that I can answer sensibly.

Thanks a lot Marco.

I am so sorry for disturbing you Marco.I have made part of the code. I send you an email on private related to Jcuda.

When I run the program in Jcuda, it gives

Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:436)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:60)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1

Knowing that I solve all syntax errors in kernel, I already create the ptx file of cu file.
How can I determine the above error in understandable statements in lines of code that I can easily solve. Also, I sent all the code on email if you have time to take a look.

I have fix all problems. It is still the same error. It is on email, the code.
Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:407)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:62)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1
BUILD FAILED (total time: 2 seconds)

I also have created the ptx file. What is the problem? Look, there is a post for you earlier for the same problem.

That linked error is totally unrelated. The error code CUDA_ERROR_LAUNCH_FAILED only means that ~„something is wrong“. It does not tell you what is wrong. You have to figure this out on your own. And this is impossible if you don’t exactly know what you’re doing.

So in this case, I can only say that I always have to say:

Your code is wrong.

Yeah. That’s it. That doesn’t help you much.

However: When you add …

private static KernelData createKernelData( ...) {

    ...

    kernelData.deviceResultFinal = deviceResultFinal; // <--------- THIS LINE
          
    return kernelData;
}

… then it will „„work““. (Of course, it will not really work. It will not do what you want it to do. But ~„the crash will disappear“, and you might think that you’re one step closer to your goal, until you encounter the next error… it’s tedious…)

Thanks a lot. It works. I put this statement in earlier projects and I omit it here. Great love to you Marco.

Ein Beitrag wurde in ein neues Thema verschoben: Restoring deleted CUDA file from PTX

Ein Beitrag wurde in ein neues Thema verschoben: Computing execution time from events

Hi Marco,

How can I learn about JcudaMP: OpenMP/Java on CUDA? Is there any projects or code for it?

As far as I know, JCudaMP was only a research project that somebody wrote about eleven years ago. It is (from my knowledge) totally unrelated to JCuda. You may try contacting the authors (e.g. via https://dl.acm.org/doi/10.1145/1808954.1808959 )

Marco,

If I have the following code inside the host side,

CUdeviceptr deviceXPattern = new CUdeviceptr();
cuMemAlloc(deviceXPattern, totallength0 * totallengthDistinct * Sizeof.INT);

This means that deviceXPattern has to be passed to the kernel to be filled there by some calculations and then returned back again to the host.

I need to define XPattern matrix instead of deviceXPattern inside the kernel to consist of totallength0 * totallengthDistinct (internal device variable)

knowing that totallength0 , totallengthDistinct are passed from the host.

It is possible to allocate memory in kernels. I once created a sample for that, and hesitated to add it to the jcuda-samples repository, because such allocations should be used with care and only when you exactly know that you’re doing. But … maybe that doesn’t matter, so I just added it via this commit: Added example for allocation in kernel · jcuda/jcuda-samples@3de3654 · GitHub

Marco, I mean by my question. I do not need to pass an empty structure deviceXPattern to the device that I will not return it again to the host. This is due to I know how to resize it from the host. Another Final array, deviceWordsFinal , the only one that will be returned to the host.

I need to define deviceXPattern inside the device to be a matrix of totallength0 * totallengthDistinct and fill it with calculations and then use it inside deviceWordsFinal .

how to define deviceXPattern inside device to be a matrix of two integers. It will be used only in device.

The example shows that you can do

float* data = (float*) malloc(rows * columns * sizeof(float));

in the kernel. If this is not what you need, maybe write what you would write in Java, then it may be possible to „translate“ that to CUDA.