How to pass an array of multidimensional rows and two columns

Thanks. I’ll try. Here, I will ask general questions about Jcuda

To be clear: I’d really like to help you. And if you review my answers in this thread, you’ll see that I tried to explain everything that I know and that I could say for sure in great detail.

But when the questions cover the range between basics (like „What is the ‚thread index‘?“) and high-level (and at the same time highly specific) questions (like „What’s the kernel code for [this-and-that paper]?“), then I just cannot help you.

I’ll still try to answer questions that are at least somewhat specific for JCuda, or broader questions that may be of interest for others and that I can answer sensibly.

Thanks a lot Marco.

I am so sorry for disturbing you Marco.I have made part of the code. I send you an email on private related to Jcuda.

When I run the program in Jcuda, it gives

Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:436)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:60)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1

Knowing that I solve all syntax errors in kernel, I already create the ptx file of cu file.
How can I determine the above error in understandable statements in lines of code that I can easily solve. Also, I sent all the code on email if you have time to take a look.

I have fix all problems. It is still the same error. It is on email, the code.
Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:2139)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.computeResult(OntologyJcudaProjectFinalDivideOther.java:407)
at ontologyjcudaprojectfinaldivideother.OntologyJcudaProjectFinalDivideOther.main(OntologyJcudaProjectFinalDivideOther.java:62)
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:1330: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyJcudaProjectFinalOther\nbproject\build-impl.xml:936: Java returned: 1
BUILD FAILED (total time: 2 seconds)

I also have created the ptx file. What is the problem? Look, there is a post for you earlier for the same problem.

That linked error is totally unrelated. The error code CUDA_ERROR_LAUNCH_FAILED only means that ~„something is wrong“. It does not tell you what is wrong. You have to figure this out on your own. And this is impossible if you don’t exactly know what you’re doing.

So in this case, I can only say that I always have to say:

Your code is wrong.

Yeah. That’s it. That doesn’t help you much.

However: When you add …

private static KernelData createKernelData( ...) {

    ...

    kernelData.deviceResultFinal = deviceResultFinal; // <--------- THIS LINE
          
    return kernelData;
}

… then it will „„work““. (Of course, it will not really work. It will not do what you want it to do. But ~„the crash will disappear“, and you might think that you’re one step closer to your goal, until you encounter the next error… it’s tedious…)

Thanks a lot. It works. I put this statement in earlier projects and I omit it here. Great love to you Marco.

Ein Beitrag wurde in ein neues Thema verschoben: Restoring deleted CUDA file from PTX

Ein Beitrag wurde in ein neues Thema verschoben: Computing execution time from events

Hi Marco,

How can I learn about JcudaMP: OpenMP/Java on CUDA? Is there any projects or code for it?

As far as I know, JCudaMP was only a research project that somebody wrote about eleven years ago. It is (from my knowledge) totally unrelated to JCuda. You may try contacting the authors (e.g. via https://dl.acm.org/doi/10.1145/1808954.1808959 )

Marco,

If I have the following code inside the host side,

CUdeviceptr deviceXPattern = new CUdeviceptr();
cuMemAlloc(deviceXPattern, totallength0 * totallengthDistinct * Sizeof.INT);

This means that deviceXPattern has to be passed to the kernel to be filled there by some calculations and then returned back again to the host.

I need to define XPattern matrix instead of deviceXPattern inside the kernel to consist of totallength0 * totallengthDistinct (internal device variable)

knowing that totallength0 , totallengthDistinct are passed from the host.

It is possible to allocate memory in kernels. I once created a sample for that, and hesitated to add it to the jcuda-samples repository, because such allocations should be used with care and only when you exactly know that you’re doing. But … maybe that doesn’t matter, so I just added it via this commit: Added example for allocation in kernel · jcuda/jcuda-samples@3de3654 · GitHub

Marco, I mean by my question. I do not need to pass an empty structure deviceXPattern to the device that I will not return it again to the host. This is due to I know how to resize it from the host. Another Final array, deviceWordsFinal , the only one that will be returned to the host.

I need to define deviceXPattern inside the device to be a matrix of totallength0 * totallengthDistinct and fill it with calculations and then use it inside deviceWordsFinal .

how to define deviceXPattern inside device to be a matrix of two integers. It will be used only in device.

The example shows that you can do

float* data = (float*) malloc(rows * columns * sizeof(float));

in the kernel. If this is not what you need, maybe write what you would write in Java, then it may be possible to „translate“ that to CUDA.

Marco, there is an email on private related to declarations inside kernel.

Marco,

The error due to internal declaration. When I limit the kernel to the first internal declaration,for eg

extern "C"
__global__ void ComputationdClustersInternelOnGPU(int numTokenSrc,int numWordSrc,int srcLength, char *src,int *srctokensSFIndices,int *srctokensLength,int *srcIndices, int *srcStartIndices,int totalLengthDistinct, char *patternRemoved,int numTokenPattern,int numWordPattern,int patternLength,char *pattern,int *patterntokensSFIndices,int *patterntokensLength,int *patternIndices,int *patternStartIndices,float *WordsFinal)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int* dX = (int*) malloc(srcLength * totalLengthDistinct * sizeof(int));
    if(ix<totalLengthDistinct)
        {
            for (int i = 0; i < srcLength; i++) {
               if (src[i] == ',')
                  dX[ix * srcLength + i] = 0; 
               else
                {
                  if (src[i] == patternRemoved[ix])
	            dX[ix * srcLength + i] = srcIndices[i];
                  else if (src[i] != patternRemoved[ix])
	            dX[ix * srcLength + i] = dX[ix * srcLength +  i-1];
                }
             }
             
        } 
        __syncthreads(); 

        for(int i =0;i<srcLength*totalLengthDistinct;i++){

          printf("Elements of an array");
          printf("%d\n",dX[i]);

        }
}

When I run the kernel, it gives all zeros as an output for the matrix dX, this is the reason for the final error denoted in the email. While, when declare dX inside the host and send it to the kernel, it gives the right output in the matrix. What is wrong in the declaration in the device, it is only an int matrix of size srcLength * totalLengthDistinct .

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Can you help me,please?

You’re doing something wrong.

I cannot help you.

How often do I have to repeat that?

I think that you still did not anticipate the fact that the kernel is run thousands of times in parallel. It seems like you’re just writing plain, procedural, sequential code in the kernel, and expect it to be faster, because „it is CUDA“.

This is wrong.

You have to develop a deep unterstanding of the CUDA programming model, and a clear idea about how your problem can be mapped to the CUDA programming model in order to achieve a speedup, and in order to create an implementation that ‚works‘ in the first place.


You’re allocating some memory in the kernel. There are constraints for that. I don’t know these constraints. But here is an example that does „some“ allocation in the kernel. There are two lines that are marked with NOTE - See message. When you uncomment these lines, you will see different behavior:

  • When you are not calling free for the memory that you allocated, then it will not work
  • When the memory that you are trying to allocate is too large, then it will not work

I have no idea what your current code looks like, and what might be wrong there. My request to send ma a ZIP file with the project, so that I can test it, was ignored. So you’ll have figure that you on your own. But even if you send me the project: Your problem is simply not my problem. You want that PhD. Go for it.

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemGetInfo;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import jcuda.Pointer;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

public class JCudaKernelAllocTest
{
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void testKernel(int n)" + "\n" +
        "{" + "\n" +
        "    int* array = (int*) malloc(n * sizeof(int));" + "\n" +
        "    for (int i=0; i<n; i++)" + "\n" +
        "    {" + "\n" +
        "        array[i] = i;" + "\n" +
        "    }" + "\n" +
        "    for (int i=0; i<n; i++)" + "\n" +
        "    {" + "\n" +
        "        //printf(\"%d\\n\",array[i]);" + "\n" +
        "    }" + "\n" +
        
        // NOTE - See message
        // "    free(array);" + "\n" +
        
        "}" + "\n";
    
    public static void main(String[] args)
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);

        CUfunction function = defaultInitialization();

        int runs = 50;
        int n = 500;

        // NOTE - See message
        //n = 5000;
        
        for (int i=0; i<runs; i++)
        {
            printMemoryInfo();
            runKernel(function, n);
            printMemoryInfo();
        }
        
        System.out.println("Done");
    }

    private static void runKernel(CUfunction function, int n)
    {
        System.out .println(
            "Running kernel, allocating " + (n * n * 4) + " bytes");
        
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{n})
        );
        int blockSizeX = 256;
        int gridSizeX = (n + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, null,
            kernelParameters, null
        );
        cuCtxSynchronize();
        
        System.out .println(
            "Running kernel done");
    }
    
    private static void printMemoryInfo()
    {
        long free[] = { -1 };
        long total[] = { -1 };
        cuMemGetInfo(free, total);
        System.out.println("Free: " + free[0] + " total: " + total[0]);
    }

    private static CUfunction defaultInitialization()
    {
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        
        // Use the NVRTC to create a program by compiling the source code
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        
        // Print the compilation log (for the case there are any warnings)
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        
        // Obtain the PTX ("CUDA Assembler") code of the compiled program
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);

        // Create a CUDA module from the PTX code
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);

        // Obtain the function pointer to the "add" function from the module
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "testKernel");
        
        return function;
    }
}

Dear Marco,

You sent me before about calculating time for Jcuda programs,

An example of how events can be used can be derived from https://github.com/jcuda/jcuda-samples/blob/2e6e62d0a463a6ebca6ca230bd015f96b955f08e/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMemoryBandwidths.java#L227 (it’s very similar to the C code, except for the usual C/Java-specific differences)

float elapsedTimeMs = elapsedTimeMsArray[0];

Does it mean microsecond or millisecond?