How to pass an array of multidimensional rows and two columns

As far as I know, JCudaMP was only a research project that somebody wrote about eleven years ago. It is (from my knowledge) totally unrelated to JCuda. You may try contacting the authors (e.g. via https://dl.acm.org/doi/10.1145/1808954.1808959 )

Marco,

If I have the following code inside the host side,

CUdeviceptr deviceXPattern = new CUdeviceptr();
cuMemAlloc(deviceXPattern, totallength0 * totallengthDistinct * Sizeof.INT);

This means that deviceXPattern has to be passed to the kernel to be filled there by some calculations and then returned back again to the host.

I need to define XPattern matrix instead of deviceXPattern inside the kernel to consist of totallength0 * totallengthDistinct (internal device variable)

knowing that totallength0 , totallengthDistinct are passed from the host.

It is possible to allocate memory in kernels. I once created a sample for that, and hesitated to add it to the jcuda-samples repository, because such allocations should be used with care and only when you exactly know that you’re doing. But … maybe that doesn’t matter, so I just added it via this commit: Added example for allocation in kernel · jcuda/jcuda-samples@3de3654 · GitHub

Marco, I mean by my question. I do not need to pass an empty structure deviceXPattern to the device that I will not return it again to the host. This is due to I know how to resize it from the host. Another Final array, deviceWordsFinal , the only one that will be returned to the host.

I need to define deviceXPattern inside the device to be a matrix of totallength0 * totallengthDistinct and fill it with calculations and then use it inside deviceWordsFinal .

how to define deviceXPattern inside device to be a matrix of two integers. It will be used only in device.

The example shows that you can do

float* data = (float*) malloc(rows * columns * sizeof(float));

in the kernel. If this is not what you need, maybe write what you would write in Java, then it may be possible to „translate“ that to CUDA.

Marco, there is an email on private related to declarations inside kernel.

Marco,

The error due to internal declaration. When I limit the kernel to the first internal declaration,for eg

extern "C"
__global__ void ComputationdClustersInternelOnGPU(int numTokenSrc,int numWordSrc,int srcLength, char *src,int *srctokensSFIndices,int *srctokensLength,int *srcIndices, int *srcStartIndices,int totalLengthDistinct, char *patternRemoved,int numTokenPattern,int numWordPattern,int patternLength,char *pattern,int *patterntokensSFIndices,int *patterntokensLength,int *patternIndices,int *patternStartIndices,float *WordsFinal)
{
    int ix = blockIdx.x * blockDim.x + threadIdx.x;
    int* dX = (int*) malloc(srcLength * totalLengthDistinct * sizeof(int));
    if(ix<totalLengthDistinct)
        {
            for (int i = 0; i < srcLength; i++) {
               if (src[i] == ',')
                  dX[ix * srcLength + i] = 0; 
               else
                {
                  if (src[i] == patternRemoved[ix])
	            dX[ix * srcLength + i] = srcIndices[i];
                  else if (src[i] != patternRemoved[ix])
	            dX[ix * srcLength + i] = dX[ix * srcLength +  i-1];
                }
             }
             
        } 
        __syncthreads(); 

        for(int i =0;i<srcLength*totalLengthDistinct;i++){

          printf("Elements of an array");
          printf("%d\n",dX[i]);

        }
}

When I run the kernel, it gives all zeros as an output for the matrix dX, this is the reason for the final error denoted in the email. While, when declare dX inside the host and send it to the kernel, it gives the right output in the matrix. What is wrong in the declaration in the device, it is only an int matrix of size srcLength * totalLengthDistinct .

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Can you help me,please?

You’re doing something wrong.

I cannot help you.

How often do I have to repeat that?

I think that you still did not anticipate the fact that the kernel is run thousands of times in parallel. It seems like you’re just writing plain, procedural, sequential code in the kernel, and expect it to be faster, because „it is CUDA“.

This is wrong.

You have to develop a deep unterstanding of the CUDA programming model, and a clear idea about how your problem can be mapped to the CUDA programming model in order to achieve a speedup, and in order to create an implementation that ‚works‘ in the first place.


You’re allocating some memory in the kernel. There are constraints for that. I don’t know these constraints. But here is an example that does „some“ allocation in the kernel. There are two lines that are marked with NOTE - See message. When you uncomment these lines, you will see different behavior:

  • When you are not calling free for the memory that you allocated, then it will not work
  • When the memory that you are trying to allocate is too large, then it will not work

I have no idea what your current code looks like, and what might be wrong there. My request to send ma a ZIP file with the project, so that I can test it, was ignored. So you’ll have figure that you on your own. But even if you send me the project: Your problem is simply not my problem. You want that PhD. Go for it.

package jcuda.driver.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemGetInfo;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import jcuda.Pointer;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

public class JCudaKernelAllocTest
{
    private static String programSourceCode = 
        "extern \"C\"" + "\n" +
        "__global__ void testKernel(int n)" + "\n" +
        "{" + "\n" +
        "    int* array = (int*) malloc(n * sizeof(int));" + "\n" +
        "    for (int i=0; i<n; i++)" + "\n" +
        "    {" + "\n" +
        "        array[i] = i;" + "\n" +
        "    }" + "\n" +
        "    for (int i=0; i<n; i++)" + "\n" +
        "    {" + "\n" +
        "        //printf(\"%d\\n\",array[i]);" + "\n" +
        "    }" + "\n" +
        
        // NOTE - See message
        // "    free(array);" + "\n" +
        
        "}" + "\n";
    
    public static void main(String[] args)
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);

        CUfunction function = defaultInitialization();

        int runs = 50;
        int n = 500;

        // NOTE - See message
        //n = 5000;
        
        for (int i=0; i<runs; i++)
        {
            printMemoryInfo();
            runKernel(function, n);
            printMemoryInfo();
        }
        
        System.out.println("Done");
    }

    private static void runKernel(CUfunction function, int n)
    {
        System.out .println(
            "Running kernel, allocating " + (n * n * 4) + " bytes");
        
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{n})
        );
        int blockSizeX = 256;
        int gridSizeX = (n + blockSizeX - 1) / blockSizeX;
        cuLaunchKernel(function,
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, null,
            kernelParameters, null
        );
        cuCtxSynchronize();
        
        System.out .println(
            "Running kernel done");
    }
    
    private static void printMemoryInfo()
    {
        long free[] = { -1 };
        long total[] = { -1 };
        cuMemGetInfo(free, total);
        System.out.println("Free: " + free[0] + " total: " + total[0]);
    }

    private static CUfunction defaultInitialization()
    {
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        
        // Use the NVRTC to create a program by compiling the source code
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        
        // Print the compilation log (for the case there are any warnings)
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        
        // Obtain the PTX ("CUDA Assembler") code of the compiled program
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);

        // Create a CUDA module from the PTX code
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);

        // Obtain the function pointer to the "add" function from the module
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "testKernel");
        
        return function;
    }
}

Dear Marco,

You sent me before about calculating time for Jcuda programs,

An example of how events can be used can be derived from https://github.com/jcuda/jcuda-samples/blob/2e6e62d0a463a6ebca6ca230bd015f96b955f08e/JCudaSamples/src/main/java/jcuda/runtime/samples/JCudaRuntimeMemoryBandwidths.java#L227 (it’s very similar to the C code, except for the usual C/Java-specific differences)

float elapsedTimeMs = elapsedTimeMsArray[0];

Does it mean microsecond or millisecond?

ms should always mean Milliseconds.

(If it was microsecond, it would be μs, but this should not appear in source code, so then it would be written as us or spelled out as microseconds).

Here, ms means Milliseconds.

Hi Marco,

I am now applying examples on Hyper-Q with Cuda. I will try to convert them with Jcuda. But, when I apply the example in simpleHyperqDepth.cu from Wrox.com.

The example as follows.

#include "../common/common.h"
#include <stdio.h>
#include <cuda_runtime.h>
#include <stdlib.h>
#include <math.h>


#define N 300000
#define NSTREAM 4

__global__ void kernel_1()
{
	double sum = 0.0;

	for (int i = 0; i < N; i++)
	{
		sum = sum + tan(0.1) * tan(0.1);
	}
}
__global__ void kernel_2()
{
	double sum = 0.0;

	for (int i = 0; i < N; i++)
	{
		sum = sum + tan(0.1) * tan(0.1);
	}
}
__global__ void kernel_3()
{
	double sum = 0.0;

	for (int i = 0; i < N; i++)
	{
		sum = sum + tan(0.1) * tan(0.1);
	}
}

__global__ void kernel_4()
{
	double sum = 0.0;

	for (int i = 0; i < N; i++)
	{
		sum = sum + tan(0.1) * tan(0.1);
	}
}

int setenv(const char *name, const char *value, int overwrite)
{
	int errcode = 0;
	if (!overwrite) {
		size_t envsize = 0;
		errcode = getenv_s(&envsize, NULL, 0, name);
		if (errcode || envsize) return errcode;
	}
	return _putenv_s(name, value);
}

int main(int argc, char **argv)
{
	int n_streams = NSTREAM;
	int isize = 1;
	int iblock = 1;
	int bigcase = 0;
	// get argument from command line
	if (argc > 1) n_streams = atoi(argv[1]);
	if (argc > 2) bigcase = atoi(argv[2]);
	float elapsed_time;
	// set up max connectioin
	char* iname = "CUDA_DEVICE_MAX_CONNECTIONS";
	setenv(iname, "9", 1);
	char *ivalue = getenv(iname);
	printf("%s = %s\n", iname, ivalue);
	int dev = 0;
	cudaDeviceProp deviceProp;
	CHECK(cudaGetDeviceProperties(&deviceProp, dev));
	printf("> Using Device %d: %s with num_streams=%d\n", dev, deviceProp.name,
		n_streams);
	CHECK(cudaSetDevice(dev));
	// check if device support hyper-q
	// check if device support hyper-q
	if (deviceProp.major < 3 || (deviceProp.major == 3 && deviceProp.minor < 5))
	{
		if (deviceProp.concurrentKernels == 0)
		{
			printf("> GPU does not support concurrent kernel execution (SM 3.5 "
				"or higher required)\n");
			printf("> CUDA kernel runs will be serialized\n");
		}
		else
		{
			printf("> GPU does not support HyperQ\n");
			printf("> CUDA kernel runs will have limited concurrency\n");
		}
	}
	printf("> Compute Capability %d.%d hardware with %d multi-processors\n",
		deviceProp.major, deviceProp.minor, deviceProp.multiProcessorCount);
	printf("> Concurrent Kernels %d\n",deviceProp.concurrentKernels);
	// Allocate and initialize an array of stream handles
	cudaStream_t *streams = (cudaStream_t *)malloc(n_streams * sizeof(
		cudaStream_t));
	// run kernel with more threads
	if (bigcase == 1)
	{
		iblock = 512;
		isize = 1 << 12;
	}

	// set up execution configuration
	dim3 block(iblock);
	dim3 grid(isize / iblock);
	printf("> grid %d block %d\n", grid.x, block.x);
	// creat events
	cudaEvent_t start, stop;
	CHECK(cudaEventCreate(&start));
	CHECK(cudaEventCreate(&stop));

	// record start event
	CHECK(cudaEventRecord(start, 0));
	// dispatch job with depth first ordering
	// dispatch job with depth first ordering
	for (int i = 0; i < n_streams; i++)
	{
		kernel_1 << <grid, block, 0, streams[i] >> >();
		kernel_2 << <grid, block, 0, streams[i] >> >();
		kernel_3 << <grid, block, 0, streams[i] >> >();
		kernel_4 << <grid, block, 0, streams[i] >> >();
	}
}

When I run the project, the statements
kernel_1 << <grid, block, 0, streams[i] >> >();
kernel_2 << <grid, block, 0, streams[i] >> >();
kernel_3 << <grid, block, 0, streams[i] >> >();
kernel_4 << <grid, block, 0, streams[i] >> >();

I found Error, expected an expression (What is wrong!)

it is quoted from the example.

Secondly , Does any one helps me with implementing Hyper-Q with Jcuda?

You had already asked me about Hyper-Q in a mail (30.01.2021, 03:50), and I already mentioned that I have not yet used this technology, and cannot provide profound help here.

But looking at the example (from professional-cuda-c-programming/simpleHyperqDepth.cu at master · deeperlearning/professional-cuda-c-programming · GitHub ) and the error message:

You apparently try to compile a CUDA Runtime File. Apparently, because you didn’t provide any context.

The given file is a C-file with a CUDA kernel. This has to compiled with the NVCC, and a C-compiler in the background. The syntax of

    kernel_1 <<<grid, block, 0, streams[i] >>>();

should already have told you that, if you had any idea about what you are doing there.

However, here is the same example, ported to the CUDA Driver API, with JCuda:

package jcuda.test;

import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuDeviceGetAttribute;
import static jcuda.driver.JCudaDriver.cuEventCreate;
import static jcuda.driver.JCudaDriver.cuEventElapsedTime;
import static jcuda.driver.JCudaDriver.cuEventRecord;
import static jcuda.driver.JCudaDriver.cuEventSynchronize;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoadData;
import static jcuda.driver.JCudaDriver.cuStreamCreate;
import static jcuda.nvrtc.JNvrtc.nvrtcCompileProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcCreateProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcDestroyProgram;
import static jcuda.nvrtc.JNvrtc.nvrtcGetPTX;
import static jcuda.nvrtc.JNvrtc.nvrtcGetProgramLog;

import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdevice_attribute;
import jcuda.driver.CUevent;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.CUstream;
import jcuda.driver.JCudaDriver;
import jcuda.nvrtc.JNvrtc;
import jcuda.nvrtc.nvrtcProgram;

// Based on https://github.com/deeperlearning/professional-cuda-c-programming/
// blob/master/examples/chapter06/simpleHyperqDepth.cu
public class JCudaHyperqExample
{
    /**
     * The source code of the program that will be compiled at runtime:
     * 
     * Note: The function should be declared as  
     * extern "C"
     * to make sure that it can be found under the given name.
     */
    private static String programSourceCode = 
            "#define N 300000" + "\n" +
            "#define NSTREAM 4" + "\n" +
            "" + "\n" +
            "extern \"C\"" + "\n" +
            "__global__ void kernel_1()" + "\n" +
            "{" + "\n" +
            "    double sum = 0.0;" + "\n" +
            "" + "\n" +
            "    for(int i = 0; i < N; i++)" + "\n" +
            "    {" + "\n" +
            "        sum = sum + tan(0.1) * tan(0.1);" + "\n" +
            "    }" + "\n" +
            "}" + "\n" +
            "" + "\n" +
            "extern \"C\"" + "\n" +
            "__global__ void kernel_2()" + "\n" +
            "{" + "\n" +
            "    double sum = 0.0;" + "\n" +
            "" + "\n" +
            "    for(int i = 0; i < N; i++)" + "\n" +
            "    {" + "\n" +
            "        sum = sum + tan(0.1) * tan(0.1);" + "\n" +
            "    }" + "\n" +
            "}" + "\n" +
            "" + "\n" +
            "extern \"C\"" + "\n" +
            "__global__ void kernel_3()" + "\n" +
            "{" + "\n" +
            "    double sum = 0.0;" + "\n" +
            "" + "\n" +
            "    for(int i = 0; i < N; i++)" + "\n" +
            "    {" + "\n" +
            "        sum = sum + tan(0.1) * tan(0.1);" + "\n" +
            "    }" + "\n" +
            "}" + "\n" +
            "" + "\n" +
            "extern \"C\"" + "\n" +
            "__global__ void kernel_4()" + "\n" +
            "{" + "\n" +
            "    double sum = 0.0;" + "\n" +
            "" + "\n" +
            "    for(int i = 0; i < N; i++)" + "\n" +
            "    {" + "\n" +
            "        sum = sum + tan(0.1) * tan(0.1);" + "\n" +
            "    }" + "\n" +
            "}" + "\n";    
    
    /**
     * Entry point of this sample
     * 
     * @param args Not used
     */
    public static void main(String[] args)
    {
        // Enable exceptions and omit all subsequent error checks
        JCudaDriver.setExceptionsEnabled(true);
        JNvrtc.setExceptionsEnabled(true);

        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        // Check that concurrent kernels are supported
        int[] attributeArray = { 0 };
        cuDeviceGetAttribute(attributeArray, 
            CUdevice_attribute.CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, device);
        System.out.println("Concurrent kernels supported? "+attributeArray[0]);
        
        
        // Compile the source code and obtain the function
        nvrtcProgram program = new nvrtcProgram();
        nvrtcCreateProgram(
            program, programSourceCode, null, 0, null, null);
        nvrtcCompileProgram(program, 0, null);
        String programLog[] = new String[1];
        nvrtcGetProgramLog(program, programLog);
        System.out.println("Program compilation log:\n" + programLog[0]);        
        String[] ptx = new String[1];
        nvrtcGetPTX(program, ptx);
        nvrtcDestroyProgram(program);
        CUmodule module = new CUmodule();
        cuModuleLoadData(module, ptx[0]);
        
        CUfunction kernel_1 = new CUfunction();
        cuModuleGetFunction(kernel_1, module, "kernel_1");
        CUfunction kernel_2 = new CUfunction();
        cuModuleGetFunction(kernel_2, module, "kernel_2");
        CUfunction kernel_3 = new CUfunction();
        cuModuleGetFunction(kernel_3, module, "kernel_3");
        CUfunction kernel_4 = new CUfunction();
        cuModuleGetFunction(kernel_4, module, "kernel_4");
        
        
        int n_streams = 4;
        CUstream[] streams = new CUstream[n_streams];

        for (int i = 0 ; i < n_streams ; i++)
        {
            streams[i] = new CUstream();
            cuStreamCreate(streams[i], 0);
        }

        CUevent start = new CUevent();
        CUevent stop = new CUevent();
        cuEventCreate(start, 0);
        cuEventCreate(stop, 0);

        cuEventRecord(start, null);

        int numElements = 100000000;
        int blockSizeX = 256;
        int gridSizeX = (numElements + blockSizeX - 1) / blockSizeX;
        
        for (int i = 0; i < n_streams; i++)
        {
            cuLaunchKernel(kernel_1,
                gridSizeX,  1, 1,
                blockSizeX, 1, 1,
                0, streams[i],  
                null, null
            );
            cuLaunchKernel(kernel_2,
                gridSizeX,  1, 1,
                blockSizeX, 1, 1,
                0, streams[i],  
                null, null
            );
            cuLaunchKernel(kernel_3,
                gridSizeX,  1, 1,
                blockSizeX, 1, 1,
                0, streams[i],  
                null, null
            );
            cuLaunchKernel(kernel_4,
                gridSizeX,  1, 1,
                blockSizeX, 1, 1,
                0, streams[i],  
                null, null
            );
        }
        
        cuEventRecord(stop, null);
        cuEventSynchronize(stop);

        float[] milliseconds = { 0.0f };
        cuEventElapsedTime(milliseconds, start, stop);
        System.out.printf("Measured time for parallel execution = %.3fs\n",
               milliseconds[0] / 1000.0f);

        // TODO Clean up (release streams and events).
    }
}

Compile it. Run it. Now you can write into your thesis that you used Hyper-Q, and you can write into your CV that you have experience with modern technologies like NVIDIA CUDA Hyper-Q. Isn’t that great?

Marco, not to be angry please, I have completed running ontologies with the technologies of warp shuffle and other parallel technologies. It acquires great results on GPU, with reduction in time up to 80%. Thanks for your help. I benefit from your experience to attach this.

After doing so, I have to improve the time more and more. I need to apply my kernels with the technology of hyper-Q on large ontologies. I need your general help and I could benefit in my code.

I need to divide the large ontology data into partitions and all partitions executes in parallel on GPU.

Have you any other examples written on Jcuda like the one you send, please? One kernel applying on a set of streams, each stream executes portion of ontology data.

Without being disrespectful, this is a public forum, Marco13 is doing his project in his free time, this includes answering questions.

This is not meant as free professional consultancy for people that need help getting their job done since over a year.

Maybe just listen to people when they say „I can not help you“?

I’m not „angry“. At least, not in a concerning way. It is rather frustration. I’ve been developing JCuda since 2008 (and it is only one of dozens of open source projects that I’m working on). I have spent literally thousands of hours for projects like this.
For free.
I did not get anything from that.
Not s single cent.
No fame.
No PhD diploma.
Nothing.

Now, I could try to come up with an estimate of how many hours of personal, direct, specific support I already provided for you. I could multiply that with the average hourly wage of a freelancer, and that would probably be a 4-digit $$$$ number. But that’s not even the point.

The point is that you apparently have absolutely no idea what you are doing there. Everything that I have seen so far has been…

  • You asked a vague question
  • I provided a patient answer, trying to elaborate the relevant aspects of CUDA programming that I know, and containing code snippets and samples that could be a basis for your further work
  • The next thing was then usually that you tweaked around in my example code, broke things and didn’t know why, or wanted extensions for the code and didn’t know how.

If I had to hire somebody for a job, and saw this thread, then would be a red flag. (Yeah, I know: My cynicism would also be a red flag, but … „Scratch the surface of any cynic, and you will find a wounded idealist underneath.“). Buf fortunately, for you, all this does not matter. You will receive your PhD nevertheless. You will qickly become a manager in an IT department. No clue about anything whatsoever, but with lots of responsibility, and when push comes to shove, they’ll not fire the managers, but the people who implemented everything. Idiots like me.


I have not yet worked with Hyper-Q. I have shown you the basic example. You should have been able to port that example to JCuda on your own, but I did it for you, once again. Applying this highly-specific technology to a highly-specific problem that is solved with highly-specific code (that I do not even have access to) is plainly not possible. You’ll have do do this on your own.

I’m not only open for questions, I actually appreciate questions about JCuda …

  • as long as they are specific enough so that they can be answered sensibly at all (and not „How to implement something with ontologies with CUDA?“)
  • and as long as I can answer them (and when the question is „How to solve a problem with HyperQ?“, and I don’t even really know what HyperQ is, then it’s unlikely that I can provide a sensible answer).

Hi Marco,

When I run the project, it gives the following error,

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffc6803e1a8, pid=8132, tid=6860
#
# JRE version: Java(TM) SE Runtime Environment (15.0.1+9) (build 15.0.1+9-18)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (15.0.1+9-18, mixed mode, sharing, tiered, compressed oops, g1 gc, windows-amd64)
# Problematic frame:
# C  [nvcuda.dll+0x3ce1a8]
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# D:\NetBeanProjects\OntologyThresholdSerial2023Final\hs_err_pid8132.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
D:\NetBeanProjects\OntologyThresholdSerial2023Final\nbproject\build-impl.xml:1355: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyThresholdSerial2023Final\nbproject\build-impl.xml:961: Java returned: 1
BUILD FAILED (total time: 15 seconds)

I have reinstalled cuda and java enviroment, but I still have the error. Why? the program was successfully running before. I have send the bug on email.

I have another error

Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_NOT_FOUND
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuModuleGetFunction(JCudaDriver.java:2587)
at ontologythresholdserial2023final.ParallelLevenstein.defaultInitialization(ParallelLevenstein.java:166)
at ontologythresholdserial2023final.ParallelLevenstein.(ParallelLevenstein.java:80)
at ontologythresholdserial2023final.OntologyThresholdSerial2023Final.main(OntologyThresholdSerial2023Final.java:426)
D:\NetBeanProjects\OntologyThresholdSerial2023Final\nbproject\build-impl.xml:1355: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyThresholdSerial2023Final\nbproject\build-impl.xml:961: Java returned: 1
BUILD FAILED (total time: 4 seconds)

It have created ptx file but still have errors

The last one that says CUDA_ERROR_NOT_FOUND:

This is most likely caused by a call like

cuModuleGetFunction(kernel, module, "kernelName");

where the name "kernelName" is wrong. Make sure that your .cu file contains the kernel as

extern "C"
__global__ void kernelName(....) 

The first one (the access violation) may have many reasons. Without the source code, it’s impossible to tell what is wrong there.

Marco,

If you have time, please give me an answer.

When I run the following kernel, it gives the same error.

__device__ void levenshteinDistance(int srcLength,char *patternRemoved, int distinctStart, int distinctEnd,char *pattern,int clStart,int clEnd, char *src,int *srcIndices, int strucStart, int strucEnd,int tokensStrucStart,int tokensStrucEnd,int tokensClStart,int tokensClEnd,int wordsClStart,int wordsClEnd,int wordsStrucStart,int wordsStrucEnd,int wordsTokensClStart,int wordsTokensClEnd, int wordsTokensStrucStart, int wordsTokensStrucEnd,int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{
   
        for (int i = distinctStart; i < distinctEnd; i++) {
            for (int j = strucStart; j < strucEnd; j++) {
                if (src.charAt(j) == ',')
                  dX[i * srcLength + j] = 0;
                 else
                {
                  if (src.charAt(j) == patternRemoved.charAt(i))
                    dX[i * srcLength + j] = srcIndices[j];
                  else if (src.charAt(j) != patternRemoved.charAt(i))
                    dX[i * srcLength + j] = dX[i * srcLength + j-1];
                }
            }
             
        }
}

extern "C"
__global__ void ComputationdStructureOnGPUSerial(int srcLength,int numTokenSrc, int srcSructuresNum, int srcWordNum, char *src,int *srcStructuresSFIndices, int *srcIndices,int *srcStructureTokensSFIndices, int *srcStartIndices, int *srcStructureWordsSFIndices,int *srcStructureWordstokensSFIndices,int *srcTokensLengths,int totalLengthDistinct, char *patternRemoved, int *distinctSFIndices,int patternLength,int numPatternToken,int patternClustersNum,int patternWordNum,char *pattern, int *patternClustersSFIndices,int *patternIndices,int *patternTokensSFIndices,int *patternStartIndices,int *PatternWordsSFIndices,int *PatternWordstokensSFIndices,int *patternTokensLengths, int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{

     int ix = blockIdx.x * blockDim.x + threadIdx.x;
     int distinctStart,distinctEnd,clStart,clEnd,strucStart,strucEnd,tokensStrucStart,tokensStrucEnd,tokensClStart,tokensClEnd,wordsClStart,wordsClEnd,wordsStrucStart,wordsStrucEnd,wordsTokensClStart,wordsTokensClEnd,wordsTokensStrucStart,wordsTokensStrucEnd;


     if(ix<patternClustersNum){
       distinctStart = distinctSFIndices[ix];
       distinctEnd = distinctSFIndices[ix+1];
       clStart = patternClustersSFIndices[ix];
       clEnd = patternClustersSFIndices[ix+1];
       strucStart = srcStructuresSFIndices[ix];
       strucEnd = srcStructuresSFIndices[ix+1];
       tokensStrucStart = srcStructureTokensSFIndices[ix];
       tokensStrucEnd = srcStructureTokensSFIndices[ix+1];
       tokensClStart = patternTokensSFIndices[ix];
       tokensClEnd = patternTokensSFIndices[ix+1];
       wordsClStart = PatternWordsSFIndices[ix];
       wordsClEnd = PatternWordsSFIndices[ix+1];
       wordsStrucStart = srcStructureWordsSFIndices[ix];
       wordsStrucEnd = srcStructureWordsSFIndices[ix+1];
       wordsTokensClStart = PatternWordstokensSFIndices[ix];
       wordsTokensClEnd = PatternWordstokensSFIndices[ix+1];
       wordsTokensStrucStart = srcStructureWordstokensSFIndices[ix];
       wordsTokensStrucEnd = srcStructureWordstokensSFIndices[ix+1];
levenshteinDistance(srcLength,patternRemoved, distinctStart, distinctEnd,pattern,clStart,clEnd, src,srcIndices,strucStart, strucEnd,tokensStrucStart,tokensStrucEnd,tokensClStart,tokensClEnd,wordsClStart,wordsClEnd,wordsStrucStart,wordsStrucEnd,wordsTokensClStart,wordsTokensClEnd,wordsTokensStrucStart,wordsTokensStrucEnd,dX,ResultFinal,TokensFinal,WordsTokensFinal1,WordsTokensFinal2,WordsFinal1,WordsFinal2,WordsFinal);

 

     }



}

it gives the following error:

Executing
nvcc -m64 -ptx D:\NetBeanProjects\OntologyThresholdSerial2023Test\src\ontologythresholdserial2023test\OntologyThresholdSerial2023.cu -o D:\NetBeanProjects\OntologyThresholdSerial2023Test\src\ontologythresholdserial2023test\OntologyThresholdSerial2023.ptx
nvcc process exitValue 1
errorMessage:
D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(6): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(10): error: expression must have class type

Exception in thread "main" java.io.IOException: Could not create .ptx file: D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(6): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(10): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(10): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(12): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(10): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(12): error: expression must have class type

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(12): error: expression must have class type

5 errors detected in the compilation of "C:/Users/COMPUT~1/AppData/Local/Temp/tmpxft_00001b9c_00000000-10_OntologyThresholdSerial2023.cpp1.ii".

outputMessage:
OntologyThresholdSerial2023.cu

D:/NetBeanProjects/OntologyThresholdSerial2023Test/src/ontologythresholdserial2023test/OntologyThresholdSerial2023.cu(12): error: expression must have class type

5 errors detected in the compilation of "C:/Users/COMPUT~1/AppData/Local/Temp/tmpxft_00001b9c_00000000-10_OntologyThresholdSerial2023.cpp1.ii".

    at ontologythresholdserial2023test.ParallelLevenstein.preparePtxFile(ParallelLevenstein.java:130)
    at ontologythresholdserial2023test.ParallelLevenstein.<init>(ParallelLevenstein.java:79)
    at ontologythresholdserial2023test.OntologyThresholdSerial2023Test.main(OntologyThresholdSerial2023Test.java:426)
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:1355: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:961: Java returned: 1
BUILD FAILED (total time: 3 seconds)

the error : error: expression must have class type, it is from src.charAt(j) because it is passed as a pointer in the device function. When I search for the error as in

it needs the object not the pointer. how to solve this, please?

Knowing that I have used src.charAt(j) inside global function in another kernel, it succeeds.

extern "C"
__global__ void ComputationdStructureOnGPUParallel