How to pass an array of multidimensional rows and two columns

Thanks a lot Marco. I know exactly how to divide the data.

When I run concurrent kernel execution after dividing the data, it gives the following error.

Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_INVALID_HANDLE
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:359)
at jcuda.driver.JCudaDriver.cuLaunchKernel(JCudaDriver.java:16930)
at ontologythresholdserial2023test.ParallelLevenstein.computeStructureParallelResultStream(ParallelLevenstein.java:4878)
at ontologythresholdserial2023test.ParallelLevenstein.applyingHyperQ(ParallelLevenstein.java:4449)
at ontologythresholdserial2023test.OntologyThresholdSerial2023Test.main(OntologyThresholdSerial2023Test.java:441)
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:1355: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:961: Java returned: 1

This is the code for the main function:

public void applyingHyperQ(List<ArrayList<Cluster>> lstCsrcF,List<Cluster> lstCdest){
       divideDataForHyperQ(4,lstCsrcF,lstCdest); 
      cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);
     // Check that concurrent kernels are supported
        int[] attributeArray = { 0 };
        cuDeviceGetAttribute(attributeArray, 
        CUdevice_attribute.CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS, device);
        System.out.println("Concurrent kernels supported? "+attributeArray[0]);
int c = structuresGroup.length;
    float value;
    int numElements=0;
    int ns = c;
    ArrayList<Float> resultParallelFinal = new ArrayList<Float>();
    CUstream[] streams = new CUstream[ns];
     StructureKernelData[] sClusterDataArray = new StructureKernelData[ns];
    System.out.println("Streams Structures");
    for (int i = 0 ; i < ns ; i++)
     {
        streams[i] = new CUstream();
        cuStreamCreate(streams[i], 0);
     }
    
  
 
    int blockSizeX = 256;
    int gridSizeX;
     //int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
    for (int i = 0 ; i < ns ; i++)
     {
     
        sClusterDataArray[i] = createKernelDataApproachStream(structuresGroup[i],clustersGroup[i],streams[i]);
        numElements = GetMaxLength(structuresGroup[i],clustersGroup[i]);
        gridSizeX = (int)Math.ceil((double)(numElements + blockSizeX - 1) / blockSizeX);
        ResultWordsFinal[i] = computeStructureParallelResultStream(function, sClusterDataArray[i],streams[i],gridSizeX,blockSizeX);
 for(int i = 0 ; i < ns ; i++){
       cleanStructureKernelData(sClusterDataArray[i]);
       cuStreamDestroy(streams[i]);
     }
         
 }

This is the code for initializing input data and transfer from host to device
public StructureKernelData createKernelDataApproachStream(
       List<ArrayList<Cluster>> lstCsrcF, List<Cluster> lstCdest,CUstream stream)
    {
        List<String> tokenslst0 = lstStructureAllTokens(lstCsrcF);
        List<String> tokenslst1 = lstclusterAllTokens(lstCdest);
        //int structuresNum = lstCsrcF.size();
        //int clustersNum = lstCdest.size();
        int structuresNum = getStructureSpecialSize(lstCsrcF);
        int clustersNum =  getClusterSpecialSize(lstCdest);
        int numtokenslst0 = tokenslst0.size();
        int numtokenslst1 = tokenslst1.size();
        int[] lstCsrcFTokensSizes = lstStructureSize(lstCsrcF);
        int[] lstCdestTokensSizes = lstClusterSize(lstCdest);
        int[] structureTokensSFIndices = clusterSFIndices(lstCsrcFTokensSizes);
        int[] patternTokensSFIndices = clusterSFIndices(lstCdestTokensSizes);
        int[] lstWordsNum0 = lstStructureLengths(lstCsrcF);
        int[] structureWordsSFIndices = clusterSFIndices(lstWordsNum0);
        int wordsNum0 = totalStructureLength(lstCsrcF);
        int[] lstWordsNum1 = lstClusterLengths(lstCdest);
        int[] clusterWordsSFIndices = clusterSFIndices(lstWordsNum1);
        int wordsNum1 = totalClusterLengthFinal(lstCdest);
        int[] tokensCount0 = structureTokensCount(lstCsrcF);
        int[] tokensCount1 = clustersTokensCount(lstCdest);
        
        int[] tokensSFIndices0 = clusterSFIndices(tokensCount0);
        int[] tokensSFIndices1 = clusterSFIndices(tokensCount1);
        int[] tokensLength0  = clusterTokensLength(tokenslst0);
        int[] tokensLength1  = clusterTokensLength(tokenslst1);
        String srcStringsJoined0 = Join(tokenslst0).toLowerCase();
        String srcStringsJoined1 = Join(tokenslst1).toLowerCase();
        int srcStringsJoinedIndex0[] = stringIndex(srcStringsJoined0,',');
        int srcStringsJoinedIndex1[] = stringIndex(srcStringsJoined1,',');
        String PatternJoinedDistinct = patternDistinctConversion(lstCdest).toLowerCase();
        int[] totallengthArray0 = extractLengthArray0(lstCsrcF);
        int[] structuresSFIndices0 = clusterSFIndices(totallengthArray0);
        int[] totallengthArray1 = extractLengthArray1(lstCdest);
        int[] clustersSFIndices1 =  clusterSFIndices(totallengthArray1);
        int[] totallengthDistinct = extractLengthDistinct(lstCdest);
        int[] distinctSFIndices =  clusterSFIndices(totallengthDistinct);
        int srcStringStartIndices0[] = characterOccurence(srcStringsJoined0,',');
        //int srcStringEndIndices0[] = new int[stringlst0.size()];
        //int srcIndividualLengths0[] =   new int[stringlst0.size()];
        int srcStringStartIndices1[] = characterOccurence(srcStringsJoined1,',');
        int totallength0 = extractLength(srcStringsJoined0);
        int totallength1 = extractLength(srcStringsJoined1);
        int totallengthDistinctPattern = extractTotalLengthDistinct(lstCdest);
        byte[] stringData0 = extractData(srcStringsJoined0);
        byte[] stringData1 = extractData(srcStringsJoined1);
        byte[] stringDistinctData = extractData(PatternJoinedDistinct);
        
        
        CUdeviceptr devicestringData0 = copyToDeviceByteStream(stringData0,stream);
        CUdeviceptr devicelstCsrcFTokensSizes = copyToDeviceINTStream(lstCsrcFTokensSizes,stream);
        CUdeviceptr devicestructureTokensSFIndices = copyToDeviceINTStream(structureTokensSFIndices,stream);
        CUdeviceptr devicelstWordsNum0 = copyToDeviceINTStream(lstWordsNum0,stream);
        CUdeviceptr devicestructureWordsSFIndices = copyToDeviceINTStream(structureWordsSFIndices,stream);
        CUdeviceptr devicetokensCount0 = copyToDeviceINTStream(tokensCount0,stream);
        CUdeviceptr devicetokensSFIndices0 = copyToDeviceINTStream(tokensSFIndices0,stream);
        CUdeviceptr devicetokensLength0 = copyToDeviceINTStream(tokensLength0,stream);
        CUdeviceptr devicesrcStringsJoinedIndex0 = copyToDeviceINTStream(srcStringsJoinedIndex0,stream);
        CUdeviceptr devicetotallengthArray0 = copyToDeviceINTStream(totallengthArray0,stream);
        CUdeviceptr devicestructuresSFIndices0 = copyToDeviceINTStream(structuresSFIndices0,stream);
        CUdeviceptr devicesrcStringStartIndices0 = copyToDeviceINTStream(srcStringStartIndices0,stream);
               
        CUdeviceptr devicestringData1 = copyToDeviceByteStream(stringData1,stream);
        CUdeviceptr devicelstCdestTokensSizes = copyToDeviceINTStream(lstCdestTokensSizes,stream);
        CUdeviceptr devicepatternTokensSFIndices = copyToDeviceINTStream(patternTokensSFIndices,stream);
        CUdeviceptr devicelstWordsNum1 = copyToDeviceINTStream(lstWordsNum1,stream);
        CUdeviceptr deviceclusterWordsSFIndices = copyToDeviceINTStream(clusterWordsSFIndices,stream);
        CUdeviceptr devicetokensCount1 = copyToDeviceINTStream(tokensCount1,stream);
        CUdeviceptr devicetokensSFIndices1 = copyToDeviceINTStream(tokensSFIndices1,stream);
        CUdeviceptr devicetokensLength1 = copyToDeviceINTStream(tokensLength1,stream);
        CUdeviceptr devicesrcStringsJoinedIndex1 = copyToDeviceINTStream(srcStringsJoinedIndex1,stream);
        CUdeviceptr devicetotallengthArray1 = copyToDeviceINTStream(totallengthArray1,stream);
        CUdeviceptr deviceclustersSFIndices1 = copyToDeviceINTStream(clustersSFIndices1,stream);
        CUdeviceptr devicesrcStringStartIndices1 = copyToDeviceINTStream(srcStringStartIndices1,stream);
        CUdeviceptr devicedistinctSFIndices = copyToDeviceINTStream(distinctSFIndices,stream);
        CUdeviceptr devicestringDistinctData = copyToDeviceByteStream(stringDistinctData,stream);
        CUdeviceptr devicetotallengthDistinct = copyToDeviceINTStream(totallengthDistinct,stream);
        CUdeviceptr deviceXPattern = new CUdeviceptr();
        cuMemAlloc(deviceXPattern, totallength0 * totallengthDistinctPattern * Sizeof.INT);
        
        CUdeviceptr deviceResultFinal = new CUdeviceptr();
        cuMemAlloc(deviceResultFinal, totallength0 * totallength1 * Sizeof.INT);
        
        CUdeviceptr deviceTokensFinal = new CUdeviceptr();
        cuMemAlloc(deviceTokensFinal, numtokenslst0 * numtokenslst1 * Sizeof.FLOAT);
       
        CUdeviceptr deviceWordsTokensFinal1 = new CUdeviceptr();
        cuMemAlloc(deviceWordsTokensFinal1, numtokenslst0 * wordsNum1 * Sizeof.FLOAT);
        
        CUdeviceptr deviceWordsTokensFinal2 = new CUdeviceptr();
        cuMemAlloc(deviceWordsTokensFinal2, wordsNum0 * numtokenslst1 * Sizeof.FLOAT);
        
        CUdeviceptr deviceWordsFinal1 = new CUdeviceptr();
        cuMemAlloc(deviceWordsFinal1, wordsNum0 * wordsNum1 * Sizeof.FLOAT);
        
        CUdeviceptr deviceWordsFinal2 = new CUdeviceptr();
        cuMemAlloc(deviceWordsFinal2, wordsNum0 * wordsNum1 * Sizeof.FLOAT);
        
               
        CUdeviceptr deviceWordsFinal = new CUdeviceptr();
        cuMemAlloc(deviceWordsFinal, wordsNum0 * wordsNum1 * Sizeof.FLOAT);
        StructureKernelData sClusterData = new StructureKernelData();
        sClusterData.tokenslst0 = tokenslst0;
        sClusterData.tokenslst1 = tokenslst1;
        sClusterData.structuresNum = structuresNum;
        sClusterData.clustersNum = clustersNum;
        sClusterData.numtokenslst0 = numtokenslst0;
        sClusterData.numtokenslst1 = numtokenslst1;
        sClusterData.lstCdestTokensSizes = lstCdestTokensSizes;
        sClusterData.lstCsrcFTokensSizes = lstCsrcFTokensSizes;
        sClusterData.structureTokensSFIndices = structureTokensSFIndices;
        sClusterData.patternTokensSFIndices = patternTokensSFIndices;
        sClusterData.lstWordsNum0 = lstWordsNum0;
        sClusterData.structureWordsSFIndices = structureWordsSFIndices;
        sClusterData.wordsNum0 = wordsNum0;
        sClusterData.lstWordsNum1 = lstWordsNum1;
        sClusterData.clusterWordsSFIndices = clusterWordsSFIndices;
        sClusterData.wordsNum1 = wordsNum1;
        sClusterData.tokensCount0 = tokensCount0;
        sClusterData.tokensCount1= tokensCount1;
        sClusterData.tokensSFIndices0 = tokensSFIndices0;
        sClusterData.tokensSFIndices1 = tokensSFIndices1;
        sClusterData.tokensLength0 = tokensLength0;
        sClusterData.tokensLength1 = tokensLength1;
        sClusterData.srcStringsJoined0 = srcStringsJoined0;
        sClusterData.srcStringsJoined1 = srcStringsJoined1;
        sClusterData.srcStringsJoinedIndex0 = srcStringsJoinedIndex0;
        sClusterData.srcStringsJoinedIndex1 = srcStringsJoinedIndex1;
        sClusterData.PatternJoinedDistinct = PatternJoinedDistinct;
        sClusterData.totallengthArray0 = totallengthArray0;
        sClusterData.totallengthArray1 = totallengthArray1;
        sClusterData.structuresSFIndices0 = structuresSFIndices0;
        sClusterData.clustersSFIndices1= clustersSFIndices1;
        sClusterData.distinctSFIndices = distinctSFIndices;
        sClusterData.totallengthDistinct = totallengthDistinct;
        sClusterData.srcStringStartIndices0 = srcStringStartIndices0;
        sClusterData.srcStringStartIndices1 = srcStringStartIndices1;
        sClusterData.totallength0 = totallength0;
        sClusterData.totallength1 = totallength1;
        sClusterData.totallengthDistinctPattern = totallengthDistinctPattern;
        sClusterData.devicestringData0 = devicestringData0;
        sClusterData.devicelstCsrcFTokensSizes = devicelstCsrcFTokensSizes;
        sClusterData.devicestructureTokensSFIndices = devicestructureTokensSFIndices;
        sClusterData.devicelstWordsNum0 = devicelstWordsNum0;
        sClusterData.devicestructureWordsSFIndices = devicestructureWordsSFIndices;
        sClusterData.devicetokensCount0 = devicetokensCount0;
        sClusterData.devicetokensSFIndices0 = devicetokensSFIndices0;
        sClusterData.devicetokensLength0 = devicetokensLength0;
        sClusterData.devicesrcStringsJoinedIndex0 = devicesrcStringsJoinedIndex0;
        sClusterData.devicetotallengthArray0 = devicetotallengthArray0;
        sClusterData.devicestructuresSFIndices0 = devicestructuresSFIndices0;
        sClusterData.devicesrcStringStartIndices0 = devicesrcStringStartIndices0;
               
        sClusterData.devicestringData1 = devicestringData1;
        sClusterData.devicelstCdestTokensSizes = devicelstCdestTokensSizes;
        sClusterData.devicepatternTokensSFIndices = devicepatternTokensSFIndices;
        sClusterData.devicelstWordsNum1 = copyToDeviceINT(lstWordsNum1);
       sClusterData.deviceclusterWordsSFIndices = deviceclusterWordsSFIndices;
        sClusterData.devicetokensCount1 = devicetokensCount1;
        sClusterData.devicetokensSFIndices1 = devicetokensSFIndices1;
        sClusterData.devicetokensLength1 = devicetokensLength1;
        sClusterData.devicesrcStringsJoinedIndex1 = devicesrcStringsJoinedIndex1;
        sClusterData.devicetotallengthArray1 = devicetotallengthArray1;
        sClusterData.deviceclustersSFIndices1 = deviceclustersSFIndices1;
        sClusterData.devicesrcStringStartIndices1 = devicesrcStringStartIndices1;
        
        
        sClusterData.devicestringDistinctData = devicestringDistinctData;
        sClusterData.devicetotallengthDistinct = devicetotallengthDistinct;
        sClusterData.devicedistinctSFIndices = devicedistinctSFIndices;
        sClusterData.deviceXPattern = deviceXPattern;
        sClusterData.deviceResultFinal = deviceResultFinal;
        sClusterData.deviceTokensFinal = deviceTokensFinal;
        sClusterData.deviceWordsTokensFinal1 = deviceWordsTokensFinal1;
        sClusterData.deviceWordsTokensFinal2 = deviceWordsTokensFinal2;
        sClusterData.deviceWordsFinal1 = deviceWordsFinal1;
        sClusterData.deviceWordsFinal2 = deviceWordsFinal2;
        sClusterData.deviceWordsFinal = deviceWordsFinal;
        return sClusterData; 
        
    }

I do this for copyToDeviceByteStream inside the above function for example: 

public CUdeviceptr copyToDeviceByteStream(byte hostData[],CUstream stream)
    {
        
        CUdeviceptr deviceData = new CUdeviceptr();
              cudaMalloc(deviceData, hostData.length * Sizeof.BYTE);
        cuMemcpyHtoDAsync(deviceData, Pointer.to(hostData),hostData.length * Sizeof.BYTE, stream);
         cuStreamSynchronize(stream);
        return deviceData;
    } 

This is the kernel function
 public float[]  computeStructureParallelResultStream(
        CUfunction function, StructureKernelData sClusterData,CUstream stream,int gridSizeX,int blockSizeX)
    {
        int clustersNum = sClusterData.clustersNum;
        int structuresNum = sClusterData.structuresNum;
        int totalTokens0 = sClusterData.numtokenslst0;
        int totalTokens1 = sClusterData.numtokenslst1;
        int totallength0 = sClusterData.totallength0;
        int totallength1 = sClusterData.totallength1;
        int wordsNum0 = sClusterData.wordsNum0;
        int wordsNum1 = sClusterData.wordsNum1;
        int totallengthDistinctPattern = sClusterData.totallengthDistinctPattern;
        
                      
            Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{totallength0}),
            Pointer.to(new int[]{totalTokens0}),
            Pointer.to(new int[]{structuresNum}),
            Pointer.to(new int[]{wordsNum0}),
            Pointer.to(sClusterData.devicestringData0),
            Pointer.to(sClusterData.devicestructuresSFIndices0),
            Pointer.to(sClusterData.devicesrcStringsJoinedIndex0),
            Pointer.to(sClusterData.devicestructureTokensSFIndices),
            Pointer.to(sClusterData.devicesrcStringStartIndices0),
            Pointer.to(sClusterData.devicestructureWordsSFIndices),
            Pointer.to(sClusterData.devicetokensSFIndices0),
            Pointer.to(sClusterData.devicetokensLength0),
            Pointer.to(new int[]{totallengthDistinctPattern}),
            Pointer.to(sClusterData.devicestringDistinctData),
            Pointer.to(sClusterData.devicedistinctSFIndices),
            Pointer.to(new int[]{totallength1}),
            Pointer.to(new int[]{totalTokens1}),
            Pointer.to(new int[]{clustersNum}),
            Pointer.to(new int[]{wordsNum1}),
            Pointer.to(sClusterData.devicestringData1),
            Pointer.to(sClusterData.deviceclustersSFIndices1),
            Pointer.to(sClusterData.devicesrcStringsJoinedIndex1),
            Pointer.to(sClusterData.devicepatternTokensSFIndices),
            Pointer.to(sClusterData.devicesrcStringStartIndices1),
            Pointer.to(sClusterData.deviceclusterWordsSFIndices),
            Pointer.to(sClusterData.devicetokensSFIndices1),
            Pointer.to(sClusterData.devicetokensLength1),
            Pointer.to(sClusterData.deviceXPattern),
            Pointer.to(sClusterData.deviceResultFinal),
            Pointer.to(sClusterData.deviceTokensFinal),
            Pointer.to(sClusterData.deviceWordsTokensFinal1),
            Pointer.to(sClusterData.deviceWordsTokensFinal2),
            Pointer.to(sClusterData.deviceWordsFinal1),
            Pointer.to(sClusterData.deviceWordsFinal2),
            Pointer.to(sClusterData.deviceWordsFinal)
                       
           
           
        );
       
       
        cuLaunchKernel(function,
            gridSizeX,  1, 1,
            blockSizeX, 1, 1,
            0, stream,
            kernelParameters, null
        );
               cuCtxSynchronize();
  
        float[] ResultWordsFinal = new float[wordsNum0 * wordsNum1];
        cuMemcpyDtoHAsync(Pointer.to(ResultWordsFinal), sClusterData.deviceWordsFinal,
        sClusterData.wordsNum0 * sClusterData.wordsNum1 * Sizeof.FLOAT,stream); 
        cuStreamSynchronize(stream);
      
        
        
    
      
     
        
        
       return   ResultWordsFinal;
    }

What is the error, please? Kowing that the kernel succussfully runs at the single server mode. I will apply multiple streams on the known Vector Add project and see the results.
Thanks Marco.

Marco,

for the above error, I have solved it (Exception in thread „main“ jcuda.CudaException: CUDA_ERROR_INVALID_HANDLE) This is due to the following repeat of using the GPU:

cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);

I have removed it and the message disappears but when I run the project it gives the following result.
Matrix0
Exception in thread „main“ java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds for length 4
at ontologythresholdserial2023test.ParallelLevenstein.applyingHyperQ(ParallelLevenstein.java:4491)
at ontologythresholdserial2023test.OntologyThresholdSerial2023Test.main(OntologyThresholdSerial2023Test.java:441)
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.060606044, 0.0, 0.100000024, 0.066666685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.16666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.72727275, 0.18181819, 0.27272725, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.20833333, 0.57738096, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.111111104, 0.0, 0.111111104, 0.0, 0.18181819, 0.111111104, 0.07407407, 0.111111104, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18181819, 0.18181819, 0.8181818, 0.090909064, 1.0, 0.090909064, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19047618, 0.111111104, 0.16666667, 0.15151514, 0.09523809, 1.0, 0.09523809]
170
Matrix1
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.060606044, 0.0, 0.100000024, 0.066666685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.16666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.72727275, 0.18181819, 0.27272725, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.20833333, 0.57738096, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.111111104, 0.0, 0.111111104, 0.0, 0.18181819, 0.111111104, 0.07407407, 0.111111104, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18181819, 0.18181819, 0.8181818, 0.090909064, 1.0, 0.090909064, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19047618, 0.111111104, 0.16666667, 0.15151514, 0.09523809, 1.0, 0.09523809]
170
Matrix2
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.060606044, 0.0, 0.100000024, 0.066666685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.16666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.72727275, 0.18181819, 0.27272725, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.20833333, 0.57738096, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.111111104, 0.0, 0.111111104, 0.0, 0.18181819, 0.111111104, 0.07407407, 0.111111104, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18181819, 0.18181819, 0.8181818, 0.090909064, 1.0, 0.090909064, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19047618, 0.111111104, 0.16666667, 0.15151514, 0.09523809, 1.0, 0.09523809]
170
Matrix3
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.060606044, 0.0, 0.100000024, 0.066666685, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.16666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.72727275, 0.18181819, 0.27272725, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.20833333, 0.57738096, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.125, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.111111104, 0.0, 0.111111104, 0.0, 0.18181819, 0.111111104, 0.07407407, 0.111111104, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18181819, 0.18181819, 0.8181818, 0.090909064, 1.0, 0.090909064, 0.15151514, 0.090909064, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.19047618, 0.111111104, 0.16666667, 0.15151514, 0.09523809, 1.0, 0.09523809]
170
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:1355: The following error occurred while executing this line:
D:\NetBeanProjects\OntologyThresholdSerial2023Test\nbproject\build-impl.xml:961: Java returned: 1
BUILD FAILED (total time: 7 seconds)

Why is ArrayIndexOutOfBoundsException? Also from the results, it appears as if all streams work only on the first partition of the data knowing that I put the data in an a list of arrays and pass each to its dedicated stream.

Also, when I put ns = 1, number of streams for experiment it gives

Exception in thread „main“ java.lang.NullPointerException: Cannot read the array length because „this.ResultWordsFinal[i]“ is null
at ontologythresholdserial2023test.ParallelLevenstein.applyingHyperQ(ParallelLevenstein.java:4491)
at ontologythresholdserial2023test.OntologyThresholdSerial2023Test.main(OntologyThresholdSerial2023Test.java:441)

The ArrayIndexOutOfBoundsException indicates that you are trying to access an array at an index that is not smaller than the length of the array. For example, when you have

int array[] = new int[10];
int value = array[11];

then the second line will cause an ArrayIndexOutOfBoundsException.

A NullPointerException means that you are trying to dereference something that is null. For example, when you have

String example = null;
int length = example.length();

then the second line will cause a NullPointerException.

1 „Gefällt mir“

Thanks a lot Marco, I will try to solve the problems.

Marco,

When I run the unified memory example at jcuda-samples/JCudaDriverUnifiedMemory.java at 66d72e3044b2c2e3df4b54f62f22bb5f10349b71 · jcuda/jcuda-samples · GitHub on NVIDIA GeForce GTX 860M, it stops working.

It does not give Device does not support managed memory

This means that my device supports unified memory.

When it comes to ByteBuffer bb = p.getByteBuffer(0, size);

    System.out.println("Buffer on host side: " + bb);

The Net Beans IDE stops working! What is wrong? I need to apply unified memory into my application to do without host-device and device-host transfers.
Have you another succeeded example you made before?

Do the runtime samples jcuda-samples/JCudaRuntimeMappedMemory.java at master · jcuda/jcuda-samples · GitHub , jcuda-samples/JCudaRuntimeMemoryBandwidths.java at master · jcuda/jcuda-samples · GitHub or jcuda-samples/JCudaRuntimeUnifiedMemory.java at master · jcuda/jcuda-samples · GitHub work?

(The next step would be to try out cuda-samples/Samples/0_Introduction/UnifiedMemoryStreams at master · NVIDIA/cuda-samples · GitHub or other CUDA samples, if you can compile and start them)

Thanks Marco, but have the last samples cuda-samples/Samples/0_Introduction/UnifiedMemoryStreams at master · NVIDIA/cuda-samples · GitHub
a JCUDA version.

Marco,

Another question. Have you experimented unified memory before and get best results in time? I have sent the same question to NVIDIA CUDA and their response was:

It would probably true for a Jetson platform. In the general case, I wouldn’t say that is true, I have never read that anywhere except Jetson docs, and I wouldn’t support that idea in the general case. Since you mentioned 860M (which is not Jetson) I think we could rule that out.
And send me this link

Will I waste the time in usefulness effort?

have the last samples […] a JCUDA version.

Not all of them. The point was only to make sure that the functionality that you want to use works on your GPU at all. (If it does not work with CUDA, then it will not work with JCuda. If it does work with CUDA, then it should also work with JCuda. But it can be hard to make guarantees here)

Another question. Have you experimented unified memory before and get best results in time?
[…]
Will I waste the time in usefulness effort?

I am not a CUDA expert. I have hardly used CUDA at all. I don’t know anything about the performance of different approaches. Robert Crovella appears to be a trustworthy source. When he says that you will not see a performance benefit, then I’d assume that this is true.

Yes, I have tested examples of unified memory in the book titled CUDA C Programming before and it does not give a rise. In common, I will try the first example of unified memory in JCUDA and see the results.
Thanks a lot Marco

Marco,

I read the examples in shared memory. I will apply it on my project. I need you to decide if I am correct or not. This is a portion of my project, I will test using shared memory or not. This is the global memory part. All inputs are used in different parts all over the program.

extern "C"
__global__ void ComputationdClustersOnGPUShuffle(int numTokenSrc,int numWordSrc,int srcLength, char *src,int *srctokensSFIndices,int *srctokensLength,int *srcIndices, int *srcStartIndices,int totalLengthDistinct, char *patternRemoved,int numTokenPattern,int numWordPattern,int patternLength,char *pattern,int *patterntokensSFIndices,int *patterntokensLength,int *patternIndices,int *patternStartIndices,int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{
       int ix = blockIdx.x * blockDim.x + threadIdx.x;
       int  min_val = 0,var1 = 0, var2 = 0;
       int Avar, Bvar, Cvar, Dvar;
       float maxleven1 = 0.0f,resultName = 0.0f,sumleven = 0.0f,sumfinal = 0.0f;
       //int diff;
       //sumfinal = 0.0f,resultName = 0.0f,maxleven1 = 0.0f
       if(ix<totalLengthDistinct)
        {
            for (int i = 0; i < srcLength; i++) {
               if (src[i] == ',')
                  dX[ix * srcLength + i] = 0;
               else
                {
                  if (src[i] == patternRemoved[ix])
                dX[ix * srcLength + i] = srcIndices[i];
                  else if (src[i] != patternRemoved[ix])
                dX[ix * srcLength + i] = dX[ix * srcLength +  i-1];
                }
             }
             
        }
        __syncthreads();

}




extern "C"
__global__ void ComputationdClustersOnGPUShuffle(int numTokenSrc,int numWordSrc,int srcLength, char *src,int *srctokensSFIndices,int *srctokensLength,int *srcIndices, int *srcStartIndices,int totalLengthDistinct, char *patternRemoved,int numTokenPattern,int numWordPattern,int patternLength,char *pattern,int *patterntokensSFIndices,int *patterntokensLength,int *patternIndices,int *patternStartIndices,int *dX,int *ResultFinal,float *TokensFinal,float *WordsTokensFinal1,float *WordsTokensFinal2,float *WordsFinal1,float *WordsFinal2,float *WordsFinal)
{
       int shared_string_len = 256;      // The same as block size
      __shared__ char patternRemoved _shared [ shared_string_len ] ;

       int ix = blockIdx.x * blockDim.x + threadIdx.x;
       patternRemoved _shared[ix]  = patternRemoved[ix]
       int  min_val = 0,var1 = 0, var2 = 0;
       int Avar, Bvar, Cvar, Dvar;
       float maxleven1 = 0.0f,resultName = 0.0f,sumleven = 0.0f,sumfinal = 0.0f;
       //int diff;
       //sumfinal = 0.0f,resultName = 0.0f,maxleven1 = 0.0f
       if(ix<totalLengthDistinct)
        {
            for (int i = 0; i < srcLength; i++) {
               if (src[i] == ',')
                  dX[ix * srcLength + i] = 0;
               else
                {
                  if (src[i] == patternRemoved _shared [ix])
                dX[ix * srcLength + i] = srcIndices[i];
                  else if (src[i] != patternRemoved _shared [ix])
                dX[ix * srcLength + i] = dX[ix * srcLength +  i-1];
                }
             }
             
        }

        __syncthreads();


}

If I put a correct code, this means that all over the program only the input that have ix input will be substituted by shared memory variable. Shared memory variable is created for every block inside the grid. I will declare a shared memory variable for every global memory variable inside my code uses ix.

The one and only example of CUDA shared memory that I ever implemented was jcuda-samples/JCudaReduction.java at master · jcuda/jcuda-samples · GitHub , and the kernel is basically copied literally from the CUDA sample.

If you have a question about JCuda, then ask it, and I’ll try to answer. If you have a general question about CUDA/JCuda programming, then ask it in a new thread, in a form that can be answered, and that may be useful for others, and I might try to answer there. But right now, you just posted a piece of messy, undocumented code, without any real question.

I have already spent dozens of hours for this thread here, and there is no reason for me to continue doing this. I’m a freelancer. We could start negotiating with 100€/hour, but just reading and understanding your code would probably take me several hours, and even then, I could not tell you what is „right“ or „wrong“.

I will no longer respond to this sort of „question“.

It’s OK.

Marco,

I write in the paper

Acknowledgments

The authors would like to thank Marco Hutter, Global Moderator @ https://forum.byte-welt.net for supporting us during the development and implementation of the proposed approaches by providing useful parts of the CUDA code.

Is that true for your biography?

It does not make much sense to mention the forum. If you want to say anything that includes my name, then

… Marco Hutter, author of JCuda, for …

should be OK (but if you don’t mention me at all, that’s also fine).

But related to that: Are you mentioning JCuda in any other form (e.g. as one of the references)?

Yes, jcuda.org - Java bindings for CUDA is denoted. Thanks a lot Marco.