jcuda.CudaException: CUDA_ERROR_INVALID_IMAGE

What I do to try to compile my code, is:
nvcc -m64 -dc a.cu b.cu
nvcc -dlink a.o b.o -o a.ptx

When it gets to:
JCudaDriver.cuModuleLoad(module, ptxFileName);
it gives me that error.
Am I doing something wrong?

It’s hard to guess only from the error message what might be wrong there. From the other thread, I see that this is on a Linux (likely 64bit). Can you provide more information about the kernels and the device that you are using?

Ubuntu, 64 bit. I have the 16.04. If it’s of any help, this is my current kernel versions:
Linux 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

About the device: Nvidia 980 Ti

Edit: Basically my question is: I have a.cu that depends on b.cu. How do I run them from java? The code nvcc -ptx doesn’t work for multiple files

Ah, OK. As you noticed, you can not combine multiple .CU files into a single .PTX file. There are basically two options. I’m not sure which one is easier for you in the long run, or which one you consider as more appropriate.

As an example, consider the following files (which are just the well-known vector addition, split into two files) :

multipleKernelFiles01.cu


#include "multipleKernelFiles02.h"

extern "C"
__global__ void add(int n, float *a, float *b, float *sum)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i<n)
    {
        addElements(a+i, b+i, sum+i);
    }
}

multipleKernelFiles02.h



extern "C"
__device__ void addElements(float *a, float *b, float *sum);

multipleKernelFiles02.cu


extern "C"
__device__ void addElements(float *a, float *b, float *sum)
{
    *sum = *a + *b;
}

Option 1: Compile them into a CUBIN file

Compile the .CU files into object files:
nvcc -m64 -dc multipleKernelFiles01.cu multipleKernelFiles02.cu

(The resulting files will have the extension .obj on windows, but .o on Linux - I’ll write .obj here - adjust this as necessary)

Then, link the object files into a single CUBIN:
nvcc -m64 -dlink multipleKernelFiles01.obj multipleKernelFiles02.obj -cubin -o multipleKernelFiles.cubin -arch=sm_52

Note that CUBIN files are rather specific for the target device. Particularly, you have to specify the target architecture, which is -arch=sm_52 in my case.

Then, the resulting CUBIN file can be loaded into a module:

        CUmodule module = new CUmodule();
        cuModuleLoad(module, "multipleKernelFiles.cubin");

Option 2: Compile them into PTX, and link them at runtime

Compile each of the input files into a PTX:
nvcc -m64 multipleKernelFiles01.cu -ptx
nvcc -m64 multipleKernelFiles02.cu -ptx

Link them at runtime, using the cuLink* driver functions.

Below is an example that shows both approaches, particularly the runtime linking (which may be a bit fiddly…)

import static jcuda.driver.JCudaDriver.*;

import java.io.*;

import jcuda.*;
import jcuda.driver.*;

public class JCudaMultipleKernelFiles
{
    private static CUmodule loadModuleFromCubin()
    {
        CUmodule module = new CUmodule();
        cuModuleLoad(module, "multipleKernelFiles.cubin");
        return module;
    }
    
    private static CUmodule loadModuleFromPtxFiles()
    {
        // Start the runtime linking
        CUlinkState state = new CUlinkState();
        JITOptions jitOptions = new JITOptions();
        cuLinkCreate(jitOptions, state);
        
        // Add the PTX files to the module to be created
        cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_PTX, 
            "multipleKernelFiles01.ptx", jitOptions);
        cuLinkAddFile(state, CUjitInputType.CU_JIT_INPUT_PTX, 
            "multipleKernelFiles02.ptx", jitOptions);
        
        // Finish the linking
        long size[] = { 0 };
        Pointer image = new Pointer();
        cuLinkComplete(state, image, size);
        
        // Load the module from the image data
        CUmodule module = new CUmodule();
        cuModuleLoadDataEx(module, image,
            0, new int[0], Pointer.to(new int[0]));
        cuLinkDestroy(state);
        
        return module;
    }
    
    
    public static void main(String args[]) throws IOException
    {
        // Default initialization
        JCudaDriver.setExceptionsEnabled(true);
        cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        cuDeviceGet(dev, 0);
        cuCtxCreate(pctx, 0, dev);
        
        //CUmodule module = loadModuleFromCubin();
        CUmodule module = loadModuleFromPtxFiles();
        
        // Obtain the function
        CUfunction function = new CUfunction();
        cuModuleGetFunction(function, module, "add");

        // The remaining part of this sample is the same as for the vector 
        // add sample from http://jcuda.org/samples/JCudaVectorAdd.java
        int numElements = 1000;
        float hostInputA[] = new float[numElements];
        float hostInputB[] = new float[numElements];
        for(int i = 0; i < numElements; i++)
        {
            hostInputA** = (float)i;
            hostInputB** = (float)i;
        }
        CUdeviceptr deviceInputA = new CUdeviceptr();
        cuMemAlloc(deviceInputA, numElements * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputA, Pointer.to(hostInputA),
            numElements * Sizeof.FLOAT);
        CUdeviceptr deviceInputB = new CUdeviceptr();
        cuMemAlloc(deviceInputB, numElements * Sizeof.FLOAT);
        cuMemcpyHtoD(deviceInputB, Pointer.to(hostInputB),
            numElements * Sizeof.FLOAT);
        CUdeviceptr deviceOutput = new CUdeviceptr();
        cuMemAlloc(deviceOutput, numElements * Sizeof.FLOAT);
        Pointer kernelParameters = Pointer.to(
            Pointer.to(new int[]{numElements}),
            Pointer.to(deviceInputA),
            Pointer.to(deviceInputB),
            Pointer.to(deviceOutput)
        );
        int blockSizeX = 256;
        int gridSizeX = (int)Math.ceil((double)numElements / blockSizeX);
        cuLaunchKernel(function, 
            gridSizeX, 1, 1, 
            blockSizeX, 1, 1, 0, 
            null, kernelParameters, null);
        cuCtxSynchronize();

        float hostOutput[] = new float[numElements];
        cuMemcpyDtoH(Pointer.to(hostOutput), deviceOutput,
            numElements * Sizeof.FLOAT);
        boolean passed = true;
        for(int i = 0; i < numElements; i++)
        {
            float expected = i+i;
            if (Math.abs(hostOutput** - expected) > 1e-5)
            {
                System.out.println(
                    "At index "+i+ " found "+hostOutput**+
                    " but expected "+expected);
                passed = false;
                break;
            }
        }
        System.out.println("Test "+(passed?"PASSED":"FAILED"));
        cuMemFree(deviceInputA);
        cuMemFree(deviceInputB);
        cuMemFree(deviceOutput);
    }
    
    
}

Thank you! I will try them tomorrow.

(Did you mark this as “Solved” intentionally? I’m curious which of both approaches you took eventually)

Yes, it was by accident (I thought it was the other one). I took the cubin path, and with the few lines:
cudaDeviceProp props = new cudaDeviceProp();
JCuda.cudaGetDeviceProperties(props,0);
int maj = props.major;
int min = props.minor;
I think I solved the problem of being targeted to a specific architecture. However, I keep getting: java.io.IOException: Could not create cubin file: nvcc fatal : Value ‘sm_52’ is not defined for option ‘gpu-architecture’

Reading online it seems that it’s a problem of me having an old version of CUDA. It seems that there’s 6.5 installed. I removed, purged and autoremoved everythign and I’m about to reinstall it. I will get back on you if something is still not wrong, but I hope it will work!

Update: The cubin works just fine!!!

Good to hear that. I’ll likely refer to this thread when the question of how to use multiple .CU file appears again, and maybe add it to the website/FAQ if necessary.

sir, can you please tell me solution for same problem,(jcuda.CudaException: CUDA_ERROR_INVALID_IMAGE) but I am using windows.

I answered at https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/21752-basic-jcuda-setup-windows.html#post140148

I want to Compile file into a CUBIN file for that
1.Is there any need to run this command(nvcc -m64 -dc multipleKernelFiles01.cu multipleKernelFiles02.obj ) in eclipse (windows)?? I think we have to run this command on CMD prompt, right??
So what course of action should I take now, to compile program with cubin approach in eclipse?
2. How to link the object files into a single CUBIN in eclipse (windows) and specify the target architecture??

Thanks…!!!

If you are the same person as the one who wrote the other posts:
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/21752-basic-jcuda-setup-windows.html#post140361
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/19607-matrix-row-sum-jcuda.html#post140379
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/20617-jcuda-cudaexception-cuda_error_invalid_image.html#post140362
https://forum.byte-welt.net/byte-welt-projekte-projects/jcuda/21939-cl-exe-found-path.html#post140360
then please try to focus on one topic, otherwise I don’t know what I should write where, and I don’t know what your actual question is, and what exactly works or does not work.


If you want to run a basic test, then you will likely have a single .CU file, and don’t need to link anything manually. If your goal is to combine multiple .CU files into a single .PTX file, then the post #4 of this thread should show you how this can be done. If you have questions regarding that, please be more specific.