Creating empty surface

Rusty · 10. September 2014 um 09:44

I tried to create empty surface and write some data to it, but failed with CUDA_ERROR_ILLEGAL_ADDRESS on cuCtxSynchronize() call

Here is the code (it works perfectly on C++)

  
   cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8*Sizeof.BYTE, 0, 0, 0, cudaChannelFormatKind.cudaChannelFormatKindUnsigned);
    
  _cuArray = new cudaArray();
  checkResult(cudaMallocArray(_cuArray, channelDesc, width, height));
    
  checkResult(cudaMemcpyToArray(_cuArray, 0, 0, Pointer.to(pixels), Sizeof.BYTE*_width*_height, cudaMemcpyKind.cudaMemcpyHostToDevice));

  cudaResourceDesc resDesc = new cudaResourceDesc();
  resDesc.resType = cudaResourceType.cudaResourceTypeArray;
  resDesc.array_array = _cuArray;

  _surfOutput = new cudaSurfaceObject();
  checkResult(cudaCreateSurfaceObject(_surfOutput, resDesc));```

All calls above return SUCCESS

kernel

```extern "C"
    __global__ void kernel(cudaSurfaceObject_t surface)
{
const int2 pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
	unsigned char data= 7;
	surf2Dwrite(data, surface, pos.x*sizeof(data), pos.y);
}```

Simple as that, 

But after calling 

```cuLaunchKernel
cuCtxSynchronize```
it returns CUDA_ERROR_ILLEGAL_ADDRESS, and I can't figure out why :(

If surf2Dwrite is commented then it works.

Marco13 · 10. September 2014 um 14:21

That’s a bit difficult. I don’t have any device with Compute Capability 3, which is required for surface objects.

It may be possible to find out illegal writes with cuda-memcheck, but if it is working in C++, it’s not unjustified that there’s something wrong in the bindings. I’ll try to have a look at the code paths that are related to surface objects, but am not sure when/whether I’ll be able to guess what may be wrong there…

Rusty · 11. September 2014 um 02:05

One more thing, surface is working correctly, if it was created via opengl texture id.
I think the issue is somewhere in the cudaMallocArray and cudaMemcpyToArray functions

Marco13 · 11. September 2014 um 03:09

That narrows it down, and I can test this, maybe later today (does it also show an error when you call cuCtxSynchronize directly after cudaMemcpyToArray?)

Rusty · 11. September 2014 um 03:14

No

Rusty · 11. September 2014 um 05:35

cudaMemcpyToArray can be removed
also I tried to use JCuda.cudaArraySurfaceLoadStore for cudaMallocArray but it didn’t help either

*** Edit ***

I tried to put all C++ code to the Java_jcuda_runtime_JCuda_cudaCreateSurfaceObjectNative, and still no luck

So it has nothing to do with cudaMallocArray
Now I don’t know where to look

Marco13 · 11. September 2014 um 07:56

Admittedly, I don’t (yet ;-)) know the whole API by heart, but according to the description, I think that the cudaArraySurfaceLoadStore flag will be required in cudaMallocArray in order to use it for surface access.

Additionally, the surface reference documentation says:

Before a kernel can use a surface reference to access a CUDA array, the surface reference must be bound to the CUDA array using cudaBindSurfaceToArray().

I’m not sure whether this should be called directly after creating the surface, or immediately before the kernel launch, but maybe you’ll find this out…

Rusty · 11. September 2014 um 08:35

With cudaSurfaceObject, cudaBindSurfaceToArray is not needed anymore. with cudaArraySurfaceLoadStore result is the same,
BTW here is the code which is working in C++

#include <cstdio>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <texture_indirect_functions.h>


__global__ void mykernel_image(cudaSurfaceObject_t dst)
{
	const int2 pos = make_int2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y);
	unsigned char res255 = 7;
	surf2Dwrite(res255, dst, pos.x*sizeof(res255), pos.y);
}

int main(int argc, char* argv[])
{
	cudaError error;

	int width = 256;
	int height = 256;
	
	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
	cudaArray *cArray;
	error = cudaMallocArray(&cArray, &channelDesc, width, height, cudaArraySurfaceLoadStore); // or cudaArrayDefault , does not matter

	cudaResourceDesc resDesc;
	memset(&resDesc, 0, sizeof(resDesc));
	resDesc.resType = cudaResourceTypeArray;
	resDesc.res.array.array = cArray;

	cudaSurfaceObject_t surface;
	error = cudaCreateSurfaceObject(&surface, &resDesc);

	mykernel_image<<<dim3(256, 256, 1), dim3(1, 1, 1)>>>(surface);

	error = cudaDeviceSynchronize();

	unsigned char *h_data = new unsigned char[width * height];
	error = cudaMemcpyFromArray(h_data, cArray, 0, 0, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost);
	
	for(int i =0; i<width*height; ++i)
	{
		std::cout << (int)h_data** << std::endl;
	}	

	return 0;
}```

In the end it shows 7's for all h_data

Marco13 · 11. September 2014 um 10:29

Sorry, I mixed up surface references and surface objects there.

At the moment, I’m not sure about the most appropriate way of debugging this further. Again, I have no chance to test this on my own. I can continue with trying to trace the relevant code paths and see whether there is an “obvious” bug, but might miss a subtle one.

You could give cuda-memcheck a try: When you create a batch file like “startTest.bat” that starts a minimal test program via the console, you can call
cuda-memcheck startTest.bat
and see whether it prints anything about invalid memory accesses (although this is a bit unlikely, when you say that it works with texture IDs, but may be worth a try).

In general, a minimal test program (that may even assume the kernel to be already present as PTX file) could be helpful if someone with a ComputeCapability 3.0 card wants to test this…

Rusty · 12. September 2014 um 04:56

Ok good news for me

Surface was created before this call

        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);```

and kernel after, thus they were using different context.

BTW calling this whole snippet is useless in my case, everything works without it
While nvidia says for **cuInit**

> Initializes the driver API and must be called before any other function from the driver API

I suppose all cu... functions are driver API functions

Marco13 · 12. September 2014 um 08:14

OK, good to head that, on the one hand.

On the other hand: I already noticed that you mixed the Runtime- and the Driver API. I thought about pointing this out, but in general it is possible to mix the runtime and the driver API, and it should not cause any problems.

In this case, it did, due the new context being created. The context management (and the subtle differences of the context management in the Runtime- and the Driver API) have changed slightly between the different CUDA versions. Originally, the Runtime- and Driver API had been separated more strictly, but there have been additional changes related to the context management itself.

For the general difference between the Runtime- and Driver API, you may refer to the Tutorial Introduction.

Referring to the context management, this has the following implications:

[ul]
[li]The functions that are defined in the jcuda.runtime.JCuda class are the Runtime API functions. These functions start with cuda.... When you call the first function from the runtime API, then a new CUDA context will automatically be created.
[/li][li]In contrast to that, the functions from jcuda.driver.JCudaDriver, which start with cu... are the Driver API functions. If you want to use these functions, you have to do the context management on your own, by calling cuInit(0) etc (the block that you quoted).
[/li][li]But when you called a runtime function, then the context already was created, so you don’t have to manually create one
[/li][/ul]

This is a bit simplified, and for details, you should refer to the CUDA Programming Guide, particularly to the section about context management.

So to summarize: It’s probably not a bad idea to decide for one API. In this case, you used cudaCreateSurfaceObject from the runtime API, but could also have used cuSurfObjectCreate from the driver API (see Surface Object Management in the Driver API), using the appropriate structure for the format description etc. The differences between both APIs are negligible in this regard: Basically every functionality (except for the differences pointed out in the tutorial link) is available in both APIs. (Which creates a load of redundancy, particularly for bindings like JCuda, but that’s a different story…)