General Question about Sizes

Hi,

im Programming a Neural Network Library for my Bachelors-Thesis
and I have run into a Problem.

It seems as though there are Size Restrictions on JCublas.cublasAlloc and/or JCublas.cublasSgemm, since on of them is not doing the right thing.

Let me explain my Problem when I try to run my Library with 16(or less) Neurons it works, when I run it with 17(or more) it doesnt.
The First Operations Fail(Sgemm) and they are using 1717 matrices parsed as float[] of length 289(1717) and 171 matrices parsed as float[] of length 17(171). I wouldnt think this should be a Problem since that Size is still ridiculously small.
Btw the Operations simply return arrays with 17 zeros (1717 x 171 = 17*1).

The First thing id like to try is find out which Device is being used. Im sure there are Methods for this,
could anyone point me in the right direction.

And if anybody has an idea of how to solve this I would much appreciate it.

I have not included Code, because the Code is quite Large and it wouldnt be easy to understand anything of it.

Although I could include it if needed.

cheers
Noodles

*** Edit ***

I have found the culprit,

but I have no Idea whats going wrong.

CUDeviceHolder(Programmed by myself just a Container that handles pointers)
[SPOILER]```package Matrix;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.jcublas.JCublas;

public class CuDeviceHolder {

public CuDeviceHolder(Mat2 mat) {
	matrix = mat;
	ptr = new CUdeviceptr();
	JCublas.cublasAlloc(matrix.elements(), Sizeof.FLOAT, ptr);
	JCublas.cublasSetVector(matrix.elements(), Sizeof.FLOAT, Pointer.to(matrix.data()), 1, ptr, 1);
}

Mat2 matrix;
CUdeviceptr ptr;

public CUdeviceptr get() {
	return ptr;
}

public void free() {
	JCudaDriver.cuMemFree(ptr);
}

public void print() {
	GPUOp.print(ptr,matrix.elements());
}

}```[/SPOILER]

[SPOILER]```
public static CUdeviceptr GateBulk(CuDeviceHolder env, Mat2 weights,int neurons) {
CuDeviceHolder d_weights = new CuDeviceHolder(weights);
//d_weights.print();
CUdeviceptr result = new CUdeviceptr();
JCublas.cublasAlloc(neurons, Sizeof.FLOAT, result);
JCublas.cublasSgemm(‘n’, ‘n’, neurons, 1, neurons, 1.0f, d_weights.get(), neurons, env.get(), neurons, 0.0f, result, neurons);
//print(result,neurons);
d_weights.free();
result = fNormalize(neurons,result,true);
return result;
}

public static CUdeviceptr GateBulk(CuDeviceHolder env, Mat2 weights,int neurons,int input) {
	CuDeviceHolder d_weights = new CuDeviceHolder(weights);		
	CUdeviceptr result = new CUdeviceptr();
	JCublas.cublasAlloc(neurons, Sizeof.FLOAT, result);
	JCublas.cublasSgemm('n', 'n', input, 1, neurons, 1.0f, d_weights.get(), input, env.get(), input, 0.0f, result, input);
	d_weights.free();
	result = fNormalize(neurons,result,true);
	return result;
}

public static void doCalc(Mat2 result,Mat2 outputWeights, Mat2 inputInGateWeights, Mat2 inputOutGateWeights, Mat2 inputChangeGateWeights, Mat2 inputValues, Mat2 outGateWeights,
Mat2 changeGateWeights, Mat2 inGateWeights, Mat2 internalNeuronValues,Mat2 externalNeuronValues) {

	//result.print();
	//inputValues.print();
	//internalNeuronValues.print();
	//externalNeuronValues.print();
	
	//outGateWeights.print();
	//inGateWeights.print();
	//changeGateWeights.print();
	
	int neurons = internalNeuronValues.elements();
	int input = inputValues.elements();
	int output = result.elements();
	
	//
	CuDeviceHolder d_ENV = new CuDeviceHolder(externalNeuronValues);
	CUdeviceptr d_yOut = GateBulk(d_ENV,outGateWeights,neurons);
	CUdeviceptr d_yIn = GateBulk(d_ENV,inGateWeights,neurons);
	CUdeviceptr d_yCh = GateBulk(d_ENV,changeGateWeights,neurons);
	
	//print(d_yOut,neurons);
	//print(d_yIn,neurons);
	//print(d_yCh,neurons);
	
	CuDeviceHolder d_I = new CuDeviceHolder(inputValues);
	
	//PROBLEM
	print(d_yOut,neurons);
	CUdeviceptr d_yOut2 = GateBulk(d_I,inputOutGateWeights,neurons,input);
	print(d_yOut,neurons);

What I have marked with //PROBLEM is where the Problem ensues, for some Reason
```CUdeviceptr d_yOut2 = GateBulk(d_I,inputOutGateWeights,neurons,input);```
influences d_yOut, but it shouldnt.

beforehand d_yOut has the right values basically random values,
after the execution of
```CUdeviceptr d_yOut2 = GateBulk(d_I,inputOutGateWeights,neurons,input);```
d_yOut has just zeros, and this carries through through the rest of doCalc()

For completeness
CUDeviceHolder
[SPOILER]```package Matrix;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.jcublas.JCublas;

public class CuDeviceHolder {

	public CuDeviceHolder(Mat2 mat) {
		matrix = mat;
		ptr = new CUdeviceptr();
		JCublas.cublasAlloc(matrix.elements(), Sizeof.FLOAT, ptr);
		JCublas.cublasSetVector(matrix.elements(), Sizeof.FLOAT, Pointer.to(matrix.data()), 1, ptr, 1);
	}
	
	Mat2 matrix;
	CUdeviceptr ptr;
	
	public CUdeviceptr get() {
		return ptr;
	}
	
	public void free() {
		JCudaDriver.cuMemFree(ptr);
	}
	
	public void print() {
		GPUOp.print(ptr,matrix.elements());
	}
}```[/SPOILER]

GPUOp(the class that contains doCalc())
[SPOILER]```package Matrix;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.jcublas.JCublas;
import jcuda.vec.VecFloat;

public class GPUOp {
	
	public static float[] add(float[] hostInputA,float[] hostInputB) {
		int n = hostInputA.length;
		
        VecFloat.init();
        
        CUdeviceptr deviceX = new CUdeviceptr();
        JCudaDriver.cuMemAlloc(deviceX, n * Sizeof.FLOAT);
        JCudaDriver.cuMemcpyHtoD(deviceX, Pointer.to(hostInputA), n * Sizeof.FLOAT);

        CUdeviceptr deviceY = new CUdeviceptr();
        JCudaDriver.cuMemAlloc(deviceY, n * Sizeof.FLOAT); 
        JCudaDriver.cuMemcpyHtoD(deviceY, Pointer.to(hostInputB), n * Sizeof.FLOAT);

        CUdeviceptr deviceResult = new CUdeviceptr();
        JCudaDriver.cuMemAlloc(deviceResult, n * Sizeof.FLOAT);

        VecFloat.add(n, deviceResult, deviceX, deviceY);

        float hostResult[] = new float[n];
        JCudaDriver.cuMemcpyDtoH(Pointer.to(hostResult), deviceResult, n * Sizeof.FLOAT);

        JCudaDriver.cuMemFree(deviceX);
        JCudaDriver.cuMemFree(deviceY);
        JCudaDriver.cuMemFree(deviceResult);
        
        VecFloat.shutdown();
        
		return hostResult;
	}
	
	public static float[] mul(float[] hostInputA,float[] hostInputB,int ar,int ac,int br,int bc) {

        JCublas.cublasInit();

        float[] result = new float[ar*bc];
        
        CUdeviceptr d_A = new CUdeviceptr();
        CUdeviceptr d_B = new CUdeviceptr();
        CUdeviceptr d_C = new CUdeviceptr();
        JCublas.cublasAlloc(ar*ac, Sizeof.FLOAT, d_A);
        JCublas.cublasAlloc(br*bc, Sizeof.FLOAT, d_B);
        JCublas.cublasAlloc(ar*bc, Sizeof.FLOAT, d_C);

        JCublas.cublasSetVector(ar*ac, Sizeof.FLOAT, Pointer.to(hostInputA), 1, d_A, 1);
        JCublas.cublasSetVector(br*bc, Sizeof.FLOAT, Pointer.to(hostInputB), 1, d_B, 1);
        JCublas.cublasSetVector(ar*bc, Sizeof.FLOAT, Pointer.to(result), 1, d_C, 1);
        
        JCublas.cublasSgemm('n', 'n', ar, bc, ac, 1.0f, d_A, ar, d_B, br, 0.0f, d_C, ar);

        JCublas.cublasGetVector(ar*bc, Sizeof.FLOAT, d_C, 1, Pointer.to(result), 1);

        JCublas.cublasFree(d_A);
        JCublas.cublasFree(d_B);
        JCublas.cublasFree(d_C);

        JCublas.cublasShutdown();
        
		return result;
	}
	
	public static CUdeviceptr fNormalize(int n,CUdeviceptr d_In,boolean clean) {
		CUdeviceptr d_Out = new CUdeviceptr();
		JCublas.cublasAlloc(n, Sizeof.FLOAT, d_Out);
		
		VecFloat.exp(n, d_Out, d_In); // x = e^x
		VecFloat.scalarDiv(n, d_Out, 1f, d_Out); // x = e^(-x)
		VecFloat.scalarAdd(n, d_Out, 1f, d_Out); // x = 1+e^(-x)
		VecFloat.scalarDiv(n, d_Out, 1f, d_Out); // x = 1/(1+e^(-x))
		
		if (clean)
			JCudaDriver.cuMemFree(d_In);
		return d_Out;
	}
	
	public static CUdeviceptr gNormalize(int n,CUdeviceptr d_In,boolean clean) {
		CUdeviceptr d_Out = new CUdeviceptr();
		JCublas.cublasAlloc(n, Sizeof.FLOAT, d_Out);
		
		VecFloat.exp(n, d_Out, d_In); // x = e^x
		VecFloat.scalarDiv(n, d_Out, 1f, d_Out); // x = e^(-x)
		VecFloat.scalarAdd(n, d_Out, 1f, d_Out); // x = 1+e^(-x)
		VecFloat.scalarDiv(n, d_Out, 2f, d_Out); // x = 2/(1+e^(-x))
		VecFloat.subScalar(n, d_Out, d_Out, 1f); // x = (2/(1+e^(-x)))-1
		
		if (clean)
			JCudaDriver.cuMemFree(d_In);
		return d_Out;
	}
	
	public static CUdeviceptr GateBulk(CuDeviceHolder env, Mat2 weights,int neurons) {
		CuDeviceHolder d_weights = new CuDeviceHolder(weights);
		//d_weights.print();
		CUdeviceptr result = new CUdeviceptr();
		JCublas.cublasAlloc(neurons, Sizeof.FLOAT, result);
		JCublas.cublasSgemm('n', 'n', neurons, 1, neurons, 1.0f, d_weights.get(), neurons, env.get(), neurons, 0.0f, result, neurons);
		//print(result,neurons);
		d_weights.free();		
		result = fNormalize(neurons,result,true);
		return result;
	}
	
	public static CUdeviceptr GateBulk(CuDeviceHolder env, Mat2 weights,int neurons,int input) {
		CuDeviceHolder d_weights = new CuDeviceHolder(weights);		
		CUdeviceptr result = new CUdeviceptr();
		JCublas.cublasAlloc(neurons, Sizeof.FLOAT, result);
		JCublas.cublasSgemm('n', 'n', input, 1, neurons, 1.0f, d_weights.get(), input, env.get(), input, 0.0f, result, input);
		d_weights.free();
		result = fNormalize(neurons,result,true);
		return result;
	}
	
	public static void doCalc(Mat2 result,Mat2 outputWeights, Mat2 inputInGateWeights, Mat2 inputOutGateWeights, Mat2 inputChangeGateWeights, Mat2 inputValues, Mat2 outGateWeights,
			Mat2 changeGateWeights, Mat2 inGateWeights, Mat2 internalNeuronValues,Mat2 externalNeuronValues) {
		
		//result.print();
		//inputValues.print();
		//internalNeuronValues.print();
		//externalNeuronValues.print();
		
		//outGateWeights.print();
		//inGateWeights.print();
		//changeGateWeights.print();
		
		int neurons = internalNeuronValues.elements();
		int input = inputValues.elements();
		int output = result.elements();
		
		//
		CuDeviceHolder d_ENV = new CuDeviceHolder(externalNeuronValues);
		CUdeviceptr d_yOut = GateBulk(d_ENV,outGateWeights,neurons);
		CUdeviceptr d_yIn = GateBulk(d_ENV,inGateWeights,neurons);
		CUdeviceptr d_yCh = GateBulk(d_ENV,changeGateWeights,neurons);
		
		//print(d_yOut,neurons);
		//print(d_yIn,neurons);
		//print(d_yCh,neurons);
		
		CuDeviceHolder d_I = new CuDeviceHolder(inputValues);
		
		//PROBLEM
		print(d_yOut,neurons);
		CUdeviceptr d_yOut2 = GateBulk(d_I,inputOutGateWeights,neurons,input);
		print(d_yOut,neurons);
		
		VecFloat.add(neurons,d_yOut,d_yOut,d_yOut2);		
		JCudaDriver.cuMemFree(d_yOut2);
		CUdeviceptr d_yIn2 = GateBulk(d_I,inputInGateWeights,neurons,input);
		VecFloat.add(neurons,d_yIn,d_yIn,d_yIn2);
		JCudaDriver.cuMemFree(d_yIn2);
		CUdeviceptr d_yCh2 = GateBulk(d_I,inputInGateWeights,neurons,input);
		VecFloat.add(neurons,d_yCh,d_yCh,d_yCh2);
		JCudaDriver.cuMemFree(d_yCh2);
		d_I.free();
		//
		
		//
		CUdeviceptr d_Ch = new CUdeviceptr();
		JCublas.cublasAlloc(neurons, Sizeof.FLOAT, d_Ch);
		VecFloat.mul(neurons, d_Ch, d_yIn, d_yCh);
		JCudaDriver.cuMemFree(d_yIn);
		JCudaDriver.cuMemFree(d_yCh);
		//
		
		//
		CuDeviceHolder d_INV = new CuDeviceHolder(internalNeuronValues);
		VecFloat.add(neurons, d_INV.get(), d_INV.get(), d_Ch);
		JCudaDriver.cuMemFree(d_Ch);
		//
		
		//
		CUdeviceptr d_nI = fNormalize(neurons,d_INV.get(),false);
		float[] h_INV = new float[neurons];
		JCublas.cublasGetVector(neurons, Sizeof.FLOAT, d_INV.get(), 1, Pointer.to(h_INV), 1);
		internalNeuronValues.set(h_INV);
		d_INV.free();
		//
		
		//
		VecFloat.mul(neurons,d_ENV.get(), d_yOut, d_nI);
		JCudaDriver.cuMemFree(d_yOut);
		JCudaDriver.cuMemFree(d_nI);
		//
		
		//
		CuDeviceHolder d_OW = new CuDeviceHolder(outputWeights);
		CUdeviceptr d_Out = new CUdeviceptr();
		JCublas.cublasAlloc(output, Sizeof.FLOAT, d_Out);
		JCublas.cublasSgemm('n', 'n', output, 1, neurons, 1.0f, d_OW.get(), output, d_ENV.get(), neurons, 0.0f, d_Out, output);
		//
		
		//
		float[] h_ENV = new float[neurons];
		JCublas.cublasGetVector(neurons, Sizeof.FLOAT, d_ENV.get(), 1, Pointer.to(h_ENV), 1);
		externalNeuronValues.set(h_ENV);
		d_ENV.free();
		d_OW.free();
		//
		
		float[] h_Out = new float[output];
		JCublas.cublasGetVector(output, Sizeof.FLOAT, d_Out, 1, Pointer.to(h_Out), 1);
		result.set(h_Out);
		JCudaDriver.cuMemFree(d_Out);
		
		
	}
	
	public static void init() {
		JCublas.cublasInit();
		VecFloat.init();
	}
	
	public static void close() {
		JCublas.cublasShutdown();
		VecFloat.shutdown();
	}
	
	public static void print(CUdeviceptr ptr,int size) {
		float[] h_ENV = new float[size];
		JCublas.cublasGetVector(size, Sizeof.FLOAT, ptr, 1, Pointer.to(h_ENV), 1);
		System.out.println("######################");
		for (int i=0;i<h_ENV.length;i++) {
			System.out.println(h_ENV**);
		}
		System.out.println("######################");
	}
}```[/SPOILER]

Mat2(my Matrix Container class)
[SPOILER]```package Matrix;

import Util.Random;

public class Mat2 {
	
	private int rows;
	private int cols;
	private float[] data;
	private int elements;
	private boolean rowFirst = false;
	
	public Mat2(int rows,int cols) {
		this.rows = rows;
		this.cols = cols;
		elements = this.rows*this.cols;
		data = new float[elements];
	}
	
	public void set(float[] data) {
		this.data = data;
	}
	
	public void set(int row,int col,float val) {
		data[loc(row,col)] = val;
	}
	
	public float get(int row,int col) {
		return data[loc(row,col)];
	}
	
	public int rows() {
		return rows;
	}
	
	public int cols() {
		return cols;
	}
	
	public float[] data() {
		return data;
	}
	
	public int elements() {
		return elements;
	}
	
	public Mat2 add(Mat2 second) {
		Mat2 result = new Mat2(rows,cols);
		if (!rowFirst) result.colFirst();
		result.data = GPUOp.add(data, second.data);
		return result;
	}
	
	public Mat2 mul(Mat2 second) {
		Mat2 result = new Mat2(rows,second.cols);
		if (!rowFirst) result.colFirst();
		result.data = GPUOp.mul(data, second.data,rows,cols,second.rows,second.cols);
		return result;
	}
	
	public Mat2 sk(float skalar) {
		return null;
	}
	
	public Mat2 copy() {
		Mat2 result = new Mat2(rows,cols);
		if (!rowFirst) result.colFirst();
		for (int r=0;r<rows;r++) {
			for (int c=0;c<cols;c++) {
				result.set(r, c, get(r,c));
			}
		}
		return result;
	}
	
	public int loc(int row,int col) {
		if (rowFirst) {
			if (row>=rows) throw new WrongDimensionException("Not Enough Rows:"+row+"/"+rows);
			if (col>=cols) throw new WrongDimensionException("Not Enough Cols:"+col+"/"+cols);
			return (rows*row)+col;
		} else {
			if (row>=rows) throw new WrongDimensionException("Not Enough Cols:"+row+"/"+rows);
			if (col>=cols) throw new WrongDimensionException("Not Enough Rows:"+col+"/"+cols);
			return (rows*col)+row;
		}
	}
	
	public boolean colFirst() {
		if (!rowFirst) return false;
		rowFirst = false;
		return true;
	}
	
	public boolean rowFirst() {
		if (rowFirst) return false;
		rowFirst = true;
		return true;
	}
	
	@Override
	public String toString() {
		String result = "";
		for (int r=0;r<rows;r++) {
			if (r!=0)result+="
";
			for (int c=0;c<cols;c++) {
				if(c!=0) result+=",";
				result += get(r,c);
			}
		}
		return result;
	}

	public void randomize(float min, float max) {
		for (int i=0;i<data.length;i++) {
			data**=Random.getRandom().getFloat(min, max);
		}
	}

	public void id() {
		for (int i=0;i<(rows<cols?rows:cols);i++) {
			set(i,i,1f);
		}
	}
	
	public void removeRow(int i) {
		rows -=1;
		elements = rows*cols;
		float[] newData = new float[elements];
		int c = 0;
		for (int j=0;j<data.length;j++) {
			if (j%(rows+1) == i){
				c++;
			} else {
				newData[j-c]=data[j];
			}
		}
		data = newData;
	}
	
	public void removeCol(int i) {
		cols -=1;
		elements = rows*cols;
		float[] newData = new float[elements];
		for (int j=0;j<data.length;j++) {
			if (j < rows*i){
				newData[j]=data[j];
			} else if (j >= rows+rows*i){
				newData[j-rows]=data[j];
			}			
		}
		data = newData;
	}
	
	public void addRow(int i,float min,float max) {
		rows+=1;
		elements = rows*cols;
		float[] newData = new float[elements];
		int c=0;
		for (int j=0;j<newData.length;j++) {
			if (j%(rows)==i) {
				newData[j]=Random.getRandom().getFloat(min, max);
				c++;
			} else {
				newData[j]=data[j-c];
			}
		}
		data=newData;
	}
	
	public void addCol(int i,float min,float max) {
		cols+=1;
		elements = rows*cols;
		float[] newData = new float[elements];
		for (int j=0;j<newData.length;j++) {
			if (j < rows*i){
				newData[j]=data[j];
			} else if (j >= rows+rows*i){
				newData[j]=data[j-rows];
			} else {
				newData[j]=Random.getRandom().getFloat(min, max);
			}
		}
		data = newData;
	}
	
	public void mutate(float min,float max) {
		int r=Random.getRandom().getInt(0, elements-1);
		data[r]=data[r]+Random.getRandom().getFloat(min, max);
		/*if (data[r]>5) data[r]=5;
		if (data[r]<-5) data[r]=-5;*/
	}
	
	public void print() {
		System.out.println("#####################################");
		for (int i=0;i<rows;i++) {
			for (int j=0;j<cols;j++) {
				if (j>0) System.out.print(",");
				System.out.print(get(i,j));
			}
			System.out.println();
		}
		System.out.println("#####################################");
	}
	
}```[/SPOILER]

Hello

That’s indeed a lot of code, and I haven’t read it completely - only tried to follow the path of which you suggested that it might likely cause the error.

“Likely”, because this is sometimes hard to determine. Many CUDA calls are asynchronous, and errors (like writing to invalid memory areas) may not show up immediately, but cause seemingly unrelated errors later.

So if you did not already do this, you should add

JCuda.setExceptionsEnabled(true);
JCudaDriver.setExceptionsEnabled(true);
JCublas.setExceptionsEnabled(true);

at the beginning of your main method. I’d recommend to do this in general, particularly during development, because it can already catch many errors that otherwise would remain unnoticed. (Manual error checking in CUDA is tedious - nobody does it…)

If this does not bring any insights, I’ll have to read the relevant code path more thoroughly. Until now, I did not spot any “obvious” errors. (Of course, an example where the error can be reproduced would be helpful - but I guess that creating such an example would already reveal the actual culprit ;-))

bye
Marco

Hi Marco,

I did as you told me and no errors where reported.

The thing I would like to try is find out, what Device the Code is using since im not sure if it would be possible that the CPU instead of the GPU is being used and as such the CPU having more limited memory is causing the problem.

I initialy had a hard time installing cuda and the likes.

cheers
Noodles

Regarding the devices: CUDA can currently not be run on a CPU. (There are approaches … GPU Ocelot | a Dynamic Compilation Framework for GPU Computing … but usually not). Unless you have multiple CUDA-capable devices, the question does not arise. When you have multiple devices, you can select the desired device using the driver API, create a context for this device, and attach the current thread to this context (it’s a bit fiddly), but for plain JCublas calls on a single device, all this should not be necessary.

Regarding the error: I’ll try to re-read the code tomorrow, but until now, I didn’t see anything “obviously” wrong. (Not having the possiblity to test or reproduce this makes it harder…)

A (probably simple) step could be to try what happens when you comment out the lines
JCublas.cublasSgemm('n', 'n', input, 1, neurons, 1.0f, d_weights.get(), input, env.get(), input, 0.0f, result, input);
and
result = fNormalize(neurons,result,true);
of the GateBulk method, and see whether it still affects a (seemingly) unrelated pointer. Particularly, I did not yet check/verify the parameters to cublasSgemm. E.g. the input, or whether the given sizes match the actual matrix sizes in general. If they do not match, this might cause other memory areas to be overwritten. But this is just a guess until now, trying to narrow down the search space.

(If everything else fails, trying to run it with cuda-memcheck could be an option, but maybe this is not even necessary)

Hi Marco,

you saved my life, you were right I had the dimensions mixed up instead of
JCublas.cublasSgemm('n', 'n', input, 1, neurons, 1.0f, d_weights.get(), input, env.get(), input, 0.0f, result, input);
it should have been
JCublas.cublasSgemm('n', 'n', neurons, 1, input, 1.0f, d_weights.get(), neurons, env.get(), input, 0.0f, result, neurons);

I get mixed up with column major ordering of the matrices, but then in Sgemm you are supposed to give the leading dimension not the column.

cheers
Noodles

Good to hear that.

(Yes, JCublas is only a very thin layer around CUBLAS, and does not introduce additional error checks or so. Given the large number of parameters of CUBLAS functions, their “similar” (and confusing) meanings and the inherent brain-twist of row-major vs. column-major, it’s easy to stumble over this…)