Performance Optimization

Hi,

this is really going over my head,
I just made my classes a bit more readable by removing commented out lines and such an made the use of some of my functions easier and the problem was fixed, I just went over my Code and cant see any difference, but the Error is gone.

Also I dont quite understand why but the “faster” approach of using multiple Threads isnt actually faster its actually about three times as slow, so im sticking to just compute things in a line, anyways I thank you for all your help, and heres the version that for some Reason works.

[SPOILER]```package Matrix;

import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.JCudaDriver;
import jcuda.jcublas.JCublas;
import jcuda.runtime.JCuda;
import jcuda.vec.VecFloat;

public class GPUOp {
public static void fNormalize(Mat2 d_Out,Mat2 d_In) {
VecFloat.exp(d_In.rows(), d_Out.getDevice().get(), d_In.getDevice().get()); // x = e^x
VecFloat.scalarDiv(d_In.rows(), d_Out.getDevice().get(), 1f, d_Out.getDevice().get()); // x = e^(-x)
VecFloat.scalarAdd(d_In.rows(), d_Out.getDevice().get(), 1f, d_Out.getDevice().get()); // x = 1+e^(-x)
VecFloat.scalarDiv(d_In.rows(), d_Out.getDevice().get(), 1f, d_Out.getDevice().get()); // x = 1/(1+e^(-x))
}

public static void gNormalize(Mat2 d_Out,int n,Mat2 d_In) {		
	VecFloat.exp(n, d_Out.getDevice().get(), d_In.getDevice().get()); // x = e^x
	VecFloat.scalarDiv(n, d_Out.getDevice().get(), 1f, d_Out.getDevice().get()); // x = e^(-x)
	VecFloat.scalarAdd(n, d_Out.getDevice().get(), 1f, d_Out.getDevice().get()); // x = 1+e^(-x)
	VecFloat.scalarDiv(n, d_Out.getDevice().get(), 2f, d_Out.getDevice().get()); // x = 2/(1+e^(-x))
	VecFloat.subScalar(n, d_Out.getDevice().get(), d_Out.getDevice().get(), 1f); // x = (2/(1+e^(-x)))-1
}

public static void GateBulkNeuron(Mat2 d_Out,Mat2 env, Mat2 weights) {
	int neurons = env.rows();
	JCublas.cublasSgemm('n', 'n', neurons, 1, neurons, 1.0f, weights.getDevice().get(), neurons, env.getDevice().get(), neurons, 0.0f, d_Out.getDevice().get(), neurons);		
	fNormalize(d_Out,d_Out);
}

public static void GateBulkInput(Mat2 d_Out,Mat2 input, Mat2 weights) {
	int numNeurons = weights.rows();
	int numInputs = weights.cols();
	JCublas.cublasSgemm('n', 'n', numNeurons, 1, numInputs, 1.0f, weights.getDevice().get(), numNeurons, input.getDevice().get(), numInputs, 0.0f, d_Out.getDevice().get(), numNeurons);
	fNormalize(d_Out,d_Out);
}

public static void doCalcIns(Mat2 in1,Mat2 in2,Mat2 in3,Mat2 inputValues,Mat2 inputOutGateWeights,Mat2 inputInGateWeights,Mat2 inputChangeGateWeights) {
	GateBulkInput(in1,inputValues,inputOutGateWeights);
	GateBulkInput(in2,inputValues,inputInGateWeights);
	GateBulkInput(in3,inputValues,inputChangeGateWeights);
}

public static void doCalc(Mat2 in1, Mat2 in2, Mat2 in3, Mat2 outGateWeights,
		Mat2 changeGateWeights, Mat2 inGateWeights, Mat2 internalNeuronValues,Mat2 externalNeuronValues,Mat2 calc1,Mat2 calc2,Mat2 calc3) {	
	int neurons = internalNeuronValues.elements();
	
	GateBulkNeuron(calc1,externalNeuronValues,outGateWeights);
	//calc1 used d_yOut
	GateBulkNeuron(calc2,externalNeuronValues,inGateWeights);
	//calc2 used d_yIn
	GateBulkNeuron(calc3,externalNeuronValues,changeGateWeights);
	//calc3 used d_yCh
	
	VecFloat.add(neurons,calc1.getDevice().get(),calc1.getDevice().get(),in1.getDevice().get());
	VecFloat.add(neurons,calc2.getDevice().get(),calc2.getDevice().get(),in2.getDevice().get());
	VecFloat.add(neurons,calc3.getDevice().get(),calc3.getDevice().get(),in3.getDevice().get());
	
	VecFloat.mul(neurons, calc2.getDevice().get(), calc2.getDevice().get(), calc3.getDevice().get());
	//calc3 free
	
	VecFloat.add(neurons, internalNeuronValues.getDevice().get(), internalNeuronValues.getDevice().get(), calc2.getDevice().get());
	//calc2 free
	
	fNormalize(calc2,internalNeuronValues);
	//calc2 used
	
	VecFloat.mul(neurons,externalNeuronValues.getDevice().get(), calc1.getDevice().get(), calc2.getDevice().get());
	//calc1 free
	//calc2 free
}

public static void getOutput(Mat2 result,Mat2 outputWeights,Mat2 externalNeuronValues) {
	int neurons = externalNeuronValues.elements();
	int output = result.elements();
	CUdeviceptr d_Out = new CUdeviceptr();
	JCublas.cublasAlloc(output, Sizeof.FLOAT, d_Out);
	JCublas.cublasSgemm('n', 'n', output, 1, neurons, 1.0f, outputWeights.getDevice().get(), output, externalNeuronValues.getDevice().get(), neurons, 0.0f, d_Out, output);
	float[] h_Out = new float[output];			
	JCublas.cublasGetVector(output, Sizeof.FLOAT, d_Out, 1, Pointer.to(h_Out), 1);
	result.set(h_Out);
	JCublas.cublasFree(d_Out);
}

public static void init() {
	JCuda.setExceptionsEnabled(true);
	JCudaDriver.setExceptionsEnabled(true);
	JCublas.setExceptionsEnabled(true);
	JCublas.cublasInit();
	VecFloat.init();
}

public static void close() {
	JCublas.cublasShutdown();
	VecFloat.shutdown();
}

public static void print(CUdeviceptr ptr,int size) {
	float[] h_ENV = new float[size];
	JCublas.cublasGetVector(size, Sizeof.FLOAT, ptr, 1, Pointer.to(h_ENV), 1);
	System.out.println("######################");
	for (int i=0;i<h_ENV.length;i++) {
		System.out.println(h_ENV**);
	}
	System.out.println("######################");
}

}```[/SPOILER]

PS: dont worry about my Deadlines, I dont have any yet, I postponed setting Deadlines until I have a working Version, because the actuall Thesis isnt about the working or not working of this Programm, just using the Programm to analyse things.

cheers
Noodles

Well. That may have been embarassingly simple. Although I did not (yet) verify in detail what you are computing there (i.e. how large the matrices are etc), adding

    public static void fNormalize(Mat2 d_Out,int n,Mat2 d_In) {
        System.out.println(n+" vs "+d_In.rows());
...

in the previous version of GPUOp in hindsight seems to reveal the culprit: It sometimes prints lines like
197 vs 17
So whatever you have been using the “n” for (and regardless of where it came from), it should possibly have been “d_In.rows()” all the time…

(This seemed to be the only relevant difference that a GitDiff revealed…)


Nevertheless, I still have to take a look at the context handling in the Vec library. (But there are several other tasks in the queue right now, and although there remains the uncanny feeling that “the context handling might be wrong”, it did not seem to really cause an error even for this extreme case of 50 threads - so the priority of this has just decreased significantly…)

Hi again,

you are right I didnt notice the difference

        int neurons = inputOutGateWeights.cols();//wrong
        int neurons = inputOutGateWeights.rows();//right
        int input = inputValues.elements();
       
        GateBulk(in1,inputValues,inputOutGateWeights,neurons,input);
        GateBulk(in2,inputValues,inputInGateWeights,neurons,input);
        GateBulk(in3,inputValues,inputChangeGateWeights,neurons,input);
    }```

This might have Produced the Error, it is truly so hard to find all these Errors.
Thank you that helped alot, since now I now it was really an Error I made.

About VecFloat not being Thread-Safe, now there are two ways in which I would think VecFloat could not be Thread-Safe:

1. I call VecFloat.exp(n,ptr1,1f,ptr2) twice at the same time with ptr1 and ptr2 being the same in each Thread, meaning ptr1 and ptr2 are accessed twice, which might produce problems.

2. I call VecFloat.exp(n,ptr1,1f,ptr2) and VecFloat.exp(n,ptr3,1f,ptr4) at the same time,
this could not be Thread-Safe because both operations are trying to use the same context and as such malloc and mfree produce problems because they are not Thread-Safe.

Could you explain which of the two, or both, Szenarios it is?

cheers
Noodles

About the thread-safety: Again, I have to review this in more detail.

But regarding your examples:

  1. One should not expect that executing
    VecFloat.exp(n,result,...,...); and
    VecFloat.exp(n,result,...,...); (where „result“ is the SAME pointer in both cases)
    on different threads yields reasonable results. In both calls, the kernel will be executed on the same device. So one will be called first, and then the other one, and the „result“ data will simply be overwritten - you just don’t know which one was executed first and which one was executed second.

However, of course, it should „work“, meaning that it should NOT cause nasty crashes.

  1. A call to
    VecFloat.exp(n,resultA,1.0f,inputA); and
    VecFloat.exp(n,resultB,1.0f,inputB); (where the pointers are completely unrelated)
    on different threads should be safe as well. The vector library functions are NOT doing any allocation/free calls internally (and this shouldn’t matter, by the way). The calls are basically ONLY executing the pre-defined vector kernels. And again, these kernel calls will be passed to the device, and the device will handle them.

But which device?

This question leads to some of the actual questions that I’ll have to investigate (roughly - I still have to sort my thoughs here) :

First of all: There may be multiple devices. Right now, the Vec-library creates a context for the first device, and does not offer an option to use another device, or even multiple devices/contexts at the same time. This is the „TODO“ that I already linked to.

Second: The points that are related to the answer by tmurray in CUDA,Context and Threading - CUDA Programming and Performance - NVIDIA Developer Forums :

  • a context belongs to a single device
  • a thread has a single context bound at a time (again, ignoring context stack stuff)
  • a context can be bound to multiple threads simultaneously

    The only time things get crazy is when you’re mixing runtime-created and driver-created contexts in the same app.

The VecFloat class attaches to any existing context during initialization. And this will/may be the primary, runtime-created context that was created for CUBLAS. If no such context exists, it will create one. This will not be a primary context. Multiple threads can be bound to a context, and the thread that is bound to a single context can be switched. There seems to be no information about which context a call is bound to when a driver API call is made on a thread that did not explicitly bind to any context using „cuCtxSetCurrent“. When memory is allocated on one thread, it will be allocated in the context that is currently bound to the calling thread. This may be a context that is different to the one that is used in the VecFloat class.

Now, what does all this mean?
Heck, I don’t know -_-
The possible interweavings of runtime-contexts, possible user-created contexts and the context that is used or created by the Vec-classes simply have to be reviewed. It might be „easy“ or „safe“ for your particular case, because you are only using the primary, runtime-created context of CUBLAS. But others might want to create multiple contexts, and use the VecFloat class to operate on memory that was allocated in different contexts. So they might need the possibility to specify which context this should be. I’ll have to sort out

  • whether there has to be some „cuCtxSetCurrent“ call inside of the VecFloat methods
  • whether it will be the responsibility of the user to call something like „VecFloat.attachToCurrentThread()“ before using the VecFloat class on a different thread
  • whether it will be the responsibility of the user to call something like „VecFloat.setCurrentContext(myContext)“ before using the VecFloat class on data that belongs to „myContext“
  • whether one VecFloat instance has to be created for each context (because the modules that are created during the initialization of the VecFloat class belong to one context only)
  • whether the VecFloat will need a (context-) „handle“, like it is used in the newer versions of the runtime libraries, and which has to be passed to all method calls

I have added this as https://github.com/jcuda/jcuda-vec/issues/1 , so that it won’t get lost so easily, but I’m not sure when I’ll have the time to tackle this.

Hi once more,

I have done some testing and it seems as though when executing functions like
JCublas.cublasSgemm()
VecFloat.exp()

doing this Multithreaded does not speed up calculation at all meaning they can not be run simultaniously, is this true shouldn’t the GPU be able to have multiple host Threads doing calculations at once ?

cheers
Noodles

I’m not sure what you mean. But when you have two calls, like cublasSgemm and exp and each of them takes, say, 2 seconds. Then executing them will take 4 seconds. Even when they are issued by different host threads. It’s the GPU that is busy all the time. (The fact that it receives these commands from different host threads does not make the execution on the GPU faster).

The tricky questions (that I sketched above) aim more at the case where you have two GPUs. Then you should be able to send cublasSgemm to the first GPU, and exp to the second GPU, so that they only take ~2 seconds instead of 4. But this is not (yet) possible.