I think this, because in your matrix multiplication examples you used the cublasSetVector function. The cublasSetMatrix function seems to do the same, it just has slightly different parameters.
Matrices are practically always represented as 1D arrays. The cublasSetMatrix-Functions offer some further options for filling individual rows/columns (or, so to say, „sub-matrices“), which are not required when filling the whole matrix.
Of course, some operations that could be added into such a „vector operation utility library“ are straightforward and nearly trivial. In fact, kernels for the implementation of the CUDA math functions could be created with something as simple as a macro in a text editor. A slightly more interesting part might be the API on Java side. Again, there is a straighforward solution for the simplest case:
public static void mapTanf(Pointer input, Pointer output, int size)
And that’s it.
However, every step beyond that bears its own challenges. For example, you mentioned the „mapAddn“ function. It has no direct representation in the CUDA math functions. But is such a basic building block that it should definitely be included. What else do we need? Of course, the first answers are easy to find, just from the tip of my head:
void addf(Pointer input, float valueToAdd, Pointer output, int size) // Add to each
void mulf(Pointer input, float factor, Pointer output, int size) // Multiply each
void addf(Pointer inputA, Pointer inputB, Pointer output, int size) // Element-wise add
void mulf(Pointer inputA, Pointer inputB, Pointer output, int size) // Element-wise mul
...
But you already mentioned one point that goes beyond simple arithmetic: The „mapGt“ function. So there should be a mapGt, mapGte, mapEq, mapNeq, …
What else could we need? Things like scan/reductions, of course.
And permutations.
And…
You see that this will hardly find an end
That’s why I had this idea of writing a general Vector Processing Machine. This idea was inspired by the NESL language described in the PhD Thesis of Guy Blelloch: „Vector Models for Data-Parallel Computing“, from http://www.cs.cmu.edu/~blelloch/pubs.html (at the bottom of the page - it’s 18 years old, but more up to date than ever…).
I’ve been doing some research about the possible applications and implementations, and most importantly, about the options for chosing the instruction set of such a machine. One could write a Java library with hundreds or thousands (!) of utility functions for specific CUDA kernels. But I wanted to reduce this to a „minimal“ (or at least, very small) instruction set that should be capable of emulating all functions that could possibly be required, and use these functions as building blocks for more complex ones.
So I started this implementation according to the description in the thesis, but basically got stuck at the concept of „segmented vectors“. It is an astonishingly powerful concept, and it does not look very complicated at the first glance, but I found it tremendously hard to implement - and it never worked properly . Maybe I’ll give it another try, but am not sure when I will find the time for that.
But since this „Vector Processing Machine“ is postponed and might never be implemented at all, one first step would certainly be what you proposed, namely to implement the first set of „simple“ vector operations like the ones mentioned above.
What I can already say is that the concept of having a such set of library functions is definitely feasible. I already did this with OpenCL/JOCL: There I had to do some computations, for example, a simple numerical integration for a simulation, with kernels that (in pseudocode) did roughly things like
void integrate(float *positions, float *velocities, float *accelerations, float timeStep)
{
int t = ... // thread index
velocities[t] += accelerations[t] * timeStep;
positions[t] += velocities[t] * timeStep;
...
}
Then I broke this down into a set of individual „basic vector instruction kernels“. For example, one „scaleAdd“ kernel, that just did the operation „vectorA += vectorB * scalar“. Using these kernels, I could build the original operation in Java like this:
void integrate(Pointer positions, Pointer velocities, Pointer accelerations, float timeStep)
{
scaleAdd(velocities, accelerations, timeStep);
scaleAdd(positions, velocities, timeStep);
...
}
And in the end, the computation that was based on these „basic vector instruction kernels“ was only a few percent slower (but of course, much more flexible and versatile!) than the highly specific kernel.
I’ll have some business trips in the next few days, but will definitely have a closer look at this (and respond to your mails!) beginning of next week.
bye
Marco