Generic mapping function

system · 10. Mai 2013 um 04:28

Does JCuda include a generic mapping function?
The CUDA Math Api is here: http://docs.nvidia.com/cuda/cuda-math-api/index.html
It offers tons of useful functions, that I would like to apply elementwise on vectors or matrices.
In principle my idea is to say something like:

JCuda.map(tanh, myDoubleMatrix, result);

When writing CUDA-C code directly, there is no need for such a construct, because it is easy to write is as a kernel, which may look like:


__global__
void mapTan(double* seq, double* result) {
  int i = threadIdx.x;
  result** = tan(seq**);
}

What I imagine JCuda to have is a generic construct, like the one above, where we can replace the part “tan(seq**)”.
Is there a way to solve this with Java? This would be very helpful, and greatly reduce the need to write and compile kernels manually.

Marco13 · 10. Mai 2013 um 07:00

Hello Mitt

There is no such function in JCuda at the moment, and such a functionality will not be added to JCuda directly, because JCuda itself is only intended as a wrapper for the API as it is.

But in general, this is a very interesting proposal But instead of adding such a function directly to JCuda, it should be possible to implement this as some sort of utility library. The kernels could be integrated as precompiled PTX files into a JAR, and loaded on demand.

I’ll definitely consider this, and post any thoughts or results here.

Thanks for this suggestion!
Marco

BTW: I once started writing a general „vector processing machine“, with the intention to go far beyond the plain application of math funtions to individual array elements. It was supposed to roughly offer a functionality that is similar to that of NESL ( NESL: A Nested Data-Parallel Language (3.1) ). But this will probably never reach a mature and usable state. A library that only offers the specific CUDA math functions is a far more realistic goal

system · 10. Mai 2013 um 11:14

Yes, an additional Lib would be perfectly fine for this too, of course. Perhaps there is a way to autogenerate those, say, 400 functions in a .cu file, and compile into a .ptx file, yes.
Basically there could be several hundred of such tiny functions, and I wonder why NVidia does not provide this in its 900 MB toolkit somewhere.

What I would like to do is this:
I uploaded several matrices A, B, C, … onto the GPU, and reserved a few result-matrices R1, R2, …, which have the correct amount of memory, to store the result of an operation on A alone or A and A or A and B. As I understand it, in CUDA, a Matrix is just an array. I think this, because in your matrix multiplication examples you used the cublasSetVector function. The cublasSetMatrix function seems to do the same, it just has slightly different parameters.
Now I want to apply one of the math functions on A. For example I want to add a number to each element of A, and store the result in R1 (which has the same size as A).

void mapAddn(double* seq, double* result, double n) {
  int i = threadIdx.x;
  result** = seq**+n;
}

Or implement greater-than this way:

void mapGt(float* seq1, float* seq2, float* result) {
  int i = threadIdx.x;
  if (seq1** > seq2**)
    result** = 1.0;
    result** = 0.0;
}

This is untested code. I am not a C hacker (doing Clojure on the JVM), just wanted to illustrate what I was thinking of.
Unfortunately I can not simply try this out, because I can’t get nvcc running under Windows. Well, it runs, but constantly produces strange errors. I never used Visual Studio, so I have no idea how to get that working.

In principle I could start a small JVM project that autogenerates the source code, for floats and for doubles. This would take just a few hours. If you could give me a perfectly valid code template of the examples above, then I would start this project immediately and contact you. Btw, I wrote two emails to the contact email addres that I found on the jcuda.org website.

Marco13 · 10. Mai 2013 um 16:15

I think this, because in your matrix multiplication examples you used the cublasSetVector function. The cublasSetMatrix function seems to do the same, it just has slightly different parameters.

Matrices are practically always represented as 1D arrays. The cublasSetMatrix-Functions offer some further options for filling individual rows/columns (or, so to say, „sub-matrices“), which are not required when filling the whole matrix.

Of course, some operations that could be added into such a „vector operation utility library“ are straightforward and nearly trivial. In fact, kernels for the implementation of the CUDA math functions could be created with something as simple as a macro in a text editor. A slightly more interesting part might be the API on Java side. Again, there is a straighforward solution for the simplest case:

public static void mapTanf(Pointer input, Pointer output, int size)

And that’s it.

However, every step beyond that bears its own challenges. For example, you mentioned the „mapAddn“ function. It has no direct representation in the CUDA math functions. But is such a basic building block that it should definitely be included. What else do we need? Of course, the first answers are easy to find, just from the tip of my head:

void addf(Pointer input, float valueToAdd, Pointer output, int size) // Add to each
void mulf(Pointer input, float factor, Pointer output, int size) // Multiply each
void addf(Pointer inputA, Pointer inputB, Pointer output, int size) // Element-wise add
void mulf(Pointer inputA, Pointer inputB, Pointer output, int size) // Element-wise mul
...

But you already mentioned one point that goes beyond simple arithmetic: The „mapGt“ function. So there should be a mapGt, mapGte, mapEq, mapNeq, …
What else could we need? Things like scan/reductions, of course.
And permutations.
And…
You see that this will hardly find an end

That’s why I had this idea of writing a general Vector Processing Machine. This idea was inspired by the NESL language described in the PhD Thesis of Guy Blelloch: „Vector Models for Data-Parallel Computing“, from http://www.cs.cmu.edu/~blelloch/pubs.html (at the bottom of the page - it’s 18 years old, but more up to date than ever…).

I’ve been doing some research about the possible applications and implementations, and most importantly, about the options for chosing the instruction set of such a machine. One could write a Java library with hundreds or thousands (!) of utility functions for specific CUDA kernels. But I wanted to reduce this to a „minimal“ (or at least, very small) instruction set that should be capable of emulating all functions that could possibly be required, and use these functions as building blocks for more complex ones.

So I started this implementation according to the description in the thesis, but basically got stuck at the concept of „segmented vectors“. It is an astonishingly powerful concept, and it does not look very complicated at the first glance, but I found it tremendously hard to implement - and it never worked properly . Maybe I’ll give it another try, but am not sure when I will find the time for that.

But since this „Vector Processing Machine“ is postponed and might never be implemented at all, one first step would certainly be what you proposed, namely to implement the first set of „simple“ vector operations like the ones mentioned above.

What I can already say is that the concept of having a such set of library functions is definitely feasible. I already did this with OpenCL/JOCL: There I had to do some computations, for example, a simple numerical integration for a simulation, with kernels that (in pseudocode) did roughly things like

void integrate(float *positions, float *velocities, float *accelerations, float timeStep)
{
    int t = ... // thread index
    velocities[t] += accelerations[t] * timeStep;
    positions[t] += velocities[t] * timeStep;
    ...
}

Then I broke this down into a set of individual „basic vector instruction kernels“. For example, one „scaleAdd“ kernel, that just did the operation „vectorA += vectorB * scalar“. Using these kernels, I could build the original operation in Java like this:

void integrate(Pointer positions, Pointer velocities, Pointer accelerations, float timeStep)
{
    scaleAdd(velocities, accelerations, timeStep);
    scaleAdd(positions, velocities, timeStep);
    ...
}

And in the end, the computation that was based on these „basic vector instruction kernels“ was only a few percent slower (but of course, much more flexible and versatile!) than the highly specific kernel.

I’ll have some business trips in the next few days, but will definitely have a closer look at this (and respond to your mails!) beginning of next week.

bye
Marco

system · 11. Mai 2013 um 08:54

Aloha Marco, thanks for your reply.

Matrices are practically always represented as 1D arrays
I don’t know what the technical implications of this are, but the first CUDA example in the NVidia docs is this:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C) {
     int i = threadIdx.x;
     C** = A** + B**;
}

int main() {
     ...
     // Kernel invocation with N threads
     VecAdd<<<1, N>>>(A, B, C);
     ...
}

(see CUDA C++ Programming Guide )

From this I imagine that a function call is basically like a for loop, only that all iterations are executed at the same time. Of course, all at the same time is not possible with 5k * 5k matrices, but given a hypothetical GPU, it could indeed run all additions in parallel. About the technical issue: I saw that besides the threadIDx, there is also a blockIDx and a blockDim. They don’t seem to be required here, if we just can say: start 5k*5k threads, and run this VecAdd function above. But perhaps there is a limitation to N, so that 25 million is a too big number in practice, and we need to use this blockID thing.

You explained that a totally generic construct is not a trivial thing to implement. My opinion here is, to start with the basics. It is unfortunate that we can not have the perfect version in the next couple of days. But we can easily come up with a script that generates nearly all the code for tons of useful functions. It’s true, it won’t be the optimal solution. But my question is: is it better to have 400 useful pre-compiled kernels than not having them? I think: yes!
Especially given that it can be automatically generated.

What does this mean? In its CUDA Math Lib NVidia offers nearly 80 functions for floats, and the same amount of functions for doubles. Also, some few functions are offered as C operators (vs. function calls), and that is: +, -, *, /, <, >, <=, >=, ==.

Let’s only concentrate on functions that have one or two parameters. So this is bad news for fans of the sincospif function ( http://docs.nvidia.com/cuda/cuda-math-api/index.html#group__CUDA__MATH__SINGLE_1gab8978300988c385e0aa4b6cba44225e ), but some win and some loose

Functions that have one parameter, such as sin, cos or tanh, are most trivial to auto-generate.
It would always be:


**extern **"C"
**__global__ void [COLOR=DarkGreen]«**[/COLOR][COLOR=DarkGreen]FUNCTION»([/COLOR]**[COLOR=DarkGreen]«TYPE» **[/COLOR]*input, [COLOR=DarkGreen]**«**[/COLOR]**[COLOR=DarkGreen]TYPE» **[/COLOR]*output)
{
    int i = threadIdx.x;
    // or **int **i = blockIdx.x * blockDim.x + threadIdx.x;
    output** = «FUNCTION»(input**);
}

Where «FUNCTION» needs to be replaced by those 10-40 CUDA functions that operate on one single argument, and where «TYPE» needs to be replaced 1x by float, and 1x by double.

Now about functions that take two args and combine them with an operator (==> infix notation):
Variant 1:


**extern **"C"
**__global__ void [COLOR=DarkGreen]«**[/COLOR][COLOR=DarkGreen]OPname»([/COLOR]**[COLOR=DarkGreen]«TYPE» **[/COLOR]*input, [COLOR=DarkGreen]**«****[COLOR=DarkGreen]TYPE» **[/COLOR][/COLOR]n, [COLOR=DarkGreen]**«**[/COLOR]**[COLOR=DarkGreen]TYPE» **[/COLOR]*output)
{
    int i = threadIdx.x;
    // or **int **i = blockIdx.x * blockDim.x + threadIdx.x;
    output** = input** «OP» n;
}

Variant 2:


**extern **"C"
**__global__ void [COLOR=DarkGreen]«OPname**[/COLOR][COLOR=DarkGreen]»([/COLOR]**[COLOR=DarkGreen]«TYPE» **[/COLOR]*input1, [COLOR=DarkGreen]**«****[COLOR=DarkGreen]TYPE» ***input2,[/COLOR] **«**[/COLOR]**[COLOR=DarkGreen]TYPE» **[/COLOR]*output)
{
    int i = threadIdx.x;
    // or **int **i = blockIdx.x * blockDim.x + threadIdx.x;
    output** = input1** «OP» input2**;
}

Again, versions needed for floats and doubles.
Last but not least the same code, again for functions that take two args, as the operators do, but that need prefix notation. Again two variants needed, here only the example where both inputs are arrays:

Variant 1:


**extern **"C"
**__global__ void [COLOR=DarkGreen]«**[/COLOR][COLOR=DarkGreen]FUNCTION»([/COLOR]**[COLOR=DarkGreen]«TYPE» **[/COLOR]*input1, [COLOR=DarkGreen]**«****[COLOR=DarkGreen]TYPE» ***input2**,**[/COLOR] **«**[/COLOR]**[COLOR=DarkGreen]TYPE» **[/COLOR]*output)
{
    int i = threadIdx.x;
    // or **int **i = blockIdx.x * blockDim.x + threadIdx.x;
    output** = «FUNCTION»(input1**, input2**);
}

That’s it. We only need to come up with a naming scheme that makes sense, and then we can auto-generate this code. Then of course the Java side will also be required, but it will be as simple as the kernel stuff.
Perhaps a size argument is required, as you indicated.

So, with this we would have a nice JCuda addon that gives us 200-400 very useful functions, that I can finally use to implement my neural networks
Also I guess, lots of people are writing 1-5 of such kernels for most of their apps, again and again. And those guys have to go through the driver API, learn kernel writing, etc.

I will be glad to hear from you,
sunny greetings,
Mitt

Marco13 · 12. Mai 2013 um 12:44

Hello

That’s exactly how the execution should be imagined. Of course, the GPU does some scheduling internally. If there are only 1000 cores but 2000 data elements to process, then the work will be done in 2 batches. But the intention is that this is done transparently and entirely in hardware, and on software side, it „feels like“ having 2000 cores.
However, the Block- and Grid sizes are important:

The threadIdx specifies the thread index inside a block, and there is a maximum value for that.
The blockIdx specifies the block index inside a grid, and there also is a maximum for that.
When you run the „deviceQuery“ sample from the SDK, you will see something like


  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

(so the vectors could have a maximum size of 512 * 65536 = 33 million elements on my old GeForce ). There are some rather simple „patterns“ that can be applied to compute an appropriate execution layout (in terms of block and grid sizes). Computing the optimal execution layout is a science, but hopefully not so much for the simple kernels that we are currently talking about.

You explained that a totally generic construct is not a trivial thing to implement. My opinion here is, to start with the basics.
…
…
Also I guess, lots of people are writing 1-5 of such kernels for most of their apps, again and again. And those guys have to go through the driver API, learn kernel writing, etc.

In general, I totally agree with you - especially with the last sentence: I experienced it myself that it can be „annoying“ having to launch the whole Visual Studio + NVCC + Driver API + … Pipeline just to implement a trivial vector addition, so such a library will certainly be beneficial for many users, since many operations can be built based on these small, simple and canonical building blocks. There are still some doubts whether this simple and straightforward approach could turn out to be a limiting factor for further generalizations, but one could argue that the more general approach (regardless of how it could look like) should be capable of doing what this library is intended for.

What does this mean? In its CUDA Math Lib NVidia offers nearly 80 functions for floats, and the same amount of functions for doubles. Also, some few functions are offered as C operators (vs. function calls), and that is: +, -, *, /, <, >, <=, >=, ==.

Let’s only concentrate on functions that have one or two parameters. So this is bad news for fans of the sincospif function
…

It’s probably not much effort to include the 3-parameter functions as well. It partially depends on HOW the functions will be generated. It’s clear that this will be some sort of auto-generation. Maybe things like http://velocity.apache.org/engine/index.html could be used here, but maybe an own, small code generator could do the job as well. I’ll probably implement some of the functions manually, maybe one or two of each kind (Vecotor+Scalar, Vector+Comparator, Vector+Vector, …), to get a better feeling and overview of the requirements, and to see what might be the most appropriate approach here.

bye
Marco

system · 12. Mai 2013 um 13:55

Sounds good. Yes, a first test with 2-3 handwritten minikernels sounds good. After such a lib would have been written, it would appear as if it were part of the Runtime API, which is very nice. I guess you would compile it into .ptx files, so that the code can basically run on every plattform. Although it will be interesting to see how it will work on 32 vs. 64 bit systems. And also one thing remains: it might be interesting to play a bit with the numbers of gridsize, blocks per grid and threads per block. On new cards much higher numbers are allowed. So if we have a variable that can be filled during auto-generating the code, we could output two versions. One which should run on pretty much every card, and one that targets a compute capability of 3.0 or higher, so that 1024 threads per block might be useful. Perhaps

About this velocity lib: might be useful, but I would instead better write a few lines of Clojure code which auto-generates the cuda and Java code. At least for me this would be much easier, cause Clojure is +/- the most powerful programming language ever

Marco13 · 12. Mai 2013 um 16:45

At the moment, I assume (or hope? ) that the 32/64-bit issue can be resolved pragmatically, by compiling the kernels in both versions, and simply load the appropriate version at runtime.
The block/grid size thing should also be no problem: It does not affect the PTX file. It is only a matter of how the kernel is invoked, and of course, this can be chosen to match the available resources of the machine that the whole thing is running on.
I’m really curious to see whether this all works as expected.

Marco13 · 16. Mai 2013 um 14:43

So I started experimenting a little. As I already mentioned, the basic concept itself might be feasible. But of course, there is a tremendous number of degrees of freedom for the implementation. I’ll try to summarize some of the first (and/or most important) ones that I came along and will have to think about.

First of all, I’m not sure whether it’s worth putting too much effort into a templating scheme. The Code is simple and can be generated with a TextPad macro, and it’s unlikely that the set of mathematical functions or operators like <,<=,==, … or +,-,*,/ will change dramatically in the future Most of the effort related to this is tearing apart the definitions and classifying them properly into arithmetic, comparison, 1-arg-math-functions, 2-arg-math-functions etc. But again, this also has to be done only once.

The next big and important question is the API. I’m always very careful with that. There is a reason why JCuda and JOCL only map the original APIs to Java: If there’s something odd, it’s not my fault. (Although it’s definitely justified to say that mapping a C-API to Java is „odd“ :o ).

In the case of these vector functions, there is of course the „brute-force-option“ to put everything into a giant class with hundreds of static methods, but that’s ugly (even uglier than this mapping is for the existing JCuda classes ).

I thought about different options. I won’t list them all, but post a SKETCH of what I’m currently thinking about. This is in NO WAY even close to something that could be considered as a sketch of what I intend to do, but only what I’m playing around with:

public class VecSketch01
{
    public static void main(String[] args)
    {
        Pointer x = null;
        Pointer y = null;
        Pointer z = null;
        long n = 0;
        
        Vec vec = Vectors.create();
        
        // x = 1
        vec.set(x, 1, n);

        // y = acos(x)
        vec.math().acos(y, x, n);
        
        // z = x + y
        vec.arithmetic().add(z, x, y, n);
        
        // z = z + 0.5 
        vec.arithmetic().add(z, z, 0.5, n);
        
        vec.shutdown();
        
    }
}


/*
 * - The interfaces are equal for double- and float case. Should
 *   there be different interfaces? In fact, one should be able
 *   to apply all these operations to integers as well!
 *   (And sometimes, they even involve mixed types, 
 *    like float+integer for permute(...)
 * 
 * - The instance is created at runtime for either 32 or 64 bit
 * 
 * - All methods receive the result vector as the first argument 
 *   and the size 'n' as the last argument
 */ 

// Element-wise math functions from CUDA
interface VecMath
{
    void acos(Pointer result, Pointer x, long n);
    void pow(Pointer result, Pointer x, Pointer y, long n);
    //...
}

// Element-wise comparison operators <, <=, == ... 
// Setting the value 1.0 for 'true' and 0.0 for 'false'
interface VecComparison
{
    void lt(Pointer result, Pointer x, Pointer y, long n);
    void lte(Pointer result, Pointer x, Pointer y, long n);
    //...
}

// Arithmetic
interface VecArithmetic
{
    // Element-wise
    void add(Pointer result, Pointer x, Pointer y, long n);
    void mul(Pointer result, Pointer x, Pointer y, long n);
    
    // Element-wise vector and scalar. All these functions accept
    // a double value as the scalar, which may internally
    // be cast to 'float' for the 'float' implementation
    void add(Pointer result, Pointer x, double y, long n);
    void mul(Pointer result, Pointer x, double y, long n);
    
}

// Scan, reduce, permute... These are ALL important. In fact,
// with a proper 'scan', most other operations could be
// emulated...
enum Operator
{
   ADD, MUL, MIN, MAX 
}
interface VecScan
{
    double scan(Operator op, Pointer x, long n);
}
interface VecReduction
{
    double reduce(Operator op, Pointer x, long n);
}
interface VecPermutation
{
    void permute(Pointer result, Pointer x, Pointer i, long n);
}

// The general interface for vector computations. This allows
// basic maintenance, and offers the math/comparison... etc
// interfaces 
interface Vec
{
    Pointer at(Pointer x, int elementsOffset);
    void set(Pointer result, double value, long n);
    void shutdown();
    
    VecMath math();
    VecComparison comparison();
    VecArithmetic arithmetic();
    VecScan scan();
    VecReduction reduction();
    VecPermutation permutation();
}

class Vectors 
{
    public static Vec create()
    {
        // Create a new instance of an implementation
        // of the Vec interface, either for 32 or for
        // 64 bit
        return null;
    }
}

There are already some issues inlined in the comments.

The implementations of these interfaces may look trivial at the first glance, but will nevertheless require some maintenance for the CUfunction pointers etc., althought there certainly is no rocket science involved.

However, I’m still not really happy with the direction in which this is going. It’s „close“ to a generic vector processing machine, but it’s probably not easy to add the missing parts. In order to be a little more generic and flexible, there would have to be dedicated classes for representing ‚Vector‘ instances (basically wrapping a Pointer and a ‚size‘). And in order to turn this into something that has the potential to be REALLY cool, it also has to support segmented vectors. So I’m currently also thinking about whether and how this could be seen as a possible „backend“ for such a vector instruction machine…

So long,
Marco

Marco13 · 19. Mai 2013 um 08:17

No feedback here?

At the moment, I’m working on one possible infrastructure for such a backend. One option is to have the named functions available in an interface (regardless of how this interface is offered to the client). Namely, something like

public void acos(Pointer result, Pointer x, long n) ...
public void pow(Pointer result, Pointer x, Pointer y, long n) ...
// ... + hundreds more

// Usage:
v.acos(p0, p1, n);
v.pow(p0, p1, p2, n);

Alternatively, one could offer an interface like

    public void call(String name, Object ... args)

// Usage:
v.call("acos", p0, p1, n);
v.call("pow", p0, p1, p2, n);

Of course, the latter would be very error-prone: A single typo in the ‘name’, or a wrong number of parameters would scr*w up the whole thing. But it would be very generic: ALL functions could be called with this interface. Maybe it will just become another layer between the first version and the actual kernel call…

system · 21. Mai 2013 um 14:04

In my opinion it should be a very simple start. One lesson that I learned over the years is that the very simple things, with a low level of complexity, win.
I honestly don’t see any problem having one class JCudaMath or whatever, that contains 400 or so static methods. Thinking in Java is evil. Object orientation is not required. What we have are algorithms, not classes, hierarchies or whatever.

There can be at least 2 versions for each function: one for floats and one for doubles. The developer has to care that all his arrays/vectors/matrices already sit on the device.

Marco13 · 21. Mai 2013 um 15:02

I’m not sure whether it is a good idea to throw a few hundred kernels into one large .CU file and load hundreds of CUfunction objects on the Java side. I thought about at least some structure in order to classify the functions, but am not sure whether this is approriate. I though about some degree of type safety, but of course, through the ‘Pointer’ class, this is always in the responsibility of the user. I also thought about extensibility and even whether there should be some sort of “plugin” mechanism so that this could easily be extended with own functions, but who knows whether this is necessary at all. And of course, I’d still like to implement a real vector machine, but this requires more time.
I somehow feel like investing a considerable amount of time for creating such a library and the chance that it might turn out to be an unmaintainable one-shot. If the intention is to just circumvent the compilation of one-line-kernels, a set of 400 precompiled PTX files (or a PTX file with 400 kernels) would be sufficient. But I’ll think about that thoroughly…

system · 22. Mai 2013 um 13:57

I guess that too much thinking won’t help here. IMO it’s the best to auto-generate 95% of those 1-2 line FNs and put them in one class with static methods. This will waste, say, 3 MB of RAM, and perhaps the same on the card, I don’t know. But it will be a tiny amount of memory anyway.
Currently I see no advantages of having this split into multiple files or multiple classes. As there is no serious memory waste, and we are talking kilobytes here, we won’t have to do thinking and planning as if we wanted to send a robot to Mars

There is also no block for extensions. Currently everybody who wants to use those simple FNs has to implement them herself. This will still be possible (and may be required in practice). It just means that people can get started immediately within a few moments, just write their code and call functions. Whoever wants to extend it will write a .cu file, compile it to ptx and load it, as he/she does it today.

Marco13 · 22. Mai 2013 um 16:20

Certainly, yes - I usually try to find sustainable and generic solutions for things like this, but maybe it’s best to anticipate that it will just be a quarry of functions. I’ll try to create such a simple and flat version tomorrow or during the weekend.

Marco13 · 4. Juni 2013 um 15:03

A first version of the “JCudaVec” library is now available at http://jcuda.org/utilities/utilities.html