RNG through jCuda

Has anyone developed a port of CUDA’s Mersenne Twister random number generator for jCuda? I’m not too experienced with C/C++, so am having difficulties interpreting the host code of the Mersenne Twister classes to get it up and running through the jCuda driver API.

If anyone has done so, would you be willing to share some of the code for any of us newbies starting out?

This may make a useful addition in a Utilities package down the road :wink:

Hello

I had a short look at the C source code, and it looks straightforward at the first glance. I’ll try to port this example ASAP, maybe I’ll find the time this evening (although there are still some other tasks in the queue…)

Something like this might be interesting for a utilities package, but there are MANY things that might be interesting as „building blocks“ within a set of higher-level utility classes :wink: Most prominently http://developer.nvidia.com/object/npp_home.html , or a set of general vector/scan instructions… I’m already working on that, so the queue certainly won’t be empty too soon :wink:

bye
Marco

OK then, the handling of the CPU part is of course not straightforward in Java, since it involves some 'struct’s. But at least accessing the GPU part via JCuda is not too complicated. It also involves a struct, but only a very simple one that is filled on the host side and read in the kernel.

Here’s a small example, mainly based on the GPU-related part of the original MersenneTwister example.

/*
 * JCuda - Java bindings for NVIDIA CUDA driver and runtime API
 * http://www.jcuda.org
 *
 * Copyright 2010 Marco Hutter - http://www.jcuda.org
 */

import java.io.*;
import java.nio.ByteBuffer;

import jcuda.*;
import jcuda.driver.*;
import static jcuda.driver.JCudaDriver.*;

/**
 * This is a port of the GPU part of the NVIDIA CUDA 
 * Mersenne Twister Random Number Generator example.<br />
 * <br />
 * Required files:
 * <ul>
 *   <li>
 *     <b>MersenneTwister.compute_10.sm_10.cubin</b> - The 
 *     SM 10 CUBIN file that is created from the original
 *     example when keeping the preprocessed files by adding 
 *     the <code>--keep</code> parameter to the NVCC call.
 *   </li>
 *   <li>
 *     <b>MersenneTwister.dat</b> - The data file that is
 *     contained in the original example.
 *   </li>
 * </ul>
 */
public class JCudaDriverMersenneTwister
{
    /**
     * The name of the CUBIN file
     */
    private static final String cubinFileName = 
        "MersenneTwister.compute_10.sm_10.cubin";
    
    /**
     * The name of the data file that is loaded
     */
    private static final String dataFileName = 
        "MersenneTwister.dat";
    
    // Variable declarations as in the original example
    private static final int PATH_N = 24000;
    private static final int MT_RNG_COUNT = 4096;
    private static final int N_PER_RNG = 
        alignUp(divUp(PATH_N, MT_RNG_COUNT), 2);
    private static final int RAND_N = MT_RNG_COUNT * N_PER_RNG;
    private static final int SEED = 777;

    /**
     * This is the size of the stripped Mersenne Twister structure
     * that is defined in the original sample as follows:
     * <pre>
     * typedef struct
     * {
     *     unsigned int matrix_a; 
     *     unsigned int mask_b;
     *     unsigned int mask_c;
     *     unsigned int seed;
     * } mt_struct_stripped;
     * </pre>
     */
    private static int sizeof_mt_struct_stripped = 4 * Sizeof.INT;
    
    /**
     * This variable is originally declared as
     * <code>static mt_struct_stripped h_MT[MT_RNG_COUNT];</code>
     * 
     * Since this data is filled by reading the data from a 
     * file, and it has to be copied to the device, this is
     * not stored as an array of Objects of a class that 
     * resembles the original structure, but simply as a
     * direct byte buffer.
     */
    private static ByteBuffer h_MT = 
        ByteBuffer.allocateDirect(MT_RNG_COUNT * sizeof_mt_struct_stripped);
    
    
    /**
     * The CUDA module that is created from the CUBIN file
     */
    private static CUmodule module;

    /**
     * The entry point of this sample.
     * 
     * @param args Not used
     */
    public static void main(String args[])
    {
        // Initialize the driver and create a context for the first device.
        System.out.println("Initializing CUDA driver...");
        JCudaDriver.setExceptionsEnabled(true);
        cuInit(0);
        CUcontext pctx = new CUcontext();
        CUdevice dev = new CUdevice();
        cuDeviceGet(dev, 0);
        cuCtxCreate(pctx, 0, dev);

        // Load the module from the CUBIN file
        System.out.println("Loading module from "+cubinFileName+"...");
        module = new CUmodule();
        cuModuleLoad(module, cubinFileName);

        // Obtain the function pointers to the "RandomGPU" 
        // and the "BoxMuller" functions
        CUfunction randomGPU = new CUfunction();
        cuModuleGetFunction(randomGPU, module, "_Z9RandomGPUPfi");
        CUfunction boxMullerGPU = new CUfunction();
        cuModuleGetFunction(boxMullerGPU, module, "_Z12BoxMullerGPUPfi");

        // Initialize the data for the samples
        System.out.println("Initializing data for "+PATH_N+" samples...");
        float h_RandGPU[] = new float[RAND_N];
        CUdeviceptr d_Rand = new CUdeviceptr();
        cuMemAlloc(d_Rand, RAND_N * Sizeof.FLOAT);

        // Load the twister configuration from the input data file
        System.out.println("Loading GPU twister configuration...");
        loadMTGPU(dataFileName);
        seedMTGPU(SEED);

        System.out.println("Generating random numbers on GPU...");
        int numIterations = 20;
        for (int i = -1; i < numIterations; i++)
        {
            if (i == 0)
            {
                cuCtxSynchronize();
            }
            cuFuncSetBlockShape(randomGPU, 128, 1, 1);
            cuLaunchGrid(randomGPU, 32, 1);

            cuFuncSetBlockShape(boxMullerGPU, 128, 1, 1);
            cuLaunchGrid(boxMullerGPU, 32, 1);
        }
        cuCtxSynchronize();

        System.out.println("Reading back the results...");
        cuMemcpyDtoH(Pointer.to(h_RandGPU), d_Rand, RAND_N * Sizeof.FLOAT);

        System.out.println("Results: "+stringFor(h_RandGPU, 6));
        
        System.out.println("Shutting down...");
        cuMemFree(d_Rand);
    }

    /**
     * Returns a String containing up to 'max' elements of the
     * given array.
     * 
     * @param array The array
     * @param max The maximum number of elements
     * @return The String for the array
     */
    private static String stringFor(float array[], int max)
    {
        int n = Math.min(max, array.length);
        StringBuilder sb = new StringBuilder("[");
        for (int i=0; i<n; i++)
        {
            sb.append(String.valueOf(array**));
            if (i<n-1)
            {
                sb.append(", ");
            }
        }
        if (max < array.length)
        {
            sb.append(", ...");
        }
        sb.append("]");
        return sb.toString();
    }
    
    
    /**
     * Align a to nearest higher multiple of b
     * 
     * @param a The value to align
     * @param b The alignment
     * @return The aligned value
     */
    private static int alignUp(int a, int b)
    {
        return ((a % b) != 0) ? (a - a % b + b) : a;
    }

    /**
     * Computes the quotient a/b, rounded up to the next highest 
     * integral value 
     * 
     * @param a Dividend
     * @param b Divisor
     * @return The rounded quotient
     */
    private static int divUp(int a, int b)
    {
        return ((a % b) != 0) ? (a / b + 1) : (a / b);
    }

    
    /**
     * Load the twister configuration from the file with the given name, 
     * and store its contents in the "h_MT" ByteBuffer
     * 
     * @param fname The file name
     */
    private static void loadMTGPU(String fname)
    {
        FileInputStream fis = null;
        try
        {
            fis = new FileInputStream(fname);
            byte buffer[] = new byte[sizeof_mt_struct_stripped];
            for (int i = 0; i < MT_RNG_COUNT; i++)
            {
                fis.read(buffer);
                h_MT.put(buffer);
            }
            h_MT.position(0);
        }
        catch (IOException e)
        {
            e.printStackTrace();
            try
            {
                fis.close();
            }
            catch (IOException ex)
            {}
        }
    }

    /**
     * Initialize the twister with the given seed.
     * 
     * @param seed The seed
     */
    private static void seedMTGPU(int seed)
    {
        int i;
        ByteBuffer MT = ByteBuffer.allocateDirect(
            MT_RNG_COUNT * sizeof_mt_struct_stripped);
        byte buffer[] = new byte[sizeof_mt_struct_stripped];
        for (i = 0; i < MT_RNG_COUNT; i++)
        {
            // In the original example, this is simply an assignment:
            // MT** = h_MT**;
            // Since the data here is not stored in arrays of 
            // structures, but in direct buffers, the data
            // is copied from the first buffer and stored in
            // the second
            h_MT.get(buffer);
            MT.put(buffer);
            
            // The last field of the structure is the int
            // that stores the seed. In the original 
            // example, this is just an assignment
            // MT**.seed = seed;
            // Since the structures are stored in byte buffers
            // this value is set manually here:
            MT.position(MT.position() - Sizeof.INT);
            MT.putInt(seed);
        }

        // Copy the current data to the global 'ds_MT' variable
        // of the module
        CUdeviceptr ds_MT = new CUdeviceptr();
        cuModuleGetGlobal(ds_MT, new int[1], module, "ds_MT");
        cuMemcpyHtoD(ds_MT, Pointer.to(MT), 
            MT_RNG_COUNT * sizeof_mt_struct_stripped);
    }

}

If it turns out to be helpful, I may also upload it on the website.

By the way: I already addressed the issue of handling structs in the context of JOCL, but I think that it is very similar for JOCL and for JCuda, so I’ll possibly upload some utility classes for struct handling soon. Another item in the queue :wink:

Thanks for putting together this example Marco!

I haven’t quite gotten it working, mainly due to the cubin file:

I’m only working in Java (i.e., haven’t gotten the original example running on VS), and tried to use your prepareCubinFile(MersenneTwister_kernel.cu) with the ‘-keep’ parameter added to get the appropriate .cubin file. This didn’t work however? Any suggestions?

Thanks again

Hello

Unfortunately, CUDA code can not be compiled directly without a C compiler, so you still need, for example, Visual Studio or the GCC for compiling the CUBIN files.

For the case that you have a 32bit machine, I have uploaded the CUBIN file at http://jcuda.org/samples/MersenneTwister.compute_10.sm_10.cubin. For a 64bit system, this will not work, but maybe I’ll have the chance to upload a 64bit CUBIN by next week.

bye
Marco

Hi Marco,

I am using 64bit and also have VS’08 installed, but I wanted to compile the .cu and associated files straight from Java, using the prepareCubinFile() method or something similar. In the past, i’ve only used the prepareCubinFile() method for a single .cu file, where all my Cuda method calls are within that .cu file. So i’m guessing I just need to figure out what the extra command line arguments are to pass to the prepareCubinFile() to get the mersenne twister example working?

Thanks again

Ah, OK, I see. The MersenneTwister example is not so simple: It does not contain all the required code in a single .cu-file. It includes the MersenneTwister.h, has the kernels in a sparate file, and additionally references the “shrUtils.h” and some “cutils*” headers.

It is in general possible to add the required paths as include paths to the command line in the “prepareCubinFile” method. The files
MersenneTwister.cu
MersenneTwister.h
MersenneTwister_gold.cpp
MersenneTwister_kernel.cu
have to be in the project root directory, and the include paths may be added like this:


        String includes = 
            "-I\"YOURPATHTO/NVIDIA GPU Computing SDK/shared/inc\" "+
            "-I\"YOURPATHTO/NVIDIA GPU Computing SDK/C/common/inc\" "+
            "-I\"YOURPATHTO/Microsoft SDKs/Windows/v6.0A/include\" ";
            
        String command = 
            "nvcc " + modelString + " -arch sm_11 -cubin "+includes+" "+
            cuFile.getPath()+" -o "+cubinFileName;

(replacing “YOURPATHTO” with the respective path on your system)

But of course, this is only a crude workaround - hard-coded system specific paths are never nice… -_- A more elegant solution would be to extract the relevant parts from the MersenneTwister_kernels.cu, and put them all together in a single file that may be compiled with the original “prepareCubinFile” method. Maybe I’ll find the time to do this next month, then this could also be uploaded as an example. For the time being, hopefully the extended include directories in the command line already allow you to do what you intended.

Hi Marco,

Thanks for the reply. It works now using the NVCC arguments you provided. I just added:

[LEFT][SIZE=2]private[/SIZE] [SIZE=2]static[/SIZE] String [SIZE=2]nvccArgs[/SIZE] =
[SIZE=2] "–keep "[/SIZE]+
[SIZE=2] „-I"C:/CUDA/CUDA_SDK/shared/inc“ "[/SIZE]+
[SIZE=2] „-I"C:/CUDA/CUDA_SDK/C/common/inc“ "[/SIZE]+[/LEFT]
[SIZE=2] „-I"C:/Program Files (x86)/Microsoft SDKs/Windows/v6.0A/include“ "[/SIZE];

And put all the MersenneTwister files (.h, .cu, .cpp) in the same project folder which the prepareCubinFile() method pointed to. E.g.,

[SIZE=2]String cubinFileName = JCudaUtils.prepareCubinFile([SIZE=2]„cudaFiles/MersenneTwister.cu“[/SIZE], [SIZE=2]nvccArgs[/SIZE]);

Thanks again! :slight_smile:

[/SIZE]

Fine.

By the way: The functionality of the “preapreCubinFile” method already has been extracted in the “KernelLauncher” class available on http://jcuda.org/samples/samples.html , and there you already have the possibility to specify additional nvcc arguments. This class currently should only be considered as a sample and not as an “official” utility class, but maybe one day it will (slightly modified) be part of a larger utilities package…

Hi Marco,

Sorry I put the JCudaUtils.prepareCubinFile() in the example above versus just using the KernelLauncher.prepareCubinFile() – this is a utility class with some useful methods i’ve been collecting, so I hope this thread didn’t give folks the wrong impression that there’s currently an official JCudaUtils class!

I do find the method quite handy so I don’t have to seperately compile the .cu file every time, and have extended it to test the age of the .cu file versus the .cubin file so I don’t have to unneccesarily re-compile every time ([SIZE=2]if [/SIZE](cuFile.lastModified() < cubinFile.lastModified())…), or remember to do a force recompile as in the KernelLauncher method.

Thanks again for your hard work with jCuda, it’s becoming extremely useful to my work.

There are several tasks which may be tedious with the pure low-level CUDA API. When I find the time to collect some useful methods and utility classes into a package, maybe you want to contribute the ones you already created…? :slight_smile: