Question about JCudppSample.java in JCuda

Hi,

I am using JCudppSample.java to do sorting arrays. And I would like to sort keys and value, however the JCudppSample.java is just for keys only.

If I want to sort keys and values , what should I do?

It has exception if I just change the code from

config.options = CUDPPOption.CUDPP_OPTION_KEYS_ONLY;

to

config.options = CUDPPOption.CUDPP_OPTION_KEY_VALUE_PAIRS;

Any samples code for sorting keys and value?
Thanks everyone.

Regards,
Lemon

Hello Lemon,

You also have to provide an array of values, and pass it in to the CUDPP sort call.

Here’s a small example:

import java.util.*;

import jcuda.*;
import jcuda.jcudpp.*;
import jcuda.runtime.*;

/**
 * This is a sample class demonstrating the application of JCudpp for
 * performing a sort of key-value pairs
 */
public class JCudppSortPairsSample
{
    public static void main(String args[])
    {
    	JCuda.setExceptionsEnabled(true);
    	JCudpp.setExceptionsEnabled(true);

    	int n = 10;

        System.out.println("Creating input data");
        int h_keys[] = createRandomIntData(n);
        float h_values[] = createRandomFloatData(n);

        System.out.println("input  keys  : "+Arrays.toString(h_keys));
        System.out.println("input  values: "+Arrays.toString(h_values));
        
        // Allocate memory on the device and copy host data to device
        System.out.println("Copying input data to device");
        Pointer d_keys = new Pointer();
        JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
        JCuda.cudaMemcpy(d_keys, Pointer.to(h_keys), n * Sizeof.INT, 
                cudaMemcpyKind.cudaMemcpyHostToDevice);

        Pointer d_values = new Pointer();
        JCuda.cudaMalloc(d_values, n * Sizeof.FLOAT);
        JCuda.cudaMemcpy(d_values, Pointer.to(h_values), n * Sizeof.FLOAT, 
                cudaMemcpyKind.cudaMemcpyHostToDevice);
        
        // Create a CUDPPConfiguration for a radix sort of
        // an array of ints 
        System.out.println("Setting up sort plan");
        CUDPPConfiguration config = new CUDPPConfiguration();
        config.algorithm = CUDPPAlgorithm.CUDPP_SORT_RADIX;
        config.datatype = CUDPPDatatype.CUDPP_UINT;
        config.op = CUDPPOperator.CUDPP_ADD;
        config.options = CUDPPOption.CUDPP_OPTION_KEY_VALUE_PAIRS;
        
        // Create a CUDPPHandle for the sort operation
        CUDPPHandle handle = new CUDPPHandle();
        JCudpp.cudppPlan(handle, config, n, 1, 0);
        
        // Execute the sort operation
        System.out.println("Executing CUDPP sort");
        JCudpp.cudppSort(handle, d_keys, d_values, 32, n);

        Arrays.fill(h_keys, 0);
        Arrays.fill(h_values, 0);
        
        // Copy the result from the device to the host
        System.out.println("Copying result data back to host");
        JCuda.cudaMemcpy(Pointer.to(h_keys), d_keys, n * Sizeof.INT, 
                cudaMemcpyKind.cudaMemcpyDeviceToHost);
        JCuda.cudaMemcpy(Pointer.to(h_values), d_values, n * Sizeof.FLOAT, 
                cudaMemcpyKind.cudaMemcpyDeviceToHost);

        // Print the results
        System.out.println("result keys  : "+Arrays.toString(h_keys));
        System.out.println("result values: "+Arrays.toString(h_values));
        
        // Clean up
        JCudpp.cudppDestroyPlan(handle);
        JCuda.cudaFree(d_keys);
        JCuda.cudaFree(d_values);
    }

    private static Random random = new Random(0);

    /**
     * Creates an array of the specified size, containing
     * the numbers from 0 to n-1, shuffled randomly  
     */
    private static int[] createRandomIntData(int n)
    {
        int x[] = new int[n];
        for (int i = 0; i < n; i++)
        {
            x** = i;
        }
        for (int i=0; i<n; i++)
        {
        	int i0 = random.nextInt(n);
        	int i1 = random.nextInt(n);
        	int temp = x[i0];
        	x[i0] = x[i1];
        	x[i1] = temp;
        }
        return x;
    }

    /**
     * Creates an array of the specified size, containing some random data
     */
    private static float[] createRandomFloatData(int n)
    {
        float x[] = new float[n];
        for (int i = 0; i < n; i++)
        {
            x** = random.nextFloat();
        }
        return x;
    }

    
}

Hi Marco,

Thanks again for your help.

One more question about cudaEventElapsedTime(). If i want to calculate the time spend to JCudpp.cudppSort, how to use cudaEventElapsedTime to count the time??

Lemon

Hello Lemon,

Since there are no asynchronous operations involved (and would not make any sense, since the CUDPP call can not be associated with a stream) it is not really necessary to measure the execution time using CUDAs event mechanism.

However, the appended example sorts arrays with 100k…100M elements, and compares the times

  • for sorting with Java
  • for sorting with JCudpp, measured with Java (using System.nanotime())
  • for sorting with JCudpp, measured with CUDA (using cudaEventElapsedTime())
    It shows that the time measured with Java is pretty close to the time measured with CUDA.

I’m currently sitting at a
CPU: Intel Core 2 Quad @ 2.5GHz, Java 1.6
GPU: NVIDIA GeForce GTX 280
computer, and for larger arrays, the JCudpp sort is approximately 3 times faster than the Java sort implementation. This is, however, not a real “benchmark”!!!. On the one hand, to reliably measure the time, multiple sorting passes with different configurations would have to be performed, since the running time of a sorting algorithm may heavily depend on the sortedness of the input data. On the other hand, the JCudpp sort is actually not intended as a “drop in replacement” for Arrays.sort. Much of the time in the example is spent for the data transfer between the host and the device (and the example also prints the time that is acutally spent for the JCudpp sorting call).

CUDPP is a library that provides efficient implementations for common parallel tasks, which usually are performed as individual steps in a more complex sequence of operations. I assume that the real potential of JCudpp will be unveiled when it is possible to use the sort function for data that already is stored on the device, and manipulated with own kernels that are loaded and executed with JCuda. This is NOT yet possible, due to the missing possiblity to mix runtime- and driver API calls in CUDA 2.3. But with CUDA 3.0 this will be possible, and first tests show that it will most likely be possible with JCuda for CUDA 3.0 as well.

Note that this program should be started with something like
java -Xmx1000m JCudppSampleTiming
since it allocates some large arrays.


import java.util.*;

import jcuda.*;
import jcuda.jcudpp.*;
import jcuda.runtime.*;

/**
 * An example that sorts arrays of different sizes with Java and
 * with JCudpp, and compares the execution times, which are 
 * measured via the CUDA event mechanism. <br />
 * <br />
 * This should not be considered as a benchmark.
 */
public class JCudppSampleTiming
{
    public static void main(String args[])
    {
        // Early initialization of the native libraries.
        // This is not really required and will be done
        // automatically during the first call, but is
        // used here to avoid measuring the time that
        // is required for the initialization.
        JCuda.initialize();
        JCudpp.initialize();

        testSort(100000);
        testSort(1000000);
        testSort(10000000);
        testSort(100000000);
    }
    
    /**
     * Test the JCudpp sort operation for an array of size n
     * 
     * @param n The array size
     */
    public static boolean testSort(int N)
    {
        System.out.println("Creating input data, consisting of "+N+" elements");
        int array[] = createRandomIntData(N);
        int arrayRef[] = array.clone();

        // Sort with Java, and measure and print the time
        long before = System.nanoTime();
        Arrays.sort(arrayRef);
        long after = System.nanoTime();
        double durationMS = (after-before) / 1e6;
        System.out.println("Elapsed time for Java   sort measured with Java "+durationMS);

        // Sort with JCudpp, and measure and print the time
        sort(array);
        
        boolean passed = Arrays.equals(array, arrayRef);
        System.out.println("testSort "+(passed?"PASSED":"FAILED")); 
        return passed;
    }
    
    /**
     * Implementation of sort using JCudpp
     * 
     * @param array The array to sort
     */
    private static void sort(int array[])
    {
        int n = array.length;
        
        // Create the events that will be used for measuring the total time
        cudaEvent_t startTotal = new cudaEvent_t();
        cudaEvent_t stopTotal = new cudaEvent_t();
        JCuda.cudaEventCreate(startTotal);
        JCuda.cudaEventCreate(stopTotal);

        // Create the events that will be used for measuring the sort time
        cudaEvent_t startSort = new cudaEvent_t();
        cudaEvent_t stopSort = new cudaEvent_t();
        JCuda.cudaEventCreate(startSort);
        JCuda.cudaEventCreate(stopSort);

        // Start the event recording
        JCuda.cudaEventRecord(startTotal, null);
        long before = System.nanoTime();
        
        // Allocate memory on the device
        Pointer d_keys = new Pointer();
        JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
        
        // Copy the input array from the host to the device
        JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT, 
            cudaMemcpyKind.cudaMemcpyHostToDevice);
        
        // Create a CUDPPConfiguration for a radix sort of
        // an array of ints 
        CUDPPConfiguration config = new CUDPPConfiguration();
        config.algorithm = CUDPPAlgorithm.CUDPP_SORT_RADIX;
        config.datatype = CUDPPDatatype.CUDPP_UINT;
        config.op = CUDPPOperator.CUDPP_ADD;
        config.options = CUDPPOption.CUDPP_OPTION_KEYS_ONLY;
        
        // Create a CUDPPHandle for the sort operation
        CUDPPHandle handle = new CUDPPHandle();
        JCudpp.cudppPlan(handle, config, n, 1, 0);
        
        // Execute the sort operation
        JCuda.cudaEventRecord(startSort, null);
        JCudpp.cudppSort(handle, d_keys, null, 32, n);
        JCuda.cudaEventRecord(stopSort, null);
        JCuda.cudaEventSynchronize(stopSort);
        float elapsedTimeSort[] = new float[1];
        JCuda.cudaEventElapsedTime(elapsedTimeSort, startSort, stopSort);

        // Copy the result from the device to the host
        JCuda.cudaMemcpy(Pointer.to(array), d_keys, n * Sizeof.INT, 
            cudaMemcpyKind.cudaMemcpyDeviceToHost);

        // Compute and print the elapsed time for the events
        JCuda.cudaEventRecord(stopTotal, null);
        JCuda.cudaEventSynchronize(stopTotal);
        float elapsedTimeTotal[] = new float[1];
        JCuda.cudaEventElapsedTime(elapsedTimeTotal, startTotal, stopTotal);
        long after = System.nanoTime();
        double durationMS = (after-before) / 1e6;
        System.out.println("Elapsed time for JCudpp sort measured with CUDA "+
            elapsedTimeTotal[0]+" (sorting: "+elapsedTimeSort[0]+")");
        System.out.println("Elapsed time for JCudpp sort measured with CUDA "+
            durationMS);
        
        // Clean up
        JCuda.cudaEventDestroy(startTotal);
        JCuda.cudaEventDestroy(stopTotal);        
        JCuda.cudaEventDestroy(startSort);
        JCuda.cudaEventDestroy(stopSort);        
        JCudpp.cudppDestroyPlan(handle);
        JCuda.cudaFree(d_keys);
        
    }

    /**
     * Creates an array of the specified size, containing some random data
     */
    private static int[] createRandomIntData(int n)
    {
        Random random = new Random(0);
        int x[] = new int[n];
        for (int i = 0; i < n; i++)
        {
            x** = random.nextInt(10);
        }
        return x;
    }

    
}

Hi Marco,

Thanks your quick reply. It’s a great support for me.
Thanks very much.

Lemon