Error while loading native library with base name "JCudpp"

Dippo · 17. Januar 2011 um 09:10

Hi,

I got JCuda working in Java Processing, except i get an error on the Cuda Sort example.

import jcuda.*;
import jcuda.jcudpp.*;
import jcuda.runtime.*;
import jcuda.runtime.JCuda.*;

/**
 * This is a sample class demonstrating the application of JCudpp for
 * performing a sort of an integer array with 1000000 elements.
 */
void setup() {
  noLoop();
}

void draw() {    
  testSort(1000000);
}

/**
 * Test the JCudpp sort operation for an array of size n
 *
 * @param n The array size
 */
static boolean testSort(int N) {
  println("Creating input data");
  int array[] = createRandomIntData(N);
  int arrayRef[] = array.clone();
  println("Performing sort with Java...");
  Arrays.sort(arrayRef);
  println("Performing sort with JCudpp...");
  Cuda_sort(array);
  boolean passed = Arrays.equals(array, arrayRef);
  println("testSort "+(passed?"PASSED":"FAILED"));
  return passed;
}

/**
 * Implementation of sort using JCudpp
 *
 * @param array The array to sort
 */
static void Cuda_sort(int array[]) {
  int n = array.length;
  // Allocate memory on the device
  Pointer d_keys = new Pointer();
  JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
  // Copy the input array from the host to the device
  JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT,
  cudaMemcpyKind.cudaMemcpyHostToDevice);
  // Create a CUDPPConfiguration for a radix sort of
  // an array of ints
  CUDPPConfiguration config = new CUDPPConfiguration();
  config.algorithm = CUDPPAlgorithm.CUDPP_SORT_RADIX;
  config.datatype = CUDPPDatatype.CUDPP_UINT;
  config.op = CUDPPOperator.CUDPP_ADD;
  config.options = CUDPPOption.CUDPP_OPTION_KEYS_ONLY;
  // Create a CUDPPHandle for the sort operation
  CUDPPHandle handle = new CUDPPHandle();
  //JCudpp.cudppPlan(handle, config, n, 1, 0);
  // Execute the sort operation
  JCudpp.cudppSort(handle, d_keys, null, 32, n);
  Arrays.fill(array, 0);
  // Copy the result from the device to the host
  JCuda.cudaMemcpy(Pointer.to(array), d_keys, n * Sizeof.INT,
  cudaMemcpyKind.cudaMemcpyDeviceToHost);
  // Clean up
  //JCudpp.cudppDestroyPlan(handle);
  JCuda.cudaFree(d_keys);
}

/**
 * Creates an array of the specified size, containing some random data
 */
static int[] createRandomIntData(int n) {
  Random random = new Random(0);
  int x[] = new int[n];
  for (int i = 0; i < n; i++) {
    x** = random.nextInt(10);
  }
  return x;
}

The result from the console:

Creating input data
Performing sort with Java…
Performing sort with JCudpp…
Error while loading native library with base name “JCudpp”
Operating system name: Windows 7
Architecture : amd64
Architecture bit size: 64
processing.app.debug.RunnerException: UnsatisfiedLinkError: Could not load native library
at processing.app.Sketch.placeException(Sketch.java:1543)
at processing.app.debug.Runner.findException(Runner.java:582)
at processing.app.debug.Runner.reportException(Runner.java:558)
at processing.app.debug.Runner.exception(Runner.java:498)
at processing.app.debug.EventThread.exceptionEvent(EventThread.java:367)
at processing.app.debug.EventThread.handleEvent(EventThread.java:255)
at processing.app.debug.EventThread.run(EventThread.java:89)
Exception in thread “Animation Thread” java.lang.UnsatisfiedLinkError: Could not load native library
at jcuda.LibUtils.loadLibrary(LibUtils.java:79)
at jcuda.jcudpp.JCudpp.assertInit(JCudpp.java:175)
at jcuda.jcudpp.JCudpp.cudppSort(JCudpp.java:489)
at CUDA_sort.Cuda_sort(CUDA_sort.java:86)
at CUDA_sort.testSort(CUDA_sort.java:56)
at CUDA_sort.draw(CUDA_sort.java:41)
at processing.core.PApplet.handleDraw(Unknown Source)
at processing.core.PApplet.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)

It goes wrong on the line: JCudpp.cudppSort(handle, d_keys, null, 32, n);
The only thing i can think of, is that i miss a file called cudpp64_32_16.dll. Because this is the only file i don’t have as a dll. But i am not sure.
Is there somebody who knows?

Greetings, Dippo

Marco13 · 17. Januar 2011 um 12:08

Hello

CUDPP is not developed by NVIDIA. This is why the required DLLs are not installed together with the NVIDIA CUDA Toolkit.

But fortunately, the required DLL is contained in the CUDA SDK, because CUDPP is used in some of the SDK examples. If you install the CUDA SDK, then the required DLL should be somewhere in “…\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\bin\win64\Release\cudpp64_32_16.dll”. But if you do not specifically need the functionalities of CUDPP, you don’t have to install the SDK, of course.

bye
Marco

Dippo · 17. Januar 2011 um 16:38

Hi,

Thank you, it works. Is it ok if i use your code to demonstrate on www.processing.org?

import jcuda.*;
import jcuda.jcudpp.*;
import jcuda.runtime.*;
import jcuda.runtime.JCuda.*;
int N, JstartTime, JendTime, CstartTime, CendTime;
/**
 * This is a sample class demonstrating the application of JCudpp for
 * performing a sort of an integer array with the amount of elements, 
 * depending on the sreen size. The screen size is resizable! 
 *
 * Note for Processing users: The jcudpp.dll can be found in the CUDA SDK. 
 * If you install the CUDA SDK, then the required DLL is in  
 * "...\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\common\bin"
 * The cudpp64_32_16.dll is the one that needed in 
 * "C:\Users\<user>\Documents\Processing\libraries\jcuda\library
 *
 * Copyright 2009 Marco Hutter - http://www.jcuda.org
 * Edited by Dippo - http://processing.org/
 */

void setup() {
  size(500,500,JAVA2D);      
  frame.setResizable(true);
  PFont font;
  font = createFont("ArialMT-48",32);   
  textFont(font);
}

void draw() {    
  frame.setTitle("Frame Number "+str(frameCount));   
  int N=width*height; 
  println("Creating input data");
  int array[] = createRandomIntData(N);
  int arrayRef[] = array.clone();  
  print("Performing sort with Java...");
  JstartTime=millis();
  Arrays.sort(arrayRef);  
  JendTime=millis();
  println("Java sort : "+(JendTime-JstartTime));    
  print("Performing sort with JCudpp...");  
  CstartTime=millis();
  Cuda_sort(array);
  CendTime=millis();
  println("CUDA sort : "+(CendTime-CstartTime));
  loadPixels();
  for (int i=0;i<width*height/2;++i) { // upper half by Java sort = arrayRef[]
    pixels**=color(arrayRef**,arrayRef**,arrayRef**);
    // lower half by CUDA sort = array[]
    pixels[i+width*height/2]=color(array[i+width*height/2],
    array[i+width*height/2],array[i+width*height/2]);
  }
  updatePixels();    
  fill(0,210,210,151);  
  text(width*height+" random numbers",50,100);
    fill(0,102,153,151);  
  text("Java sort : "+(JendTime-JstartTime)+" milliseconds", 50,200);
  fill(0,102,153,121);
  text("CUDA sort : "+(CendTime-CstartTime)+" milliseconds", 50,400);
  fill(0,102,153,121);
  text("<<Resize the screen>>", 50,480);
  boolean passed = Arrays.equals(array, arrayRef);
  println("testSort "+(passed?"PASSED":"FAILED"));
}

/**
 * Implementation of sort using JCudpp
 *
 * @param array The array to sort
 */
static void Cuda_sort(int array[]) {
  int n = array.length;  
  // Allocate memory on the device
  Pointer d_keys = new Pointer();
  JCuda.cudaMalloc(d_keys, n * Sizeof.INT);
  // Copy the input array from the host to the device
  JCuda.cudaMemcpy(d_keys, Pointer.to(array), n * Sizeof.INT,
  cudaMemcpyKind.cudaMemcpyHostToDevice);
  // Create a CUDPPConfiguration for a radix sort of
  // an array of ints
  CUDPPConfiguration config = new CUDPPConfiguration();
  config.algorithm = CUDPPAlgorithm.CUDPP_SORT_RADIX;
  /** 
   * Algorithms supported by CUDPP. Used to create appropriate plans using cudppPlan. 
   * CUDPP_COMPACT Compact
   * CUDPP_RAND_MD5 Pseudo Random Number Generator using MD5 hash algorithm
   * CUDPP_REDUCE Reduction
   * CUDPP_SCAN Scan
   * CUDPP_SEGMENTED_SCAN Segmented scan
   * CUDPP_SORT_INVALID Placeholder at end of enum
   * CUDPP_SORT_RADIX Radix sort within chunks, merge sort to merge chunks together
   * CUDPP_SPMVMULT Sparse matrix - vector multiplication
   */
  config.datatype = CUDPPDatatype.CUDPP_UINT;
  /**
   * Datatypes supported by CUDPP algorithms. 
   * CUDPP_CHAR   Character type (C char) - Closest Java type: byte
   * CUDPP_FLOAT  Float type (C float) - Closest Java type: float
   * CUDPP_INT    Integer type (C int) - Closest Java type: int
   * CUDPP_UCHAR  Unsigned character (byte) type (C unsigned char) - 
   *              Closest Java type: byte
   * CUDPP_UINT   Unsigned integer type (C unsigned int) - Closest Java type: int
   */
  config.op = CUDPPOperator.CUDPP_ADD;
  config.options = CUDPPOption.CUDPP_OPTION_KEYS_ONLY;
  // Create a CUDPPHandle for the sort operation
  CUDPPHandle handle = new CUDPPHandle();
  JCudpp.cudppPlan(handle, config, n, 1, 0);
  // cudppPlan(CUDPPHandle planHandle, CUDPPConfiguration config, 
  // long n, long rows, long rowPitch)          
  // Execute the sort operation
  JCudpp.cudppSort(handle, d_keys, null, 32, n);
  Arrays.fill(array, 0);
  // Copy the result from the device to the host
  JCuda.cudaMemcpy(Pointer.to(array), d_keys, n * Sizeof.INT,
  cudaMemcpyKind.cudaMemcpyDeviceToHost);
  // Clean up
  JCudpp.cudppDestroyPlan(handle);
  JCuda.cudaFree(d_keys);
}

/**
 * Creates an array of the specified size, containing some random data
 */
int[] createRandomIntData(int n) {
  Random random = new Random(0);
  int x[] = new int[n];  
  for (int i = 0; i < n; i++) {
    x** = random.nextInt(255);
  }    
  return x;
}

Marco13 · 18. Januar 2011 um 02:42

Of course this is OK. Maybe it should rather be „Copyright by Dippo“ (and maybe „based on a sample from …“).

Admittedly, I’m not so familiar with Processing, and when I started the example, I wondered what the application should suggest. Then I commented out the „sort“ calls and saw the result: It’s an „interesting“ way of painting a gradient

The real performance benefits of CUDA mainly show up when there is a large amount of data, to which a sequence of data-parallel, compute-intensive tasks is applied - like sorting, then normalizing, then inverting or something, … (all on the GPU) and afterwards obtaining the final result.

But I could imagine that there may be some („realistic“) application cases for CUDA in Processing, and it’s good to see an example that shows that the combination is working in general

Dippo · 18. Januar 2011 um 08:22

Hi,

Thank you. I don’t want to have a discussion about stealing somebody else his code. Is it ok if i convert all the other examples to Java Processing?

Processing is a graphical Java language, so combining CUDA with Processing is a win-win situation. One of problem(s) with Processing is speed. CUDA is the only option to solve that, as i can see it.

The test results from my application can be misleading. In Processing there is an option to make an executable file (Windows/OSX/Linux) or an applet (app runs in browser), both options does not work. It compiles, but the executable and applet don’t work. When i compile in the IDE, there is no error and everything works as intended. Anyhow, i don’t think that there is much of a difference.

If Java sort needs to sort lesser than 100.000 records, Java sort is faster than CUDA sort. (plz. note, i didn’t check the CPU usage yet. It can still be smarter to use CUDA instead of Java sort)
If 250.000 records needs to be sorted, CUDA sort is 2 times faster than Java sort.
If 1.600.000 records needs to be sorted, CUDA sort is 3 times faster than Java sort.

Now this is working, i can think about creating more usage for it. For example Conway, way of life (http://www.openprocessing.org/visuals/?visualID=13658) or fractals (http://www.openprocessing.org/visuals/?visualID=333). The problem is that when i alter the algorithm in Cuda, the compiler gives an error and it don’t show what line is incorrect. From this point i am depending from your examples/expertise, because i know they are working.

Thanks for your support! Dippo

Marco13 · 18. Januar 2011 um 10:09

Hello,

Maybe I should mention on the website that the samples are basically “public domain”. Credits are always appreciated, but they do not contain real intellectual property.

I know Processing in general, but just played with a few examples, and don’t know the details of deployment in Executables or as Applets. It may be difficult to use an Applet, because the platform-specific libraries are required on the client side (which may have an ATI card, by the way…). The executable could be a solution, but I’m not sure how this is solved. Which problems (error messages) did you run into with this approach?

BTW: A possible alternative for CUDA could be OpenCL - namely, one of the Java bindings for OpenCL, which are listed at http://jocl.org/. It has the advantage to be independent of the Vendor and can even run on a simple CPU. An OpenCL installation (for example, the CUDA Toolkit) will still be required on client side, but nevertheless, the distribution may be simpler as well: On https://oss.sonatype.org/content/repositories/releases/org/jocl/jocl/0.1.4c/ there is a JAR which also contains the native libraries (I have not yet uploaded this latest version to the website, but will do this soon).
OpenCL does not yet have so many runtime libraries, but the community is growing. Olivier Chafik recently started a collection of OpenCL kernels on http://code.google.com/p/nativelibs4java/source/browse/trunk/libraries/OpenCL/LibCL/src/main/resources/LibCL, which already cover some image processing routines.

How are you compiling the CUDA source codes? In general, the compiler should point out the Line where the error occurred…

bye
Marco

Dippo · 19. Januar 2011 um 08:53

Hi,

With applets you get a direct result, without separate downloading libraries, source, etc. etc. So, making an applet makes people more interested.
Making an executable is not really important because you will only do this when you made an application that needs to be run everyday. No errors show up while compiling to an applet or an executable. I also asked in the Processing forum if anybody knows a solution. I rather think the problem lies in Processing than your libraries, because while i compile in the IDE, everything works ok.

What i mean with compiling CUDA source codes, is that i alter your examples to get new results. For example, i saw that jcudpp also has a random number generator, so i altered the above code (with JCurandSample.Java as a reference). When i compile, the console tells me a bug took place and if i want to report this bug. I show you the correct text, i’ve deleted my source code because i think that i made an mistake.

I will look into OpenCL. But currently, i am impressed with CUDA.
Btw, i found a small error in your code. Remember in my previous reply that i told that Java Sort is faster than Cuda Sort?
It’s because of this line:
private static int[] createRandomIntData(int n)
…
x** = random.nextInt(10);

So, 1.000.000 numbers ranging from 0 till 10. If you alter to nextInt(255), Java Sort is still faster with 100.000 numbers. BUT, when you alter to nextInt(2147483647), Cuda Sort is three times faster with 100.000 numbers, Cuda Sort TEN times faster with 1.600.000 numbers!
Normaly i have enough till 255, but seeing this… I am having nightmares…

Good link, thanks. (http://code.google.com/p/nativelibs4java/source/browse/trunk/libraries/OpenCL/LibCL/src/main/resources/LibCL)
I am currently busy with SGEMM, but i am looking for a way to make it visual.

Thank you!

Btw. you said Normalize and then invert. Is that Reduce and option_backward?
And… Because i have to think big… Do you know Edward Wilson, the man who knows everything about ants? He needs super computers to emulate the behaviour of ants because he can’t put a camera on a ants head, to see what happens in a ants nest. Ants are blind, computers too.

Marco13 · 19. Januar 2011 um 10:23

Hello

I’m not sure what you wanted to say about Applets. The native library is always necessary, and there’s no way to circumvent this. But maybe I just misunderstood this.

OpenCL and CUDA are quite similar. When you know how to write a CUDA kernel, you also know how to write an OpenCL kernel. The API is also very similar (IMHO OpenCL has a slightly „cleaner“ and more intuitive API - but maybe this is just subjective).

Admittedly, I didn’t expect that the size of the numbers inside the array has such a big impact on the speed - but once you know it, it seems plausible: When there are only 10 different numbers in the array, quicksort may terminate quite early in each recursion step. The main conceptual difference between the sort algorithms is that Java is using QuickSort, and CUDPP is using a RadixSort. With RadixSort, so the runtime for sorting n integers does not grow with O(nlogn), but with O(n*bits) - and the number of bits can be given in the call. So when you know that you only have values between 0 and 255, you could call
JCudpp.cudppSort(handle, d_keys, null, 8, n);
(instead of using 32 there), and it should be even faster then.

What I said about „normalize and invert“ was just intended as an example for a possible sequence of operations that could take place on the same data. The point is that very often the bottleneck for CUDA applications is the transfer of memory between the host and the device.

I did not know Edward Wilson. Are you going to write an ant simulation with CUDA?

bye
Marco