org.jocl.CLException: CL_DEVICE_NOT_FOUND


I am using JOCL and I am trying to access GPU from cleanup() method of mapper class of Hadoop.
(Note: My normal code(without) map reduce works fine on GPU).
When I execute my map-reduce code, It throws an error (specified below).

attempt_201309171647_0021_m_000000_1: No protocol specified
attempt_201309171647_0021_m_000000_1: No protocol specified
13/09/20 18:03:01 INFO mapred.JobClient: Task Id : attempt_201309171647_0021_m_000000_2, Status : FAILED
org.jocl.CLException: CL_DEVICE_NOT_FOUND
at org.jocl.CL.checkResult(
at org.jocl.CL.clGetDeviceIDs(
at com.testMR.jocl.WordCountMapper.cleanup(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.Child$
at Method)
at org.apache.hadoop.mapred.Child.main(

Cannot find GPU. What could be the possible reason ?
Rohit Sarewar


What is the device_type constant that you are passing to clGetDeviceIDs? It should be CL_DEVICE_TYPE_GPU (or CL_DEVICE_TYPE_ALL to obtain also CPUs).

Can you run one of the minimal, basic examples from on this PC?



What is the device_type constant that you are passing to clGetDeviceIDs? It should be CL_DEVICE_TYPE_GPU (or CL_DEVICE_TYPE_ALL to obtain also CPUs).

Can you run one of the minimal, basic examples from on this PC?


Hi Macro
I can run these basic examples on mu GPU.
This is a sample JOCL code which I tried to execute on GPU from my Hadoop mapreduce code.
You can find " final long deviceType = CL_DEVICE_TYPE_GPU;" in bold in the code snippet below.

I have an AMD GPU on my machine.
CL_DEVICE_VENDOR: Advanced Micro Devices, Inc.

If I change this to CPU instead of GPU (i.e final long deviceType = CL_DEVICE_TYPE_CPU) then the mapper runs to completion and job is successful.

Please find the code snippet(Mapper Class) below:

package com.testMR.jocl;
import static org.jocl.CL.CL_CONTEXT_PLATFORM;
import static org.jocl.CL.CL_DEVICE_TYPE_ALL;
import static org.jocl.CL.CL_DEVICE_TYPE_GPU;
import static org.jocl.CL.CL_DEVICE_TYPE_CPU;
import static org.jocl.CL.CL_MEM_COPY_HOST_PTR;
import static org.jocl.CL.CL_MEM_READ_ONLY;
import static org.jocl.CL.CL_MEM_READ_WRITE;
import static org.jocl.CL.CL_TRUE;
import static org.jocl.CL.clBuildProgram;
import static org.jocl.CL.clCreateBuffer;
import static org.jocl.CL.clCreateCommandQueue;
import static org.jocl.CL.clCreateContext;
import static org.jocl.CL.clCreateKernel;
import static org.jocl.CL.clCreateProgramWithSource;
import static org.jocl.CL.clEnqueueNDRangeKernel;
import static org.jocl.CL.clEnqueueReadBuffer;
import static org.jocl.CL.clGetDeviceIDs;
import static org.jocl.CL.clGetPlatformIDs;
import static org.jocl.CL.clReleaseCommandQueue;
import static org.jocl.CL.clReleaseContext;
import static org.jocl.CL.clReleaseKernel;
import static org.jocl.CL.clReleaseMemObject;
import static org.jocl.CL.clReleaseProgram;
import static org.jocl.CL.clSetKernelArg;

import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.jocl.CL;
import org.jocl.Pointer;
import org.jocl.Sizeof;
import org.jocl.cl_command_queue;
import org.jocl.cl_context;
import org.jocl.cl_context_properties;
import org.jocl.cl_device_id;
import org.jocl.cl_kernel;
import org.jocl.cl_mem;
import org.jocl.cl_platform_id;
import org.jocl.cl_program;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
private static String programSource =
“__kernel void “+
“sampleKernel(__global const float *a,”+
" __global const float *b,”+
" __global float *c)”+
" int gid = get_global_id(0);"+
" c[gid] = a[gid] * b[gid];"+

  //hadoop supported data types
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

       public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
         //context.write(arg0, arg1);
       protected void cleanup(Context context) throws IOException, InterruptedException
                // Create input- and output data
                int n = 10;
                float srcArrayA[] = new float[n];
                float srcArrayB[] = new float[n];
                float dstArray[] = new float[n];
                for (int i=0; i<n; i++)
                    srcArrayA** = i;
                    srcArrayB** = i;
                Pointer srcA =;
                Pointer srcB =;
                Pointer dst =;

                // The platform, device type and device number
                // that will be used
                final int platformIndex = 0;
               ** final long deviceType = CL_DEVICE_TYPE_GPU;**
                final int deviceIndex = 0;

                // Enable exceptions and subsequently omit error checks in this sample

                // Obtain the number of platforms
                int numPlatformsArray[] = new int[1];
                clGetPlatformIDs(0, null, numPlatformsArray);
                int numPlatforms = numPlatformsArray[0];

                // Obtain a platform ID
                cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
                clGetPlatformIDs(platforms.length, platforms, null);
                cl_platform_id platform = platforms[platformIndex];

                // Initialize the context properties
                cl_context_properties contextProperties = new cl_context_properties();
                contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
                // Obtain the number of devices for the platform
                int numDevicesArray[] = new int[1];
                clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
                int numDevices = numDevicesArray[0];
                // Obtain a device ID
                cl_device_id devices[] = new cl_device_id[numDevices];
                clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
                cl_device_id device = devices[deviceIndex];

                // Create a context for the selected device
                cl_context openCL_context = clCreateContext(
                    contextProperties, 1, new cl_device_id[]{device},
                    null, null, null);
                // Create a command-queue for the selected device
                cl_command_queue commandQueue =
                    clCreateCommandQueue(openCL_context, device, 0, null);

                // Allocate the memory objects for the input- and output data
                cl_mem memObjects[] = new cl_mem[3];
                memObjects[0] = clCreateBuffer(openCL_context,
                    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                    Sizeof.cl_float * n, srcA, null);
                memObjects[1] = clCreateBuffer(openCL_context,
                    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                    Sizeof.cl_float * n, srcB, null);
                memObjects[2] = clCreateBuffer(openCL_context,
                    Sizeof.cl_float * n, null, null);
                // Create the program from the source code
                cl_program program = clCreateProgramWithSource(openCL_context,
                    1, new String[]{ programSource }, null, null);
                // Build the program
                clBuildProgram(program, 0, null, null, null, null);
                // Create the kernel
                cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
                // Set the arguments for the kernel
                clSetKernelArg(kernel, 0,
                clSetKernelArg(kernel, 1,
                clSetKernelArg(kernel, 2,
                // Set the work-item dimensions
                long global_work_size[] = new long[]{n};
                long local_work_size[] = new long[]{1};
                // Execute the kernel
                clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                    global_work_size, local_work_size, 0, null, null);
                // Read the output data
                clEnqueueReadBuffer(commandQueue, memObjects[2], CL_TRUE, 0,
                    n * Sizeof.cl_float, dst, 0, null, null);
                // Release kernel, program, and memory objects
                // Verify the result
                boolean passed = true;
                final float epsilon = 1e-7f;
                for (int i=0; i<n; i++)
                    float x = dstArray**;
                    float y = srcArrayA** * srcArrayB**;
                    boolean epsilonEqual = Math.abs(x - y) <= epsilon * Math.abs(x);
                    if (!epsilonEqual)
                        passed = false;
                //System.out.println("Test "+(passed?"PASSED":"FAILED"));
                context.write(new Text("Passed"),new IntWritable(1));
                if (n <= 10)
                    //System.out.println("Result: "+java.util.Arrays.toString(dstArray));
                       context.write(new Text(java.util.Arrays.toString(dstArray)),new IntWritable(2));


I have used identity reducer.

This is a little bit strange. Did you also run the “” from the samples page? It shows how all available devices can be listed. When you start in manually, it should probably list one CPU and one GPU. What happens if you try to list all devices in the same way in a program that is called via hadoop? That is, what does it print when you insert a snippet like this…

// Obtain the number of platforms
int numPlatforms[] = new int[1];
clGetPlatformIDs(0, null, numPlatforms);

System.out.println("Number of platforms: "+numPlatforms[0]);

// Obtain the platform IDs
cl_platform_id platforms[] = new cl_platform_id[numPlatforms[0]];
clGetPlatformIDs(platforms.length, platforms, null);

// List all devices of all platforms
for (int i=0; i<platforms.length; i++)
    // Obtain the number of devices for the current platform
    int numDevices[] = new int[1];
    clGetDeviceIDs(platforms**, CL_DEVICE_TYPE_ALL, 0, null, numDevices);

    cl_device_id devicesArray[] = new cl_device_id[numDevices[0]];
    clGetDeviceIDs(platforms**, CL_DEVICE_TYPE_ALL, numDevices[0], devicesArray, null);

    System.out.println("There are "+numDevices[0]+" devices "+
        "on platform "+i+": "+Arrays.asList(devicesArray));

at the beginning of your hadoop code?

Hi Macro successfully detects CPU and GPU when executed without map reduce.

But when I also tried executing “” with three options in my map task:

    My map tasks runs successfully.
    It detects only CPU

    My map tasks runs successfully.
    It detects only CPU

    My map task fails throwin the error(mentioned below):

attempt_201309171647_0021_m_000000_1: No protocol specified
attempt_201309171647_0021_m_000000_1: No protocol specified
13/09/20 18:03:01 INFO mapred.JobClient: Task Id : attempt_201309171647_0021_m_000000_2, Status : FAILED
org.jocl.CLException: CL_DEVICE_NOT_FOUND
at org.jocl.CL.checkResult(
at org.jocl.CL.clGetDeviceIDs(


OK, then for now, we can only say that - for some reason - it simply does not detect the GPU when it is run as a MapReduce job.

There had been some experiments for using JCuda with Hadoop ( , ), but there seemed to be no „hardware“ problems. Only some settings concerning the library path etc, but no problems with devices not being found.

I’m afraid that I can’t give any specific hints right now. I could only guess what it might be related to (also because I still did not have the chance to have a closer look at Hadoop and how it works :frowning: ). Is there any sort of „abstraction“ that causes the cluster to be run on something like an own virtual machine? (Not the JVM of course, but another one, maybe something like an emulation layer for multiple nodes on one PC?). In doubt, one could also ask some Hadoop expert in the respective forums… Maybe someone has an idea why certain hardware resources can not be accessed from a Hadoop job…?

Hi Marco

We fixed it :slight_smile:

The problem that we had with the AMD OpenCL codes running on Hadoop was that the mapreduce code didn’t have access to the GPU cards, it needed the GUI services that is provided by X-server to use the GPU compute resource.

From what I understand AMD OpenCL codes (for users other than root) can’t be run without access to an X-server (

According to this thread AMD is working on getting OpenCL to work without X-server.

The solution that I found to get OpenCL codes to run on Hadoop is adapted from this thread that suggests steps to get OpenCL codes to run through an ssh login without a GUI.

The following are the steps that I followed:

  1. Edit the ‚lightdm‘ user’s shell using ‚chsh lightdm‘ command and set it to /bin/bash
    $sudo chsh lightdm
    when it prompts, type : /bin/bash
  2. Open /etc/rc.local and add the following line before ‚exit 0‘ .
    su -l lightdm -c „sleep 30 ; export DISPLAY=:0 ; xhost +local:“
  3. Create a file /etc/profile.d/ and add the following inside (and execute ‚chmod 755 /etc/profile.d/‘ ):
    export COMPUTE=:0
    #export DISPLAY=:0
    #export GPU_MAX_ALLOC_PRCENT=100
    #export GPU_MAX_HEAP_SIZE=100

if [ ! -z „$DISPLAY“ ]; then
xhost +local:
4. The commented out entries above are for testing other stuff if this setup didn’t work, but for us it worked
5. Give permissions for the above script
$sudo chmod 755 /etc/profile.d/
6. X setup resets if one logs in/out from the lightdm, so the following was added into /etc/lightdm/lightdm.conf
7. Reboot the system so that the environment variables are set for all the users (including mapred), now we can run OpenCL codes from hadoop


Great :slight_smile:

I could never have figured this out on my own, since I have very limited possibilities for running tests on Linux - and even less in a specific setup like a Hadoop cluster.

Thanks for this information, it will probably be very helpful for people who want to use the GPU in Hadoop!