Wordcount hadoop-gpu opencl

Dear users,

I’m new of this forum, so first of all I would like to say you hello :slight_smile:

I’m working with hadopp , openCL and jocl,

I was looking for an example code of wordcount in hadoop enviroment written for gpu (openCL trought JOCL)

Do you have something? Or , advice me a reading where I can start to work on that

thanks in advance

AMmiraglio

Hello AMmiraglio :slight_smile:

I’m not aware of any existing example for a Word-Count with Hadoop and JOCL. There have been some questions about using JOCL (or JCuda, or GPUs in general) on Hadoop clusters. Maybe threads like http://forum.byte-welt.net/threads/10208-org-jocl-CLException-CL_DEVICE_NOT_FOUND?p=66055&viewfull=1#post66055 will help to circumvent some caveats.

A websearch (like ‚„word count“ gpu‘) of course brings some results, but I have not yet had a closer look at this, so I’m not sure what might really be helpful here. Something like http://www.cse.ust.hk/gpuqp/Mars.html might bring some first insights concerning a possible implementation of a WordCount on the GPU in general, but since it is in C/CUDA and not related to Hadoop, it might still involve some effort to flesh out the relevant ideas and port them to OpenCL. In general, the topic of GPU on Hadoop clusters is even addressed at http://wiki.apache.org/hadoop/CUDA%20On%20Hadoop but not in very much detail.

Do you already have particular experience with either of the fields, GPU programming or Hadoop, or even both?

bye
Marco

Hello Marco,

really thanks for your answer :slight_smile:

I’ve already had a look to http://forum.byte-welt.net/threads/1...ll=1#post66055 and followed the steps :wink:

Yes, typing in google word count gpu (and/or more specific word count hadoop over gpu, etc…) brings some results, but a little bit far :slight_smile: In the sense that you need to put effort (for example pass from cuda to opencl).

Not really, I’ve been working with GPU since september (for a university course) and then I’ve explored for a little bit the word of hadoop ( I’ve just set up the local enviromeent and test the countword :slight_smile: )

So, what do you mean when you say „you are not aware of any existing example for a Word-Count with Hadoop and JOCL.“ ? :slight_smile:

*** Edit ***

PS. I’ve found also this:

I meant that I did NOT know that GitHub - cpieloth/GPGPU-on-Hadoop: GPGPU on Hadoop existed :smiley: - so thanks for posting this link! :slight_smile:

There’s a lot of research going on, in the area of GPU computing and in the area of Hadoop. So it’s not surprising that there is also a lot of research for the combination of the two. But from your question, it seemed that you are not particularly looking for interesting research papers. It sounded more like you’re looking for a basic implementation example, to get started quickly with your own experiments.

(And admittedly, although I’m not constantly monitoring the whole GPU-related research landscape, it’s at least surprising for me that I was not aware of this GitHub repo, considering that it is already 2 years old… maybe I found it once, but it got lost on my „list of things that I should have a closer look at“ …?)

However: Isn’t the Github repository exactly what you have been looking for?

In any case, I’ll have at least a short look at the corresponding Master’s thesis ( http://christof.pieloth.org/studienarbeiten/masterarbeit - it’s in German, so probably won’t help you much, but if I stumble upon any interesting or important insights (although I probably will not read the whole thing), I can post them here ;)).

bye
Marco

Dear Marco,

you’re welcome :wink:

Yes , exactly, I was looking for a very basic implementation example, in order to start with my experiment.

I think the same, but I need some help to figure it out how to set up the enviroment.

If we have a look to → GPGPU-on-Hadoop / hadoop_ocl_link_test / src /

where the are Test of different libraries for linking OpenCL to MapReduce framework (Hadoop Streaming, Hadoop Pipes, JavaCL, JOCL).
Using an example job to extract the maximum temperature of a year.

  1. Hadoop case

Have a lok to the main code in → MaxTemperature.java

if(args.length < 3) {
System.out.println("Arguments: ");
return;
}

It needs 3 in put paramenters. Questions:

  • What is the JobName? The name of the class, so → MaxTemperature ?

  • where can I find an input file (like the same that they used?)

Thanks

*** Edit ***

I mean, in which format has to be written the input file?

Something like:

2000 4 5 12 25 30 7
2001 21 30 5 8 0 9
2002 30 12 5 1 36
.
.
.
.

I am not really familiar with hadoop (although becoming familiar with it has been on my ‘todo’ list for quite a while now), and I have not yet looked at the details of the code (but I should probably put this on my ‘todo’ list as well ;)). But according to my understanding, the “JobName” is just an arbitrary, user-defined name. But I can not say for sure whether it is used somewhere for identification purposes or so…

Concerning the file format: The MaxTemperatureMapper seems to process the lines of the input file using the DataSet class. According to this, the relevant parts of the input lines are:


--------------1234-------123456------------------------------------------------------------------------123456------

Where the first block contains the year and the third block contains the maximum temperature of this year. (The second block contains another temperature, but is not used here. Also, the temperatures from the file may be double values, and are multiplied by 10 (for whatever reason) while they are read).

Dear Marco,

Here the result of my first try:

input file:

--------------1234-------023456------------------------------------------------------------------------023456------
--------------1201-------123456------------------------------------------------------------------------123456------
--------------2003-------053456------------------------------------------------------------------------053456------
--------------2001-------043456------------------------------------------------------------------------043456------

error messages:

^[[Agabriele@gabriele-K52JU:~$ hadoop jar MaxTemperature.jar MaxTemperature /user/gabriele/input /user/gabriele/output
13/11/19 21:40:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/19 21:40:29 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/19 21:40:29 INFO input.FileInputFormat: Total input paths to process : 1
13/11/19 21:40:29 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/19 21:40:30 INFO mapred.JobClient: Running job: job_local386907794_0001
13/11/19 21:40:30 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/19 21:40:30 INFO mapred.LocalJobRunner: Starting task: attempt_local386907794_0001_m_000000_0
13/11/19 21:40:30 INFO util.ProcessTree: setsid exited with exit code 0
13/11/19 21:40:30 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@15d3352
13/11/19 21:40:30 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/gabriele/input/temperatures.txt:0+469
13/11/19 21:40:30 INFO mapred.MapTask: io.sort.mb = 100
13/11/19 21:40:30 INFO mapred.MapTask: data buffer = 79691776/99614720
13/11/19 21:40:30 INFO mapred.MapTask: record buffer = 262144/327680
13/11/19 21:40:30 INFO mapred.MapTask: Starting flush of map output
13/11/19 21:40:30 INFO mapred.MapTask: Finished spill 0
13/11/19 21:40:30 INFO mapred.LocalJobRunner: Map task executor complete.
13/11/19 21:40:30 WARN mapred.LocalJobRunner: job_local386907794_0001
java.lang.Exception: java.lang.StringIndexOutOfBoundsException: String index out of range: 18
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)

Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: 18
at java.lang.String.substring(String.java:1907)
at gsod.DataSet.getYear(DataSet.java:19)
at hadoop.MaxTemperatureMapper.map(MaxTemperatureMapper.java:29)
at hadoop.MaxTemperatureMapper.map(MaxTemperatureMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
13/11/19 21:40:31 INFO mapred.JobClient: map 0% reduce 0%
13/11/19 21:40:31 INFO mapred.JobClient: Job complete: job_local386907794_0001
13/11/19 21:40:31 INFO mapred.JobClient: Counters: 0


After some tests I’ve realized that It does not like the instruction in the file MaxTemperatureMapper.java

->>>> year = DataSet.getYear(line);

It is weird no? And why String index out of range: 18 ?

PS. Marco :slight_smile: Can we communicate in another way once? (facebook, gmail, or any kind of chat)? I really need to figure it out as soon as possible and set up a basic environment that can be work
Maybe trying together will be easier :slight_smile:

thanks

Admittedly, I have not tested it, and my assumption (namely that it simply reads the individual lines from the input file) may be wrong.

Are you sure that there is no empty line at the end of the file?

In any case, it should not be too hard to find out what’s wrong there. One option could be to run it in a debugger, but assume that’s not sooo easy in this case. The other would be to simply insert a

public static String getYear(String line) {
    System.out.println("That's the line: >"+line+"<");
    return line.substring(YEAR_START, YEAR_END + 1);
}

into the DataSet.java

We could possibly arrange a chat ‘meeting’ (here’s a chat channel, and #bytewelt as IRC chat, I just have to see how this stuff works :D). However, I’m not sure whether I can help you with specific questions about this library. However, I can try to get it running on my own, and see whether I encounter any problems (and whether I find solutions for them ;))

Ok, no problem :slight_smile:

In fact If I print it try to read an empty line

That’s the line: >--------------1234-------023456------------------------------------------------------------------------023456------<
234560.0
That’s the line: >--------------1201-------123456------------------------------------------------------------------------123456------<
1234560.0
That’s the line: >--------------2003-------053456------------------------------------------------------------------------053456------<
534560.0
That’s the line: >--------------2001-------043456------------------------------------------------------------------------043456------<
434560.0
That’s the line: ><

So what I did is to add the line with the prefix STN— since in the mapper

if (line.startsWith(„STN—“))
return;

should exit.

--------------1234-------023456------------------------------------------------------------------------023456------
--------------1201-------123456------------------------------------------------------------------------123456------
--------------2003-------053456------------------------------------------------------------------------053456------
--------------2001-------043456------------------------------------------------------------------------043456------
STN-----------1988-------043456------------------------------------------------------------------------043456------

but still the same prob:

That’s the line: >--------------1234-------023456------------------------------------------------------------------------023456------<
234560.0
That’s the line: >--------------1201-------123456------------------------------------------------------------------------123456------<
1234560.0
That’s the line: >--------------2003-------053456------------------------------------------------------------------------053456------<
534560.0
That’s the line: >--------------2001-------043456------------------------------------------------------------------------043456------<
434560.0
That’s the line: ><

*** Edit ***

Ok, it is working now :),

public void map(LongWritable key, Text value,
		MaxTemperatureMapper.Context context) throws IOException,
		InterruptedException {
	line = value.toString();

	 System.out.println("That's the line from mapper: >"+line+"<");
	
	/*if (line.startsWith("STN---"))*/
**	if (line.isEmpty())**
	{
		return;
	}
	
		
	

	year = DataSet.getYear(line);
	

	
	airTemperature = DataSet.getMax(line);
	

	if (airTemperature != DataSet.MISSING) {
		context.write(new Text(year), new IntWritable(airTemperature));
		}
}

}

*** Edit ***


Ok, now the next step is to test the jocl implementation :slight_smile: (jocl/MaxTemperature.java)

which uses gpu in the reduce part (MaxTemperatureReducer)

Of course, I’ll post some errors :slight_smile: The map task is done correctly meanwhile the reduce part has some problems :frowning:

^[[A^[[Agabriele@gabriele-K52JU:~$ hadoop jar MaxTemperature.jar MaxTemperature /user/gabriele/input /user/gabriele/output
13/11/20 09:37:46 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/20 09:37:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/11/20 09:37:46 INFO input.FileInputFormat: Total input paths to process : 1
13/11/20 09:37:46 WARN snappy.LoadSnappy: Snappy native library not loaded
13/11/20 09:37:46 INFO mapred.JobClient: Running job: job_local1521899308_0001
13/11/20 09:37:46 INFO mapred.LocalJobRunner: Waiting for map tasks
13/11/20 09:37:46 INFO mapred.LocalJobRunner: Starting task: attempt_local1521899308_0001_m_000000_0
13/11/20 09:37:46 INFO util.ProcessTree: setsid exited with exit code 0
13/11/20 09:37:46 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@95149e
13/11/20 09:37:46 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/gabriele/input/temperatures.txt:0+700
13/11/20 09:37:46 INFO mapred.MapTask: io.sort.mb = 100
13/11/20 09:37:47 INFO mapred.MapTask: data buffer = 79691776/99614720
13/11/20 09:37:47 INFO mapred.MapTask: record buffer = 262144/327680
13/11/20 09:37:47 INFO mapred.MapTask: Starting flush of map output
13/11/20 09:37:47 INFO mapred.MapTask: Finished spill 0
13/11/20 09:37:47 INFO mapred.Task: Task:attempt_local1521899308_0001_m_000000_0 is done. And is in the process of commiting
13/11/20 09:37:47 INFO mapred.LocalJobRunner:
13/11/20 09:37:47 INFO mapred.Task: Task ‚attempt_local1521899308_0001_m_000000_0‘ done.
13/11/20 09:37:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local1521899308_0001_m_000000_0
13/11/20 09:37:47 INFO mapred.LocalJobRunner: Map task executor complete.
13/11/20 09:37:47 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1b647b9
13/11/20 09:37:47 INFO mapred.LocalJobRunner:
13/11/20 09:37:47 INFO mapred.Merger: Merging 1 sorted segments
13/11/20 09:37:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 68 bytes
13/11/20 09:37:47 INFO mapred.LocalJobRunner:
13/11/20 09:37:47 WARN mapred.LocalJobRunner: job_local1521899308_0001
java.lang.NoClassDefFoundError: org/jocl/CLException
at jocl.MaxTemperatureReducer.setup(MaxTemperatureReducer.java:26)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
Caused by: java.lang.ClassNotFoundException: org.jocl.CLException
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 5 more

13/11/20 09:37:47 INFO mapred.JobClient: map 100% reduce 0%
13/11/20 09:37:47 INFO mapred.JobClient: Job complete: job_local1521899308_0001
13/11/20 09:37:47 INFO mapred.JobClient: Counters: 20
13/11/20 09:37:47 INFO mapred.JobClient: File Input Format Counters
13/11/20 09:37:47 INFO mapred.JobClient: Bytes Read=700
13/11/20 09:37:47 INFO mapred.JobClient: FileSystemCounters
13/11/20 09:37:47 INFO mapred.JobClient: FILE_BYTES_READ=80999
13/11/20 09:37:47 INFO mapred.JobClient: HDFS_BYTES_READ=700
13/11/20 09:37:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=150048
13/11/20 09:37:47 INFO mapred.JobClient: Map-Reduce Framework
13/11/20 09:37:47 INFO mapred.JobClient: Reduce input groups=0
13/11/20 09:37:47 INFO mapred.JobClient: Map output materialized bytes=72
13/11/20 09:37:47 INFO mapred.JobClient: Combine output records=0
13/11/20 09:37:47 INFO mapred.JobClient: Map input records=10
13/11/20 09:37:47 INFO mapred.JobClient: Reduce shuffle bytes=0
13/11/20 09:37:47 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/11/20 09:37:47 INFO mapred.JobClient: Reduce output records=0
13/11/20 09:37:47 INFO mapred.JobClient: Spilled Records=6
13/11/20 09:37:47 INFO mapred.JobClient: Map output bytes=54
13/11/20 09:37:47 INFO mapred.JobClient: Total committed heap usage (bytes)=159907840
13/11/20 09:37:47 INFO mapred.JobClient: CPU time spent (ms)=0
13/11/20 09:37:47 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/11/20 09:37:47 INFO mapred.JobClient: SPLIT_RAW_BYTES=123
13/11/20 09:37:47 INFO mapred.JobClient: Map output records=6
13/11/20 09:37:47 INFO mapred.JobClient: Combine input records=0
13/11/20 09:37:47 INFO mapred.JobClient: Reduce input records=0

Any ideas ? :slight_smile:

ATM, I can only refer to things like http://stackoverflow.com/questions/17046744/noclassdeffounderror-in-wordcount-program , where they are talking about a HADOOP_CLASSPATH and a -libjars option. I thought that the scripts that are included in the repository should take care of settings like these, but it’s obviously more complex. (In fact, I doubt that I will ever have the chance to test this on my own, since I’m currently limited to Windows machines, and the setup there seems to require even more prerequistes than on Linux…)

Ok, but I think that here the problem is not how you say.

In the file MaxTemperatureReduce.java it get in trouble when it calls the construct

this.maxVal = new maxValueJOCL.MaxValueSimple();


public class MaxTemperatureReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {

private static final int MAX_VALUES = 65536; // max memory of opencl device,
										// MAX_VALUE % 64 = 0

private int[] buffer;

private maxValueJOCL.MaxValueAbstract maxVal;

@Override
protected void setup(MaxTemperatureReducer.Context context)
		throws IOException, InterruptedException {

	
	**this.maxVal = new maxValueJOCL.MaxValueSimple();
	this.maxVal.initialize(CL.CL_DEVICE_TYPE_GPU);**

this.buffer = new int[MAX_VALUES];

}

I mean, everything should be visible, all the class are included…

Yes, it’s indeed strange that it complains about CLException not being found (and not about the main CL.class, for example…). However, I have no idea how Hadoop reolved classes on each node (e.g. why an URLClassLoader is involved there), and how the ‘classpath’ for each node can be specified. I’m currently going through http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html , maybe I can at least get Hadoop itself running on a Windows Machine - and afterwards, try to dig deeper into the Wordcount example, although I have no idea how to “translate” all the scripts that are involved there…

*** Edit ***

So, I’ve been messing around with Hadoop for a while now, but have not yet got it running on Windows. I also encountered a “NoClassDefFoundError” and got stuck with this, for a while, but this was related to the attempt to run this stuff inside cygwin :rolleyes: Now I’m struggling with some errors that are related to the ‘Apache NativeIO’ libraries. Obviously, Hadoop does not know how to load them, because it does not know on which OS it is running. In fact, a “similar” error seems to be still under investigation ( https://issues.apache.org/jira/browse/MAPREDUCE-5451 ) and not fixed yet in the current release.

I’m not in the position to say this. I know that it sounds arrogant. I know that people might feel tempted to refer me to the ‘Dunning Kruger Effect’ when I say this. But I HAVE to say it: Even though Hadoop may be a tremendously complex piece of software: When you write anything in Java in a professional context, and it is not possible to run the result on any platform by glueing together a bunch of libraries and a single ‘Main.java’, then this is a huge FAIL. That’s it.

Ok cool,

look forward to see you reach my status and see if you will get the same problem.

:slight_smile:

I’m not sure WHEN I will have time to continue with that, and HOW much time I should invest here. It’s obviously not a JOCL-specific question, but more related to Hadoop in general and the Wordcount example in particular. But I wanted to give Hadoop a try anyhow, so I’ll continue to fiddle around … at least I have not yet given up :wink:

I really cannot understand…

If I type:

hadoop jar MaxTemperature.jar MaxTemperature /user/gabriele/input /user/gabriele/output

and in the MaxTemperature.jar there are all the classes which it needs , WHY it is not able to find them :frowning: java.lang.NoClassDefFoundError: org/jocl/CLException

I don’t need to add a class_path or Include Third-Party Libraries in the MapReduce Job… ALL the classes are already in my jar…

So, why???

PS. never give up :slight_smile:

Yes, I also tried to continue with that, and … (well, I omit some gory details here) … think that I might eventually get it running. Somehow.
(BTW: Which Hadoop version are you using? They changed some of the infrastructure, and … well, I read a lot about YARN etc. recently…)

However: The “MaxTemperature.jar” does not contain the CLException.class, does it? I’m pretty sure that you still have to define where the JOCL JAR is located. I still did not get the “big picture” of Hadoop, I guess (and if it exists at all, the developers seem to have tried very hard to hide it), but … what does it print when you type
hadoop classpath
and is the JOCL JAR contained in the resulting list?

In any case, I’ll try to continue probably on Monday.

The https://github.com/cpieloth/GPGPU-on-Hadoop/blob/master/hadoop_ocl_link_test/runTime/jocl/runJOCL.sh tries to copy some JAR files around, you should make sure that you adjust the JOCL version number there (from 0.1.6 to 0.1.9).

is it closed?

This is the solution
http://grepalex.com/2013/02/25/hadoop-libjars/

Thanks. It seems to be rather hard to find up-to-date information for configuration details like this for Hadoop, so that may indeed be helpful for others.