Intel OpenCL Beta tested and working

I’m doing a little research about how OpenCL performs on CPU and I just noticed new release of intel OpenCL sdk so I had to check how It works. Anyway my results:

First of all it works out of box(on windows at least) so note about jocl not tested on intel openCL can be removed.

Second thing is that it works much faster than AMD CPU runtime because it uses SSE4.1, while AMD’s runtime uses SSE2 only. Also it provides support for doubles(which AMD runtimes doesn’t support on my CPU). Device query results:

Number of platforms: 3
Number of devices in platform NVIDIA CUDA: 1
Number of devices in platform AMD Accelerated Parallel Processing: 1
Number of devices in platform Intel(R) OpenCL: 1
--- Info for device GeForce GT 420M: ---
CL_DEVICE_NAME:                         GeForce GT 420M
CL_DEVICE_VENDOR:                         NVIDIA Corporation
CL_DRIVER_VERSION:                         275.27
CL_DEVICE_TYPE:                                CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS:                2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:        3
CL_DEVICE_MAX_WORK_ITEM_SIZES:                0 / 0 / 0 
CL_DEVICE_MAX_WORK_GROUP_SIZE:                1024
CL_DEVICE_MAX_CLOCK_FREQUENCY:                1000 MHz
CL_DEVICE_ADDRESS_BITS:                        32
CL_DEVICE_MAX_MEM_ALLOC_SIZE:                240 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:                961 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:        no
CL_DEVICE_LOCAL_MEM_TYPE:                local
CL_DEVICE_LOCAL_MEM_SIZE:                48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:        64 KByte
CL_DEVICE_QUEUE_PROPERTIES:                CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:                CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:                1
CL_DEVICE_MAX_READ_IMAGE_ARGS:                128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:                8
CL_DEVICE_SINGLE_FP_CONFIG:                CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA 
CL_DEVICE_2D_MAX_WIDTH                        0
CL_DEVICE_2D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_WIDTH                        0
CL_DEVICE_3D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_DEPTH                        0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>        CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1


--- Info for device Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz                : ---
CL_DEVICE_NAME:                         Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz                
CL_DEVICE_VENDOR:                         GenuineIntel                   
CL_DRIVER_VERSION:                         2.0                
CL_DEVICE_TYPE:                                CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:                4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:        3
CL_DEVICE_MAX_WORK_ITEM_SIZES:                0 / 0 / 0 
CL_DEVICE_MAX_WORK_GROUP_SIZE:                1024
CL_DEVICE_MAX_CLOCK_FREQUENCY:                2527 MHz
CL_DEVICE_ADDRESS_BITS:                        64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:                2048 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:                3958 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:        no
CL_DEVICE_LOCAL_MEM_TYPE:                global
CL_DEVICE_LOCAL_MEM_SIZE:                32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:        64 KByte
CL_DEVICE_QUEUE_PROPERTIES:                CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:                1
CL_DEVICE_MAX_READ_IMAGE_ARGS:                128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:                8
CL_DEVICE_SINGLE_FP_CONFIG:                CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF 
CL_DEVICE_2D_MAX_WIDTH                        0
CL_DEVICE_2D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_WIDTH                        0
CL_DEVICE_3D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_DEPTH                        0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>        CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 0


--- Info for device Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz : ---
CL_DEVICE_NAME:                         Intel(R) Core(TM) i3 CPU       M 380  @ 2.53GHz 
CL_DEVICE_VENDOR:                         Intel(R) Corporation
CL_DRIVER_VERSION:                         1.1
CL_DEVICE_TYPE:                                CL_DEVICE_TYPE_CPU
CL_DEVICE_MAX_COMPUTE_UNITS:                4
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:        3
CL_DEVICE_MAX_WORK_ITEM_SIZES:                0 / 0 / 0 
CL_DEVICE_MAX_WORK_GROUP_SIZE:                1024
CL_DEVICE_MAX_CLOCK_FREQUENCY:                2530 MHz
CL_DEVICE_ADDRESS_BITS:                        64
CL_DEVICE_MAX_MEM_ALLOC_SIZE:                989 MByte
CL_DEVICE_GLOBAL_MEM_SIZE:                3958 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT:        no
CL_DEVICE_LOCAL_MEM_TYPE:                global
CL_DEVICE_LOCAL_MEM_SIZE:                32 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:        128 KByte
CL_DEVICE_QUEUE_PROPERTIES:                CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES:                CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT:                1
CL_DEVICE_MAX_READ_IMAGE_ARGS:                128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS:                128
CL_DEVICE_SINGLE_FP_CONFIG:                CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST 
CL_DEVICE_2D_MAX_WIDTH                        0
CL_DEVICE_2D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_WIDTH                        0
CL_DEVICE_3D_MAX_HEIGHT                        0
CL_DEVICE_3D_MAX_DEPTH                        0
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>        CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 2

To test its speed I used about 10 milion combinations calculated with function that uses a lot of pows and sqrt and few integrals of those. This test results:

Intel OpenCL - i3 380M - 104 seconds
AMD OpenCL - i3 380M - 132 seconds
Nvidia OpenCL - GT420M - 24 seconds

Intel runtime is a little faster here but I think it will depend on the task and It’s always better having the double and SSE4.1 support than not.
I’ll try to find the way to install the runtime on my ubuntu and check how it works. If you’ll find anything interesting in this device query result and want me to check anything let me know how :smiley:

Sorry for multi posting but I couldn’t edit the first one.
It works on ubuntu however you need to do this:


1) First of all grab the rpm package from http://software.intel.com/en-us/articles/download-intel-opencl-sdk/.

2) Install the rpm and alien packages (`sudo apt-get install rpm alien`).

3) Convert the rpm package to deb using alien - `fakeroot alien --to-deb <intel's rpm package filename>`.
The conversion spits some warnings, I wouldn't pay any attention to them.

4) Install the newly created deb package. `sudo dpkg -i intel-ocl-sdk-suse+11.1_1.1-2_amd64.deb`

5) One extra package you need to install for the library to work is libnuma. `sudo apt-get install libnuma1`

6) Make sure the ICD is installed. `sudo echo "/usr/lib64/OpenCL/vendors/intel/libintelocl.so" > /etc/OpenCL/vendors/intelocl64.icd`

7) The package is nice and also installs OpenCL headers in /usr/include/CL. Also the main binary 
(libOpenCL.so) is installed in /usr/lib64 - if you don't have any other OpenCL platform installed on your 
system, I suggest moving it to /usr/lib (run `sudo ldconfig` afterwards), if you do have this library already 
(for example nvidia driver also contains it) just leave it there.

8) Since the libraries are installed in non-standard location for Ubuntu (/usr/lib64/OpenCL/vendors/intel), 
you'll need to adjust your LD_LIBRARY_PATH. I usually do this using a script, but you can just run:
export LD_LIBRARY_PATH=/usr/lib64/OpenCL/vendors/intel:$LD_LIBRARY_PATH

Anyway it works on linux with jocl

Hello kacperpl1,

Thanks for this detailed information, great to hear that :slight_smile: I’ll try to update the site soon and maybe include your installation description for Linux.

It’s interesting that there are some subtle differences between AMDs view on the CPU and Intels view (driver version, max. alloc. size, but also the clock speed…)

Concerning the speed: Although I’m not fmiliar with the details of the different SSE versions, I assume that the SSE instructions will most likely bring an advantage for vector operations (i.e. operations involving cl_float4, for example).

bye
Marco

Driver installation description is not mine, it was in the net from a while, however I updated 6th step to set absolute path to the library instead of just its name so ubuntu can load the library. By the way the stupid thing is that there is no runtime-only version of the installer as provided for amd versions. Windows version is really fat(190 MB) and linux version is 10 times smaller what is a bit weird.

I’m not using any vector operations in my kernel so its not the cause here. The speed up is caused by faster (I think) compiler provided by Intel. In my algorithm I’m doing an iteration like this:

  1. prepare kernel for actual part of data
  2. prepare actual part of data
  3. run kernels
  4. read output data
  5. go to 1)

And if I run this with small range of data for each 2nd step It shows big distance between AMD’s and Intel’s runtimes like 60 sec vs 10 sec(Btw nvidia gets it in 15 sec) so it might prove that AMD’s kernel compiler is a little slower.

Apart from that, what you said is obvious that SSE4.1 support will speed up vector calculations.

Oh and PS: there aren’t any gigantic memory leaks here, just the same little amount as on other platforms.