Smith Waterman algorithm with OpenCL

duymap · 3. Februar 2014 um 02:38

Hi Everyone,

I am researching how to parallel Smith Waterman algorithm by OpenCL. Do anyone has experience about this ? Any idea or suggestion is appreciated.

I found a few of SW implementations by CUDA but I don’t know much about CUDA.

Thanks,
Duy.

Marco13 · 3. Februar 2014 um 03:00

Hello,

The Smith Waterman alignment is a comparatively common task to be solved on the GPU. The CUDA showcase at http://www.nvidia.com/object/cuda_showcase_html.html brings up quite some resources when searching for ‘Waterman’. Some of them are certainly conceptual theses/papers that contain information about the general approaches for porting this algorithm to the GPU, but it contains some implementations as well. In many cases, it is not sooo difficult to port an algorithm from CUDA to OpenCL: The concepts and the actual kernel code are rather similar. But even when directly doing a websearch for something like “opencl smith waterman”, one quickly finds implementations like http://li5.ziti.uni-heidelberg.de/oclsw/ . I did not have a closer look at this yet (neither tried to implement it on my own), but maybe it’s a good starting point…?

bye
Marco

duymap · 4. Februar 2014 um 00:40

Hi Macro13,

Thanks. I found it as you mentioned and beside that, I also found another one https://github.com/l-urence/smith-waterman. But the point is source code of them is more complicate than I need. I just need to demonstrate Smith-Waterman by OpenCL simply. I read the idea how to parallel this algorithm, the idea is simple and I understood, but I don’t know how to implement it by OpenCL because I am still newbie in OpenCL. I attached document to explain the idea how to parallel it at https://drive.google.com/file/d/0B1_p06g81l5ANkt1WDdhMXFQNXM/edit?usp=sharing , If you don’t mind, can you please take a look and give me some idea, I don’t know how to parallely in matrix like this.

Thanks,
Duy.

Marco13 · 4. Februar 2014 um 01:16

The document only describes the basic SmithWaterman algorithm, and mentions that it CAN be parallelized. There are certainly some details to consider when trying to implement it in OpenCL. However, from a first quick look at the source in the GitHub directory, the kernel looks rather simple (at least much simpler than I expected). Have you tried to get this running? What is your actual intention?

duymap · 4. Februar 2014 um 01:51

Hi Marco,

I have not tried to run this code yet , but I have read code twice. I intend to follow the approach of github source code because it seems simpler. However, I have 2 points that I don’t clear.

The first one is slice and sub variable, I don’t know why the purpose of them.
The second one is the code in github just resolve for 2 sequence string with same length. I am not sure it works with 2 sequence string with different length. That is the reason why I created document and ask you some advices.

I still keep trying to understand purpose of “slice” and “sub” in this code. In the other hand, I posted this thread to ask some help whether I can re-write again with my own way or not to make it simpler.

I am going to do the final assignment to prove that OpenCL can work on multiple platforms and how to handle it. Currently I can handle OpenCL by using CPU ( intel ) and GPU ( Nvidia ) with basic algorithm that I discussed with you before. Now I need to demo OpenCL handling with a few algorithm examples such as: Smith waterman, graphic and file sorting in hard disk. Hopefully I can finish this on time, since I finished them I will share w/ you to post them as example in jocl website.

Thanks,
Duy.

Marco13 · 4. Februar 2014 um 04:00

In the implementation on GitHub, the matrix is divided into quadratic sub-matrices. The “sub” variable describes the size of these submatrices, and the “slice” is the number of the diagonal of submatrices:


-------------------------------
|         |         |         |
| Slice 0 | Slice 1 | Slice 2 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 1 | Slice 2 | Slice 3 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 2 | Slice 3 | Slice 4 |
|         |         |         |
-------------------------------

The “subIndexes” store the coordinates of these sub-matrices in the large matrix.

The fact that this implementation assumes

that both strings have the same length AND
that the length is evenly divisible by the size of the submatrix
is a simplification. This is not intended to be a “General Sequence Alognment Framework”, but only a proof of concept. Every generalization increases the complexity, and makes it necessary to treat special cases (and (for OpenCL), to introduce padding of the matrices).

The description of this approach in https://github.com/l-urence/smith-waterman/blob/735260a2a94d28a6b29230fffedf3dc3ff782cb8/ausarbeitung/psys_dna.pdf?raw=true is in German, but maybe you’ll find the algorithm or images helpful anyhow.

duymap · 4. Februar 2014 um 08:52

Hi Macro,

Many thanks! Your information is really helpful. I will try to understand code based on your information.

Yes, this implementation assume 2 sequence string with same length. If I can modify it generally to work with any pair of sequence string, I will.

Thanks so much again,
Duy.

*** Edit ***

[QUOTE=Marco13]In the implementation on GitHub, the matrix is divided into quadratic sub-matrices. The “sub” variable describes the size of these submatrices, and the “slice” is the number of the diagonal of submatrices:


-------------------------------
|         |         |         |
| Slice 0 | Slice 1 | Slice 2 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 1 | Slice 2 | Slice 3 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 2 | Slice 3 | Slice 4 |
|         |         |         |
-------------------------------

The “subIndexes” store the coordinates of these sub-matrices in the large matrix.

The fact that this implementation assumes

that both strings have the same length AND
that the length is evenly divisible by the size of the submatrix
is a simplification. This is not intended to be a “General Sequence Alognment Framework”, but only a proof of concept. Every generalization increases the complexity, and makes it necessary to treat special cases (and (for OpenCL), to introduce padding of the matrices).

The description of this approach in https://github.com/l-urence/smith-waterman/blob/735260a2a94d28a6b29230fffedf3dc3ff782cb8/ausarbeitung/psys_dna.pdf?raw=true is in German, but maybe you’ll find the algorithm or images helpful anyhow.[/QUOTE]

Hi Marco,

Now I could understand what code said. We just need set sub = 1 to resolve my problem. I also found something wrong in kernel.cl, I think we don’t need any for loop in cl code.

I have took time to port it into Java code and debug today and understood a lot of things. I will try to accomplish this code to resolve any pair of sequence with different length and the point is how to identify subindex values in case 2 different lengths of sequence strings.

Anyway, thanks so much. Since I have done, I will send you my implementation to use it as example in your website.

Thanks,
Duy.

Marco13 · 4. Februar 2014 um 08:54

Well, I’m not sure whether a sub-matrix-size of 1 will not defeat any parallelism from OpenCL. As far as I understood, the kernel is applied to each sub-matrix - and when the sub-matrix has a size of 1x1, then each kernel will only be executed by 1 thread (instead of 1000s, as it should be with OpenCL…)

duymap · 5. Februar 2014 um 02:06

Yes, I tried with sub = 1 and after each loop, the values of subIndex will store coordinates of all cells in same diagonal. I am still trying to see what I am expecting is right or wrong. Today, I have ported opencl in C by jocl, I modified kernel.cl a little bit to follow my approach but I got the error when set mode for CL Read WRITE :(. The error show below:

Using platform 0 of 2: Intel® OpenCL
Device : Intel® Core™ i5-3210M CPU @ 2.50GHz
Exception in thread “main” org.jocl.CLException: CL_INVALID_HOST_PTR
at org.jocl.CL.checkResult(CL.java:686)
at org.jocl.CL.clCreateBuffer(CL.java:5445)
at sw.opencl.SW_OpenCL.sw(SW_OpenCL.java:194)
at sw.opencl.SW_OpenCL.main(SW_OpenCL.java:286)

I am not sure why got this exception because the variable was set that used for read and write in cl file. I attached kernel.cl file and SW_OpenCL.java file in order to you can take a look why got this exception and suggest me what kind of mode I should use ?

Link: https://app.box.com/s/qvbo6e4p7pmo1lomjbsd

Sorry for disturbing you again.

Thanks,
Duy.

*** Edit ***

I could fix error CL_INVALID_HOST_PTR. Thanks.

duymap · 15. Februar 2014 um 01:28

[QUOTE=Marco13]In the implementation on GitHub, the matrix is divided into quadratic sub-matrices. The “sub” variable describes the size of these submatrices, and the “slice” is the number of the diagonal of submatrices:


-------------------------------
|         |         |         |
| Slice 0 | Slice 1 | Slice 2 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 1 | Slice 2 | Slice 3 |
|         |         |         |
-------------------------------
|         |         |         |
| Slice 2 | Slice 3 | Slice 4 |
|         |         |         |
-------------------------------

The “subIndexes” store the coordinates of these sub-matrices in the large matrix.

The fact that this implementation assumes

that both strings have the same length AND
that the length is evenly divisible by the size of the submatrix
is a simplification. This is not intended to be a “General Sequence Alognment Framework”, but only a proof of concept. Every generalization increases the complexity, and makes it necessary to treat special cases (and (for OpenCL), to introduce padding of the matrices).

The description of this approach in https://github.com/l-urence/smith-waterman/blob/735260a2a94d28a6b29230fffedf3dc3ff782cb8/ausarbeitung/psys_dna.pdf?raw=true is in German, but maybe you’ll find the algorithm or images helpful anyhow.[/QUOTE]

Hi Macro,

I have done to port from c code into java code. Link here:

https://drive.google.com/file/d/0B1_p06g81l5AYU03OGNvRTVSQUE/edit?usp=sharing ( kernel.cl )
https://drive.google.com/file/d/0B1_p06g81l5AS0djTGlnNy1naXM/edit?usp=sharing ( SW_OpenCL_V2.java )

As I said before, If I used submatrix is 1, the performance is not good. So I have to tried with bigger submatrix size and the performance was improved so much.

But I realized that the above code still have a few of limitations:

Just applied for spare matrix only.
If the size of square matrix is prime ( 7 x 7 ) we can not select sub matrix size :(.

So I intend to develop it as general algorithm to resolve above restrictions:


|         |         |         |         |
| Slice 0 | Slice 1 | Slice 2 |	Slide 3	|
|         |         |         |			|
-----------------------------------------
|         |         |         |			|
| Slice 1 | Slice 2 | Slice 3 |	Slide 4	|
|         |         |         |			|	
-----------------------------------------
|         |         |         |			|
| Slice 2 | Slice 3 | Slice 4 |	Slide 5	|
|         |         |         |			|
-----------------------------------------

As you see, my propose is not square matrix and assume we have more one more slide ( slide 5 ). Each sample is the same slice should be submatrix , and I believe the sub matrix is not square too. So do you have any idea or suggestions about it ? I appreciate to if you let me know.

Thanks,
Duy.

Marco13 · 15. Februar 2014 um 07:31

(I’m currently not at my developing PC, but can try this out and have a look at this on Monday/Tuesday)

The limitations that you mentioned (namely that is must be a square matrix, and that matrices with a size of “prime * prime” can not easily be subdivided) are simplifying the implementation, and circumventing these will make the implementation more difficult. However, a commonly used approach to handle sizes that are not divisible (that is, “prime” sizes) is padding: You round the size up to the next largest multiple of the intended submatrix size.

int matrixSize = 13; // A prime number
int submatrixSize = 5; 
int snappedMatrixSize = snap(matrixSize, submatrixSize); // Will be 15

private static int snap(int value, int divisor)
{
    return value + divisor - 1 - (value - 1) % divisor;
}

and either fill the “invalid” (padded) entries of the larger matrix with zeros, or handle them as appropriate for the particular case. (That’s just a general pattern, I have not yet thought about how this could be applied here…)

duymap · 16. Februar 2014 um 05:47

Hi Marco,

Thanks for your idea. It’s great idea because if we can round it up and avoid prime size of matrix. So I would like to share my idea, I think I am going to change the algorithm from proof concept into general algorithm. I think we have 2 cases:

If col and row of matrix are not prime number.

Assume we have col = 9 and row = 6. We will do below steps:

Find the greatest divisor of 9 and 6, it should be 3. And 3 should be size of submatrix.

/**
	 * Get greatest divisor between a and b and it can not equal a or b
	 * @param a
	 * @param b
	 * @return
	 */
	public int getGreatestDivisor(int a, int b) {
		int i = 0;
		for (i = a; i > 0; i--) {
			if (a % i == 0 && b % i == 0 && i !=a && i != b) {
				break;
			}
		}
		return i;
	}

So now size of submatrix is 3, and we can divide not square matrix ( 9 x 6 ) into square sub-matrixes and calculate, the problem here how to calculate how many diagonals and how many coordinates on the same diagonal, I modified existing code a little bit to make it works with any row and col, please see:


public int min(int a, int b, int c) {
      return Math.min(a, Math.min(b, c));
}

public void compute(int row, int col, int sub) {

	   row = row / sub;
	   col = col / sub;
		
	    for (int line=1; line<=(row + col -1); line++)
	    {
	        /* Get column index of the first element in this line of output.
	           The index is 0 for first ROW lines and line - ROW for remaining
	           lines  */
	        int start_col =  Math.max(0, line-row);
	 
	        // number of diagonals of matrix
	        int subLength = min(line, (col-start_col), row);
	        int[] subIndexes = new int[subLength*2];

	        /* store coordinates of cells on the same diagonal */
	        for (int j=0; j< subLength; j++) {
	        	subIndexes[2*j] = (Math.min(row, line)-j-1)*sub;;
	        	subIndexes[2*j+1] = (start_col+j)*sub;;
	        }
	    }
	}

So now we have subLength ( number of global work items in kernel ), subIndexes ( coordinates on the same diagonal ), we can do w/ OpenCL.

If one of them ( row or col ) is prime number or both are prime number:

We have to round row or col to make sure they are not prime number anymore. But I don’t know how to choose the best submatrix to do this.
Since we round them up, we can calculate as my above method to determine subIndexes and subLength.

And I have one question, how we choose the best submatrix size in case row = col ?

Please let me know your idea.

Marco13 · 16. Februar 2014 um 07:20

(Still not on my development PC ;))

IIRC, each submatrix is handled by ONE kernel call. So the submatrix size could, in its most general form, be determined by the hardware capabilities. And one can assume that this “padding” will be necessary in most cases. For example, when the size of the Matrix is 194x178, (which is (972)x(892), then the GCD will be 2, but this is still too small for a submatrix.

So if the GPU has 1024 cores, one could use a submatrix size of 32x32, and for a matrix of size 194x178, one would pad the matrix to be 224x224 (that is, (732)x(732)) in order to exploit as many work items as possible in each kernel call.

But these are just first thought, I still have to take a closer at this…

duymap · 16. Februar 2014 um 07:31

[QUOTE=Marco13](Still not on my development PC ;))

IIRC, each submatrix is handled by ONE kernel call. So the submatrix size could, in its most general form, be determined by the hardware capabilities. And one can assume that this “padding” will be necessary in most cases. For example, when the size of the Matrix is 194x178, (which is (972)x(892), then the GCD will be 2, but this is still too small for a submatrix.

So if the GPU has 1024 cores, one could use a submatrix size of 32x32, and for a matrix of size 194x178, one would pad the matrix to be 224x224 (that is, (732)x(732)) in order to exploit as many work items as possible in each kernel call.

But these are just first thought, I still have to take a closer at this…[/QUOTE]

Thanks so much Marco, It’s really great to discuss with you.

Yes, you can take a closer look whenever you go back w/ development PC and please think about my suggestion to make it generally ;).

Sorry if this is a stupid question. How we know the cores of GPU ? We have to write code to know or investigate via manufacture specification ?

Thanks again Marco.

Duy.

Marco13 · 16. Februar 2014 um 13:22

Obtaining the actual number of cores is a bit difficult, because every manufacturer has his own „interpretation“ of what a „core“ actually is. But in this case, one would probably obtain the CL_DEVICE_MAX_WORK_GROUP_SIZE via clDeviceGetInfo. This is 512 on my GeForce 8800, so one could use a submatrix size of 22x22, for example.

I’m not sure how far such an algorithm can be generalized. Hundreds of scientists have been working for decades on problems like this, created large amount of research results, and one will certainly not be able to simply write a „General GPU-based Sequence Alignment Framework“ from scratch. But at least the generalization for different string lengths (that is, non-quadratic matrices) should be possible.

duymap · 17. Februar 2014 um 03:28

Yes, my purpose here is how to make it generic ( that is non-quadratic matrices ). I think I can do that with above code I posted. If you have time, please help to review.

The second thing that I am concerning is how to round up matrix if its col or row is prime number and the calculation to determine the best of submatrix after rounding matrix up.

Thanks,
Duy.

*** Edit ***

This is code that I modified to make it generic ( not using round matrix up ). Please let me know your idea

https://drive.google.com/file/d/0B1_p06g81l5ANVlhbVdFOTdHX00/edit?usp=sharing
https://drive.google.com/file/d/0B1_p06g81l5ANVlhbVdFOTdHX00/edit?usp=sharing

Thanks,
Duy.

Marco13 · 18. Februar 2014 um 03:04

This was twice the same link (namely the kernel in both cases) I assume that the host code had to be modified as well? I guess it was based on the GitHib code, right?

duymap · 18. Februar 2014 um 05:43

Sorry, my mistake. The link for java code here: https://drive.google.com/file/d/0B1_p06g81l5AWlo3ekVjN3F3R28/edit?usp=sharing

Yes, It based on code from github. What I am concerning now are:

how to round matrix up enough to select the best sub matrix ?
How to estimate best size of sub-matrix ?

Thanks,
Duy.

Marco13 · 18. Februar 2014 um 10:07

As a first step, I added

        int maxWorkGroupSize = getMaxWorkGroupSize(devices[deviceIndex]);
        System.out.println("maxWorkGroupSize "+maxWorkGroupSize);

        sub = (int)Math.sqrt(maxWorkGroupSize);
        System.out.println("submatrix size "+sub);
        newRow = row / sub;
        newCol = col / sub;

using a method

	private static int getMaxWorkGroupSize(cl_device_id device)
	{
		ByteBuffer buffer = ByteBuffer.
			allocate(Sizeof.size_t).
			order(ByteOrder.nativeOrder());
		clGetDeviceInfo(device, CL.CL_DEVICE_MAX_WORK_GROUP_SIZE, 
				Sizeof.size_t, Pointer.to(buffer), null);
		if (Sizeof.size_t == 4)
		{
			return buffer.getInt();
		}
		else
		{
			return (int)buffer.getLong();
		}
	}

On my card, this already is noticably faster, but … there is quite a lot that has to be … let’s say “revieved”, I’ll have to invest more time for this.

duymap · 18. Februar 2014 um 20:08

Thanks Marco, I will try on my laptop. I will keep you post.

Thanks,
Duy.