Counting entries in Excel table

system · 24. Februar 2012 um 00:18

Hi Marco. I’d like to do a (maybe) symple work:

Read this .xls file (with this java reader file that i’ve);
Divide the file read between the different processors CUDA (for division horizontal lines);
Each partition of the initial table contains in the first n columns the healthy subjects and from column n +1 the sick subjects
For each probe (row) must count the occurrences of each symbol (a / a, …) for both classes healthy and sick building two new tables: one for sound and another for the sick, where the first column contains the probeID and the first line the symbols (a /a, …).

Thanks for any kind of help.

Marco13 · 24. Februar 2012 um 01:38

Hello

I have splitted this from the original thread since it seems fairly unrelated to matrix products.

I’m not sure whether this is a use-case where CUDA may be very beneficial. If I understood correctly, you only want to count the number of occurances of strings. There is nothing time-critical about that, and nothing that is obviously data-parallel.

Counting the number of occurances can easily be done in Excel. And if it should be done in Java, it can easily be done while the table is read. And if this is the time-critical part: Reading it as a CSV file might be much faster. Doing this in CUDA would require to translate the Strings into identifiers (integer-IDs) which would take much longer than simply counting them.

Can you elaborate why and how you intend to use CUDA for this task?

bye
Marco

system · 27. Februar 2012 um 09:33

Hi Marco. I am sorry for not having answered earlier. I have to do the Fisher’s test with the count of the number of occurances. It is possible ?

Marco13 · 27. Februar 2012 um 10:48

Hello

As far as I can see from http://en.wikipedia.org/wiki/Fisher's_exact_test , the actual computation is rather simple. So I assume that counting the elements is what takes most of the time.

Have you implemented this test in plain Java? Where do you think that a data-parallel algorithm may applied there?

bye
Marco

system · 28. Februar 2012 um 01:59

Hi Marco. In addition to the count of the number of occurances, the gpu should run this Fisher’s test like this:

FisherExactTest fe = new FisherExactTest(1000);
double twoTail = fe.getTwoTailedP(10, 6, 8, 7);

in a data-parallel algorithm. This is possible ?

Marco13 · 28. Februar 2012 um 02:41

Well… it’s not possible (for me) to understand this piece of code in a reasonable amount of time, especially what the … values (or indices?) a,b,c and d actually mean.

But roughly speaking: Whenever you have a pattern like

for (int i=0; i<n; i++)
{
    result** = computeSomething(i); // Does not depend on other results
}

it’s pretty straightforward to implement this in CUDA.

Counting may possibly be done efficiently with a reduction (to avoid using atomics). But in order to get the most out of CUDA, you have to exactly know the algorithm, and for example, also consider the accesses to global memory (i.e. whether the problem is “compute-bound” or “memory-bound”).

system · 28. Februar 2012 um 10:03

I just want to speed up the read of the file of the first sheet named BC_10_06_10, then create 2 new tables with the first row named with:

a/a a/c a/g a/t c/a c/c c/g c/t g/a g/c g/g g/t t/a t/c t/g t/t nocall -/- -/a -/c -/g -/t a/- c/- g/- t/-

and the first column named with the Probe Set ID:

AM_10001 AM_10002 … AM_15506

This new tables contains, respectively, healthy subjects (the subjects in the columns of the first table with black color) and the sick subjects (the subjects in the columns of the first table with red color). In this new tables there are the count of the a/a, a/c, a/g of each AM_10001 …

To finish, i have to do the Fisher’s test with the number of occurances of a/a a/c …

All with the help of Cuda (Jcuda).

Please help me

Marco13 · 28. Februar 2012 um 11:17

I’ll try to help you, but you also have to work on your own.

This code does NOT yet use JCuda for anything. It is just reading the input data, namely the first Sheet of the Table, which has to be present as a CSV file. It is implemented as a quick hack. Not as much a hack as the XLS reader, but still ugly.

Are the values correct? How should the Fisher test be applied to the ‘count’ tables?

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.AbstractList;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;


public class JCudaFisherTestMain
{
    private static final Map<String, Integer> valueToIndex = new LinkedHashMap<String, Integer>();
    static
    {
        valueToIndex.put("A/A",  0);
        valueToIndex.put("A/C",  1);
        valueToIndex.put("A/G",  2);
        valueToIndex.put("A/T",  3);
        valueToIndex.put("C/A",  4);
        valueToIndex.put("C/C",  5);
        valueToIndex.put("C/G",  6);
        valueToIndex.put("C/T",  7);
        valueToIndex.put("G/A",  8);
        valueToIndex.put("G/C",  9);
        valueToIndex.put("G/G", 10);
        valueToIndex.put("G/T", 11);
        valueToIndex.put("T/A", 12);
        valueToIndex.put("T/C", 13);
        valueToIndex.put("T/G", 14);
        valueToIndex.put("T/T", 15);
        valueToIndex.put("NoCall", 16);
        valueToIndex.put("-/-", 17);
        valueToIndex.put("-/A", 18);
        valueToIndex.put("-/C", 19);
        valueToIndex.put("-/G", 20);
        valueToIndex.put("-/T", 21);
        valueToIndex.put("A/-", 22);
        valueToIndex.put("C/-", 23);
        valueToIndex.put("G/-", 24);
        valueToIndex.put("T/-", 25);
    }

    
    public static void main(String[] args) throws Exception
    {
        SimpleTableData table = SimpleTableData.read("FirstTable.csv");
        List<Integer> sickColumns = Arrays.asList(
            2, 3, 13, 14, 15, 16, 17, 18,
            19, 20, 21, 22, 25, 27
        );
        Set<Integer> set = new LinkedHashSet<Integer>();
        for (int i=1; i<table.getNumCols(); i++)
        {
            set.add(i);
        }
        set.removeAll(sickColumns);
        List<Integer> healthyColumns = new ArrayList<Integer>(set);

        SimpleTableData sickTable = SimpleTableData.create(table, sickColumns);        
        SimpleTableData healthyTable = SimpleTableData.create(table, healthyColumns);        

        System.out.println("Start of sick table");
        System.out.println(sickTable.toString(5));

        System.out.println("Start of healthy table");
        System.out.println(healthyTable.toString(5));
        
        int sickCount[] = createCountMatrix(sickTable);
        int healthyCount[] = createCountMatrix(healthyTable);
        
        System.out.println("Sick count");
        System.out.println(toString2D(sickCount, valueToIndex.size()));

        System.out.println("Healthy count");
        System.out.println(toString2D(healthyCount, valueToIndex.size()));
        
    }
    
    
    private static int[] createCountMatrix(TableData tableData)
    {
        int result[] = new int[valueToIndex.size() * tableData.getNumRows()];
        for (int r=0; r<tableData.getNumRows(); r++)
        {
            for (int c=0; c<tableData.getNumCols(); c++)
            {
                String value = tableData.get(r, c);
                Integer index = valueToIndex.get(value);
                if (index != null)
                {
                    int arrayIndex = r * valueToIndex.size() + index;
                    result[arrayIndex]++;
                }
            }
        }
        return result;
    }
    
    public static String toString2D(int a[], int columns)
    {
        StringBuilder sb = new StringBuilder();
        for (int i=0; i<a.length; i++)
        {
            if (i>0 && i % columns == 0)
            {
                sb.append("
");
            }
            sb.append(String.format("%4s ", String.valueOf(a**)));
        }
        return sb.toString();
    }
    
}



interface TableData
{
    String get(int row, int col);
    int getNumRows();
    int getNumCols();
}


class SimpleTableData implements TableData
{
    private List<String> headers = null;
    private List<List<String>> data = null;
    
    public static SimpleTableData create(SimpleTableData other, List<Integer> columns)
    {
        SimpleTableData result = new SimpleTableData();
        result.headers = new ArrayList<String>();
        for (Integer column : columns)
        {
            result.headers.add(other.getHeader(column));
        }
        result.data = new ArrayList<List<String>>();
        for (int r=0; r<other.getNumRows(); r++)
        {
            List<String> row = new ArrayList<String>();
            for (Integer column : columns)
            {
                row.add(other.get(r, column));
            }
            result.data.add(row);
        }
        return result;
    }
    
    public static SimpleTableData read(String csvFileName) throws IOException
    {
        SimpleTableData result = new SimpleTableData();
        result.data = new ArrayList<List<String>>();
        BufferedReader r = new BufferedReader(
            new InputStreamReader(new FileInputStream(csvFileName)));
        String line = null;
        while (true)
        {
            line = r.readLine();
            if (line == null)
            {
                break;
            }
            String tokens[] = line.split(",");
            if (result.headers == null)
            {
                result.headers = Arrays.asList(tokens);
            }
            else
            {
                result.data.add(Arrays.asList(tokens));
            }
        }
        return result;
    }
    
    @Override
    public int getNumRows()
    {
        return data.size();
    }
    @Override
    public int getNumCols()
    {
        return headers.size();
    }
    @Override
    public String get(int row, int col)
    {
        return data.get(row).get(col);
    }

    public List<String> getHeaders()
    {
        return headers;
    }
    public String getHeader(int col)
    {
        return headers.get(col);
    }
    public List<String> getRow(int row)
    {
        return data.get(row);
    }
    
    
    public List<String> getColumn(String name)
    {
        for (int i=0; i<getNumCols(); i++)
        {
            if (getHeader(i).equals(name))
            {
                return getColumn(i);
            }
        }
        return null;
    }
    
    public List<String> getColumn(final int col)
    {
        return new AbstractList<String>()
        {
            @Override
            public String get(int index)
            {
                return SimpleTableData.this.get(col, index);
            }

            @Override
            public int size()
            {
                return getNumRows();
            }};
    }
    
    public String toString(int numRows)
    {
        StringBuilder sb = new StringBuilder();
        sb.append(headers);
        sb.append("
");
        for (int i=0; i<numRows; i++)
        {
            sb.append(getRow(i));
            sb.append("
");
        }
        sb.append("...
");
        return sb.toString();
    }
}

system · 2. März 2012 um 09:26

I know Marco but, without your help, for me is impossible :(.

If you mean at the values a/a a/c a/g … yes are corrects. I tried to run it, but, with netbeans 6.9, return this error:

Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 2
at java.util.Arrays$ArrayList.get(Arrays.java:3381)
at jcudafishertestmain.SimpleTableData.getHeader(Main.java:204)
at jcudafishertestmain.SimpleTableData.create(Main.java:140)
at jcudafishertestmain.Main.main(Main.java:64)
Java Result: 1

I’ve turned the .xls file (the first sheet) in the Csv format and put the path at the row 51 with this look: “C:/Documents and Settings/User/Documents/NetBeansProjects/JCudaFisherTestMain/src/jcudafishertestmain/FirstTable.csv”;

All in just one java class (the Main class that Netbeans make by itself).

Contrary to what i said, the healthy should be the first n and the n+1 should be the sick (or opposite). Maybe it’s this the problem ?

In few words:
From the two tables healthy and sick (that just divides the first big table in healthy and sick) should be created two new tables where the first row contains a/a a/c g/g… and the first column contains the probe set id: each row, in this case, should contains the number of occurrences a/a a/c… presents in the first two tables healthy and sick.
Now i (should) have for each row the information to do the Fisher test.
For example:

Healthy’s occurrences matrix:

           a\a   a\-    t

probe_1 2 3 0
probe_2 1 0 4

Sick’s occurrences matrix:

           a	    a\c   nocall

probe_1 0 3 6
probe_2 3 0 4

Now i should to do the Fisher’s test (with the help of jcuda) for all combinations between healthy and sick for each probe:

probe 1 : { a\a,a }, {a\a,a\c}, {a\a,nocall}, {a-,a }, {a-,a\c} …
probe 2: { a\a,a }, …

and so on for each probe and alleles (a/a a/c a/g …) with value 0 or >1

Marco13 · 2. März 2012 um 13:03

One step after the other. I see that it might be difficult to do something with CUDA/JCuda at all, and I’ll try to help you with that. Nevertheless, you should be able to find out the reason for an ArrayIndexOutOfBoundsException. What does the input file look like? What does it print after the lines
“Start of sick table”
and
“Start of healthy table”
?

system · 2. März 2012 um 22:50

Thanks a lot. For the error of ArrayIndexOutOfBoundsException, maybe is the index of the for at the line 57. The input file look like this.
After the lines “Start of sick table” and “Start of healthy table” it prints the tables such as a string equal to the value of 5 or not?

Marco13 · 3. März 2012 um 06:03

In order to read this file, the Line 170
String tokens[] = line.split(",");
has to be changed to
String tokens[] = line.split(";");
This is the character that separates the entries - for me it was “,” (maybe because it was experted with a German OpenOffice), but in your case, this separator char is “;”

The program should then divide the input table into two tables, one for the healthy and one for the sick. Each of these tables should contain the ‘counted’ number of entries, and should be printed as follows


Sick count
   0    0    0    0    0   13    0    0    0    0    0    0    0    0    0    0    1    0    ...
   0    0    0    0    0    0    0    0    0    0   14    0    0    0    0    0    0    0    ...
   0    0    0    0    0    2    0    8    0    0    0    0    0    0    0    3    1    0    ...
...

This should be the number of occurances of each token:

valueToIndex.put("A/A",  0);
valueToIndex.put("A/C",  1);
valueToIndex.put("A/G",  2);
valueToIndex.put("A/T",  3);
valueToIndex.put("C/A",  4);
valueToIndex.put("C/C",  5);
...

So the first line of this printed table means:
For AM_10001 there are
0 occurances of “A/A”
0 occurances of “A/C”
0 occurances of “A/G”
0 occurances of “A/T”
0 occurances of “C/A”
13 occurances of “C/C”

Which seems to match the first row of the ‘sick’ input table:


AM_10001;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/G;C/C;NoCall;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C;C/C

If this is all correct, we can think about how to do the Fisher test. Probably this should first be implemented in Java, in order to quickly check how it works, and to see if the results are correct. Then we can think about how to implement it with CUDA.

system · 3. März 2012 um 07:36

I changed the „,“ in the „;“ and now it works great.

Is all corrects but, to make it more general, the healthy should be the first n and the sick the n+1 randomly: for example if the sick are from the column 21 to 33, i should put this information in the program so that when the program read the csv file (or the xls, if is the same thing), the program know that from the 2nd to 20th column there are the information on the healthy, instead from the 21th (included) start the information for the sick.

This is the Fisher’s test in java with which can see if the results are correct. Thanks Marco for your help and for the time spent for me until now.

Marco13 · 3. März 2012 um 17:50

At the moment, the information about the healthy/sick elements are hard-coded in the program. It should probably be read from a file, but that’s not important for the actual task at the moment. I’ll try to have a closer look at the fisher test in the next few days when I find the time

Marco13 · 8. März 2012 um 13:36

Have not yet looked at the Fisher test, but have also not forgotten it. I’m busy ATM, but can hopefully continue soon

system · 11. März 2012 um 08:11

Ok Marco. No problem. I’ll try to do something by myself for now. Thanks anyway.

Marco13 · 14. März 2012 um 07:40

Hello

I tried to have a closer look at the Fisher test and your current implementation.

Did you use logarithmic values in the ‘factorials’ table in order to save multiplications (and using additions instead)?

Which of the methods (getCumlativeP/getTwoTailedP/Left/Right) are required? Can you explain in your own words (and for someone who’s not so keen on statistics) what exactly these methods are doing?

bye

system · 21. März 2012 um 10:01

Hi Marco. I am sorry for not having answered earlier. The Fisher test implemented in java is only for example (because in CUDA is not possible to use java’s objects). I have to do an algorithm like this:

The Fisher’s formula is „only“ composed by a certain number of products, sums and calculations of factorials (features that maybe makes the Fisher’s test an excellent candidate to resolution using CUDA).

I have not created this Fisher test class but should be used in the following way:

FisherExactTest f = new FisherExactTest(1000);
       double pvalue = f. getTwoTailedP();

That’s all (for now).

Marco13 · 21. März 2012 um 11:28

I’ve seen this forumla in the Code and on Wikipedia. For me it is not entirely clear which values this formula should be applied to.

Most explainations of the Fisher test refer to 2x2 tables. According to the explaination on http://mathworld.wolfram.com/FishersExactTest.html , it should be possible to apply it to larger tables, but how are a,b,c,d computed?

According to what I have read so far, I would have expected that the columns of the table should be “healthy” and “sick”, and each row should contain the sum (for all subjects) of each observation “A/A”, “T/T” etc.

But maybe I got something wrong. Now there are two tables, one with the “Healthy” and one with the “Sick” counts. Which parameters should not be passed to the “twoTailedP” method? (Isn’t it necessary to create all permutations of tables? If this is true, CUDA may help to reduce the computation time from 1000000 years to 100000 years or so… -_- )

system · 23. März 2012 um 23:46

Hi Marco. Here’s the correct invocation (the former one was wrong):

FisherExactTest f = new FisherExactTest(1000); 
	
double pvalue = f. getTwoTailedP(2,3,4,5);

where a=2, b=3, c=4, d=5 (for example).

And here there’s a complete example for the problem:

Hypothesis: sick are from the column Sub_1 to Sub_5 (included):

ProbeSetID Sub_1 Sub_2 Sub_3 Sub_4 Sub_5 Sub_6 Sub_7 Sub_8
AM_10001 C/C C/C C/C C/C C/C C/C C/C C/C
AM_10002 G/G G/G G/G G/G G/G G/G G/G G/G
AM_10003 C/T C/T C/T C/T C/T T/T C/C C/T
AM_10004 G/G G/G G/G G/G G/G G/G G/G G/G
AM_10005 C/T C/T C/T C/T C/T T/T C/C C/T
AM_10006 T/T C/T C/T C/T C/T T/T C/C C/T
AM_10008 A/A A/A A/A A/A A/A A/A A/A A/A
AM_10010 C/C C/T C/T C/T C/T C/C T/T C/T
AM_10011 A/G A/G A/G A/G A/G A/A G/G A/G
AM_10012 G/G G/G G/G G/G G/G G/G G/G G/G
AM_10013 A/G A/G A/G A/G A/G A/A G/G A/G
AM_10014 C/T C/T C/T C/T C/T C/C T/T C/T
AM_10016 C/T C/T C/T C/T C/T T/T C/C C/T
AM_10017 G/G G/G G/G G/G G/G G/G G/G G/G
AM_10019 NoCall A/A A/A A/A NoCall A/A A/A A/A
AM_10020 C/T C/C C/T C/T C/C C/C C/T C/C

The the first step is reading the file with the purpose of creating two new tables as follows:

Sick’s table:

ProbeSetID Sub_1 Sub_2 Sub_3 Sub_4 Sub_5
AM_10001 C/C C/C C/C C/C C/C
AM_10002 G/G G/G G/G G/G G/G
AM_10003 C/T C/T C/T C/T C/T
AM_10004 G/G G/G G/G G/G G/G
AM_10005 C/T C/T C/T C/T C/T
AM_10006 T/T C/T C/T C/T C/T
AM_10008 A/A A/A A/A A/A A/A
AM_10010 C/C C/T C/T C/T C/T
AM_10011 A/G A/G A/G A/G A/G
AM_10012 G/G G/G G/G G/G G/G
AM_10013 A/G A/G A/G A/G A/G
AM_10014 C/T C/T C/T C/T C/T
AM_10016 C/T C/T C/T C/T C/T
AM_10017 G/G G/G G/G G/G G/G
AM_10019 NoCall A/A A/A A/A NoCall
AM_10020 C/T C/C C/T C/T C/C

Healthy’s table:

ProbeSetID Sub_6 Sub_7 Sub_8
AM_10001 C/C C/C C/C
AM_10002 G/G G/G G/G
AM_10003 T/T C/C C/T
AM_10004 G/G G/G G/G
AM_10005 T/T C/C C/T
AM_10006 T/T C/C C/T
AM_10008 A/A A/A A/A
AM_10010 C/C T/T C/T
AM_10011 A/A G/G A/G
AM_10012 G/G G/G G/G
AM_10013 A/A G/G A/G
AM_10014 C/C T/T C/T
AM_10016 T/T C/C C/T
AM_10017 G/G G/G G/G
AM_10019 A/A A/A A/A
AM_10020 C/C C/T C/C

When i’ve split healthy from sick, i can proceed with the creations of tables of occurrences for healthy and sick. The tables of occurrences have the following characteristics.
The first column is the same for both tables, while the first row will contain the alleles identified in the two tables ie:

Healthy’s occurrences table:

ProbeSetID A/A A/G C/C C/T G/G T/T
AM_10001 0 0 3 0 0 0
AM_10002 0 0 0 0 3 0
AM_10003 0 0 1 1 0 1
AM_10004 …and so on
AM_10005
AM_10006
AM_10008
AM_10010
AM_10011
AM_10012
AM_10013
AM_10014
AM_10016
AM_10017
AM_10019
AM_10020

Sick’s occurrences table:

ProbeSetID A/A A/G C/C …. continues
AM_10001 0 0 4
AM_10002 … continues
AM_10003 …
AM_10004
AM_10005
AM_10006
AM_10008
AM_10010
AM_10011
AM_10012
AM_10013
AM_10014
AM_10016
AM_10017
AM_10019
AM_10020

Is carried out in the same way as that of the healthy.

Once done to count how many times each allele is present in every row i can calculate the Fisher’s test as follows:

The rows are the same in both tables and identified by probeID.
Now i have all the info to create and calculate individual fisher’s test. The Fisher’s test obtainable from the first row in this case are only one as the only alleles in the two classes with occurrence >0 are alleles C/C and C/C.
Therefore we obtain for the probe AM_10001 (where Class A are healthy and Class B are sick):

Probe Set ID: AM_10001

                  ClassA  ClassB

C/C 3 0
C/C 0 4

And then we have to do this for all probes going to consider all possible combinations of alleles with occurrence >0 in the two classes but identified by the same probe. I hope i explained myself.