Matrix multiply with diagonal

I have been using JCUBLAS cublasDsyrk to multiply a matrix with it’s transpose, and generate an upper traingular matrix. I.e. XtX.

I actually want to calculate XtDX where D is a diagonal matrix. I need to calculate (and factorise) this matrix multiple times, with the same X and changing D. This is a standard part of Newton Raphson solution of logistic regression equations.

I couldn;t find anything in BLAS to do this. So I guess I have to code it myself. Any pointers? Dimension of D is about 100x100, dimension of X is 100,000 x 100. Factorisation I’m not bothered with , it is very fast, but constructing the matrix is a big bottleneck (don’t believe all those people saying factorising is the bottleneck).

Should I start with a raw matrix multiply and go from there? I have done no CUDA programming. Obviously CUBLAS is optimised, but would a basic implementation be much slower?

I’ve found a suitable answer - construct a sqrt(D) diagonal matrix and then construct sqrt(D)X and apply syrk. seems to give the same answers as my origial pure Java, after a bit of fiddling.

@NigelEssence Good that you managed to solve this. I noticed this thread earlier, but wasn’t sure how to respond. The topic is rather specific (and after naively throwing „Newton Raphson CUBLAS“ into google, I wasn’t sure whether the actual question wasn’t more of a mathematical/conceptual nature - I’m not necessarily familiar with all possible application cases of CUBLAS :wink: ). Mind to add some details about how you are computing the square root, or maybe some performance insights?

If you look up CUDA and logistic regression, you should be able to find various research papers. The problem occurs when you have large data sets (in my case > 100,000) but fewer variables (e.g. 50). So creating the Hessian completely dominates every thing else. Java is fine for the other parts. You have to create the Hessian very many times also in my algorithm - maybe 100,000.

I tested out JCuda a couple of years ago, my problems didn’t find a use for it, they were too small, but now I am getting speedup of about 3 on this problem.

In principle it would probably go faster if there was special CUDA code, as the X matrix doesn’t change, only the diagonal D matrix. But I will leave that problem to the younguns!

HI, how are you?

Recently, I have been using ALGLIB for some calculation, especially for large sparse matrix. And I wanna do AT multiply A. But there is no such interface found in ALGLIB.

Could you please help m? Thank you…

If you’re talking about ALGLIB , then this is in no way related to JCuda or JCublas…