OK, I've never been so familar with statistics and probabilities (it scares me - too many paradoxes are based on that! :eek: ).
I wonder which parts of the application have been identified as bottlenecks and should be accelerated with data-parallel GPU processing, and what the intended "interface" for these functions looks like.
For example, when the basis for the computation is some object-oriented structure of classes (for Vectors or so), it may be hard to port this to CUDA. If you rely mainly on interfaces, it might be easier (that depends). And if you are already working on simple float arrays, it should be the easiest.
Similarly for the computation and the "data processing workflow" itself: Depending on how compute-intensive the analysis actually is, and how the GPU-accelerated step should be integrated into the remaining application, there may be more or fewer hurdles to take.
Although I'm certainly NOT a CUDA expert: If you can give an example of such a computation, and maybe some info about the data structures that are involved, I might be able to give a more precise answer.
(BTW: I assume that this should only be an example, but if you intend to define the function that is to be executed at runtime, you might consider using OpenCL/JOCL. It's conceptually similar to CUDA, and comes with a built-in compiler, so the actual kernel function can easily be defined at runtime).