Hello,
So I tried this locally. I’ve set up the https://git.elphel.com/Elphel/imagej-elphel project (I had to do some workaround in Eclipse so that it found the tools.jar
- it had to be started with a dedicated path name so that it was run with a JDK, not a JRE). Also, because I usually have the latest CUDA version installed (for obvious reasons) updated to 10.2.0 in the POM (there was a comment indicating that 9.2 was required for some TensorFlow compatibility only).
But I’ve see that you seem to have switched off the „separate complilation“ attempt in the current version. Now, I could try to generate .CU files that are filled with dummy functions so that they are intentionally too large, and then compare the behavior of the Java version and a native version that only compiles+loads these files, but I’m not sure whether this is a sensible way to go.
According to your last comment, it seems like you could work around this by making the PTX smaller. And I think I (have to) consider this now as a „low-priority“ issue that I might look into when it might become a blocker.
BTW: I had a short look to see how JCuda is actually used there. The Eyesis_Correction
has a main
and can be started, but I don’t know exactly what I can do with that. The Eyesis_Correction.java
is a file that has more than 10000 lines of code. I could probably spend years just trying to understand that single file. And that’s not even close to the point where the GPUTileProcessor
is used. That’s done from another class, and frankly, method signatures like
public double [][][][][][] clt_aberrations_quad_corr(
final ImageDttParameters imgdtt_params,
final int macro_scale,
final int [][] tile_op,
final double [][] disparity_array,
final double [][][] image_data,
final boolean [][] saturation_imp,
final double [][][][] clt_corr_combo,
final double [][][][][] clt_corr_partial,
final double [][] clt_mismatch,
final double [][] disparity_map,
final double [][][][] texture_tiles,
final int width,
final double corr_fat_zero,
final boolean corr_sym,
final double corr_offset,
final double corr_red,
final double corr_blue,
final double corr_sigma,
final boolean corr_normalize,
final double min_corr,
final double max_corr_sigma,
final double max_corr_radius,
final boolean max_corr_double,
final int corr_mode,
final double min_shot,
final double scale_shot,
final double diff_sigma,
final double diff_threshold,
final boolean diff_gauss,
final double min_agree,
final boolean dust_remove,
final boolean keep_weights,
final GeometryCorrection geometryCorrection,
final GeometryCorrection geometryCorrection_main,
final double [][][][][][] clt_kernels,
final int kernel_step,
final int transform_size,
final int window_type,
final double [][] shiftXY,
final double disparity_corr,
final double [][][] fine_corr,
final double corr_magic_scale,
final double shiftX,
final double shiftY,
final int debug_tileX,
final int debug_tileY,
final boolean no_fract_shift,
final boolean no_deconvolution,
final int threadsMax,
final int globalDebugLevel)
are impressive, but definitely not the good kind of impressive…
So on the one hand, I’m curious, and would like to know where the performance bottlenecks are, and how they are solved with JCuda. But it seems like the efforts to zoom into the right part of the code here are prohibitively large.
However, if you encounter any further issues with JCuda, just let me know.
(And if there’s a magic place with these lines…
public static void main(String args[]) {
runWithCPU("inputImage.png");
runWithGPU("inputImage.png");
}
that can be used as an entry point for further code browsing (and profiler runs and such), that would be great…)
bye
Marco