Pointer arithmetic & Java arrays

system · 8. August 2012 um 09:27

I think it’s possible there are some small variations between our code examples we may miss, so I will give you a copy of the FFT code I’m using, as well as the dummy data generator, so that you can try and reproduce the error exactly. Now if only I can find a text hosting service that isn’t blocked at work…

system · 8. August 2012 um 09:33

Ok, here is the full FFT code I’m using:

http://pastebay.net/1070340

And here is the dummy data generator code I’m using:

http://pastebay.net/1070341

And once again, I really appreciate your help and any advice you can give me!

Marco13 · 8. August 2012 um 10:23

You may have overlooked this post: http://forum.byte-welt.de/showthread.php?p=17971#post17971 But your new posts are largely covering the questions from there, so that’s OK. I already ran a test with the code from this post and a 256MB file, comparing it to the Memory Mapped file version. It hardly made a difference, but maybe I’ll examine it further at another occasion. (I’m just not sure whether the limiting factor of the speed here is the computation of the FFT, or the data transfer from/to the hard disk…)

I downloaded your code and tried to start it. Unfortunately, I’m still on a Win32 machine here, so there is a limit of about ~1.3 GB of memory allocation, which clearly is not enough. Furthermore, I only have a 1GB card. So I reduced it to use MEGABYTE (1048576) and availableVRAM=256, which fits to the 1GB without any rounding hassle. This test ran without any issue. (The strange thing is that the first run took only ~30 seconds, whereas subsequent runs took 1:40 minutes - I’ll have to examine that more closely).

What exactly goes wrong in this test (with 1 ‘GIGABYTE’)? Is it still the “cudaErrorLaunchFailure”? If necessary, I can try to run a test on a PC in my office, with Win64, to check whether there’s a problem that only occurs with larger input sizes.

What happens if you reduce the size to 256MB as well?

Code with 256MB for Copy&Paste-convenience:

package tests.jcufft.ross;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.FloatBuffer;
import java.nio.channels.FileChannel;
import java.util.Date;
import java.util.concurrent.TimeUnit;

import jcuda.Pointer;
import jcuda.jcufft.*;
import jcuda.runtime.*;

//From http://forum.byte-welt.de/showthread.php?t=4011&page=3

/**
* Benchmarks a Real to Complex, 1D, Forward, single-precision FFT using CUFFT (and JCuda for the wrapper).
* Program discards second half of output array (mirror image satisfying symmetry condition), meaning
* that input and output file size is the same. Reading and writing from/to disc uses a 4k ByteBuffer.
* Benchmarking is done using System.currentTimeMillis().
*
* Usage: java JCufftBenchmark <absolute input file path> <fft size>
* 
* @author Ross
*
*/
public class JCufftBenchmark {

        public static final int FLOAT = 4;
        public static final int GIGABYTE = 1073741824;
        public static final int MEGABYTE = 1048576;

        public static void main(String[] args) throws IOException {

            JCuda.setExceptionsEnabled(true);
            JCufft.setExceptionsEnabled(true);
            
                // Parse arguments using some (very) basic error checking
                if (args.length != 2) {
                        System.err.println("Wrong number of arguments. Usage: java JCufftBenchmark </file/path> <fftsize>");
                        System.exit(-1);
                }

                int fftSize = 0;
                String path = args[0];

                try {
                        fftSize = Integer.parseInt(args[1]);
                } catch (NumberFormatException e) {
                        System.err.println("FFT size must be an integer. Program will now exit.");
                        System.exit(-1);
                }

                File input = new File(path);

                try {
                        if (input.getParentFile().getUsableSpace() < (input.length())) {
                                System.err.println("Java reports the directory does not exist, is not usable, or there is not enough space to write the output to the path you specified.");
                                System.exit(-1);
                        }
                } catch (NullPointerException e) {
                        System.err.println("File path must be absolute (for example, /home/user/data/file.dat).");
                        System.exit(-1);
                }

                String outpath = (input.getParentFile().getParent() + "/outputs/" + input.getName().replaceFirst(".dat", "") + "jcufft" + fftSize + ".fft");
                String logpath = (input.getParentFile().getParent() + "/logs/" + input.getName().replaceFirst(".dat", "") + "jcufft" + fftSize + ".log");
                File output = new File(outpath);

                // availableVRAM sets the amount of memory (in megabytes) you think your CUDA card has available after
                // taking into account allocation for calculations, etc.; i.e., it should NOT be set to total VRAM size, 
                // but a large portion of it, and it should be a whole number (because I'm lazy to code the rounding)
                int availableVRAM = 256;
                int bufferSize = MEGABYTE * availableVRAM;

                // Number of FFTs to run on this block/buffer size (for CUFFT call)
                int batches = bufferSize / (fftSize * FLOAT);

                // Number of buffers to copy in/out of CUDA memory (this is partly why everything must line up
                // perfectly in gigabyte sizes.. none of this is designed for non-whole file sizes)
                long buffers = input.length() / bufferSize;

                // Note that the VRAM setting also affects buffer allocation
                System.out.println("Allocate "+bufferSize);
                ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);
                FloatBuffer floatBuff = inputBuff.asFloatBuffer();
                float jcufft[] = new float[bufferSize / 4];

                System.out.println("
-----------------------------------------");
                System.out.println("Starting JCufft 1D R2C FFT benchmark.");
                System.out.println("Input file  : " + path);
                System.out.println("Output file : " + outpath);
                System.out.println("FFT size    : " + fftSize);
                System.out.println("Log file    : " + logpath);
                System.out.println("-------------------------------------------");
                System.out.println("
File will be split into " + (int)buffers + " CUDA buffers. Now running...");

                // Start timer before commencing I/O
                long start = System.currentTimeMillis();
                FileChannel inchannel = new FileInputStream(input).getChannel();
                FileChannel outchannel = new FileOutputStream(output).getChannel();
                long size = inchannel.size();

                do {

                        // Read and transfer data to byte array
                        inputBuff.clear();
                        inchannel.read(inputBuff);
                        floatBuff.rewind();
                        floatBuff.get(jcufft);

                        // Create pointers to host and device memory, allocate memory
                        Pointer float_host_input = Pointer.to(jcufft);
                        Pointer float_device_input = new Pointer();
                        JCuda.cudaMalloc(float_device_input, bufferSize);

                        // Copy data to device, perform FFT, copy back to host
                        JCuda.cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
                        cufftHandle plan = new cufftHandle();
                        JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
                        JCufft.cufftExecR2C(plan, float_device_input, float_device_input);
                        JCuda.cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

                        JCufft.cufftDestroy(plan);
                        JCuda.cudaFree(float_device_input);

                        // Write data to file
                        floatBuff.clear();
                        floatBuff.put(jcufft);
                        inputBuff.rewind();
                        outchannel.write(inputBuff);

                } while (inchannel.position() != size);

                // Finish benchmarking and write result to log file
                inchannel.close();
                outchannel.close();
                long end = System.currentTimeMillis();
                long total = end - start;
                System.gc();
                System.out.println("Finished! Job took " + total + " milliseconds. Look in " + logpath + " for the results.");
                logResults(logpath, total, fftSize);
        }

        /**
         * Write results to disk.
         * @param filename name of the log file
         * @param time amount of milliseconds for job run
         * @param size fft size for the job
         */
        public static void logResults(String filename, long time, int size){
                String[] properties = { "os.name", "os.version", "os.arch", "java.vendor", "java.version" };
                try {
                        BufferedWriter out = new BufferedWriter(new FileWriter(filename, false));
                        out.write(new Date().toString());
                        out.newLine();
                        out.write("System properties:");
                        out.newLine();
                        out.write("	OS.name = " + System.getProperty(properties[0]));
                        out.newLine();
                        out.write("	OS.version = " + System.getProperty(properties[1]));
                        out.newLine();
                        out.write("	JRE arch = " + System.getProperty(properties[2]));
                        out.newLine();
                        out.write("	Java.vendor = " + System.getProperty(properties[3]));
                        out.newLine();
                        out.write("	Java.version = " + System.getProperty(properties[4]));
                        out.newLine();
                        out.write("	Available processors = " + Runtime.getRuntime().availableProcessors());
                        out.newLine();
                        out.write("	FFT size = " + size);
                        out.newLine();
                        out.newLine();
                        out.write("Time taken: ");
                        out.newLine();
                        out.write(time + "milliseconds.");
                        out.newLine();
                        out.write(String.format("	%d minutes, %d seconds.", TimeUnit.MILLISECONDS.toMinutes(time), TimeUnit.MILLISECONDS.toSeconds(time) -
                                        TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(time))));
                        out.close();
                } catch (IOException e) {
                        e.printStackTrace();
                }
        }
}

system · 8. August 2012 um 10:46

Sorry, you’re right, I didn’t see that post Yes, it is „cudaErrorLaunchFailure“; I tried your modification, and it still gives the same error (which means it’s probably not a JCuda problem, but hardware/software differences between our platforms). I guess you are right that it has something to do with memory allocation in the end - I kept decreasing GIGABYTE until I got to 512KB, at which point it ran with no issue, but that sort of defeats the purpose of minimizing memory transfers. Maybe CUDA can’t do an in-place FFT with such a huge chunk? I am running java with the flags „-d64 -server -Xms6g -Xmx6g“ if it makes any difference, on a 64-bit ubuntu machine.

I would try out the hardware theory (I’m using a GT620 with 96 cores right now), but unfortunately my GTX670 took a little dive yesterday while I was working on its watercooling :twisted:

system · 8. August 2012 um 11:28

Interesting - 256MB does work if I pull the CUFFT plan allocation back into the loop. If the plan allocation is outside the loop (i.e. once), only 512KB works. 1GB does not work either way, but I’d be happy with 256MB. Trying to confirm the results are correct now.

Also, every time I get the random question (to confirm I am not a bot) about how many legs a cow has, neither „4“ nor „four“ works

Marco13 · 8. August 2012 um 11:36

EDIT> I was writing this post while you wrote your last one <EDIT

So that’s really 512KB and not 512MB? Of course, that does not make much sense, and I don’t see a reason why it should not work with larger sizes. I’m slowly running out of ideas now. Unfortunately, I have no possibility to run tests on a Linux 64 machine at all, so this is really difficult.

One next step could be to adapt the “NVIDIA Corporation\NVIDIA GPU Computing SDK 4.2\CUDALibraries\src\simpleCUFFT” example from the CUDA SDK (or create a similar example from scratch) in order to verify that it is not related to JCufft itself, and whether it’s possible to execute larger FFTs at all. Although I could not see a reason why there should be such a general problem in JCufft, I can not rule out this option as long as I have not tested it. If you have installed the SDK, can you run the simpleCUFFT example in general? If yes, I can try to adapt this sample to resemble the JCufftBenchmark more closely, so that this can be tested as well.

system · 8. August 2012 um 11:47

Yeah, I didn’t notice that I had pulled out the plan allocation and destruction out of the map processing loop earlier; if I leave it in there (like you did in your 256MB example, which I didn’t copy and paste at first), I can actually get transfers up to 512MB working, with it crashing out at 1GB (which is actually ok with me haha).

But if I allocate just one plan, it will not accept anything greater than 512KB memory copies. I think this has more to do with how CUDA works with memory internally, not really JCufft. Unfortunately I don’t have the SDK installed, there are some modifications needed to the make files to get it to compile that I haven’t done.

I think I am going to run a big file (64GB) through 512MB buffers, and then compare it, FFT by FFT, using your threshold method from the jcufft sample. If it works correctly, I think the issue will be proven to be that

It’s not a good idea to run in-place FFTs on chunks as big as 1GB
A Cufft plan needs to be allocated and destroyed each time you perform an FFT.

which would make more sense than maybe some internal JCuda issue.

Marco13 · 8. August 2012 um 12:42

OK, there may be some not-so-obvious (or not obviously documented) limitations for „large“ FFTs. So I could understand that it works for 256MB, but not for 1GB, for whatever reason, and this would IMHO be acceptable.

But in any case, the fact that it works for a specific size when the plan is created during each run, whereas it bails out (for the same size) when one attempts to reuse the plan completely leads the idea of a ‚plan‘ ad absurdum. To quote from the CUFFT documentation:

„The advantage of this approach is that once the user creates a plan, the library stores whatever state is needed to execute the plan multiple times without recalculation of the configuration.“
(it also drops some notes about the memory requirements there, and finally says: )
„One can create a CUFFT plan and perform multiple transforms on different data sets by
providing different input and output pointers.“

That’s exactly the point of a plan: It SHOULD be reused :verzweifel:

Maybe I’ll try to create an example based on the ‚simpleCUFFT‘, because I’d really like to make sure that this issue is not related to JCufft. (I can’t imagine that it is, because the plan management is fairly simple, and IF there was a problem, I’d rather expect it to occur in the opposite case - namely when the plan is created and destroyed multiple times). However, since I could not yet reproduce the error here, and can not test it on a Linux64 machine, I’m not sure whether such a test would bring any new insights at all…

system · 8. August 2012 um 13:00

Have you tried pulling the plan allocation out of the loop in your 256MB machine? If the plan is really at fault, I think it should give you errors on your machine as well… it is odd indeed, I know plans are generally supposed to be reused from using them in FFTW. I just plotted the output from JTransforms and JCufft for a 4GB file and it seems to be close to identical.

Marco13 · 8. August 2012 um 13:42

I tried it before, and just tried it again to be sure: It works for 256 MB also when the creation of the plan and the allocation of the device memory (and of course, their destruction and freeing) are pulled out of the loop.

But I just managed to start it with availableVRAM=512 (MB), by carefully tweaking the Xmx to a size that still allows the float arrays to be allocated, but leaves enough space for the direct buffers: Now I also received a cudaErrorLaunchFailure - regardless of whether the setup was done inside or outside of the loop. After another look at the CUFFT documentation, this does not seem too surprising:

In some transforms, the temporary space allocation can be as low as the input data size.

So one probably has to assume that the maximum possible size for an FFT is at most(!) half of the available memory - and considering that it may not be a single bit more (but possibly even less, depending on the FFT type), 256MB seems to be a valid limit for a 1GB card, and 512MB seems to be valid for a 2GB card.

Of course, one could try to squeeze out the last bit by using odd sizes of, say, 800MB on a 2GB card, but this might not be worth the hassle.

BTW, another important point: Have you ever tried running your program with the line
JCufft.cufftExecR2C
commented out, and compared the running time? At least for me, by far the largest part of the time is consumed by disk I/O - but of course, this may depend on many factors, like the disk and the OS (or whether you have an SDD or still such an old magnetic one like mine )

However, some aspects, especially the different behavior depending on whether the setup/shutdown is done in the loop or outside, are still hard to explain…

system · 8. August 2012 um 14:57

Hah, strange… I’ve got it working with correct output at 512MB „buffers“ now, so at this point I’m just happy everything works, though I’m still not quite sure what the problem with the plan allocation might be. 512MB is not a big compromise over 1GB, and it’s good you found some details in the official documentation to support the size limits (much of my searching of the „unofficial“ documentation has been hampered by the fact that the nvidia forums have been down for like 2 weeks now, like you said).

I have not tried commenting out the actual calculation, although I’m sure you’re right about disc/memory I/O being the bulk of computation time. I did run a test to compare with JTransforms, however, and the calculation improvement with using CUDA is still noticeable; for a 16GB file, 128k FFTs took 15m 13s with JTransforms, and 10m 42s with CUDA. I don’t want to use my SSD as writing and reading random, non-compressible data is what SSD torture tests do to simulate „use over time“, as you probably know SSDs degrade over time, and I don’t want to ruin mine too early for a project I’m doing out of curiosity

Marco13 · 9. August 2012 um 07:41

OK, it’s good to hear that it works now and brings a noticable speedup. However, the “fragility” still concerns me a little. Of course, it’s a “border case” to do FFTs that are so large (and especially so close to the size of the available RAM), but it’s still difficult to say which size is valid, and where some differences in the behavior comes from.

I just tried to run a test on a Win64 machine, with larger buffer size, but noticed that this card also has “only” 1GB of RAM…

system · 16. August 2012 um 17:21

I think I finally figured it out - you have to use cufftPlanMany() for batched input, not plan1D. At least that’s what I got from reading the first few pages of the CUFFT guide - I am not sure if I’m correct, I am not able to test this at the moment. Will let you know.

Marco13 · 17. August 2012 um 06:00

When I read the Guide again due to this thread, I also thought for a moment that it might be necessary to use cufftPlanMany. But after having a closer look, I thought that it was only a more “general” plan creation function, and for the case of a “simple” 1D plan, cufftPlan1D could be used as well (since it also receives a parameter for the number of batches). However, maybe I misinterpreted something. Did you notice a different behavior with cufftPlanMany? (I’d have to review the specification in order figure out the appropriate parameters for the data layout - I think they are used in the latest version, although the JCufft documentation still says they are unused - this also has to be updated…)

system · 17. August 2012 um 13:47

Well, after some more digging, it appears you are correct, in theory; PlanMany() is a faster batch FFT, where the batches can execute in parallel, while Plan1D simply loops through them. So technically both should work. Still not at a computer with a CUDA card, so I’ll try this out later.

system · 17. August 2012 um 15:39

By the way, forgot to tell you; for 512MB buffer sizes (my current max), if I use the convenience ExecR2C method that takes float arrays, I get a cudaErrorMemoryAllocation. However, if I do all the mallocs and copies manually with Pointers, everything works ok (with the same exact array sizes, I checked). Any idea how that might work?

Marco13 · 18. August 2012 um 06:45

Indeed, there is something wrong in the covenience methods: In the R2C and C2R methods, it does NOT check whether the input and output arrays are the same arrays. So it always assumes that the transform is NOT „in-place“ and tries to allocate the memory twice. (In the R2R and C2C methods, it checks for an in-place transform explicitly, and allocates the memory only once if it’s in-place).

The might be because in the first CUFFT documentations, the sections about the memory layouts have not been as detailed as they are now. But maybe also because I’m not so familiar with the actual application of FFTs, and was not aware that C2R and R2C could be in-place as well :o (However, I’ve recently been trying to get a little bit more invovled by using JCufft for frequency analysis of WAV sound data - sure, it may be a „simple“ use-case, but may be a way to gain some insights).

In any case, thanks for pointing this out I’ll update the C2R/R2C-convenience methods accordingly, so that they also detect an in-place transform.

bye
Marco

system · 18. August 2012 um 20:47

Great, thanks. I had a chance to fire up an Amazon GPU instance today with 3GB of VRAM, and bigger buffer sizes than 512MB work with the exact same code. So it was definitely a hardware problem with that specific exception/issue (not the plan allocation thing, that still fails).