Pointer arithmetic & Java arrays

Thank you for taking the time to develop JCuda, it is very helpful. I have a question about byte & float arrays.
My application requires taking FFTs of a large byte array (i.e. > 1GB), however the sample points in my “signal” file are actually floats, so every 4 bytes is a single point. I tried creating a Pointer to the array, copying it to CUDA memory, then running CUFFT on the entire array inside memory (using something like FFT size = 1024, and the appropriate number of batches).

I assumed JCuda/CUDA would “see” every 4 bytes of the array as a float, but instead the output I got was 1GB of max value bytes (i.e. 7fff). If I convert the byte array to a float array, copy it and FFT it that way, then copy back the float array and convert it to a byte array, the output is correct. Is there any way to force JCuda to look at the byte array as a float array without doing all the shuffling between byte and float? With 1GB-2GB buffers/arrays this slows down my application to the point where CPU FFT is faster. I can provide code if necessary.

Hello

Some points are not clear. Are you using JCufft, or an own FFT implementation?

Regardless of that, it should be no problem to “interpret” a Java byte[] array as a float array on CUDA side: The data is copied as it is, and once it is on the GPU, it is treated as a void* pointer anyhow.

For one moment, I thought that this might be some little endian/big endian issue.

Some code snippet might be helpful here, just to avoid any misinterpretation

bye
Marco

It may be an endianness problem… Good to know it’s considered a void pointer, I guess I’m doing something wrong then. I am using JCufft. Here is the byte array code that outputs an array of 7fff bytes for any FFT size about 1024:

Here is the inefficient float array code that gives back correct results for all FFT sizes:

// bufferSize here should be 1GB, each float is 4 bytes
float jcufft[] = new float[bufferSize / 4];

    FileChannel inchannel = new FileInputStream(input).getChannel();
    FileChannel outchannel = new FileOutputStream(output).getChannel();
    long size = inchannel.size();

    do {

        // Read and transfer data to byte array
        inputBuff.clear();
        inchannel.read(inputBuff);
        floatBuff.rewind();
        floatBuff.get(jcufft);

        // Create pointers to host and device memory, allocate memory
        Pointer float_host_input = Pointer.to(jcufft);
        Pointer float_device_input = new Pointer();
        JCuda.cudaMalloc(float_device_input, bufferSize);

        // Copy data to device, perform FFT, copy back to host
        JCuda.cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
        JCufft.cufftExecR2C(plan, float_device_input, float_device_input);
        JCuda.cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

        JCufft.cufftDestroy(plan);
        JCuda.cudaFree(float_device_input);

        // Write data to file
        floatBuff.clear();
        floatBuff.put(jcufft);
        inputBuff.rewind();
        outchannel.write(inputBuff);

    } while (inchannel.position() != size);

I may have to do more tests with endianness when I get back home from vacation; all my input/output is written using ByteBuffers on Linux.

----Please ignore the previous incomplete post :)----

It may be an endianness problem… Good to know it’s considered a void pointer, I guess I’m doing something wrong then. I am using JCufft.

Here is the byte array code that outputs an array of 7fff bytes for any FFT size about 1024:

// bufferSize here is a gigabyte, since I'm processing 1024^3 bytes at once
byte jcufft[] = new byte[bufferSize];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do {

            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            inputBuff.rewind();
            inputBuff.get(jcufft);

            // Create pointers to host and device memory, allocate memory
            Pointer byte_host_input = Pointer.to(jcufft);
            Pointer byte_device_input = new Pointer();
            JCuda.cudaMalloc(byte_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(byte_device_input, byte_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, byte_device_input, byte_device_input);
            JCuda.cudaMemcpy(byte_host_input, byte_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(byte_device_input);

            // Write data to file
            inputBuff.clear();
            inputBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);

Here is the inefficient float array code that gives back correct results for all FFT sizes:


ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);
        FloatBuffer floatBuff = inputBuff.asFloatBuffer();
// bufferSize here should be 1GB, each float is 4 bytes
float jcufft[] = new float[bufferSize / 4];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do {

            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            floatBuff.rewind();
            floatBuff.get(jcufft);

            // Create pointers to host and device memory, allocate memory
            Pointer float_host_input = Pointer.to(jcufft);
            Pointer float_device_input = new Pointer();
            JCuda.cudaMalloc(float_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, float_device_input, float_device_input);
            JCuda.cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(float_device_input);

            // Write data to file
            floatBuff.clear();
            floatBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);

I may have to do more tests with endianness when I get back home from vacation; all my input/output is written using ByteBuffers on Linux.

Hello

The endianness was just a first guess, I have not thought about it in detail.

Also, you mentioned…

Here is the byte array code that outputs an array of 7fff bytes for any FFT size about 1024:

Does this mean that for smaller input sizes, the code works as expected? (This would obviously mean that it can’t be an endianness issue!).

Apart from that, just by looking at the code, I’m not sure what might be the reason for the error. Have you enabled
JCufft.setExceptionsEnabled(true);
JCuda.setExceptionsEnabled(true);
just to make sure that there’s nothing „obviously“ going wrong?

I was surprised about the ‚bufferSize‘ of 1GB: That’s a LOT. I thought that there was a limit on the size of memory allocations (At least in OpenCL on My GeForce 8800 (1GB), a device query program reports 256MB as the largest possible memory allocation - I did not find any specific information for CUDA with a quick websearch, but could imagine that there is a similar limit). But this can hardly be the problem here, because it seems to work for the ‚float‘ case, with the same buffer size.
In any case, if you enable exceptions as described above, cudaMalloc should cause an exception to be thrown if the memory cannot be allocated for whatever reason.

Unfortunately you can not send me the input data via mail :wink: so debugging is a bit difficult at the moment.

BTW: You might consider restructuring the code a little: It’s probably not very efficient to create and tear down the cufftPlan for each block. Creating a plan may allocate and prepare some internal data structures - this should not be done when it is not necessary. But at the moment, this could be seen as another „optimization“, the primary goal is to get it working.

bye

Thanks for your suggestions Marco, unfortunately I am not at home so I cannot test them right now, but I will do so as soon as I get home in a few days.

Yes, for FFT sizes < 1024, the output seems to be correct (note this is the FFT size, not the input size, which is 1GB for both float and byte tests). Are you certain CUDA cannot take FFTs of signed byte arrays? Input data can only be float or double?

I set the buffer size to 1GB in order to minimize copying from host to device; the idea is to break up my file (which is around 1TB) into 1GB chunks (my VRAM is 2GB), copy each chunk into CUDA memory, run FFTs on that chunk, then write it out to file. This should be a lot faster than copying in the bytes needed for a single FFT every time.

I will try enabling the exceptions - did not know there was an option like that (will check device query for any memory allocation limits as well). Thanks for the tip about the plan allocation, I forgot that can be put outside the loop - might definitely shave some time off. I will update this thread in a few days when I’ve had a chance to run my code with exceptions enabled, thanks again.

Hello

That’s strange: I just did a quick test, and until now, it seems as if it really is related to the byte order. (Did you verify the results, or do you assume that they are correct because they are not ‚NaN‘ or ‚7FFF‘?). In this test (see the code below) I introduced a flag ‚reverseOrderTest‘, which causes the byte order of the ‚jcufft‘ byte array to be reversed (pragmatically). The results are compared to the float-version that you posted, and until now, it seems that the byte order is the reason. (At least here, on my Win32 machine…)

Yes, definitely.

I will try enabling the exceptions - did not know there was an option like that (will check device query for any memory allocation limits as well).

The CUDA device query does not say anything about a limit. I found some forum threads indicating that there might be a limit, but no clear official statements so far…

I’d recommend to setExceptionsEnabled(true) during the development in general. Otherwise, checking for errors is tedious. Even the NVIDIA CUDA SDK contains Utility macros (CUDA_SAFE_CALL etc) to make this easier.

I set the buffer size to 1GB in order to minimize copying from host to device; the idea is to break up my file (which is around 1TB) into 1GB chunks (my VRAM is 2GB), copy each chunk into CUDA memory, run FFTs on that chunk, then write it out to file. This should be a lot faster than copying in the bytes needed for a single FFT every time.

Thanks for the tip about the plan allocation, I forgot that can be put outside the loop - might definitely shave some time off.

Certainly, that sounds feasible. I added another test in this code: In the „runTestMapped“-Method, the plan creation and memory allocation is pulled out of the loop. Additionally, I’m using Memory Mapped files there. As far as I know, these are intended for a use case like this: Mapping a region of a (unmanagably large) file into memory, manipulating it, and committing the data back - and they are said to be fast. The byte order reversal is still necessary, but maybe some time can be saved there anyhow. Note that I have not really used the mechanism of Memory Mapped files before, and I have not tested it extensively (especially not with really „large“ files), but it may be worth a try…

package tests.jcufft;

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.channels.FileChannel.MapMode;

import jcuda.Pointer;
import jcuda.jcufft.*;
import jcuda.runtime.*;

public class JCufftByteOrderTest
{
    private static boolean reverseOrderTest = false;
    
    public static void main(String[] args) throws IOException
    {
        JCuda.setExceptionsEnabled(true);
        JCufft.setExceptionsEnabled(true);
        
        int fftSize = 128;
        int bufferSize = fftSize * 4;
        int totalSize = bufferSize * 4;
        int batches = bufferSize / fftSize;
        
        File input = new File("JCufftBatchedTest_input.dat");
        if (!input.exists())
        {
            createDummyData(input, totalSize);
        }
        File outputFloat = new File("JCufftBatchedTest_output_float.dat");
        File outputByte = new File("JCufftBatchedTest_output_byte.dat");
        File outputByteRev = new File("JCufftBatchedTest_output_byte_rev.dat");
        File outputMapped = new File("JCufftBatchedTest_output_mapped.dat");
        
        System.out.println("Float:");
        runTestFloat(fftSize, bufferSize, batches, input, outputFloat);
        printOutputData(outputFloat, totalSize);

        System.out.println("Byte (not reversed)");
        reverseOrderTest = false;
        runTestByte(fftSize, bufferSize, batches, input, outputByte);
        printOutputData(outputByte, totalSize);
        
        System.out.println("Byte (reversed)");
        reverseOrderTest = true;
        runTestByte(fftSize, bufferSize, batches, input, outputByteRev);
        printOutputData(outputByteRev, totalSize);
        
        System.out.println("Mapped");
        runTestMapped(fftSize, bufferSize, batches, input, outputMapped);
        printOutputData(outputMapped, totalSize);
        
    }


    private static void runTestByte(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);        
        
        // bufferSize here is a gigabyte, since I'm processing 1024^3 bytes at once
        byte jcufft[] = new byte[bufferSize];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do {

            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            inputBuff.rewind();
            inputBuff.get(jcufft);
            
            if (reverseOrderTest)
            {
                jcufft = reverseByteOrder(jcufft);
            }

            // Create pointers to host and device memory, allocate memory
            Pointer byte_host_input = Pointer.to(jcufft);
            Pointer byte_device_input = new Pointer();
            JCuda.cudaMalloc(byte_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(byte_device_input, byte_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, byte_device_input, byte_device_input);
            JCuda.cudaMemcpy(byte_host_input, byte_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(byte_device_input);

            if (reverseOrderTest)
            {
                jcufft = reverseByteOrder(jcufft);
            }
            
            // Write data to file
            inputBuff.clear();
            inputBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);   

        outchannel.close();
        inchannel.close();
    }
        
    

    private static void runTestFloat(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);
        FloatBuffer floatBuff = inputBuff.asFloatBuffer();
        
        // bufferSize here should be 1GB, each float is 4 bytes
        float jcufft[] = new float[bufferSize / 4];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do 
        {
            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            floatBuff.rewind();
            floatBuff.get(jcufft);

            // Create pointers to host and device memory, allocate memory
            Pointer float_host_input = Pointer.to(jcufft);
            Pointer float_device_input = new Pointer();
            JCuda.cudaMalloc(float_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, float_device_input, float_device_input);
            JCuda.cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(float_device_input);
           
            // Write data to file
            floatBuff.clear();
            floatBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);
        
        outchannel.close();
        inchannel.close();
    }
        
    

    
    private static void runTestMapped(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        // bufferSize here is a gigabyte, since I'm processing 1024^3 bytes at once
        float jcufft[] = new float[bufferSize/4];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new RandomAccessFile(output, "rw").getChannel();
        long size = inchannel.size();
        long position = 0;
        
        // Create pointers to host and device memory, allocate memory
        Pointer byte_host_input = Pointer.to(jcufft);
        Pointer byte_device_input = new Pointer();
        JCuda.cudaMalloc(byte_device_input, bufferSize);

        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
        
        do 
        {
            ByteBuffer mappedInput = inchannel.map(MapMode.READ_ONLY, position, bufferSize);
            FloatBuffer inputBuffer = mappedInput.asFloatBuffer();
            inputBuffer.get(jcufft);
            
            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(byte_device_input, byte_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            JCufft.cufftExecR2C(plan, byte_device_input, byte_device_input);
            JCuda.cudaMemcpy(byte_host_input, byte_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            // Write data to file
            MappedByteBuffer mappedOutput = outchannel.map(MapMode.READ_WRITE, position, bufferSize);
            FloatBuffer outputBuffer = mappedOutput.asFloatBuffer();
            outputBuffer.put(jcufft);
            mappedOutput.force();
            
            position += bufferSize;

        } while (position < size);   

        JCufft.cufftDestroy(plan);
        JCuda.cudaFree(byte_device_input);
        
        outchannel.close();
        inchannel.close();
    }
    
    
    
    
    
    private static byte[] reverseByteOrder(byte input[])
    {
        byte[] output = new byte[input.length];
        for (int i=0; i<input.length; i+=4)
        {
            output[i+0] = input[i+3];
            output[i+1] = input[i+2];
            output[i+2] = input[i+1];
            output[i+3] = input[i+0];
        }
        return output;
    }
    
    
    private static void createDummyData(File file, int size) throws IOException
    {
        DataOutputStream dos = new DataOutputStream(
            new FileOutputStream(file));
        for (int i=0; i<size; i++)
        {
            dos.writeFloat((float)Math.sin(i*0.1f));
        }
        dos.close();
    }
    
    private static void printOutputData(File file, int size) throws IOException
    {
        DataInputStream dis = new DataInputStream(
            new FileInputStream(file));
        for (int i=0; i<size; i++)
        {
            float f = dis.readFloat();
            System.out.printf("%7s", String.format("%.3f", f));
            if ((i+1)%20 == 0)
            {
                System.out.println(", ");
            }
            else
            {
                System.out.print(", ");
            }
        }
        System.out.println("
");
        dis.close();
    }
}

Well, I enabled exceptions and it seems that it should have been throwing cudaErrorLaunchFailure for the Memcpy call this entire time… I only checked the first 1024 bytes of my output for correctness, however I’m not really sure how it was “pretending” to do FFTs on 1GB allocations without throwing any errors or seg faults… really confused.

Through careful experimentation I’ve figured out that it will throw a cudaErrorLaunchFailure for malloc’ing anything 1024KB or greater… I’m testing this on a GT620 with compute 2.1 and 2GB VRAM. This is strange; what’s the point of having 2GB memory if you can’t transfer more than 1023KB at a time? This number doesn’t seem to correspond to the “Total constant memory on the device” or “Size of L2 cache” variables I get from deviceQuery either… how do game designers transfer over huge texture files then? I have the feeling there’s something very wrong here. Any ideas Marco?

In the meantime I’m copy some of your code for properly checking the validity of the output against JTransforms…

Also, I experimented some more with passing the byte array, and I still haven’t found a solution.To demonstrate what I mean, I performed an 8 point Real to Complex FFT on the values

[1, 2, 3, 4, 5, 6, 7, 8]

If you use any online FFT calculator (such as http://www.random-science-tools.com/maths/FFT.htm) or JTransforms, the results should be a 16-element array (with the second half being the redundant mirror data), approximately

[36, 0, -4, 9.65, -4, 4, -4, 1.65, -4, 0, -3.99, -1.65, -3.99, -4.0, -3.99, -9.65].

Here is some code to demonstrate my point:

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.FloatBuffer;
import jcuda.Pointer;
import jcuda.jcufft.*;
import jcuda.runtime.*;

class JCufftBenchmark
{
    public static void main(String[] args) throws IOException
    {
        JCuda.setExceptionsEnabled(true);
        JCufft.setExceptionsEnabled(true);
        runFloat();
        runByte();
        runByteReversed();

    }

    public static void runFloat()
    {
        float input[] =
        { 1, 2, 3, 4, 5, 6, 7, 8 };
        // Memory
        Pointer host = Pointer.to(input);
        Pointer device = new Pointer();
        JCuda.cudaMalloc(device, 32);
        JCuda.cudaMemcpy(device, host, 32, cudaMemcpyKind.cudaMemcpyHostToDevice);
        // FFT and copy back
        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, 8, cufftType.CUFFT_R2C, 1);
        JCufft.cufftExecR2C(plan, device, device);
        JCuda.cudaMemcpy(host, device, 32, cudaMemcpyKind.cudaMemcpyDeviceToHost);
        // Free
        JCuda.cudaFree(device);
        JCufft.cufftDestroy(plan);
        // Print
        System.out.println("
Regular floats: ");
        for (int i = 0; i < 8; i++)
        {
            System.out.println(input**);
        }
    }

    public static void runByte()
    {
        float floats[] =
        { 1, 2, 3, 4, 5, 6, 7, 8 };
        byte input[] = float2Byte(floats);
        // Memory
        Pointer host = Pointer.to(input);
        Pointer device = new Pointer();
        JCuda.cudaMalloc(device, 32);
        JCuda.cudaMemcpy(device, host, 32, cudaMemcpyKind.cudaMemcpyHostToDevice);
        // FFT and copy back
        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, 8, cufftType.CUFFT_R2C, 1);
        JCufft.cufftExecR2C(plan, device, device);
        JCuda.cudaMemcpy(host, device, 32, cudaMemcpyKind.cudaMemcpyDeviceToHost);
        // Free
        JCuda.cudaFree(device);
        JCufft.cufftDestroy(plan);
        // Print
        float output[] = byte2Float(input);
        System.out.println("
Bytes (converted to floats for printing): ");
        for (int i = 0; i < 8; i++)
        {
            System.out.println(output**);
        }
    }

    public static void runByteReversed()
    {
        float floats[] =
        { 1, 2, 3, 4, 5, 6, 7, 8 };
        byte input[] = float2Byte(floats);
        // Memory
        Pointer host = Pointer.to(input);
        Pointer device = new Pointer();
        JCuda.cudaMalloc(device, 32);
        JCuda.cudaMemcpy(device, host, 32, cudaMemcpyKind.cudaMemcpyHostToDevice);
        // FFT and copy back
        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, 8, cufftType.CUFFT_R2C, 1);
        JCufft.cufftExecR2C(plan, device, device);
        JCuda.cudaMemcpy(host, device, 32, cudaMemcpyKind.cudaMemcpyDeviceToHost);
        // Free
        JCuda.cudaFree(device);
        JCufft.cufftDestroy(plan);
        // Print
        byte outputReversed[] = reverseBytes(input);
        float output[] = byte2Float(outputReversed);
        System.out.println("
Bytes in reverse order (converted to floats for printing): ");
        for (int i = 0; i < 8; i++)
        {
            System.out.println(output**);
        }
    }

    public static float[] byte2Float(byte[] input)
    {
        float output[] = new float[input.length / 4];
        ByteBuffer bytes = ByteBuffer.wrap(input);
        FloatBuffer floats = bytes.asFloatBuffer();
        floats.get(output);
        return output;
    }

    public static byte[] float2Byte(float[] input)
    {
        byte output[] = new byte[input.length * 4];
        ByteBuffer bytes = ByteBuffer.wrap(output);
        FloatBuffer floats = bytes.asFloatBuffer();
        floats.put(input);
        return output;
    }

    private static byte[] reverseBytes(byte input[])
    { 
        /* code emitted because I keep hitting the message limit */
    }
}

And the output is as follows:

Regular floats:
36.0
0.0
-4.0
9.656855
-4.0
4.0
-4.0
1.6568542

Bytes (converted to floats for printing):
1.193969E-38
0.0
33281.5
102.5
2.3510246E-38
-2.5388514E38
6208.0
-0.23633003

Bytes in reverse order (converted to floats for printing):
3.22142E-40
0.0
-9.2652E-41
7.3633E-41
-1.4E-45
-2.2957E-41
6.9691E-41
-4.1162E-41

You see that the float example gives the expected non-redundant Fourier coefficients (first half of the full output array), however the byte data is nonsensical no matter how you arrange it (unless I am really bad at reading scientific notation). I’m also going to try and get a workstation with a GTX670 up and running to see if the malloc() limit issue isn’t because of Compute 2.1. I talked to our in-house CUDA expert and said I should be able to malloc all the free memory available, which I’m sure is not below 1024kb on a 2GB card. Thanks in advance for any advice you can offer.

Woops, didn’t notice that you reversed the bytes BEFORE the FFT as well… that does indeed fix it, thanks!

Hello

First, concerning the code and the acutal problem: According to your last post, I assume that this is fixed now? (I also compared the results of the code snippet that I posted to the results computed with JTransforms, but omitted this test in the code that I posted - now that I know that you’re also using it, I can include it when necessary)

BTW: The code that you posted seemed to be somehow pre-formatted using BBCode tags - that’s probably why it was too large. I tried to repair it, hope that’s OK. You just can post the Java code by simply copying the plain code from the IDE into the ‚Java‘ tags that appear when selecting ‚Java‘ from the dropdown list.

Concerning the memory allocation:

Through careful experimentation I’ve figured out that it will throw a cudaErrorLaunchFailure for malloc’ing anything 1024KB or greater… I’m testing this on a GT620 with compute 2.1 and 2GB VRAM. This is strange; what’s the point of having 2GB memory if you can’t transfer more than 1023KB at a time? This number doesn’t seem to correspond to the „Total constant memory on the device“ or „Size of L2 cache“ variables I get from deviceQuery either… how do game designers transfer over huge texture files then? I have the feeling there’s something very wrong here. Any ideas Marco?

As I mentioned, there are (or may be) some constraints. I did a websearch for more info, but did not find any official statement from NVIDIA. Unfortunately, the NVIDIA forum is down for quite a while now, but the Google Cache contains some threads about this topic. For now: There are related questions at StackOverflow, e.g.

or

According to some statements there, there may be many factors influencing the maximum possible memory allocation size - it may also be platform dependent and so on.

One of the reasons for the failure to allocate a large memory block may be memory fragmentation - VERY simplified:


Free Memory initially: 2GB     [                ]
Allocate 'A' block of 500MB    [AAAA            ]
Allocate 'b' block of 500MB    [AAAAbbbb        ]
Allocate 'A' block of 500MB    [AAAAbbbbAAAA    ]
Allocate 'b' block of 500MB    [AAAAbbbbAAAAbbbb]
Delete all 'b' blocks          [AAAA    AAAA    ]
Free Memory is now 1GB - but no single block 
of this size can be allocated!

But I’m not sure whether this could be the reason here.

Additionally, as I mentioned, creating an FFT plan may allocate some memory internally. In the worst case, it has to allocate a large memory block, for example, when an ‚in-place FFT‘ should be performed, and therefore, some auxiliary memory is required. But I do NOT know anything about the internals of CUFFT, so again, I don’t know in how far this may one part of the problem.

I just ran this test on a 1GB card: It simply allocates and frees memory, in blocks ranging from 10MB to 2000MB (in 10 MB steps). It bails out at about 980MB, which could be expected. I wonder whether it’s possible to allocate a block of >1GB on a 2GB card at all (although the conditions in this test are as artificial as they can be, of course).

package tests;

import jcuda.Pointer;
import jcuda.runtime.JCuda;

public class MaxMallocTest
{
    public static void main(String[] args)
    {
        JCuda.setExceptionsEnabled(true);
        long free[] = {0};
        long total[] = {0};
        for (int i=1000000*10; i<1000000*2000; i+=1000000*10)
        {
            Pointer p = new Pointer();
            JCuda.cudaMemGetInfo(free, total);
            System.out.println("Before allocating "+i+" bytes, free: "+free[0]+" total: "+total[0]);
            JCuda.cudaMalloc(p, i);
            JCuda.cudaMemGetInfo(free, total);
            System.out.println("After  allocating "+i+" bytes, free: "+free[0]+" total: "+total[0]);
            JCuda.cudaFree(p);
            JCuda.cudaMemGetInfo(free, total);
            System.out.println("After  freeing    "+i+" bytes, free: "+free[0]+" total: "+total[0]);
        }
    }
}

Ok thanks very much Marco, all good points that I have to research. I actually also coded a separate program to test just mem copies with large amounts, as well as ran your JCufft sample (which I think tests 8MB), both worked fine on the same card, and I don’t think CUFFT can allocate more than 300-400MB at most according to my friend.

A moment ago I got a CUFFT_SETUP_FAILED exception when using the same code as I originally posted (with exceptions enabled this time), running on a Tesla card instead of a GTX670. I even rebooted the machine to make sure all the CUDA memory was freed. I keep getting these exceptions, yet the output files are not blank, nor are they a copy of the original file, so calculation still somehow goes on. I will compare the output files from JTransforms and JCufft using your threshold method to see if the output is actually correct, but a thought popped into my head - is it possible I am getting these seemingly random exceptions because I’m using CUDA 4.2 with JCuda 4.1?

Also, yes, the endianess issue is fixed, thanks :slight_smile: I benchmarked multiple runs however, and reversing the bytes manually using a for-loop turns out to be much, much slower than playing around with Byte and Float buffers, so I decided to keep my previous approach… but at least I know now why it works. Taking the cufftPlan allocation out of the main loop gave a %6.3 improvement in computation time for a 64GB file (using 1GB blocks), so that was also a great tip.

[QUOTE=Ross]A moment ago I got a CUFFT_SETUP_FAILED exception when using the same code as I originally posted (with exceptions enabled this time), running on a Tesla card instead of a GTX670.

a thought popped into my head - is it possible I am getting these seemingly random exceptions because I’m using CUDA 4.2 with JCuda 4.1?[/quote]

Are they really purely random, or happening reproducably (at least, to some extent)? If they are, for example, always happening after a certain running time, it could again be a matter of resource management. If they randomly happen even during the first call, I may have a problem… -_-
I’m not so sure about the difference in the version number: They actually have to match, but I think between 4.1 and 4.2 the CUFFT library did not change at all. Or to put it differently: If this was a problem in this case, it would probably not work at all. In fact, all the JCu* libraries usually do nothing else than forwarding the method calls directly to the CUDA libraries via JNI, so when the methods are found, the behavior should not differ much from that in plain CUDA).

I benchmarked multiple runs however, and reversing the bytes manually using a for-loop turns out to be much, much slower than playing around with Byte and Float buffers

Wo-ho :eek: this was NOT meant as a solution for this problem! I just introduced this ‚reversal method‘ in order to pin down the problem and to demonstrate that it is solely caused by the byte order. Of course, in the real case, you will not create a new array and fill it byte by byte, but instead, use a FloatBuffer that was created from a ByteBuffer with proper ByteOrder. The FloatBuffer also has to reverse the bytes, but is doing this internally, on the fly, and in case of direct buffers with some help from magic Sun-internal classes.
BTW, I’m curious: Did you also do benchmarks with Memory Mapped Files? If not, maybe I’ll run a short test with that, I’m wondering how fast this really is…

bye
Marco

I’m not sure if they are random; so far it does not seem like it, I don’t think the problem is your code. I will do more testing and verify the output when exceptions are enabled before saying anything for certain. I will definitely get to the bottom of this.

I have not worked with buffers enough to know they are fast :smiley: In fact I thought allocating such big buffers would slow things down, but you are right, because they do these things internally, it is a very fast solution to the problem. I have not benchmarked the memory mapped files - I want to figure out the exceptions I am getting before moving on to optimizations. Do let me know if you decide to time it, I am also very curious to see if it is significantly faster…

OK, I’ll be happy to hear whether you can find the reason for the failed initialization. (Unless it’s a hardly reproducible, randomly occuring bug in JCufft :o :wink: )

The memory handling in general is certainly something where you can tweak the code in many places. It’s hard to tell beforehand how the „best“ solution will look like, because the speed at the end depends on many factors. It’s usually faster to have few, large memory copies than many small ones (with the same total size). But once the data is read from or written to a file, this will certainly influence the performance as well. You also have the choice between direct buffers and buffers that are created by FloatBuffer.wrap(floatArray), but especially for large buffers, direct buffers might be beneficial, because in this case, it is guaranteed that the data can be accessed directly on the native side.

In any case, I’m curious about any insights concerning the „best practices“ that you can derive from a real application case. (I’m lacking real experience here, admittedly: I hardly created more than the libraries and the small samples on the website…)

[QUOTE=Marco13]Hello

That’s strange: I just did a quick test, and until now, it seems as if it really is related to the byte order. (Did you verify the results, or do you assume that they are correct because they are not ‘NaN’ or ‘7FFF’?). In this test (see the code below) I introduced a flag ‘reverseOrderTest’, which causes the byte order of the ‘jcufft[]’ byte array to be reversed (pragmatically). The results are compared to the float-version that you posted, and until now, it seems that the byte order is the reason. (At least here, on my Win32 machine…)

Yes, definitely.

The CUDA device query does not say anything about a limit. I found some forum threads indicating that there might be a limit, but no clear official statements so far…

I’d recommend to setExceptionsEnabled(true) during the development in general. Otherwise, checking for errors is tedious. Even the NVIDIA CUDA SDK contains Utility macros (CUDA_SAFE_CALL etc) to make this easier.

Certainly, that sounds feasible. I added another test in this code: In the “runTestMapped”-Method, the plan creation and memory allocation is pulled out of the loop. Additionally, I’m using Memory Mapped files there. As far as I know, these are intended for a use case like this: Mapping a region of a (unmanagably large) file into memory, manipulating it, and committing the data back - and they are said to be fast. The byte order reversal is still necessary, but maybe some time can be saved there anyhow. Note that I have not really used the mechanism of Memory Mapped files before, and I have not tested it extensively (especially not with really “large” files), but it may be worth a try…

package tests.jcufft;

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.channels.FileChannel.MapMode;

import jcuda.Pointer;
import jcuda.jcufft.*;
import jcuda.runtime.*;

public class JCufftByteOrderTest
{
    private static boolean reverseOrderTest = false;
    
    public static void main(String[] args) throws IOException
    {
        JCuda.setExceptionsEnabled(true);
        JCufft.setExceptionsEnabled(true);
        
        int fftSize = 128;
        int bufferSize = fftSize * 4;
        int totalSize = bufferSize * 4;
        int batches = bufferSize / fftSize;
        
        File input = new File("JCufftBatchedTest_input.dat");
        if (!input.exists())
        {
            createDummyData(input, totalSize);
        }
        File outputFloat = new File("JCufftBatchedTest_output_float.dat");
        File outputByte = new File("JCufftBatchedTest_output_byte.dat");
        File outputByteRev = new File("JCufftBatchedTest_output_byte_rev.dat");
        File outputMapped = new File("JCufftBatchedTest_output_mapped.dat");
        
        System.out.println("Float:");
        runTestFloat(fftSize, bufferSize, batches, input, outputFloat);
        printOutputData(outputFloat, totalSize);

        System.out.println("Byte (not reversed)");
        reverseOrderTest = false;
        runTestByte(fftSize, bufferSize, batches, input, outputByte);
        printOutputData(outputByte, totalSize);
        
        System.out.println("Byte (reversed)");
        reverseOrderTest = true;
        runTestByte(fftSize, bufferSize, batches, input, outputByteRev);
        printOutputData(outputByteRev, totalSize);
        
        System.out.println("Mapped");
        runTestMapped(fftSize, bufferSize, batches, input, outputMapped);
        printOutputData(outputMapped, totalSize);
        
    }


    private static void runTestByte(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);        
        
        // bufferSize here is a gigabyte, since I'm processing 1024^3 bytes at once
        byte jcufft[] = new byte[bufferSize];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do {

            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            inputBuff.rewind();
            inputBuff.get(jcufft);
            
            if (reverseOrderTest)
            {
                jcufft = reverseByteOrder(jcufft);
            }

            // Create pointers to host and device memory, allocate memory
            Pointer byte_host_input = Pointer.to(jcufft);
            Pointer byte_device_input = new Pointer();
            JCuda.cudaMalloc(byte_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(byte_device_input, byte_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, byte_device_input, byte_device_input);
            JCuda.cudaMemcpy(byte_host_input, byte_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(byte_device_input);

            if (reverseOrderTest)
            {
                jcufft = reverseByteOrder(jcufft);
            }
            
            // Write data to file
            inputBuff.clear();
            inputBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);   

        outchannel.close();
        inchannel.close();
    }
        
    

    private static void runTestFloat(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);
        FloatBuffer floatBuff = inputBuff.asFloatBuffer();
        
        // bufferSize here should be 1GB, each float is 4 bytes
        float jcufft[] = new float[bufferSize / 4];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        do 
        {
            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            floatBuff.rewind();
            floatBuff.get(jcufft);

            // Create pointers to host and device memory, allocate memory
            Pointer float_host_input = Pointer.to(jcufft);
            Pointer float_device_input = new Pointer();
            JCuda.cudaMalloc(float_device_input, bufferSize);

            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftHandle plan = new cufftHandle();
            JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
            JCufft.cufftExecR2C(plan, float_device_input, float_device_input);
            JCuda.cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            JCufft.cufftDestroy(plan);
            JCuda.cudaFree(float_device_input);
           
            // Write data to file
            floatBuff.clear();
            floatBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);
        
        outchannel.close();
        inchannel.close();
    }
        
    

    
    private static void runTestMapped(int fftSize, int bufferSize, int batches, File input, File output) throws FileNotFoundException, IOException
    {
        // bufferSize here is a gigabyte, since I'm processing 1024^3 bytes at once
        float jcufft[] = new float[bufferSize/4];

        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new RandomAccessFile(output, "rw").getChannel();
        long size = inchannel.size();
        long position = 0;
        
        // Create pointers to host and device memory, allocate memory
        Pointer byte_host_input = Pointer.to(jcufft);
        Pointer byte_device_input = new Pointer();
        JCuda.cudaMalloc(byte_device_input, bufferSize);

        cufftHandle plan = new cufftHandle();
        JCufft.cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
        
        do 
        {
            ByteBuffer mappedInput = inchannel.map(MapMode.READ_ONLY, position, bufferSize);
            FloatBuffer inputBuffer = mappedInput.asFloatBuffer();
            inputBuffer.get(jcufft);
            
            // Copy data to device, perform FFT, copy back to host
            JCuda.cudaMemcpy(byte_device_input, byte_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            JCufft.cufftExecR2C(plan, byte_device_input, byte_device_input);
            JCuda.cudaMemcpy(byte_host_input, byte_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);

            // Write data to file
            MappedByteBuffer mappedOutput = outchannel.map(MapMode.READ_WRITE, position, bufferSize);
            FloatBuffer outputBuffer = mappedOutput.asFloatBuffer();
            outputBuffer.put(jcufft);
            mappedOutput.force();
            
            position += bufferSize;

        } while (position < size);   

        JCufft.cufftDestroy(plan);
        JCuda.cudaFree(byte_device_input);
        
        outchannel.close();
        inchannel.close();
    }
    
    
    
    
    
    private static byte[] reverseByteOrder(byte input[])
    {
        byte[] output = new byte[input.length];
        for (int i=0; i<input.length; i+=4)
        {
            output[i+0] = input[i+3];
            output[i+1] = input[i+2];
            output[i+2] = input[i+1];
            output[i+3] = input[i+0];
        }
        return output;
    }
    
    
    private static void createDummyData(File file, int size) throws IOException
    {
        DataOutputStream dos = new DataOutputStream(
            new FileOutputStream(file));
        for (int i=0; i<size; i++)
        {
            dos.writeFloat((float)Math.sin(i*0.1f));
        }
        dos.close();
    }
    
    private static void printOutputData(File file, int size) throws IOException
    {
        DataInputStream dis = new DataInputStream(
            new FileInputStream(file));
        for (int i=0; i<size; i++)
        {
            float f = dis.readFloat();
            System.out.printf("%7s", String.format("%.3f", f));
            if ((i+1)%20 == 0)
            {
                System.out.println(", ");
            }
            else
            {
                System.out.print(", ");
            }
        }
        System.out.println("
");
        dis.close();
    }
}
```[/QUOTE]



Well, I've sort of pinpointed the first exception I was getting - the cudaErrorLaunchFailure (JCuda.checkResult(357), JCuda.cudaMemcpy(2964)). If you look back at the code you wrote in this post, runTestFloat is identical to what I am doing right now, and the Memcpy call is what is throwing the exception. I tried creating a separate float array, even a separate byte array, but the copy back to host fails, and I can't figure out why. I'm running with 6G of heapspace and plenty of RAM/disk space, so I don't think it's a space issue..

And actually, if I comment out copying back to host, the cudaFree call after that will throw the same exception.

Hello

It’s difficult: Note that basically ALL cuda methods carry a small note in their documentation: „Note that this function may also return error codes from previous, asynchronous launches.“

The „cudaErrorLaunchFailure“ can obviously not be caused by a cudaMemcpy or cudaFree. It is caused by a previous launch, and this can basically only be one that happens internally, maybe inside (!) the CUFFT functions. Can you create a minimal example where the error occurs, maybe by just keeping the parts of the „JCufftByteOrderTest“ that are closest to the structure that you are already using?

I tried to create such a test, but am not sure how closely it resembles your application case, or whether the error occurs there as well (it’s only using an artificial file of 16MB size that is created at the first start… if necessary, I can try it with a 1GB file, but am not sure how long it will take to write such a file :wink: )

package tests.jcufft;

import static jcuda.jcufft.JCufft.*;
import static jcuda.runtime.JCuda.*;

import java.io.*;
import java.nio.*;
import java.nio.channels.FileChannel;

import jcuda.Pointer;
import jcuda.jcufft.*;
import jcuda.runtime.*;

public class JCufftByteOrderTestMin
{
    public static void main(String[] args) throws IOException
    {
        JCuda.setExceptionsEnabled(true);
        JCufft.setExceptionsEnabled(true);
        
        int fftSize = 1024;
        int bufferSize = fftSize * 1024;
        int totalSize = bufferSize * 4;
        int batches = bufferSize / fftSize;
        
        File input = new File("JCufftByteOrderTestMin_input.dat");
        if (!input.exists())
        {
            createDummyData(input, totalSize);
        }
        File outputFloat = new File("JCufftByteOrderTestMin_output_float.dat");

        System.out.println("Float:");
        runTestFloat(fftSize, bufferSize, batches, input, outputFloat);
        printOutputData(outputFloat, totalSize);

    }

    private static void runTestFloat(int fftSize, int bufferSize, int batches, File input, File output) throws IOException
    {
        ByteBuffer inputBuff = ByteBuffer.allocateDirect(bufferSize);
        FloatBuffer floatBuff = inputBuff.asFloatBuffer();
        
        // bufferSize here should be 1GB, each float is 4 bytes
        float jcufft[] = new float[bufferSize / 4];

        // Create pointers to host and device memory, allocate memory
        Pointer float_host_input = Pointer.to(jcufft);
        Pointer float_device_input = new Pointer();
        cudaMalloc(float_device_input, bufferSize);

        // Prepare the CUFFT plan
        cufftHandle plan = new cufftHandle();
        cufftPlan1d(plan, fftSize, cufftType.CUFFT_R2C, batches);
        
        FileChannel inchannel = new FileInputStream(input).getChannel();
        FileChannel outchannel = new FileOutputStream(output).getChannel();
        long size = inchannel.size();

        System.out.println("Processing "+size+" bytes ("+(size/4)+" elements) with fftSize: "+fftSize+" batches: "+batches);
        do 
        {
            System.out.println(
                "Processing bytes "+inchannel.position()+
                " to "+(inchannel.position()+inputBuff.capacity()));

            // Read and transfer data to byte array
            inputBuff.clear();
            inchannel.read(inputBuff);
            floatBuff.rewind();
            floatBuff.get(jcufft);
            
            // Copy data to device, perform FFT, copy back to host
            cudaMemcpy(float_device_input, float_host_input, bufferSize, cudaMemcpyKind.cudaMemcpyHostToDevice);
            cufftExecR2C(plan, float_device_input, float_device_input);
            cudaMemcpy(float_host_input, float_device_input, bufferSize, cudaMemcpyKind.cudaMemcpyDeviceToHost);
           
            // Write data to file
            floatBuff.clear();
            floatBuff.put(jcufft);
            inputBuff.rewind();
            outchannel.write(inputBuff);

        } while (inchannel.position() != size);

        // Clean up
        cufftDestroy(plan);
        cudaFree(float_device_input);
        
        outchannel.close();
        inchannel.close();
    }
        
    
    private static void createDummyData(File file, int size) throws IOException
    {
        System.out.println("Creating "+file+" with "+(size*4)+" bytes");
        
        DataOutputStream dos = new DataOutputStream(
            new FileOutputStream(file));
        for (int i=0; i<size; i++)
        {
            dos.writeFloat((float)Math.sin(i*0.1f));
        }
        dos.close();
    }
    
    private static void printOutputData(File file, int size) throws IOException
    {
        DataInputStream dis = new DataInputStream(
            new FileInputStream(file));
        for (int i=0; i<size; i++)
        {
            float f = dis.readFloat();
            System.out.printf("%7s", String.format("%.3f", f));
            if ((i+1)%20 == 0)
            {
                System.out.println(", ");
            }
            else
            {
                System.out.print(", ");
            }
            
            if (i > 1024)
            {
                System.out.println("...");
                break;
            }
        }
        System.out.println("
");
        dis.close();
    }
}

Sorry for the multiple posts :frowning: I can’t edit as a guest, here are some more developments:

I copy and pasted your byte order code, as well as the malloc test code, and ran those. Your float FFT code ran fine and is identical to mine except for one thing - you test a very small amount of numbers, while I’m doing 1GB (eventually I plan to run this on 768GB files, so the big CUDA memory buffer is a must for me). Your max malloc code bailed out at 1.8GB on my 2GB card, which is around what I expected.

The only deduction I can make when I put these 2 together is that the „scratch“ space the CUDA plan allocates for the FFT makes it impossible to allocate 1GB on the card. However, I added CUFFT plan allocation to your malloc test (allocating space for size-1024 FFT, which means there are 262144 batches for 1GB of data), and it takes up only 2162688 bytes! (~2MB).

So if the malloc test shows a max of 1.8GB, and I copy over 1GB, do an in-place FFT, and try to copy it back, something goes wrong. I am stumped.