CUDA cuFFT vs FFTW
My initial thoughts were that I was going to perform FFT on the GPU as it intuitively sounds fast to use the multi-core system that the GPU provides. However, depending on the size of the FFT, GPU based algorithms may actually be outperformed by running FFT on the CPU.
Transferring memory-buffers to the GPU is a slow operation, and the overhead of transferring data to the GPU does affect the overall performance of running FFT particularly. When using CUDA cuFFT compared to serial FFTW, it has been shown that for small N, approximately N <= 4096, the gain from running cuFFT is lost due to slow memory transfer rates.
From lecture: "Fast Fourier Transforms (FFTs) and Graphical Processing Units (GPUs)" - Kate Despain, University of Maryland Institute for Advanced Computer Studies. Originally from University of Waterloo (2007). Comparing FFT and cuFFT including and excluding memory transfer rates. Y axis labeling flops, X axis labeling size of FFT work set.
Transferring memory-buffers to the GPU is a slow operation, and the overhead of transferring data to the GPU does affect the overall performance of running FFT particularly. When using CUDA cuFFT compared to serial FFTW, it has been shown that for small N, approximately N <= 4096, the gain from running cuFFT is lost due to slow memory transfer rates.
From lecture: "Fast Fourier Transforms (FFTs) and Graphical Processing Units (GPUs)" - Kate Despain, University of Maryland Institute for Advanced Computer Studies. Originally from University of Waterloo (2007). Comparing FFT and cuFFT including and excluding memory transfer rates. Y axis labeling flops, X axis labeling size of FFT work set.
In my project, I'm running 2D FFT since I'm simulating a water surface, so 4096 elements would equal to a grid of size 64x64, which is a rather small surface. In the original research paper by Jerry Tessendorf, Simulating Ocean Water, it is stated that: "The values of N and M can be between 16 and 2048, in powers of two. For many situations, values in the range 128 to 512 are sufficient.", and it is also concluded that for interactive performance, a CPU of +1 GHz could should nearly suffice for a 512x512 sized grid.
With this in mind, I think I will be able to still use cuFFT and achieve reasonable performance, as 512x512 >> 4096. However, if time allows, it would be interesting to compare the performance of serial FFTW and cuFFT for different sizes of grids. However, my initial expectation is still that cuFFT is the way to go.
Kommentarer
Skicka en kommentar