My motion compensation code in lab spends most of its CPU time doing length-8192 iFFTs using FFTW 3.2.2. Specifically, each motion-update event corresponds to a cross-correlation comparison of the input with 20-160 reference vectors. Each of these comparisons is totally independent, so the problem is “embarrassingly parallel”. This was irrelevant when I wrote the code on a uniprocessor Pentium 4, but when that machine began to fail I moved to our sparkling new 8-core Xeon i7. Just one of those cores is already faster than the original hardware, but hey, faster is always better, right?
Looking for the path of maximum laziness, my first idea was to let FFTW handle the parallelization. I replaced my
fftw_plan_dft_c2r_1d(n, in, out, FFTW_MEASURE)*, which produces a plan that is used
k times, with
fftw_plan_many_dft_c2r(1, &n, k, in, NULL, 1, d, out, NULL, 1, n, FFTW_MEASURE)*, where
d = n/2 + 1. Then I called
fftw_plan_with_nthreads to enable parallelization. This required refactoring the cross-correlations so that all the FFTs could be performed simultaneously, at a certain cost in cache performance, but I figured it would be a small price to pay.
I was wrong. Even just benchmarking the FFTs, the speed up from 1 thread to 4 threads was only 33%. The program actually ran slower with 8 threads than it did with just 1. I tried increasing
d to be 16-byte aligned, but nothing helped. My best guess is that FFTW just doesn’t know it should parallelize by independent transforms in plans from
fftw_plan_many_*. Instead, it seemed to be splitting each of the 160 transforms across all the threads, even if this was counterproductive.**
Having exhausted this approach, I tried the next thing on my list, OpenMP***. This was much easier than expected. All I had to do was make my cross-correlation search thread-safe using
fftw_execute_dft_c2r. After that, it just took one
#pragma omp parallel for (and about an hour of fighting with Visual Studio configuration options … it turns out OpenMP only works in Release mode, not Debug). The resulting code is simple, clear, and gives nearly perfect linear speedup at 8 threads.
Eight times faster would be pretty good for two days work, especially if I actually needed the code to be any faster than it already was.
*: I only need single precision, so actually these calls are all fftwf_*.
**: This is a wild guess. If it’s true, it might be an undiscovered bug because I’m the only person who’s tried to combine plan_many, complex-to-real, Windows, single-precision mode, and multithreading.
***: Oddly enough Intel’s Thread Building Blocks page has a great explanation of why you (if you’re me) should use OpenMP (and not TBB).