Signalsmith FFT

Signalsmith FFT is a small (600-line) C++ header-only FFT implementation.

I wrote it because KISS FFT (the most common one I could find with a commercial-friendly license) was slow, and I thought I could do better. It's not as fast as FFTW (when used properly), but it's pretty good. :)

It only supports some sizes (2N x 1, 3 or 9), and will round up to the closest supported size. It produces permuted results, with a separate de-permutation step.


signalsmith-fft.h - 21kb, 600 lines

You can also view it in Bitbucket.


MIT License - if you need something else, let me know.


Despite its small size, this library:

  • Outperforms KISS FFT at all sample sizes
  • Competes with FFTW's "estimate" configuration at larger sizes

Here are some benchmarks I ran myself on the machines I had available (64-bit, max optimisation).

The (fairly rough) benchmark code is also in Bitbucket, and if you run it yourself (or add it to any other benchmarks!), please let me know.

How to use

auto fft = getFft<double>(32768); // returns a std::shared_ptr<FFT<double>>

fft->fft(buffer); // In-place (recommended)

fft->fft(input, spectrum);
fft->ifft(spectrum, input);

The .fft() method produces permuted output, and .ifft() expects permuted input. There are methods to re-order the computed spectrum:

fft->permute(permuted, unpermuted); // Out-of-place (recommended)
fft->ipermute(unpermuted, permuted);

fft->permute(result); // In-place

Not all sizes are supported. The actual size may be up to 33% larger than requested. The actual size is returned from .size() or .setSize():

int currentSize = fft.size();
int actualSize = fft.setSize(16000); // returns 16384

You can also provide a second argument to setSize()/getFft() for parallel FFTs:

auto fft = getFft<double>(256, 16); // 16 interleaved 256-point FFTs
fft.setSize(512, 8); // 8 interleaved 512-point FFTs

If .setSize() doesn't actually change the size, it doesn't reallocate/recalculate. If it does change, it costs as much as constructing a new object with getFft().

In a multi-threaded environment, each thread should have its own instance, because of temporary memory used for out-of-place/in-place shuffling.