The Design of Signalsmith Stretch
Geraint LuffSignalsmith Audio Ltd.
A closer look at our open-source C++ pitch-shifting library
Back in November, I presented Four Ways To Write A Pitch-Shifter at ADC'22. My talk's summary promised to finish with the design used in a new open-source polyphonic pitch/time C++ library
.
Of course, I spent all my time writing the actual presentation, and didn't start the library (now called Signalsmith Stretch) until a week after ADC finished. 😅
Now that my talk is out on YouTube, it seems like a good moment to take a closer look at that open-source library. This article assumes you've watched the talk - or at least enough of the introduction to be confident you know what it says. There's also an introduction to Fourier/FFT which I had to cut from the talk for time.
Time-frequency observations
I'll start by summarising the fourth method in the talk (which Signalsmith Stretch is based on). We're assembling a series of output spectra (or equivalently, a set of downsampled sub-bands), and for each time-frequency point we need to decide two things:
Amplitude
Deciding the amplitude (or energy) is simple enough. We do an STFT analysis of the input, and measure the energy at an appropriate position, mapped through time and frequency:
If you're into equations, we could write that as:
Complex phase
To decide the phase, we take measurements of the relative phase in the input compared to nearby points, and attempt to create similar phase relationships in the output.
If we assemble each spectrum by iterating upwards in frequency, we can use the existing output points (backwards in time or downwards in frequency) to make multiple predictions, and blend them together to get an output phase:
Weighted phase averages
Rather than combining the actual phase values, this blending can be done by averaging (or just summing) complex values with the appropriate phases:
We can measure the phase difference from
This complex value has the phase we want, and we can multiply against a previously-decided output value to get a phase prediction.
However, its amplitude also depends on the amplitude of
Scaling the vertical phase-changes
In the talk, we also talked about having to scale our phase-change estimates (in both time and frequency) based on the time- and frequency-stretching factors.
Rather than scaling the phase using a bunch of (slow) trigonometric functions, we can change the distance across which we're measuring the phase, in time of frequency. For vertical changes (across frequency), this is nice and simple:
We have the whole input spectrum, so this is straightforward.
Scaling the horizontal phase-changes
Horizontal phase-changes are more awkward. If we want to avoid slow trigonometric functions, we might also think to scale the distance in a similar way:
Unfortunately, the frequency-map is not linear - it's actually quite uneven. We'll discuss why below, but it means that every frequency's phase-change would be measured across a different time-step.
While this is possible, it would make the implementation more complicated. You'd need to keep a longer history of the inputs (or input spectra) and interpolate between them differently for each frequency.
For Signalsmith Stretch, I just ignored this horizontal phase-change scaling! 😬 However, it does calculate an extra "previous" input spectrum when doing a time-stretch, so the "previous" and "current" spectra are always one STFT interval apart in the input (not affected by the time-stretch). This is correct distance when the frequency-map is 1:1, which ends up being the case around strong harmonics, as explained below.
Multi-stage predictions
In the talk, I described making predictions forwards in time (like the phase-vocoder) and upwards in frequency (the counterpart or "vase-phocoder" prediction). This is convenient if you're assembling each spectrum upwards in a single sweep.
Something that bugs me with that arrangement is: if you have a strong tone, it will appear in multiple bins/sub-bands, but the sub-bands below that tone have no mechanism to align their phase with the much stronger bands.
There are a ton of options here. One is to include diagonal phase-observations from upwards in frequency but back in time:
Another option is to do an upwards sweep as before, and then a second downwards sweep. The downwards sweep could use the results from the first iteration to take predictions from both up and down in frequency:
Signalsmith Stretch (currently) does an initial prediction based on just the horizontal predictions, and then the second iteration uses these for the downwards predictions:
The short vertical steps are always one output bin (with the input distance scaled by the time-stretch factor). The longer vertical step (shown here as 3 output bins) depends on the overlap ratio of the STFT.
Multiple channels
Another thing which wasn't really mentioned in the talk is how to deal with multiple channels. The example code uses the inter-channel phase difference as a third prediction, blending it with the horizontal and vertical ones (with an extra boost for whichever of those was the strongest).
Signalsmith Stretch instead picks the loudest channel, makes a prediction as above, and then exactly copies the inter-channel phase difference from the input to preserve the stereo image.
This isn't a separate step - it's done as part of the second iteration, so that those output bins can be used for upwards predictions.
Aliasing
When condensing in time or frequency, this approach works pretty well. But when stretching, particularly by larger factors, we get two forms of aliasing:
Frequency aliasing
Let's consider the STFT analysis of a steady tone, from a downsampled sub-band perspective. The sub-bands overlap, so this tone will appear in multiple bands, depending on the exact filtering used for the downsample (a.k.a. the window shape).
If we straightforwardly stretch the spectrum, then this tone would appear in even more bands. But because the sub-bands are downsampled, they have limited bandwidth. Whatever scaling we apply, some of the bands are simply too far away from the tone to be able to properly synthesise it:
So what happens to the energy in those bands? They actually produce aliased copies of the tone:
The higher the pitch-shift, the worse this gets. What can we do?
Non-linear frequency map
It's far from the only solution, but what Signalsmith Stretch does is identify peaks in the spectrum (using a simple heuristic). We then create a non-linear frequency map which is locally 1:1 around any strong harmonics:.
This avoids frequency aliasing (at least for harmonic peaks) and is also how we get away with not scaling the horizontal phase-changes properly, since the frequency-stretching factor is locally 1 around the most important frequencies.
Time-aliasing
There's a similar problem with time-stretching. Because our blocks overlap, a short transient sound will show up in multiple blocks:
When combined with the vertical phase-scaling, these blocks produce distinct separate transients at time-aliased locations:
If you're curious what that sounds like, here's an 8x stretch of a drum-loop:
Non-linear time
It would be great to have a non-linear time map which bunched up around transients, just as this implementation bunches up around strong frequencies. It would keep transients clearer when stretching, and enable longer time-stretch ratios.
... the hack
For now, there's a hack which limits the vertical phase-scaling to 2x, and starts to randomise slightly for longer stretches. This means that although the transient appears in too many blocks, we don't get such distinct time-aliased copies. Instead, we get a juddery smudge:
It's still wrong, but it's a more familiar type of wrong. Fixing this and doing it properly is at the top of my TODO list, and mostly got delayed because I went off on a tangent thinking about transient-detector design. 😅
Tonality limit
Lower harmonics are more important for pitch-perception than higher ones, with the higher parts mostly contributing the timbre (or rhythm, in the case of percussion).
Signalsmith Stretch includes an option to take advantage of this when pitch-shifting, by using a non-linear frequency map. It's scaled according to the pitch-shifting factor up to some corner frequency, and then it's 1:1 after that:
Since we're not actually frequency-stretching the higher parts of the spectrum, we should get fewer artefacts. I think it also preserves slightly more of the input timbre, since those higher frequencies are closer to their original positions.
The corner frequency we use is the tonality limit multiplied/divided by
Time + frequency vs. resampling
Polyphonic pitch/time libraries often stretch in either pitch or time (usually time) and then use resampling to adapt that to the other kind of shift. The fact that our library does both separately is a little unusual.
It also prompts the question: if you stretch out in time by 2x, and also down in pitch by an octave, is the result equivalent to playing the input at half-speed?
In the current implementation, no. But it is possible, and I wanted to quickly explain how it could be done.
Observation shapes
One thing I didn't have time to dig into in my talk (and is also a bit too deep to get into here!) is that the time-frequency observations we make have a shape. When we're synthesising the spectrum, the time-frequency points similarly have a shape.
These shapes depend on the block-length and the window function we're using, and are subject to funky time-frequency uncertainty constraints: they can't be too focused in both time and frequency, so you have to compromise.
Making it equivalent to resampling
If we have a matched time-frequency shift, and we wanted our result to be equivalent to a resample, our input observation shapes would need to be appropriately-stretched equivalents to our output synthesis shapes:
You could change these shapes by using a different STFT block size, or you could derive differently-shaped input observations by combining multiple observations (using nearby times to make it flatter, or nearby frequencies to make it taller).
You'd also need to adjust the non-linear frequency/time logic as appropriate, and also decide what to do when the time-/frequency-stretches don't match up exactly.
Conclusion
I haven't picked through the implementation line-by-line, but those are some of the significant design choices in Signalsmith Stretch. If you haven't already, listen to some examples or give it a spin!
And if you have any other questions, get in touch.