You are on page 1of 31

Phase Vocoder Design, Implementation, and Effects

Processing

Matthew Pierce

Final Project Report


Electrical Engineering 484
University of Victoria

Abstract

The following report outlines the functionality of the phase vocoder: A software tool that
facilitates audio signal processing in both the time and frequency domains. In addition,
the report discusses, at length, a Matlab implementation of a phase vocoder and various
audio effects that can be achieved with it. This discussion incorporates the obstacles and
successes encountered by the author while building the program, in order to present what
can hopefully be a useful and insightful guide for others. There are brief code examples
within the report, however, the full implementation code can also be found within the
first online appendix (Appendix A). The sounds created by the phase-vocoder-based
effects are discussed in the report also, and can be found and listened to within the second
online appendix (Appendix B). Appendix B also contains a few informative diagrams not
presented within the report, for brevity. Both appendices are located on the authors
website: http://web.uvic.ca/~mpierce/elec484

1 Introduction

To begin, a note should be made towards the differentiation of the phase vocoder and the
audio-effects processing that is facilitated by it. Often, as is the case with the
implementation discussed throughout this paper, the two parts are not well separated
within the code. This is because, in general, effects processing is done before the phase
vocoder has finished executing. At a high level, the phase vocoder can be thought to
contain two distinct operations: Analysis and synthesis (or resynthesis). These will both
be discussed at length later on, however, it is important to note that most effects
processing is done following the analysis stage and prior to the synthesis stage. Because
of this, much of the code involved in creating audio effects is quite deeply embedded
within the phase vocoder code itself. The presentation of ideas in this report, however,
will refrain from discussing audio effects until after fully discussing the concepts behind
the phase vocoder, thus breaking from the ordering present within the code.

2 Analysis

The first stage of the phase vocoder is the analysis of a digital audio signal. Analysis,
here, is a general term, which describes the operations undergone by the signal in order to
reach a final result: A complete frequency domain representation of the signal over time.
These operations include splitting the signal into smaller segments, smoothing these
segments using a windowing function, centering each segment with respect to time,
taking the Short-time (or Fast) Fourier Transform (STFT or FFT) of each segment, and
storing the magnitude and phase values of each segment. Each step is further detailed in
the following sections.

2.1 Windowing

The first step in the phase vocoder analysis stage is to split the audio signal into segments
of a predetermined number of samples (known as the window size). Each element in the
segment is then multiplied by a windowing function (the reason for this will be explained
later). There are a number of different windowing functions that can be used such as the
Hanning and Blackman windows. The one used in the example program is a raised-
cosine window given by the following equation:

The code below shows how to window an audio signal (x) into segments of 2048
samples:

% Window size and analysis hop size variables


win = 2048;
ra_hop = 256;

% Create the raised cosine function


rcos_win = 1/2*(1-cos(2*pi*(1:win)/win));

for n=1:ra_hop:length(x)-win

% Partition the signal into size win


x_i = x(n:n+win-1);

% Piecewise multiply the input segment with the cosine function


x_rcos = x_i .* rcos_win;

Notice in the above code that the for-loop index n is incremented by a variable called
ra_hop. This value is known as the analysis hop size and is vitally important to the
operation of the phase vocoder. To obtain a good audio resolution upon signal resynthesis
it is necessary for the signal to be split into more segments than simply the original signal
length divided by the window size. In order to avoid any distortion, the analysis hop size
must be, at most, half of the window size (in the code it is 1/8th of the window size).
Intuitively, we can see that this will result in more segments being stored and that the
segments are going to overlap one another (ie. each sample of the input signal will be
present in more than one windowed segment, except for those at the very beginning and
end of the audio vector).

2.2 Cyclic Shift

An important detail of phase vocoder implementation that can be easily overlooked is the
fact that each windowed segment must undergo a cyclic shift (or circular shift)
operation before its frequency representation is stored. The reason for doing the cyclic
shift is that taking the FFT of an unshifted signal where the time origin is at the
leftmost part of the window can result in improper phase response values such as non-
zero phases for centered impulses.
There are two ways to achieve the cyclic shift, one in the time domain and one in the
frequency domain. In the time domain, each waveform sample must remain within the
window but is to be shifted in time by half of the window size (moving, for instance, all
the left values to the middle). Values beyond the halfway point will be wrapped around to
the beginning of the window. The cyclic shift can be achieved in the frequency domain
(ie. after taking the FFT) by multiplying the FFT result, X(k), by 1 to the power of k:

Matlab has a quick function for doing the cyclic shift in the time domain, called
circshift(). Our example uses this method for programmatic convenience. The method, as
it appears in the example, is given below. Recall that x_rcos is the windowed segment
that was obtained above:

% Perform a circular shift on the time domain segment


x_rcos_shft = circshift(x_rcos, [0, win/2]);

The second parameter to the circshift() method specifies the number of rows and columns
each sample should be shifted by (in this case, half the window size).

2.3 Fast Fourier Transform

The final step towards obtaining the time-frequency representation of an audio signal is
taking the Fast Fourier Transform of each windowed and shifted segment. The FFT is a
fast algorithm for finding the Discrete Fourier Transform (DFT), described by the
following equation:
This function gives us the frequency domain representation of the signal. For our
purposes, we wish to separate the magnitude (absolute value) and phase (angle) values of
the frequency spectrum before we store them. The reason for this will become obvious
later when we see a number of effects that can be achieved by operating on each set of
values individually. The code below completes the picture for the analysis stage of the
phase vocoder:

% Take the fft of the windowed input


Y_rcos_shft = fft(x_rcos_shft);

% Store the magnitude and phase of this fft window in matrices


mag = abs(Y_rcos_shft);
fft_mag_matrix(floor(n/ra_hop)+1, 1:win) = mag;

phase = angle(Y_rcos_shft);
fft_phase_matrix(floor(n/ra_hop)+1, 1:win) = phase;

The fft() function takes the segment obtained from the cyclic shift as its argument and
returns an N-point FFT vector, where N is the length of x_rcos_shft (N can also be
explicitly specified as a second argument to fft(), however, for our purposes, we will
always want the length of the time and frequency domain vectors to be the same during
the analysis phase). The separated magnitude and phase vectors are obtained by taking
the absolute value and the angle of the FFT segment, respectively. The fft_mag_matrix
and fft_phase_matrix variables are our storage matrices for which the rows of each matrix
represent time and the columns represent either magnitude or phase. This complicated-
looking index value: floor(n/ra_hop)+1, finds which segment (a sequential value, starting
from 1) the phase vocoder is currently looking at.
Once a magnitude and phase vector has been stored in its appropriate matrix we have
stored the complete frequency representation of the current segment. Once the entire
audio signal is processed (the for-loop index n is equal to the length of the audio signal
minus one window length) we have the complete time-frequency representation of the
signal and have thus completed the analysis phase.

3 Synthesis

The second and final stage of the phase vocoder is the synthesis (or, more accurately,
resynthesis) of the audio signal in the time domain. Synthesis describes the process
involved in creating a new time domain signal from the frequency domain segments
obtained during the analysis stage. Prior to returning to the time domain, however, there
are a number of special phase calculations that are necessary in order to execute two of
the most common phase vocoder effects: Time stretching and pitch shifting. The
algorithms for finding the unwrapped phase and principle argument values, which
will be discussed at length in the next section, do not, by themselves, create an audio
effect. Because of this and because time stretching is such a fundamental phase vocoder
operation, these algorithms are considered to be part of the phase vocoder
implementation itself, as opposed to an audio-effect implementation.
Following the initial phase calculations, the remaining steps involved in resynthesizing
the signal include taking the inverse FFT (IFFT), unshifting the new time domain
segment, re-windowing the signal to remove artifacts, and performing overlap-add in
order to recombine each segment at its proper location in time.
The synthesis stage will be presented here in the general case, which makes use of a
resynthesis hop size that is separate (although not necessarily different-valued) from the
analysis hop size. Only certain effects actually make use of the two hop sizes, and in all
other cases the two values will be the same. Both types of effects will be discussed later
in this report.

3.1 Target Phase, Phase Unwrapping, and the Principle Argument

In completing the phase vocoder, the author found that the most difficult section both
from a comprehension and a programmatic standpoint was the implementation of a
number of phase calculations necessary for both the time stretching and pitch shifting
effects (although the operations, in and of themselves, do not alter the signal). Before
showing the example code, it is important to discuss the definition and purpose of each
phase calculation.
The main idea behind each phase operation is the concept of instantaneous frequency.
Mathematically speaking, the instantaneous frequency is the derivative of a signals
phase. Practically, the instantaneous frequency of each FFT bin determines which
frequency is contained in that bin (in the same way that the magnitude of each bin
determines its volume). The calculation of a target phase for each bin of each FFT
segment (besides the first) allows us to represent these instantaneous frequencies through
time. Not only does this give us a more in-depth view of our signals frequency content, it
also allows us to alter the individual frequencies of each bin.
The target phase can be found by using the phase values of the previous phase vocoder
window. The following equation describes the target phase calculation where sRa
represents the index of the previous window based on the analysis hop size (ie. (s+1)Ra
is the current window):

where,

Now, its important to realize that the phase values being used in the above equation are
not the exact measured phases as obtained in the analysis stage. The phase windows must
first undergo an unwrapping stage. This consists of two steps:

1. Finding the difference between the current windows measured phase and the
previous windows target phase.
2. Obtaining a principle argument vector that brings each phase value within the
range [-, ].

The first step is a simple subtraction (the only tricky part is storing the windows properly
from one frame to the next, which will be discussed in the code example). Next, the
principle argument calculation is done using the following formula:

where the in phase is the difference vector obtained from the first step. Once this vector is
calculated we can get the instantaneous frequency values with the following:

where the delta-phi vector is the one calculated using the principle argument formula, and
the omega value is the same as above. Now, we just need to add the new instantaneous
frequencies to the previous phase window values to obtain our updated window values.
What this does is give us a new phase slope, which is to say, an updated (but possibly
unchanged) difference value in the instantaneous frequency for each FFT bin through
time.
After we have obtained our updated instantaneous frequency values we must ensure that
our original FFT vector is updated as well. This is so that if any changes have been made
to the phase slope (ie. during a time-stretching operation) it is reflected when converting
back to the time domain. To recombine the magnitudes and phases we can utilize the
following relation where y(n, k) is the new FFT segment, X(n, k) is the old segment, and
the phi vector contains the updated phase window:

Programming these phase calculations took a great deal of trial and error to ensure that
each step was returning a proper result. The author advises displaying the values of the
phase vectors after each step, and through time, so that one can do the calculations
separately as a measure for double-checking. The code below follows from the analysis
stage and includes a number of important initialization steps, in addition to the phase
calculations themselves:

% Create the omega value for unwrapping the phase


omega = 2*pi*ra_hop*(0:win-1)/win;

if(n==1)
% Set the initial phase window vectors for the first pass
last_phase_win = phase;
target_phase = phase + omega;
end

% Phase processing
if(n>1)

% Create the princarg difference vector argument


princarg_vector = phase - target_phase;

% Allocate space for each new princarg output vector


unwrapped_phase_diff = zeros(1, length(princarg_vector));

% Determine the princarg output value


for i=1:length(princarg_vector)
unwrapped_phase_diff(i) = mod(princarg_vector(i)+pi, -2*pi) + pi;
end

% Get the new instantaneous frequency difference values


inst_freqs= (omega + unwrapped_phase_diff) ./ ra_hop;

% Find the new phase values using the previous window


new_phase_win = inst_freqs + last_phase_win;

% Set the updated target_phase and last_phase_win values


target_phase = phase + omega;
last_phase_win = new_phase_win;

% Get the altered frequency domain window segment


Y_rcos_shft = mag .* exp(j*new_phase_win);

% Add the altered vector back into the phase matrix


fft_phase_matrix(floor(n/ra_hop)+1, 1:win) = angle(Y_rcos_shft);

end

Notice that the phase calculations do not begin until after the first pass of the for-loop (if
(n>1)), and the first pass merely consists of setting a previous measured phase window
(last_phase_win) and a target phase (target_phase) using the first measured phases.
Updating these last-window values on subsequent passes can theoretically be done at
any point after the calculations have been completed (and before the end of the for-loop),
however, it is advised that they are looked after immediately, so as to keep the code
organized into well-defined sections. Additionally, notice that after the FFT vector is
recalculated using the complex exponential, the new angle values are stored back into the
phase matrix at the current window so that if the values are changed, this can be properly
reflected in any output plots.
3.2 Inverse FFT, Unshifting, and a Second Windowing

The Inverse Fast Fourier Transform performs (as one would expect) the inverse operation
of the Fast Fourier Transform. It is a fast algorithm for performing the inverse DFT and is
given by the following equation:

As with the FFT, there is a Matlab function, ifft(), which performs this operation. The
example code is shown below:

% Find the time domain output (still shifted) using ifft


y_rcos_shft = ifft(Y_rcos_shft);

Notice that the argument for the ifft() function is the windowed FFT segment that was
obtained by the analysis stage of the phase vocoder. It should be noted that different
implementations might separate the analysis and synthesis stages into multiple functions.
The author, however, found it simpler to keep both parts within the same program and, in
fact, within the same for-loop structure (this goes for the phase calculations above, as
well). One reason for this was the ability to maintain recently defined variables, such as
the Y_rcos_shft vector used here, without having to re-initialize them within a new
function.
Once the new time domain segment has been obtained, the time axis must be re-aligned
(remember we shifted it by half the window size in order to center the FFT). We also
would like to smooth the output, as artifacts may have crept into our signal due to the
FFT resolution, which is determined by our window size. These two operations are done
using code that mirrors the cosine window and cyclic shift operations done during the
analysis stage:

% "Unshift" the segment


y_rcos = circshift(y_rcos_shft, [0, -win/2]);

% Multiply by the cosine window again for output smoothing


y_i = y_rcos .* rcos_win;

The second argument to the circular shift function can be either of win/2 or win/2 due to
wraparound as previously discussed, however, using the negative value fits better
conceptually with the idea of mirroring the analysis stage.

3.3 Overlap-Add

The final operation involved in the synthesis stage (and the phase vocoder itself) is
overlap-add. Overlap-add takes the new time-domain segment and puts it into an
output vector at the proper time index. This index is determined by the resynthesis hop
size and not the analysis hop size. If the two values are equal, the output segment will be
added at the same index that the input segment came from and the length of the complete
output will be the same as the length of the input. If the two values are different, however
(due to a time-stretching), the indexing will be different from the input and the output
length will also vary.
Because the analysis and resynthesis hop sizes will both only be, at most, half the size of
the window length, there will always be some overlap between consecutive segments (as
discussed previously). Intuitively, it seems as though adding the overlapping values
together in the output signal would double the amplitudes of the waveform. This is
where the cosine windowing function from the analysis stage comes into play. By doing
the windowing we have attenuated those values that are going to be overlapping during
resynthesis, at an inverse proportion (ie. if one sample value is attenuated by 40%, the
overlapping value is attenuated by 60%). Because of this, when two overlapping samples
are added together, it should equal exactly the original amplitude value. This greatly
simplifies the overlap-add operation, as can be seen in the example code:

% Set a variable to keep track of the resynthesis hopsize index (done prior to for loop)
m = 1;

% Get the length value


len = length(y_i);

% Overlap add the output


y(m:m+len-1) = y(m:m+len-1) + y_i;

% Increment the resynthesis hopsize index


m = m + rs_hop;

end % End of for loop

The m variable, initialized before the for-loop, is our resynthesis index (whereas n can be
thought of as our analysis index) and uses the resynthesis hop size variable, rs_hop, as its
increment value. Obtaining the len variable is not actually necessary when the input and
output segments are the same length, but for other cases (ie. pitch-shifting), it will be
necessary and greatly helps to simplify the indexing.
Once the overlap-add is completed, we have finished implementing the phase vocoder.
4 Testing

It is important to thoroughly test the phase vocoder implementation, not only to ensure
that it is working correctly, but also to achieve a greater understanding of what it actually
does using different types of signals. Well begin testing the system with simple sinusoids
of various frequencies and plot their magnitude and phase spectra in the time-frequency
domain. We will also test the effect of the cyclic shift, in order to get a greater
appreciation for its purpose. Finally, well test the system with a complex audio signal
in this case a vocal track to check that the system still works when faced with more
difficult calculations.

4.1 Cosine Waves

This test will use six sinusoid waves of six different frequencies. The frequencies have
been chosen such that all of the following are tested: Waves with an integer number of
samples per cycle (the center frequency is a factor of the sampling rate), waves with a
non-integer number of samples per cycle, waves with many cycles per segment (high
frequency), waves with about one cycle per segment, and waves with a fraction of a cycle
per segment (low frequency). The sinusoids are cosine functions that have the given
center frequency and a sampling rate of 8000Hz. A window size of 2048 was used and
the analysis and resynthesis hop sizes were both 1024 (win/2).
First we will ensure that for each cosine wave, the input and output signals are identical
by plotting both the time and frequency representation of a single windowed segment
(because these are continuously periodic waveforms, we need only look at one segment
for this):

Cosine 1 = 1000Hz (Integer number of samples per cycle, fraction of a cycle per
segment)
Cosine 2 = 976Hz (Non-integer number of samples per cycle, fraction of a cycle per
segment)

Cosine 3 = 2000Hz (Integer number of samples per cycle, about one cycle per segment)

Cosine 4 = 2085Hz (Non-integer number of samples per cycle, about one cycle per
segment)
Cosine 5 = 4000Hz (Integer number of samples per cycle, many cycles per segment)

Cosine 6 = 4013Hz (Non-integer number of samples per cycle, many cycles per segment)

We can see from each plot that the waveform and frequency content match in each case.
Also, we can see how choosing an integer or non-integer number of samples per cycle
can alter the accuracy of the waveform and frequency content.
Next we want to show each cosine wave in the time-frequency domain to get a better idea
of how the magnitudes and phases change over time given each case. The following plots
were created using the surf() function in matlab, which creates a 3-dimensional plot of a
given matrix.
These plots show the values of the phase vocoders magnitude matrix for each waveform,
with the rows of the matrix representing the time axis, and the columns representing the
magnitude spectrum:
Cosine 1 (1000Hz)

Cosine 2 (976Hz)

Cosine 3 (2000Hz)
Cosine 4 (2085Hz)

Cosine 5 (4000Hz)

Cosine 6 (4013Hz)
Notice that the magnitudes of the waves with non-integer numbers of samples per cycle
have weaker magnitude intensities at the center frequency (represented by color) than
those with integer numbers of samples per cycle. This is because there are sideband
frequencies (hard to see in these plots) to either side of the center frequencies for these
signals that use some of this energy. Understanding why these sidebands occur, however,
is easier to do after adding the phases to the picture.
The following plots were created using the surf() function with the phase matrices as
inputs. Additionally, those phases not near the center frequency bins have been removed
(set to zero) to make it easier to understand what is happening (for the curious, the full
non-zeroed phase plots can be found in Appendix_B on the authors website):

Cosine 1 (1000Hz)

Cosine 2 (976Hz)
Cosine 3 (2000Hz)

Cosine 4 (2085Hz)

Cosine 5 (4000Hz)
Cosine 6 (4013Hz)

We see now that the phase is constant through time for waves with integer numbers of
samples per cycle we can disregard the nonlinear sections at the end of each of these
plots as they are simply created because of the windowing function (remember that there
will be no overlapping window segments for the last hop). For the waves with non-
integer numbers of samples, however, the phase ramps at a constant rate. This will cause
the instantaneous frequency values of certain FFT bins to change with time, so that while
the wave itself has a steady frequency, the phase vocoder frequency representation is
slightly off, resulting in sidebands. Because this phenomenon has to do with whether
the instantaneous frequency content is contained completely within an FFT bin or not, we
use the term bin-centered or not bin-centered to describe it.

4.2 Understanding the Cyclic Shift

To better describe the effect that the cyclic shift operation has on the phase vocoder we
can compare the plots of signals where the cyclic shift is used, to plots of the same
signals with the cyclic shift taken out. First well test using a 1000Hz cosine wave, which
using the cyclic shift has the magnitude and phase spectra below:
Now, removing the cyclic shift, we get the following figures:

As we can see, the magnitude spectrum is unchanged, however a negative (relative)


phase component has been added to the original phase components. To test further, we
use an impulse signal, which has the following frequency spectra while using cyclic shift:
As we would expect, there is a magnitude spike right at zero and no significant phase
content. Now, after removing the cyclic shift again, we obtain the following plots:
Once again the magnitude spectrum is left unchanged, while the phase spectrum has an
added negative component. We conclude that this negative component comes from the
unnecessary repetition of overlapping frequency values that arises when the window is
not time-centered prior to taking the FFT. Intuitively we can realize that this will confuse
the phase vocoders instantaneous frequency calculations and add noise to the output
signal.

4.3 Testing With Real Audio

As a final test of the flexibility of our phase vocoder implementation, we will give the
phase vocoder a more complex real audio signal as input (ie. one that might be used in
real signal-processing situations). In this example we use the signal Toms_diner.wav a
short vocal track and plot the entire input and output signals in the time and frequency
(magnitude) domains:

This indicates that, apart from some slight noise cancellation due to the windowing
function, the two signals are the same. Using the wavwrite command, an output audio file
was created (TD_Output.wav, available in Appendix B online). Listening to the audio file
confirms this result as well. The implementation is therefore robust enough to deal with
complex signals, and therefore can be used for real-world audio processing.
5 Effects Processing

A properly implemented phase vocoder can facilitate a number of robust audio effects
with some simple modifications of the phase vocoder signal. Each effect discussed in this
section modifies the signal (more accurately, individual segments of the signal)
immediately following the phase vocoders analysis stage and preceding the synthesis
stage. Some effects such as time stretching however, have some other implementation
details that spill over into the phase vocoder implementation itself. Still other effects do
not make use of certain phase vocoder features, and in the specific effects programs
located in Appendix A (on the authors website), these features have been removed for
efficiency.
While this report does not provide a comprehensive list of effects that can be achieved by
utilizing the phase vocoder, it does cover a number of common effects, a few esoteric
effects, and even explores how a phase vocoder might be used for audio compression.

5.1 Time Stretching and Pitch Shifting

Stretching a digital audio signal in time is a relatively simple task. The basic idea is to
take a signal at one sampling rate and resample it to a different rate (ie. 44,100 to
48,000). This process will either shorten or lengthen the signal, depending on whether the
new rate is higher or lower. Our phase vocoder will be able to do this if we simply
change the resynthesis hop size to be different (higher or lower) than the analysis hop
size. Looking once again at the overlap-add process we can see how this will work:

% Get the length value of the new time domain segment


len = length(y_i);

% Overlap add the output


y(m:m+len-1) = y(m:m+len-1) + y_i;

% Increment the resynthesis hopsize index


m = m + rs_hop;

end % End for loop

Here, the length of our new time domain segment y_i is still going to be equal to the
window length, however, because our resynthesis index m is changing at a different rate
than our analysis index n, each segment will be pieced together at a different time index
than where it was taken from the original input signal. Because of this, the overall length
of the signal will be different, even though each segment is still the same length. We can
appreciate, however, that doing this will now shift the segments so that they no longer
line up properly in terms of the windowing function. This creates the same kind of
effect that you would hear if you were to speed up or slow down a record on a turntable.
While the audio is now a different length, all of the frequency information has been
altered as well (think chipmunks or Darth Vader). The way to counteract this unwanted
side effect has already been partially implemented. The phase processing operations at
the beginning of the phase vocoders synthesis stage give us the ability to alter the
instantaneous frequencies of each FFT bin. By setting our target phase to be the values
necessary for the different-length signal we will alter the phase slope, and therefore alter
the frequency content to match the signal length. This will restore our initial frequency
content and still give us an output signal of the specified length.
Since all of the phase slope calculations have already been done, we need only alter one
line of the example code in order to scale the instantaneous frequencies of our new
window (as well as give us altered phase values for our next target phase):

% Get the new, unwrapped phase difference values and scale them
% by the stretching factor
inst_freqs = (omega + unwrapped_phase_diff).*(rs_hop/ra_hop);

Here, rather than dividing by the analysis hop size, we multiply the unwrapped phase
values by the ratio of the resynthesis hop size to the analysis hop size. If the resynthesis
hop size is greater than the analysis hop size (the ratio is greater than 1) we will get
smaller differences between the phases (a shallower slope) and a longer overall output
signal. If the resynthesis hop size is smaller than the analysis hop size (the ratio is less
than 1) we will get greater difference values (a steeper slope) and a shorter overall output
signal. One important implementation note is that we must ensure, when we are pre-
allocating space for the output vector, to set its length to be that of the original input
length multiplied by our hop size ratio (our time stretching factor).

Pitch shifting is very similar to time stretching (in fact, we achieved a simplistic type of
pitch shifting earlier when we simply resampled our input signal without scaling the
instantaneous frequencies). A good pitch-shifting algorithm should be able to maintain
the harmonic relationships present within the complex signal after the pitch has been
shifted and should also be able to leave the signals length untouched.
One surprisingly robust algorithm for pitch shifting a signal while maintaining its original
length is to resample each segment in the frequency domain by simply altering the
length of the FFT window before taking the inverse transform during resynthesis (thus
changing the IFFT resolution). Because the resampling step is done during the IFFT we
dont want to resample again during overlap-add (essentially un-resampling the
segment) so we have to change our overlap-add index to be our analysis hop size index (n
in our example) rather than the resynthesis index (m). The code below shows the
pertinent alterations to our example (starting right before the IFFT calculation in the
synthesis stage):

% Create a vector with the proper resampling length as given by the


% resampling factor (zero pad if necessary)
if( ra_hop <= rs_hop )

% Only take the first N samples of the segment when the factor
% is less than 1
Y_rcos_shft = Y_rcos_shft(1:ceil(win*(ra_hop/rs_hop)));

% Also set an output window function with the proper length


out_rcos_win = rcos_win(1:length(Y_rcos_shft));

elseif( ra_hop > rs_hop )

% Zero pad the segment by N samples when the factor is greater


% than 1
pad_len = ceil(win*(ra_hop/rs_hop)) - win;
Y_rcos_shft = [Y_rcos_shft, zeros(1, pad_len)];

% In this case, double the window function length


out_rcos_win = [rcos_win, rcos_win];

end

% Find the time domain output (still shifted) using ifft


y_rcos_shft = ifft(Y_rcos_shft);

% "Unshift" the segment


y_rcos = circshift(y_rcos_shft, [0, -win/2]);

% Multiply by the cosine window again for output smoothing


y_i = y_rcos .* out_rcos_win(1:length(y_rcos));

% Get the length value


len = length(y_i);

% Overlap add the output using the analysis hopsize for the
% different length segment, thereby resampling the signal
y(n:n+len-1) = y(n:n+len-1) + y_i;

Notice that because our new output segment is a different length than the input segment
we will have to create a new cosine window variable that is the appropriate length, for
output windowing. It is also important to note that in this case, unlike previously, our len
variable is going to be different from the original window size, which makes it very
useful for indexing during the overlap-add stage (because we dont know the correct
length ahead of time).

The time stretching program (time_stretch.m) was tested using Toms_diner.wav as an


input file. The program was run with a window size of 2048 and an analysis hop size of
256. A sped up output signal with scale factor 0.3 (time_stretch_fast_0.3.wav) and a
slowed down output signal with scale factor 1.7 (time_stretch_slow_1.7.wav) were
created and can be listened to in Appendix B, along with the original audio. The software
can be found in Appendix A.
The pitch shifting program (pitch_shift.m) was tested using moore_guitar.wav as an input
file. The program was run with a window size of 2048 and an analysis hop size of 256.
An output signal shifted up by a factor of 1.5 (pitch_shift_up_1.5.wav) and an output
signal shifted down by a factor of 0.5 (pitch_shift_down_0.5.wav) were created and can
be listened to in Appendix B, along with the original audio. The software can be found in
Appendix A.

5.2 Transient and Stable Component Separation

Separating audio into stable (constant frequency) and transient (changing frequency)
components is a difficult audio effect to achieve. Using a phase vocoder, we can attempt
to determine which frequency bins of each FFT segment are stable through time and
which are transient by looking at the differences between phase bins through time. We
are already calculating these difference vectors, so we simply need to determine whether
the value in each bin of the difference vector is within a stable threshold or not. If it is
outside the threshold we can assume that the instantaneous frequency is changing very
rapidly and is therefore not stable (and vice versa). To make the algorithm more robust,
we can use the following equation, which considers the previous two FFT phase
difference vectors in its calculation:

Because our range, df, extends in front of and behind the target phase if you
consider the range as an angular phase range on the unit circle we must take the
absolute value of the above equation when testing whether or not to keep the frequency
bin. If we are not keeping the frequency bin, we zero the value for that bin before
reconstructing the FFT segment. Using this strategy we can decide to keep either the
stable or transient FFT bins, therefore we simply need to set a flag at the beginning to
decide which component to keep.
The following program snippets show the code involved in stable and transient
component separation:

% Create an angle range value for determining whether to keep freq bins
df = 0.0008;

% Debugging: numzeros = 0;

for

% Get the new instantaneous frequency values


inst_freqs = (omega + unwrapped_phase_diff)./ra_hop;

if(n>1+win) % (Third pass and on)

% Check the separation flag to see which components to keep


if(sep_flag==1)

% Before reconstructing the signal, dispose of any


% unstable frequency bins (those with phase difference
% values not in the df range)
for k=1:length(inst_freqs)

% Calculate the (positive) difference between the


% current phase window and the previous two windows
if(abs(inst_freqs(k)-2*last_inst(k)+two_back(k))>df)
% Debugging: numzeros = numzeros + 1;
Y_rcos_shft(k) = 0;
end
end
else

% Before reconstructing the signal, dispose of any


% stable frequency bins (those with phase difference
% values within the df range)
for k=1:length(inst_freqs)

% Calculate the (positive) difference between the


% current phase window and the previous two windows
if(abs(inst_freqs(k)-2*last_inst(k)+two_back(k))<df)
numzeros = numzeros + 1;
Y_rcos_shft(k) = 0;
end
end
end
end

% Set the updated target_phase and last_inst values as well


% as the value for the "two-back" phases
target_phase = phase + omega;
two_back = last_phase_diff;
last_inst = inst_freqs;

The biggest obstacle encountered in implementing the component separation algorithm


was finding a reasonable phase range value, df. This is somewhat dependant on other
audio factors so it may vary depending on the signal. A good way to test how well it is
working is to set a counter variable (see the commented debugging code in the example)
to see how many bins are being set to zero for each operation (stable vs. transient). Given
the signal being separated, the ratio of zeroed bins for stable and transient frequencies
should give a good indication of how appropriate the range value is. Finally, note that
because this algorithm uses the previous two phase-vectors in the range calculation, we
should not start zeroing bins until the third for-loop pass (using regular hop sizes, the
amount of signal not processed should be negligible).

The component separation program (com_sep.m) was tested using noisy_flute.wav and
moore_guitar.wav as input files. The program was run with a window size of 2048 and a
hop size of 256. Stable output files (com_sep_stable_flute.wav,
com_sep_stable_guitar.wav) and transient output files (com_sep_trans_flute.wav,
com_sep_trans_guitar.wav) for both input files were created and can be listened to in
Appendix B, along with the original audio. The software can be found in Appendix A.

5.3 Robotization and Whisperization

Robotization and whisperization are simple effects to implement. Both effects take the
phase vector of each FFT segment and replace all the values with different values. In the
case of robotization, all phase values are simply set to zero. This means that the
frequency content of the output signal is flat and is actually determined by the window
size itself (therefore changing the window size will give a different robotization
frequency). The example code is shown below:

% Take the fft of the windowed input


Y_rcos_shft = fft(x_rcos_shft);

% Store the magnitude of this fft window


mag = abs(Y_rcos_shft);
fft_mag_matrix(floor(n/ra_hop)+1, 1:win) = mag;

% Set all phase values to zero to create the robotization effect


phase = zeros(1, length(Y_rcos_shft));
fft_phase_matrix(floor(n/ra_hop)+1, 1:win) = phase;

% Recombine the fft with the zeroed phase values


Y_rcos_shft = mag .* exp(j*phase);

By effectively removing any harmonic information, we make an audio signal (especially


a vocal signal) sound robotic.

For whisperization, we set all of the phase values to random values within the range [0
2]. This completely removes all useful frequency information, so that only the
magnitude values remain. Because of this, we still hear the amplitude envelope of the
waveform, but there are no distinct pitches so it sounds like whispers. The code for
setting the new phase values is shown below:

% Set the phase to random values within the range 0-2pi to create
% the whisperization effect
phase = 2*pi*rand(1, length(Y_rcos_shft));
fft_phase_matrix(floor(n/ra_hop)+1, 1:win) = phase;

The robotization program (robotization.m) was tested using claire_oubli_voix.wav as an


input file. The program was run with a window size of 2048 and a hop size of 256. A
robotized output signal (robotization.wav) was created and can be listened to in Appendix
B, along with the original audio. The software can be found in Appendix A.

The whisperization program (whisperization.m) was tested using claire_oubli_flute.wav


as an input file. The program was run with a window size of 64 and a hop size of 8. A
whisperized output signal (whisperization.wav) was created and can be listened to in
Appendix B, along with the original audio. The software can be found in Appendix A.

5.4 Wah-Wah Filter

Because the phase vocoder obtains the frequency domain representation of an entire input
signal, we have the ability to do simplified convolution of the signal. Recall that
convolution in the time domain is equivalent to multiplication in the frequency domain
a much simpler operation so any convolution-based filter can be simplified by using a
phase vocoder. We simply need to specify a transfer function H(z) to multiply with the
input signal FFT X(z). A wah-wah filter a second-order bandpass filter that has an
oscillating center frequency is described by the following transfer function:

where A(z) is a second-order allpass filter described by:

where z is a complex exponential, and c and d are defined:


Where fc is the oscillated center frequency and fb is the frequency bandwidth. Once this
transfer function is calculated, the input FFT segment is piecewise multiplied with it to
obtain the filtered output segment. The Matlab Implementation looks like:

% Set depth and speed values for the "wah" excursion and frequency
depth = 0.45;
speed = 0.8;

for

% Create an oscillator for the center frequency


osc = fc*(1 + depth*cos(2*pi*speed*n/fs));

% Create the two filter coefficient values (d uses the osc value)
c = ( tan(pi*fb/fs) - 1 ) / ( tan(2*pi*fb/fs) + 1 );
d = -cos(2*pi*osc/fs);

% Get a complex exponential z value


w = 2*pi*(0:win-1)/win;
z = exp(w*j);

% Set the second order allpass filter transfer function


A_z = (-c + d*(1-c)*(z.^(-1)) + (z.^(-2))) ./ (1 + d*(1-c)*(z.^(-1)) - c*(z.^(-2)));

% Set the second order bandpass filter transfer function using A_z
H_z = (1/2) * (1 - A_z);

% Filter this frequency domain segment


Y_rcos_shft = X_rcos_shft .* H_z;

The trickiest parts of the code arise in creating the oscillator for the center frequency and
determining which values should be piecewise multiplied (using the Matlab dot operator)
and which should not (this took some trial and error to get correct). The oscillator
which replaces the fc value in the calculation of the d coefficient uses two values
depth and speed which correspond to the excursion and frequency of the oscillations,
respectively. Playing around with these variables allows us to get a better feel for
appropriate wah values.

The wah-wah filter program (wah_filter.m) was tested using flute2.wav and
Toms_diner.wav as input files. The program was run with a window size of 2048 and a
hop size of 256. A faster wah output with higher speed and depth variables
(wah_tom.wav) and a slower wah output with lower speed and depth variables
(wah_flute.wav) were created and can be listened to in Appendix B, along with the
original audio. The software can be found in Appendix A.
5.5 Denoising and Audio Compression

One novel use for the phase vocoder is to treat it as a noise gate. Because noisy parts of
signals will usually have low frequency magnitudes, we can remove them by setting a
threshold value and zeroing all FFT bins with magnitude values below this threshold.
Because we want to set the threshold dynamically based on the maximum frequency of
each segment (because different signals will be recorded at different volumes), we use the
following formula to determine the threshold:

where r is the maximum frequency magnitude for the current FFT segment and c is a
predetermined coefficient that will alter the amount of noise being reduced. The code is
implemented as follows:

% Set a threshold coefficient value for helping determine which


% frequency bins are removed from the output
c = 0.1;

for

% Take the fft of the windowed input


Y_rcos_shft = fft(x_rcos_shft);

% Create a noise gate threshold from the maximum magnitude value


r = max(abs(Y_rcos_shft));
NT = r/(r-c);

% Zero frequency bins with magnitudes below the noise gate


% threshold
Y_rcos_shft(abs(Y_rcos_shft)<NT) = 0;

One side effect of removing this frequency content is that we are actually making the
output signal smaller. Expanding on this idea, we can start to think of the phase vocoder
as an audio compression tool. Perceptual coding theory tells us that some frequency
content, in the presence of louder frequencies in the same range, is masked to the human
ear. By removing this unperceived frequency information we can create smaller audio
signals without losing sound quality. By modifying the phase vocoder to only retain the
strongest N frequency bins (based on magnitude) of each segment, we can build a
rudimentary audio compressor. The code for achieving this is shown:
% Take the fft of the windowed input
Y_rcos_shft = fft(x_rcos_shft);

% Initialize an array to store the N maximum frequency values


val_array = zeros(1, N);

% Find the N maximum frequency values and store both the value and
% the index at which is was found
for k=1:N

% Get the max value and index


[max_val max_k] = max(abs(Y_rcos_shft));

% Set the bin at max index to -1 (temporarily) and store the


% value. This allows us to find the "next" max value
Y_rcos_shft(max_k) = -1;
val_array(k) = max_val;

end

% Initialize a max value array index variable


mv_index = 1;

% Restore the values in bins denoted by -1 to their initial (maximal)


% values and set all other frequency bins to zero.
for k=1:win

if(Y_rcos_shft(k)==-1)
% Restore the initial values to the strongest N bins
Y_rcos_shft(k) = val_array(mv_index);
mv_index = mv_index + 1;
else
% Set all other frequency bins to zero
Y_rcos_shft(k) = 0;
end

end

The limitation of this code is obviously its efficiency, and the greater the N-value the
more processing necessary. Small N values (ie. 1) provide a rather interesting audio
effect it sounds as though someone is tracing the signal and larger values give us a
feel for how much compression is acceptable for a given signal.

The denoising program (denoising.m) was tested using noisy_flute.wav as an input file.
The program was run with a window size of 2048, a hop size of 256, and a threshold
coefficient of c = 0.1. A denoised output (denoising.wav) was created and can be listened
to in Appendix B, along with the original audio. The software can be found in Appendix
A.

The audio compression program (compression.m) was tested using Toms_diner.wav and
moore_guitar.wav as input files. The program was run with a window size of 2048 and a
hop size of 256. Output signals for N values of 256, 16, 8, and 1 (compression_1.wav,
compression_8.wav, compression_16.wav, compression_256.wav) were created for
Toms_diner.wav and an output signal for an N value of 1 (compression_guitar_1.wav)
was created for moore_guitar.wav. These outputs, along with the original audio can be
listened to in Appendix B. The software can be found in Appendix A.

6 Conclusions

Implementing a phase vocoder can be a daunting task. Dividing the implementation into
different stages, conceptually, can help us keep track of the various components as more
and more steps are added. Having a good guide handy also helps and trying hard to fully
understand the motivation behind each step can go a long way in easing the pain of
debugging when things go wrong (which they will). The phase vocoder is one of the most
useful and most essential audio processing tools, and understanding its inner workings
gives us a much better understanding of the entire field of digital signal processing.
Although the road may be long and arduous, the end result is worth the journey.

7 References

[1] Zolzer, Udo, DAFX: Digital Audio Effects, John Wiley & Sons (2002)

[2] De Gotzen, Amalia, Bernardini, Nicola, and Arfib, Daniel, Traditional (?)
Implementations Of A Phase Vocoder: The Tricks Of The Trade in COST G-6
Conference on Digital Audio Effects (DAFX-00), Verona, Italy (2000)

[3] Stanford Exploration Project: Instantaneous frequency,


http://sepwww.stanford.edu/sep/prof/pvi/spec/paper_html/node7.html (1998)

You might also like