Professional Documents
Culture Documents
SwissQual License AG
Allmendweg 8 CH-4528 Zuchwil Switzerland
t +41 32 686 65 65 f +41 32 686 65 66 e info@swissqual.com
www.swissqual.com
Part Number: 12-070-200912-4
SwissQual has made every effort to ensure that eventual instructions contained in the document are adequate and free
of errors and omissions. SwissQual will, if necessary, explain issues which may not be covered by the documents.
SwissQuals liability for any errors in the documents is limited to the correction of errors and the aforementioned advisory
services.
Copyright 2000 - 2012 SwissQual AG. All rights reserved.
No part of this publication may be copied, distributed, transmitted, transcribed, stored in a retrieval system, or translated
into any human or computer language without the prior written permission of SwissQual AG.
Confidential materials.
All information in this document is regarded as commercial valuable, protected and privileged intellectual property, and is
provided under the terms of existing Non-Disclosure Agreements or as commercial-in-confidence material.
When you refer to a SwissQual technology or product, you must acknowledge the respective text or logo trademark
somewhere in your text.
SwissQual, Seven.Five, SQuad, QualiPoc, NetQual, VQuad, Diversity as well as the following logos are
registered trademarks of SwissQual AG.
Diversity Explorer, Diversity Ranger, Diversity Unattended, NiNA+, NiNA, NQAgent, NQComm, NQDI,
NQTM, NQView, NQWeb, QPControl, QPView, QualiPoc Freerider, QualiPoc iQ, QualiPoc Mobile,
QualiPoc Static, QualiWatch-M, QualiWatch-S, SystemInspector, TestManager, VMon, VQuad-HD are
trademarks of SwissQual AG.
SwissQual acknowledges the following trademarks for company names and products:
Adobe, Adobe Acrobat, and Adobe Postscript are trademarks of Adobe Systems Incorporated.
Apple is a trademark of Apple Computer, Inc.
DIMENSION, LATITUDE, and OPTIPLEX are registered trademarks of Dell Inc.
ELEKTROBIT is a registered trademark of Elektrobit Group Plc.
Google is a registered trademark of Google Inc.
Intel, Intel Itanium, Intel Pentium, and Intel Xeon are trademarks or registered trademarks of Intel Corporation.
INTERNET EXPLORER, SMARTPHONE, TABLET are registered trademarks of Microsoft Corporation.
Java is a U.S. trademark of Sun Microsystems, Inc.
Linux is a registered trademark of Linus Torvalds.
Microsoft, Microsoft Windows, Microsoft Windows NT, and Windows Vista are either registered trademarks or
trademarks of Microsoft Corporation in the United States and/or other countries U.S.
NOKIA is a registered trademark of Nokia Corporation.
Oracle is a registered US trademark of Oracle Corporation, Redwood City, California.
SAMSUNG is a registered trademark of Samsung Corporation.
SIERRA WIRELESS is a registered trademark of Sierra Wireless, Inc.
TRIMBLE is a registered trademark of Trimble Navigation Limited.
U-BLOX is a registered trademark of u-blox Holding AG.
UNIX is a registered trademark of The Open Group.
Contents
Voice Quality with ITU-T P.863 POLQA ........................................................................................................ 0
1
Conclusion .........................................................................................................................................32
ii
CONFIDENTIAL MATERIALS
Figures
Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks .................... 6
Figure 2: Scheme of a full-reference psycho-acoustic motivated speech quality model .................................. 6
Figure 3: Basic scheme of the main components of P.863 POLQA ................................................................ 8
Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts ................. 9
Figure 5: Illustration of assigned signal parts and the optimal path of signal correspondences ..................... 9
Figure 6: Example of an aligned pair of reference and degraded signal ........................................................... 9
Figure 7: Block-scheme of POLQA as in ITU-T P.863 .................................................................................... 10
Figure 8: Application of masking slopes to the Bark spectrum........................................................................ 12
Figure 9: Consideration of fully and partially masked spectral parts ............................................................... 13
Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking ....................... 13
Figure 11: Insertion and capturing in a speech test setup ............................................................................... 16
Figure 12: IRS in send and receive direction as specified in ITU-T P.48 ........................................................ 17
Figure 13: P.863 POLQA narrowband main result representation in NQDI .................................................. 19
Figure 14: P.863 POLQA narrowband detail result representation in NQDI ................................................. 20
Figure 15: P.863 POLQA test selection in NQDI ........................................................................................... 20
Figure 16: P.863 POLQA statistical report in MS EXCEL .............................................................................. 21
Figure 17: P.863 POLQA wideband main result representation in NQDI ...................................................... 25
Figure 18: P.863 POLQA wideband audio bandwidth representation in NQDI ............................................. 25
Figure 19: P.863 POLQA and P.862.1 PESQ presentation in NQDI ........................................................... 26
Figure 20: P.863 POLQA and P.862.1 PESQ presentation in NQDI with signal interruptions .................... 26
Figure 21: Distribution of predicted MOS scores by P.862.1 PESQ ............................................................. 27
Figure 22: Distribution of predicted MOS scores by P.863 POLQA .............................................................. 28
Figure 23: Distribution of predicted MOS scores by P.863 POLQA SWB modeand NB mode .................... 31
Figure 24: Distribution of predicted MOS scores by P.863 POLQA SWB mode in wideband networks ....... 32
Tables
Table 1: Improvement in performance of P.863 POLQA to P.862 PESQ .................................................... 14
Table 2: Typical predicted MOS-LQ values for common transmission techniques ......................................... 18
Table 3: Typical P.863 POLQA scores for common transmission techniques .............................................. 24
Table 4: Comparison of P.862.1 PESQ scores to P.863 POLQA in high qualitative UMTS/GSM setups ... 27
Table 5: Comparison of P.862.1 PESQ scores to P.863 POLQA in common real field setups .................. 29
Table 6: Comparison of different speech samples in common real field setups ............................................. 30
Table 7: Comparison of the NB and SWB mode of P.863 POLQA in common real field setups .................. 31
iii
CONFIDENTIAL MATERIALS
SwissQual has been driving the development of new objective perceptual quality prediction algorithms since
it was founded in 2000. Immediately after SwissQuals foundation, its voice quality predictor SQuad was
specifically developed to meet the requirements of mobile and Voice-over-IP scenarios. SQuad still forms the
backbone of the entire voice quality suite of SwissQual to this day. Already from the beginning it overcame
disadvantages of ITU-T P.862 PESQ in these application areas. In order to keep up with the latest
advancements in network and processing technologies, SQuad was continuously maintained and improved
over the years to deliver precise quality scores to the customer.
Already in 2005 ITU-T started a project for standardization of a new objective voice quality model. This
project called P.OLQA should extend the scope of the existing ITU-T P.862 PESQ and overcome
disadvantages and known problems of PESQ.
The P.OLQA project was finalized in 2010 by a competition between six candidate models, including the
latest SQuad algorithm. In a detailed analysis based on more than 45000 speech files, the SQuad algorithm
was selected as one winning model passing the challenging thresholds set by ITU-T.
Together with the two other selected models from Opticom and TNO, SQuad was integrated into a Joint
Model POLQA that combines the strengths of the three underlying algorithms and now forms the new ITU-T
P.863 POLQA approved in January 2011.
SwissQual is one of the most active drivers for the development of objective measures in international
standardization bodies. As a consequence SwissQual leads the corresponding working group at ITU-T and
both initiated and set several standards over the last years such as ITU-T P.563 (a no-reference voice quality
measure), ITU-T J.341 (a full-reference measure for HDTV), ETSI TR 102 506 (a method for an Estimation
of Quality per Call) and now the brand-new ITU-T P.863 POLQA.
POLQA is becoming an integral and central part of SwissQuals voice quality analysis suite and will be the
recommended voice quality predictor for both narrowband and wideband speech.
The existing and widely introduced SQuad algorithm remains a part of Diversity and can still be used if
desired. SQuad can still be combined with the previous ITU-T P.862 PESQ. This gives all customers the
possibility to continue their ongoing measurement campaigns and to plan a transition to ITU-T
P.863 POLQA on their own schedule.
speech frames. All of this considerably changes the physical signal, without necessarily affecting its
qualitative perception. The correct rating of these types of signal distortions is a clear shortcoming of PESQ
and is now solved by POLQA.
high quality
speech signal
transmission
channel
Full reference
measurement
copy of
high quality
speech signal
Figure 1: Application of a full-reference psycho-acoustic model in telecommunication networks
The basic approach follows the common approach used by other measures such as SQuad. It compares the
received and potentially degraded signal with an undistorted reference signal. This allows a very detailed
and fine analysis of any kind of differences between the two signals. To consider human perception, at first a
model of the listening device (i.e. a handset or a headphone) is applied. That way, the exact same signal as
it would be heard by using such a device is used.
Distorted
signal
Model of Device
(i.e.handset)
Psycho-acoustic model
(frequency and intensity
warping, masking)
Distance Cognitive
Similarity
model
Reference
signal
Model of Device
(i.e.handset)
MOS-LQO
Psycho-acoustic model
(frequency and intensity
warping, masking)
The more important step however is the application of a psycho-acoustic model that transforms the signal
into an internal sound representation under consideration of frequency and intensity warping and masking
effects. In this internal sound representation plane, the differences between the degraded and reference
speech are calculated. These differences describe differences that would be perceptible in a direct subjective
comparison. Since speech perception and recognition is more than just listening to sound stimuli, a cognitive
model is the last step of the quality prediction. Here individual distortions are weighted according to speech
perception. For example, in case of human voice, a listener is more tolerant to certain distortion types as
long as they can be considered natural even if they differ significantly from the reference signal.
POLQA predicts the voice quality as it is perceived in an ITU-T P.800 subjective listening only test (LOT).
Those tests are the most used listening tests in telecommunications. A listener scores the quality of a
presented voice sample on a 1 (bad) to 5 (excellent) so-called Absolute Category Rating (ACR) scale. The
listener does not compare the signal directly to a reference; he compares the signal to an internal reference,
i.e. his or her expectation of how it should it sound if it were perfect).
6
CONFIDENTIAL MATERIALS
7
CONFIDENTIAL MATERIALS
Idealization
Perceptual
model
Internal
representation
of the ideal
Idealization
Space / Time
Alignment
Internal
representation
of the output
Idealization
Difference in
internal
representation
Cognitive
model
Quality
Perceptual
model
Degraded
Output
Time Alignment
Why does POLQA perform time alignment?
POLQA and other objective measures following the same base structure compare the (spectral) short-term
characteristics of the reference signal and the degraded signal frame by frame. The alignment marks
corresponding sections in both signals. Only this way can the correct frames be compared to each other.
What makes it challenging?
Aligning two signals is simple for constant delay between the two signals and a linear transmission. Here,
just an offset has to be compensated. More complicated are un-synchronous devices (clock drift), they lead
to a constantly increasing / decreasing delay. Here the compensation is not constant but at least constantly
and linearly changing over time. Even more challenging are processing components transmitting individual
parts of the signal with different delays. These can lead to stretched or compressed speech pauses but also
to stretched or compressed speech parts. This stretching or compressing can be done by preserving the
pitch or by just warping the entire signal part.
In all these cases, each individual short frame of the degraded signal (usually 32ms in length) has to be
assigned to a corresponding frame in the reference signal.
How can it be done in a robust and fast way?
At first POLQA indicates signal parts where the delay can be assumed to be constant and flags them as
landmarks. These parts can be of different length; in the simplest case one single part covers the entire
signal (if there is a constant delay over the entire file).
8
CONFIDENTIAL MATERIALS
REFERENCE
Correspondence
with confidence
PROCESSED
Figure 4: Basic flow of the so-called landmark approach for assigning corresponding signal parts
In a second step, the areas between these landmarks are analyzed. Therefore, the signal is sub-divided
more and more into a series of smaller parts. Each part has an assigned corresponding part in the other
signal.
Each assigned signal part is given a value that rates the confidence of the assignment. In less confident
areas a wider signal range is analyzed, whereas the assignment correspondences of parts with a high
confidence are considered as fixed.
This approach allows a very efficient and robust search structure since the search range becomes more and
more restricted as more landmarks are set. The result is a kind of matrix with corresponding signal parts and
associated search ranges.
Figure 5: Illustration of assigned signal parts and the optimal path of signal correspondences
A Viterbi-like algorithm then calculates the most likely path through this matrix and fixes the corresponding
signal parts.
The end result of the time alignment step is a correspondence table with start and the end times of each
signal part and its correspondence in the reference. Parts of the degraded signal with no correspondence in
the reference (i.e. inserted or added parts), as well as parts of the reference signal that are missing in the
degraded signal, are marked as well. The following signal graph illustrates a practical example. The upper
graph shows the (complete) reference signal, the lower graph shows the received and degraded signal.
9
CONFIDENTIAL MATERIALS
The green areas denote signal parts assigned with high confidence, the blue ones are those with lower
confidence. The red signal part indicates a part of the reference signal that was lost during transmission and
is no longer present in the degraded signal. Unassigned silent parts (white) are not used for direct
comparison but rather for an analysis of the annoyance of the noise floor in there.
Psycho-acoustic model
Just like any of the models that have the same basic approach as POLQA, the psycho-acoustic model starts
with a global level alignment followed by a frame-wise spectral analysis of overlapping frames. As is usual in
these models, a short-term level scaling is applied as well, and the application of a cosine-based window and
a FFT is used for converting the audio signal from the time domain to the spectral domain.
The block scheme of the POLQA psycho-acoustic model is shown in the figure below.
Output Degraded
Input Reference
Scaling towards
degraded
Idealization
Scaling towards
playback level
Frequency response
Noise estimation
Reverb
Windowed FFT
Windowed FFT
FRQ NOI RVB indicators
Frequency warping
to pitch scale
Frequency warping
to pitch scale
Frequency response
compensation
Masking
Masking
Intensity warping to
loudness scale
Intensity warping to
loudness scale
Partial Local and Global
scaling
Nose suppression
Nose suppression
Perceptual subtraction
Asymmetry processing
FRQ
NOI
spectral shaping
band limitation
Stationary and
switched noises
Lp time integration
Lp time integration
Disturbance indic. Da
Disturbance indic. D
Disturbances in speech
Disturbances in speech
RVB
Room
reverberations
Cognitive model
- Combination of individual indicators
- Training on subjective reference scores
- Mapping into MOS scale
Predicted Listening
Quality MOS-LQO
The basic approach of the psycho-acoustic model, which means the use of critical bands and the loudness
compression, looks similar to well-known state-of-the-art models.
10
CONFIDENTIAL MATERIALS
However, there are three parts that make P.863 POLQA different from established standards such as
P.862 PESQ.
Removing / Reduction of individual distortion types and separate consideration of them
Idealization of the reference signal
Sharpened loudness spectra
11
CONFIDENTIAL MATERIALS
Sone
Sone
Bark
Bark
The second step consists of analyzing which masking slopes other parts of the spectrum, either fully or
partially (Figure 9):
12
CONFIDENTIAL MATERIALS
unmasked
Sone
Sone
Partially masked
Bark
Bark
Masked
In a third step the fully masked spectral parts are removed and the partially masked parts are reduced in
their loudness (Figure 10):
unmasked
Sone
Sone
Partially masked
Bark
Bark
Masked
Figure 10: Calculation of a modified Bark spectrum under consideration of spectral masking
Finally, we get a loudness spectrum that represents the individual spectral parts as they contribute to
perception. This means that fully masked parts are taken out, while partially masked parts are attenuated.
These modified spectra of the reference and the degraded signal are then compared and differences are
considered as perceptible differences. The big advantage of the sharpened approach is the remaining high
resolution of the spectrum. It allows a high spectral resolution in the analysis, as required e.g. for a valid
qualitative assessment of the reproduction of fine spectral structures in upper bands by compression
algorithms.
13
CONFIDENTIAL MATERIALS
New types of speech codecs and codecs not yet used in telecommunications, e.g. audio codecs
Voice Quality Enhancement (VQE) systems, non-linear processing for increasing intelligibility
Re-sampling, time-warping
In addition the P.863 development should extend the scope of P.862 mainly by
Due to the wide scope of P.863, the development and evaluation required a huge amount of test data. Test
data means, speech samples with this variation of degradations scored by human listeners in defined subjective experiments. In the end, for the evaluation of P.863 POLQA a total of 62 subjectively scored data
sets were used containing more than 45000 voice samples.
1
These data sets were used for calculating the prediction performance by means of residual square errors or
correlation coefficients. The residual square error or as in previous times Pearsons correlation coefficient
is the indicator for the accuracy of the objective measure; it is given by the remaining prediction error to the
true scores obtained in the subjective tests.
These values give an overview of the performance in general. However, the actual reached numbers depend
on the construction of the data set and the kind of conditions it contains. It is always true that there are test
conditions that can be predicted easily in an accurate way by a model (e.g. noises, waveform codecs and
so on) and others where the deviation is higher (usually combinations of distortions). The occurrence of such
conditions in a data set has a strong influence on these figures. This is not only due to the objective
prediction method rather caused by uncertainties of the listeners in the auditory tests as well.
For the P.863 POLQA evaluation ITU-T has chosen a statistic approach that is based on an r.m.s.e.
calculation, but takes the uncertainty of the subjectively derived MOS values into account. Based on these
figures, the performance evaluation of P.863 POLQA compared to P.862.1 and P.862.2 PESQ was done.
Table 1: Improvement in performance of P.863 POLQA to P.862 PESQ
rmse*
Classical narrowband exp.
Advanced narrowband exp.
Wideband experiments
P.862.1 'PESQ'
0.157
0.227
Improvement by
22%
32%
P.862.2 'PESQ-WB'
0.345
Improvement by
57%
A data set, also often called experiment or database, is a set of speech files processed or transmitted under different
real field or simulated conditions and scored subjectively. A data set usually consists of about 200 individual speech
samples. The prediction accuracy is calculated by comparison of the MOS scores given by the listeners and the
prediction by the objective measure as e.g. P.863 POLQA.
Chapter 2 | Technical Details of POLQA
14
CONFIDENTIAL MATERIALS
The so-called classical set of narrowband experiments covers 22 data sets used in ITU-T already for
standardization efforts from the mid 90s until about 2003. They contain common codec and noise
nd
rd
distortions, mobile channels of the 2 and 3 generation as well as VoIP as it was state of the art at the
millennium. Even though these databases cover distortions that were already used during the development
of P.862 PESQ, the new method P.863 POLQA shows even higher prediction accuracy here.
The advanced set of narrowband experiments is more focused on the latest coding technologies, frame loss
rd
th
concealment strategies, noise reduction and of course 3 and 4 generation mobile as well as the newest
VoIP implementations. This set is based on 15 data sets. The improvement reached with the new method
P.863 POLQA is evident. This set covers a wide range of test conditions of latest technologies which P.863
was designed for.
Finally, there was a set of common wideband data as well. It covers 7 different data sets. Here the
improvement over P.862.2 PESQ-WB is extremely high.
15
CONFIDENTIAL MATERIALS
Reference
speech signal
-26dB ovl
model of
microphone
-26dB ovl
electr.
interface
electr.
interface
electr.
interface
electr.
interface
79 dB(A) SPL
model of
handset
model of
handset
Copy of reference
speech signal
psychoacoustic
model
Similar to this, the sending direction is modeled in this narrow-band setup as well. The source speech signal
is inserted into an electrical interface, either a PSTN or ISDN line or into the microphone input of a mobile
device. In reality at this point the signal has passed the microphone and some voice processing components
already. To emulate this part of the signal path, a model of a typical narrowband microphone is applied. This
is called IRS send, since it models the device in sending direction. It can also be imagined as a weak
16
telephony band-pass but with a quite strong pre-emphasis up to 3kHz. This makes the speech sound a bit
2
sharp but with higher intelligibility in background noise situations.
Figure 11 schematizes the idea behind a narrow-band test. The modeled sending device allows a direct
electric coupling to the channel under test and guarantees reproducible results independent from an actual
used microphone.
The frequency responses for the two filters modeling the device are given in Figure 12. It is clearly visible
that there is a bandwidth limitation to the telephony band, although a slightly wider band can pass than just
300 to 3400Hz.
IRS send direction (ITU-T P.48)
10
0
a / dB
a / dB
10
-10
-20
-10
-20
-30
-30
0
1000
2000
3000
4000
1000
2000
3000
4000
f / Hz
f / Hz
Figure 12: IRS in send and receive direction as specified in ITU-T P.48
While for ISDN and PSTN interfaces defined level and impedance requirements are given and fulfilled by the
interface devices, for mobile phones only the headset connector as a proprietary interface is available.
SwissQuals connector interface for mobile phones is adjusted for this type of interface. It applies the correct
level, adjusts the frequency response and matches to the impedance of each individual phone type and
enables a quasi-standard electrical network termination point even for mobile handsets.
This characteristic is taken from older carbon microphones: the pre-emphasis should compensate the low-pass
characteristic of the inductive loaded analogue lines at that time.
3
The value of -26dB relates to an overload point of 32767/-32768 as is used in 16bit resolution in the digital signal
domain.
Chapter 3 | Narrow-band Voice Quality measurements with
P.863 POLQA' in Diversity
CONFIDENTIAL MATERIALS
17
P.862.1
(narrowband)
SQuad-LQ 08
(narrowband)
P.863
(narrowband)
Linear distortions
Transparent transmission
~40 ~3800 Hz
4.50
4.50
4.50
Transparent transmission
~180 ~3500 Hz (G.712)
4.40
4.50
4.30
Transparent transmission
~200 ~3500 Hz (IRSsend)
4.50
4.50
4.40
Transparent transmission
300 3400 Hz (box block)
4.10
4.30
3.60
IRSsend + G.711
(A-Law standard PCM)
4.40
4.40
4.30
Codec conditions
4.15
4.15
4.20
IRSsend + EFR
(real loss-free connection)
4.10
4.15
4.10
3.90
4.00
4.00
3.75
3.90
3.90
3.75
4.00
3.90
3.90
4.00
3.95
3.75
3.90
3.85
3.40
3.70
3.65
ITU-T and 3GPP do not recommend the use of the P.862 family for EVRC-type codecs.
18
The codecs are used as reference SW implementations. In addition one EFR condition is shown as it
behaves in a real loss-free channel, using a commercial Nokia handset as access device to the network. The
channel was terminated by an ISDN card device running G.711 A-Law.
Firstly, a very slight more pessimistic prediction is enabled by P.863 POLQA compared to SQuad08.
However, for practical use cases this absolute difference is negligible. Compared to P.862.1 the higher rates
of AMR match very well even though the lower rates are scored higher by P.863. In addition, the EVRC type
codecs are scored higher and more realistic by P.863 and especially SQuad08 compared to P.862.1.
P.863 POLQA considers linear distortions and bandwidth limitations in its score. For super-wideband mode
it is obvious. There, a signal is always compared to a super-wideband reference (50 to 14000 Hz). It is
important to note that P.863 POLQA in narrow-band mode considers a full narrow-band signal (~50 to
3800 Hz) as reference. To this signal an IRSrcv filter is applied in P.863 POLQA itself. That means
limitations lowering this bandwidth will lead to a predicted distortion. With P.863 POLQA the actual channel
filters and band-pass characteristics in the microphone and loudspeaker path of the used mobile phone are
5
taken more into account as it was for P.862 PESQ.
SwissQuals SQuad08 also considers linear distortion in narrow-band mode; however it is less sensitive than
P.863 POLQA and is supposed to be less dependent from the actually used phone and its internal filtering.
SwissQuals speech quality suite offers two methods for predicting listening quality: The known SQuad08
and the new ITU-T P.863 POLQA. Both models may be combined with ITU-T P.862 PESQ as an option.
The entire framework as known from SQuad including the voice samples, the insertion and capturing
procedure and of course all of the additional signal analysis results are used and available for
P.863 POLQA in the same way.
To differentiate P.863 POLQA tests from SQuad and P.862 PESQ, the actually used method is given in
parentheses behind label Listening Quality. For an immediate visual feedback, the POLQA logo is shown
right below the predicted MOS score.
Since, P.863 POLQA measures the actual spectral loss of the speech signal, the actual impact by band-limitations
depend on the actual spectral power distribution if the speech sample. That means there are samples more or less
affected by this filtering due to their spectral characteristic e.g. losing more or less high frequency parts.
19
In addition to the global values for the entire speech sample, graphs illustrate the quality profile over the
sample duration, the signal envelopes as well as the signal gain
P.863 POLQA is treated as a separate method for listening quality measurements in NQDI. The test
selection tab sheet in NQDI can be used to select individual P.863 POLQA tests.
For reporting, the group of Voice reports in NQDI sports a LQ narrowband statistic report. It reports not
only the P.863 results but rather the results of all other algorithms such as SQuad and P.862 PESQ in the
Chapter 3 | Narrow-band Voice Quality measurements with
P.863 POLQA' in Diversity
CONFIDENTIAL MATERIALS
20
same table. The results for each algorithm are given in a separate column.
21
For narrow-band mode P.863 POLQA applies an IRS receive filter that emulates a narrow-band handset
(see: Figure 12: IRS in send and receive direction as specified in ITU-T P.48)
Chapter 4 | Wideband Voice Quality measurements with
P.863 POLQA' in Diversity
CONFIDENTIAL MATERIALS
22
This is the difference to the narrow-band case. The comparison of the recorded signal is made relatively to a
super-wideband reference. In the same way, the recorded signal is not post-filtered to avoid any band
limitation that models a receiving HiFi headphone.
That means, in case of a full-band audio channel (i.e. a VoIP connection using full audio bandwidth or an
application using a MP3 with sufficient bitrate as in video or audio streaming), the recorded signal matches to
the reference in its bandwidth. In case of a common wideband or even a narrow-band channel or device, the
bandwidth becomes limited during transmission. In case this signal is recorded and compared to the full
reference, the spectral loss is weighted as degradation.
Of course the exploration of a wideband channel requires also the insertion of a signal with sufficient
bandwidth. To actually feed wideband signals into the channel, new voice samples were recorded. They are
without a perceptual bandwidth limitation and are stored at 32kHz sampling frequency in a separate
reference folder Speech-Wideband or Speech-Wideband POLQA respectively. As usual, the samples are
constructed out of a male and a female spoken sentence and have a constant length of 6s. Thus, the
continuity to the narrowband tests is completely given.
For the time being SwissQual provides samples in
German (German pronunciation)
German (Swiss pronunciation)
British English
Italian
Dutch
Each language sample is provided without any pre-filtering (except for a 50 14000Hz band-pass) and
called i.e. GE_fm_wide.wav. As specified for wideband devices, the microphone path is considered as flat in
the transmission band. It means no IRSsend as for required narrow-band is applied. The signal remains flat,
without any further band limitation and without any pre-emphasis as in the IRS.
23
Table 3: Typical P.863 POLQA scores for common transmission techniques in a wideband and a narrowband context
P.863 in
super-wideband
(50-14000 Hz)
P.863 in
narrowband
(300-3400 Hz)
4.75
4.3
3.8
3.6
4.5
3.5
4.4
3.0
3.6
3.5
4.3
3.2
4.2
3.0
3.9
3.0
3.9
2.9
3.9
It can be seen that the rank-order of the systems remains independent from the test scenario. The upper
range of the wide-band scale is just used for the high qualitative wideband voice samples. The common
narrowband scenarios are compressed to the lower 60% of the scale and thus show a smaller gradient as
well.
In case of optimizing and benchmarking pure narrowband networks and applications, the common
narrowband test application can be used without any problems. The individual systems are more clearly
discriminated due to the wider scale range used.
For optimizing wideband applications and networks and especially for benchmarking of wideband networks
against narrowband ones, a wideband test application is required.
Firstly, the degradations in wideband mode can only be assessed in a wideband test application and
secondly, a wideband signal can only show its better quality against narrowband in wideband mode.
Note: Narrowband MOS-LQ values and wideband MOS-LQ values must never be mixed or directly
compared. They are referring to different interpretations of the MOS scale.
24
were considered in the huge training set for SQuad and P.863 POLQA.
The main focus of Diversitys wideband test solution is of course the evaluation and benchmarking of
wideband channels in cellular networks.
An additional application area for wideband voice testing in Diversity is video streaming. In video streaming
audio codecs are usually used; these dont have any bandwidth restriction, except in very low bitrate
conditions. Consequently, Speech Wideband as a test case is also applied to video streaming starting with
Release 10.2 of Diversity and completed in Release 11.0 with the full support of ITU-T P.863 POLQA.
The application type (highlighted in red) explains the modeled listening situation in detail. In addition, since a
potential bandwidth reduction is a serious impact in a wideband scenario, the actual bandwidth of the
channel is measured and reported as well (highlighted in green). There are three classes:
narrowband (up to ~3800Hz)
wideband (up to 8000Hz)
super-wideband (up to 14000Hz).
The remaining values are the same as usual and well known for SQuad and are visible in narrowband tests
as well. They provide information about the speech level, noise floor, the amount of missed voice and the
gain applied by the channel.
The tab sheet Speech Details clearly shows the audio bandwidth of the measured audio channel, in this
case a common wide band channel up to almost 8000Hz (Figure 18).
The lower and upper bound are marked with blue lines. As is clearly visible, Diversity and ITU-T P.863 make
use of real super-wideband signals. The frequency scale here ends at 16000 Hz; this corresponds to an
internal sampling frequency of 32000kHz.
25
One of the most important questions is the relation of P.863 POLQA results to previous P.862 PESQ
measurements under real field conditions. Of course, P.862 PESQ and P.863 POLQA are different
algorithms and treat distortions in the signal differently. However, at the end the predicted MOS should
accurately describe the quality of the voice or of the voice channel. This means that in cases where P.862
PESQ delivered accurate predictions, the newer and improved P.863 POLQA should predict almost the
same value. For distortions where P.862 PESQ produced more inaccurate predictions, P.863 POLQA as
8
an improved method will predict more accurate but therefore differently from P.862 PESQ.
In real field measurements the channel consists of more than just a codec. Even under perfect radio
conditions there can be other factors that limit the maximum quality. These could be further bandwidth
limitations that are due to the actual device used, or further speech processing steps such as noise and gain
control that are applied in the device or in the network. There might also be trans-coding, i.e. a second
encoding/decoding step, for example in case of mobile-to-mobile connections or in special gateways from
the mobile core to PSTN networks. For these reasons, the MOS scores obtained in a plain codec emulation
as given in Table 3 are usually only reached in real field cases where the device and the network can be
considered as transparent and do not apply further speech signal processing as e.g. through noise or gain
control.
A good example of the difference between the two algorithms is the treatment of interruptions and lost
speech. Here P.862 PESQ is suspected of scoring inaccurately and usually too optimistic. In the example
almost 4% of the original speech was lost, however P.862 PESQ scores with 3.2, while P.863 POLQA only
predicts 2.7 which appears closer to the perceived score here.
Figure 20: P.863 POLQA and P.862.1 PESQ presentation in NQDI with signal interruptions
By analyzing a larger number of quality scores obtained in a drive test, the picture remains almost the same.
The following figures are based on a drive test and a collection of data from a European operator. The
P.862 PESQ defines the algorithm technically. The actual transformation from the P.862 outcome to a MOS-like scale
is defined in P.862.1. All predicted MOS scores in this document are computed in accordance to P.862 and were
converted to the MOS domain according to P.862.1.
Chapter 5 | Real field measurements
26
CONFIDENTIAL MATERIALS
speech sample used was American English and each given number is based on a collection of around 100
individual scores.
Table 4: Comparison of P.862.1 PESQ scores to P.863 POLQA in high qualitative UMTS/GSM setups
Average
P.862.1
P.863
PESQ
POLQA
Downlink
Uplink
Maximum
P.862.1
P.863
PESQ
POLQA
UMTS 2100
Device A
Device B
3.97
4.04
3.97
4.06
4.19
4.17
4.19
4.17
GSM 900
Device A
Device B
3.78
3.87
3.77
3.87
4.19
4.17
4.18
4.20
UMTS 2100
Device A
Device B
3.92
4.01
3.80
3.83
4.13
4.12
4.02
4.04
GSM 900
Device A
Device B
3.74
3.78
3.60
3.59
4.11
4.10
4.01
3.99
Just looking at Downlink which is usually the less critical direction, there is on average a difference between
PESQ and POLQA averages of just 0.02, which is completely negligible. There are small differences in
average between the phones and the two technologies GSM and UMTS. But the behavior is always the
same for either method, i.e. GSM 900 is scored lower by 0.2 MOS on average with both methods.
In Uplink the situation is slightly different. Here P.863 POLQA scores slightly lower than PESQ, on average
by 0.15 MOS. This effect is due to several reasons, the main one being the more restricted audio bandwidth
by using the microphone path of the mobile device as it is the case in Uplink. By contrast, the Downlink is
using the (wider) loudspeaker path of the phone. The former P.862 PESQ compensates the frequency
response of the channel and therefore ignores that band-limitation mostly. P.863 POLQA considers
changes in bandwidth as they are perceived by a user and consequently a limitation will lead to a slightly
lower score here.
Besides the average values, the distribution of the predicted values provides information of the measures
behavior. The following two graphs are based on the downlink scores of Device A in UMTS 2100 as above.
4.8-4.9
4.6-4.7
4.4-4.5
4.2-4.3
4-4.1
3.8-3.9
3.6-3.7
3.4-3.5
3.2-3.3
3-3.1
2.8-2.9
2.6-2.7
2.4-2.5
2.2-2.3
2-2.1
1.8-1.9
1.6-1.7
1.4-1.5
1.2-1.3
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
Listening Quality
Figure 21: Distribution of predicted MOS scores by P.862.1 PESQ (Device A, UMTS, Downlink as in Table 4)
27
CONFIDENTIAL MATERIALS
4.8-4.9
4.6-4.7
4.4-4.5
4.2-4.3
4-4.1
3.8-3.9
3.6-3.7
3.4-3.5
3.2-3.3
3-3.1
2.8-2.9
2.6-2.7
2.4-2.5
2.2-2.3
2-2.1
1.8-1.9
1.6-1.7
1.4-1.5
1.2-1.3
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
Listening Quality
Figure 22: Distribution of predicted MOS scores by P.863 POLQA (Device A, UMTS, Downlink as in Table 4)
Both distribution functions are very close and concentrate a wide majority of the scores in the range of 4.0 to
4.2 that corresponds to the best quality in error-free connections. It is logical that a certain quality cant be
exceeded. It is set by the coding scheme, the channel limits and other included voice processing. Even in
undistorted conditions they insert a certain amount of degradation. This defines the upper level that cant be
exceeded in this setup. This causes the steep decline towards higher values on the right-hand side. Usually,
the majority of scores are in this region which corresponds to error-free transmission.
In the direction of lower values, the distribution falls shallower. Values in this region indicate degradations in
addition to the unavoidable distortions. In cellular networks these problems are usually interruptions (due to
handovers), falling back to lower bitrates in case of AMR (due to bad radio conditions) and frame losses that
were concealed artificially by the AMR decoder. In principle there could be other distortions as well,
e.g. transcodings in case of special routing or noise bursts coupled into analogue parts of the PSTN.
Regarding the absolute maximum as shown in Table 4 there is no difference between the phones and the
technologies used, meaning that the reachable quality is identical for both and the slightly differing averages
are caused by individual test conditions e.g. slightly different RF coupling or a few more bad channels in the
averaging process. It should be noted that the reached maximum is the same as obtained by just processing
the same speech sample over an AMR 12.2 kbps codec in offline emulation. This indicates that there are no
further distortions introduced by the phone or constant speech processing components in the network.
28
CONFIDENTIAL MATERIALS
Table 5: Comparison of P.862.1 PESQ scores to P.863 POLQA in common real field setups
Average
P.862.1
P.863
PESQ
POLQA
Maximum
P.862.1
P.863
PESQ
POLQA
80% Percentile
P.862.1
P.863
PESQ
POLQA
Downlink Network 1
UMTS
Device A
3.97
3.97
4.19
4.19
4.13
4.12
Device B
4.04
4.06
4.17
4.17
4.14
4.13
Network 2
UMTS
Device C
3.35
3.42
3.69
3.69
3.54
3.56
Network 3
UMTS
Device D
3.98
3.79
4.10
3.97
4.03
3.87
Network 4
CDMA / EVRC
Device E
3.30
3.29
3.76
3.66
3.62
3.54
Network 5
CDMA / EVRC-B
Device F
3.33
3.44
3.77
3.84
3.56
3.64
For network 2 the situation is different, despite of a device that applies some gain and noise control, the
network is here limited to AMR at 5.9 kbps. This reduces the achievable quality compared to a
network/device combination as in network 1 significantly.
Network 3 is somewhat in between networks 1 and 2, it enables AMR at 12.2 kbps but the used handset
is not as transparent as the devices A and B are. P.863 POLQA scores these device characteristics lower
than P.862 PESQ.
The two real field CDMA networks are also in the range of networks 2 and 3. The quality is determined by
the coding schemes used. Mainly for EVRC-B the quality scores are improved compared to P.862 PESQ.
However, aggressive noise and gain control have a strong influence to the achieved scores as well. Finally,
even the maximum scores are lower than what could be expected by plain encoding distortions.
The achieved quality figures in real field measurements are of course depending on the RF conditions in
the network. However, it has to be considered that a certain quality cant be exceeded due to fixed speech
processing components in the channel and in the device. Despite of comparing averages, a closer look at
the distribution of the scores and the values where most of the scores are located, give useful information
about potential reasons of a non-perfect quality.
29
CONFIDENTIAL MATERIALS
To get an impression of how this deviation could be, the following analysis was made. Nine different speech
samples were transmitted consecutively in a phone call during a drive test in a real UMTS network. A total of
30 calls were made. It can be assumed that the distribution of real channel quality was the same for all nine
samples.
The following table shows the averages, the absolute maximum values and the 80% percentiles of the MOS
scores obtained with the nine speech samples. For a better overview, the samples are grouped by language.
The test situation is the reference situation as above, i.e. network 1 (European UMTS 2100MHz, Device B
as a quasi-transparent device and uncritical downlink only).
Table 6: Comparison of different speech samples in common real field setups
Average
P.862.1
P.863
PESQ
POLQA
Network 1 American English
UMTS
German
Downlink Spanish
Device B
Greek
Russian
Hungarian
Arabic
Polish
Japanese
4.04
4.04
3.98
3.98
3.87
3.93
3.84
3.82
3.82
4.06
4.07
4.03
3.96
3.87
4.17
3.85
3.95
3.96
Maximum
P.862.1
P.863
PESQ
POLQA
4.17
4.20
4.21
4.13
4.09
4.14
4.02
4.09
3.99
4.17
4.26
4.31
4.17
4.09
4.39
4.06
4.32
4.10
80% Percentile
P.862.1
P.863
PESQ
POLQA
4.14
4.18
4.19
4.09
4.03
4.11
4.00
4.06
3.92
4.13
4.18
4.25
4.11
4.01
4.35
3.99
4.26
4.06
In general it can be observed that there is considerable difference between the samples. The averages and
the maximum values span over a range of >0.2 MOS in case of P.862 PESQ and even >0.3 MOS for P.863
POLQA. There are two reasons for this. First, the individual samples are treated slightly differently by the
voice processing in the channel. They are more or less affected by e.g. band-pass filtering or compression.
Secondly, there is the consideration of the talkers timbre, the spectral power distribution of the reference and
degraded signal in P.863 POLQA. Since there are differences in the talkers individual characteristics and
the actual recording conditions of the reference speech samples, P.863 POLQA scores slightly different too.
The situation is widely systematic, i.e. a speech sample that is scored slightly lower, will tend to lower scores
under all realistic test conditions. Therefore, when comparing MOS values from different investigations, the
influence of the speech sample used should not be overlooked. Ideally, results that are to be compared to
each other should be based on the same speech sample or the same selection of those.
30
CONFIDENTIAL MATERIALS
Table 7: Comparison of the NB and SWB mode of P.863 POLQA in common real field setups
P.863 'POLQA'
Average
NB
SWB
Network 1
UMTS DL
Device A
4.06
Maximum
NB
SWB
3.03
4.17
80% Percentile
NB
SWB
3.25
4.13
3.17
The achievable quality scores of 3.17 (80% percentile) or 3.25 (absolute maximum in the collection) fit the
simulated value of 3.2 given in Table 3 quite well (which itself is an average over a set of different speech
samples).
In a more detailed analysis the distributions of the two test cases are compared in Figure 23. It can be seen
that the typical shape of MOS distribution is shifted and compressed towards the lower scale end.
4.8-4.9
4.6-4.7
4.4-4.5
4.2-4.3
4-4.1
3.8-3.9
3.6-3.7
3.4-3.5
3.2-3.3
3-3.1
2.8-2.9
2.6-2.7
2.4-2.5
2.2-2.3
2-2.1
1.8-1.9
1.6-1.7
1.4-1.5
1.2-1.3
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
Listening Quality
4.8-4.9
4.6-4.7
4.4-4.5
4.2-4.3
4-4.1
3.8-3.9
3.6-3.7
3.4-3.5
3.2-3.3
3-3.1
2.8-2.9
2.6-2.7
2.4-2.5
2.2-2.3
2-2.1
1.8-1.9
1.6-1.7
1.4-1.5
1.2-1.3
50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1-1.1
Listening Quality
Figure 23: Distribution of predicted MOS scores by P.863 POLQA SWB mode (lower graph) and NB mode (upper
graph) using Device A, UMTS, Downlink as in Table 4
It has to be considered that the reporting of SWB scores will not match the expected figures obtained with
NB in the past. Both values and analysis types must not be mixed or compared to each other. However, it
will just be a question of time until the market deals with quality scores obtained in super-wideband mode
and has adapted to the lower range of quality achievable in narrowband connections.
The most important point for using P.863 POLQA in SWB is of course the evaluation of wideband channels
and networks, both for comparison to each other and to traditional narrowband systems.
Today in 2011, only few mobile networks are equipped with wideband transmission capabilities, and it is
often restricted to mobile-to-mobile connections in transcoding-free operational mode. On the other hand,
Chapter 5 | Real field measurements
31
CONFIDENTIAL MATERIALS
VoIP services have been using wideband and even super-wideband transmission already for a long time and
with P.863 POLQA there is now an appropriate objective measure of speech quality for these systems.
The following example for wideband transmission is based on a collection of speech samples obtained in a
mobile network that was equipped with AMR-WB at most locations.
Listening Quality distribution (P863-SWB 'POLQA) PDF
30%
25%
20%
15%
10%
5%
4.8-4.9
4.6-4.7
4.4-4.5
4.2-4.3
4-4.1
3.8-3.9
3.6-3.7
3.4-3.5
3.2-3.3
3-3.1
2.8-2.9
2.6-2.7
2.4-2.5
2.2-2.3
2-2.1
1.8-1.9
1.6-1.7
1.4-1.5
1.2-1.3
1-1.1
0%
Listening Quality
Figure 24: Distribution of predicted MOS scores by P.863 POLQA SWB mode in wideband capable networks
It can clearly be observed that the majority of the quality scores are in a range from 3.7 to 3.9, which
represents the achievable quality for AMR-WB at 12.65kbps (as shown in Table 4). The lower scores are
partially caused by transmission errors and a few narrowband connections in this selection. The AMR-NB at
12.2kbps will result in a predicted MS of 3.2 or lower in this dataset.
Conclusion
With P.863 POLQA a new measure for objective speech quality assessment has been standardized,
serving todays and future demands of voice quality testing. P.863 POLQA is embedded in SwissQuals
strong SQuad application framework which computes a powerful set of additional information characterizing
the analyzed speech signal.
The actual implementation of P.863 POLQA was thoroughly speed optimized for a resource saving
calculation in a fraction of real time on SwissQuals platform. The SQuad framework that provides P.863
POLQA can also compute P.862 PESQ scores in parallel to P.863 POLQA in NB mode as an option for
customers who are a interested in a direct comparison between the two measures.
Of course, the SQuad measurement suite can also be equipped with SQuad08 as a MOS predictor.
SQuad08 is technically compatible to P.863 and serves the same applications as P.863 POLQA.
Chapter 6 | Conclusion
32
CONFIDENTIAL MATERIALS