You are on page 1of 11

arXiv:1605.07116v1 [cs.

CV] 23 May 2016

A Formal Evaluation of PSNR as Quality Measurement


Parameter for Image Segmentation Algorithms
Fernando A. Fardo, Victor H. Conforto, Francisco C. de Oliveira, Paulo S.
Rodrigues
Centro Universit
ario da FEI, S
ao Paulo, Brazil

Abstract
Quality evaluation of image segmentation algorithms are still subject of
debate and research. Currently, there is no generic metric that could be
applied to any algorithm reliably. This article contains an evaluation for
the PSRN (Peak Signal-To-Noise Ratio) as a metric which has been used to
evaluate threshold level selection as well as the number of thresholds in the
case of multi-level segmentation. The results obtained in this study suggest
that the PSNR is not an adequate quality measurement for segmentation
algorithms.
Keywords: Segmentation, threshold, PSNR
1. Introduction
In image processing, segmentation is a a set of techniques that separate
regions from a scene based on similarity. There are several techniques available for this process [10, 4]. Segmentation is usually based on attributes such
as color, brightness contrast or continuity of pixel regions. In the particular
case of threshold based techniques, one ore more threshold values is determined. Pixels of similar brightness levels are then grouped as below or above
such threshold levels [6].
Fig. 1 shows an example of a scene containing a simple foreground and a
background. Fig. 2 shows its corresponding 256 gray level histogram with
an obtained threshold level t at 118. The resulting image of a threshold based
segmentation algorithm can is shown at Fig. 3, where pixels below t are set
to (0). Conversely, pixels of brightness level above t are set to 255. In this

case, pixels labeled as (0) and (255) can be treated as the background and
foreground, respectively.

Figure 1: Example of an image with foreground and background

Figure 2: Gray level histogram with detected threshold t = 118

Figure 3: Resulting image after threshold based segmentation with t = 118

Such techniques are often used at pre-processing step in high level computer vision based systems as it reduces the amount of irrelevant information
by similarity grouping of the pixels in the same region. The objective of
threshold algorithms is to detect the threshold level that separates an image
in regions of interest more accurately. The main problem is that the quality
2

evaluation of such algorithms lacks an objective parameter and cannot be


determined automatically.
There are many proposals for a generic metric of segmentation algorithms.
Such metric is often difficult to describe making an objective evaluation
method potentially unreliable. The evaluation methods can be divided in
two main categories: analytic and empirical [2]. The analytic methods are
based in properties obtained from the segmented image which can be used
in order to obtain a quantitative quality measurement. These methods are
not very reliable as determining the quality of a segmentation based purely
in analytic parameters can be difficult [2]. The empirical methods are based
on the comparison of the resulting segmented image with pre-defined desirable results determined by human operators, and can be further divided into
two subcategories, goodness methods and discrepancy methods. Goodness
methods are uses pre-established parameters such as as region uniformity or
inter region contrast. The discrepancy methods rely on the comparison of
the segmentation result with a reference image known as ground truth, which
is established by an human operator [2].
Despite its limitations, the PSNR has been used as an analytic metric by
several authors of threshold based algorithms. [3, 7, 1]. As subject to study
we performed some experiments to verify if PSNR can be used reliably as an
analytic metric for image segmentation.
2. PSNR
The PSNR is a signal processing measurement that compares a given
received or processed signal to its original source signal. This comparison
allows us to quantify how much a processed signal is faithful to the original,
also allowing us to identify possible noises or distortions to the signal. We
can say that the PSNR represents a direct relationship of a signal before and
after a degradation process.
Mathematically, the PSRN is described by the Equations (1) and (2)


M AXI2
(1)
P SN R = 20 log10
M SE
m1 n1
1 XX
M SE =
[I (i, j) K (i, j)]2
mn i=0 j=0

(2)

where M AX is the highest possible value of the signal. In the case of a


gray scale image of 8 bits, M AX = 255. As demonstrated in Eq. (1), the
P SN R is inversely proportional to the MSE (Mean Squared Error). The
final value of the P SN R is given in decibel.
The PSNR is generally used to evaluate the quality if transmission and
compression of image or video signals, based on de mean square error of the
received or processed image in comparison to the source image. However, it
also has been used as an analytic metric for segmentation algorithm evaluation [3, 7]. In the case of multi-threshold algorithms, it was also used as a
metric to determine the number of thresholds [1] as well as its values [13].
3. Objective
The purpose of this paper is to evaluate the PSNR itself as a reliable
analytic method for evaluation of image segmentation algorithms.
4. Methodology
Since we are not trying to evaluate an algorithm but the metric itself,
we cannot rely on some existing study that used the PSNR as an analytic
method for evaluation. Instead, we propose the adoption of ground truth
data that would normally be suitable for empiric methods as results of a
segmentation algorithm. Then, we use the PSRN as an analytic method to
evaluate such results.
For the experiments, we used the set of images from the Berkeley BSR300
Database [9]. It comprises of 300 images containing several types of scenes
where every image I has its corresponding ground truth image G. The
ground truth is an image contained contours of objects from each scene defined by volunteers as the most relevant ones. Fig. 4 shows an example of
an image (a) of the database and its respective ground truth image (b).
From each ground truth image G, a region mask G0 is obtained, separating
the background from the foreground. The mask was obtained by automatically filling of the closed contours with the white color (255), thus creating
masks with the most relevant regions of interest. After applying a threshold algorithm to this mask, a binary mask B is obtained. Since computer
vision techniques are strongly inspired by the human vision, we can assume
that such binary masks are close to an ideal segmentation algorithm. Fig.

(a)

(b)

Figure 4: Example of an image from the database (a) and its respective ground truth (b)

(a)

(b)

Figure 5: Automatically filled ground truth image (a) and obtained binary mask (b)

5 shows an example of a filled ground truth G0 (a) and the corresponding


binary mask B (b) after threshold.
To verify the efficacy of the PSNR as an analytic method for image segmentation, we generated poorly segmented masks based on binary masks
with the use of salt and pepper noise. As the salt and pepper noise adds
changes pixels randomly to either 0 or 255 we can use this to simulate a
bad segmentation. The resulting mask B 0 therefore, contains several pixels that are incorrectly classified as foreground (255) and background (0).

Fig. 6 shows an example of a binary mask B (a) and its corresponding bad
segmentation B 0 (b).
When used as an analytic method, the PSNR is used between the resulting
image and the original. Therefore, the PSNR must be calculated between
each original image I and the corresponding segmentation mask B and bad
segmentation mask B 0 .
For each image in the database, the PSNR is calculated between both
B and B 0 and I and the results of the PSNR are calculated and stored for
posterior analysis.

(a)

(b)

Figure 6: Binary mask B (a) and bad segmentation mask B 0 after salt and pepper noise
(b)

4.1. Proof
Let P be the set of PSNR results calculated between each binary mask B
and its corresponding image I. Le P 0 be the set of PSNR results calculated
between each bad segmentation mask B 0 and its corresponding source image
I. If the PSNR is not an adequate analytic method, the average of PSNR
values in P should be significantly superior to those obtained in P 0 . For this
paper, this condition is adopted as our main hypothesis.

5. Results and discussion


To confirm the main hypothesis, initially we proposed the use of Sudents
T test with 95% of significance [11] between P and P 0 . However this test
requires the variance between the samples to be homogeneous. Firstly we
used the Fishers F test for variance [5] to verify such homogeneity between
P and P 0 . Figs. 7 and 8 shows the density of probability for the sets P
and P 0 respectively. If the results from the F test indicate that the variance
between the sets P and P 0 is not homogeneous the Students T test cannot be
applied. In this case, the Welchs T test should be used instead [12]. These
hypothesis tests were performed using the R language.

Figure 7: Probability density for the set P of PSNR results for good segmentation masks

Figure 8: Probability density for the set P 0 of PSNR results for bad segmentation masks

5.1. Fishers F test for variance


As a null hypothesis for the F test, we adopt that the variances of the sets
are homogeneous. As the alternative hypothesis, we adopt that the variances
between the sets are not homogeneous. The results from the F test are shown
on table 1.
The p value for the F test is in the region for acceptance of the alternative
hypothesis. Therefore, is not safe to assume that the variances between P
7

F
df
df denominator
P value
Confidence interval
Variance rates

0.4618
299
299
4.2651011
0.3679506 a 0.5795227
0.4617745

Table 1: Results for the F test of variance between P and P 0

and P 0 are homogeneous and the Students T test cannot be used reliably.
The Welchs T test is then used to determine if the difference between P and
P 0 is statistically significant.
5.2. Welchs T test
As a null hypothesis, we adopt that P and P 0 are equal and the difference
between the means of both sets is zero (0). As the alternative hypothesis, we
adopt that the mean of P 0 is superior to the mean of P . Should the alternative
hypothesis be accepted, it would suggest that the bad segmentation masks
were considered better then the ideal segmentation according to the PSNR
metric.
The Welchs T Test is then applied with 95% of significance between both
sets P and P 0 . Table 2 shows the results of the Welchs T test.
T statistics
df
p value
Confidence interval
Mean of P
Mean of P 0

-7.6524
526.607
4.7351014
0.8641351
5.638749
6.740013

Table 2: Results for the Welchs T test between P and P 0

The p value for the Welchs T test is 4.735 1014 and is found in the
area of rejection of the null hypothesis. We are left with the acceptance of
the alternative hypothesis which indicate that the PSNR values calculated
from the bad segmentation masks B 0 are superior to the ones calculated by
human obtained masks B.

6. Final considerations
We investigated the efficacy of the PSNR as an analytic method for segmentation algorithms the same way its adopted. We used human created
segmentation masks as an ideal reference of a segmentation algorithm and
compared the calculated PSNR values from these masks to those calculated
from artificially inferior segmentation masks.
To verify if the PSNR is a good evaluation method we compared the values of two sets of calculated PSNR values from good and bad segmentation
masks. The mask generation procedure can produce masks that would not
be obtainable from threshold algorithms as the values for labels are usually
determined by the values of the calculated thresholds. For example, a foreground object on a brighter background would have its pixels set to (0) in
the binary mask while the background would be set to (255). However, there
is no rule for what levels each label should be set to and this could influence
the PSNR as well. Some graph based algorithms even separate regions using
random colors [8]. Results from such such algorithms could not be verified
with the PSNR as it is as they would change greatly from one execution to
another.
We proposed the use of Welchs T test to verify if the difference between the sets of PSNR values from good and bad segmentation is significant.
Higher PSNR values for good segmentation masks would suggest the PSNR
is in fact a good analytic method. However, the results from the Welch T test
suggest exactly the opposite. The values of PSNR value for the bad segmentation masks are significantly superior than the ones for good segmentation
masks. Therefore, the PSNR should not be considered an adequate method
for evaluation of segmentation algorithms. However, the PSNR is still a good
method to evaluate discrepancies between images and could be used to evaluate edge detection algorithms by comparing with ground truth images such
as the ones present in the BSR300 database.
Future works could include the verification of multi-threshold algorithms
and the determination of the number of thresholds as well as the impact of
the label values.
7. Acknowledgment
The authors would like to thank the Berkeley University for the creation
and availability of the BSR300 database.
9

References
[1] Siddharth Arora, Jayadev Acharya, Amit Verma, and Prasanta K Panigrahi. Multilevel thresholding for image segmentation through a fast
statistical recursive algorithm. Pattern Recognition Letters, 29(2):119
125, 2008.
[2] Jaime S Cardoso and Lus Corte-Real. Toward a generic evaluation
of image segmentation. Image Processing, IEEE Transactions on,
14(11):17731782, 2005.
[3] Yu-Kumg Chen, Fan-Chieh Cheng, and Pohsiang Tsai. A gray-level
clustering reduction algorithm with the least i psnr/i. Expert Systems
with Applications, 38(8):1018310187, 2011.
[4] H. Erdmann, G. Wachs-Lopes, C. Gallao, P. M. Ribeiro, and S. P. Rodrigues. Developments in Medical Image Processing and Computational
Vision, chapter A Study of a Firefly Meta-Heuristics for Multithreshold
Image Segmentation, pages 279295. Springer International Publishing,
Cham, 2015.
[5] Ronald Aylmer Fisher. The asymptotic approach to behrenss integral,
with further tables for the d test of significance. Annals of Eugenics,
11(1):141172, 1941.
[6] Rafael C Gonzalez and Richard E Woods. Digital image processing,
2002.
[7] Ming-Huwi Horng and Ren-Jean Liou. Multilevel minimum cross entropy threshold selection based on the firefly algorithm. Expert Systems
with Applications, 38(12):1480514811, 2011.
[8] Qing-Hua Huang, Su-Ying Lee, Long-Zhong Liu, Min-Hua Lu, Lian-Wen
Jin, and An-Hua Li. A robust graph-based segmentation method for
breast tumors in ultrasound images. Ultrasonics, 52(2):266275, 2012.
[9] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human
segmented natural images and its application to evaluating segmentation
algorithms and measuring ecological statistics. In Proc. 8th Intl Conf.
Computer Vision, volume 2, pages 416423, July 2001.

10

[10] Paulo S. Rodrigues and Gilson A. Giraldi. Improving the non-extensive


medical image segmentation based on tsallis entropy. Pattern Analysis
and Applications, 14(4):369379, 2011.
[11] Student. The probable error of a mean. Biometrika, pages 125, 1908.
[12] Bernard L Welch. The generalization ofstudents problem when several
different population variances are involved. Biometrika, pages 2835,
1947.
[13] Cao Yun-Fei, Xiao Yong-Hao, Yu Wei-Yu, and Chen Yong-Chang.
Multi-level threshold image segmentation based on psnr using artificial bee colony algorithm. China Research Journal of Applied Sciences,
Engineering and Technology Published: January, 15, 2011.

11

You might also like