Professional Documents
Culture Documents
|w
]
)
n
=1
(2)
The joint probability P(v
1
, v
2
, . . . , v
n
) is described by
p(:
1
,:
2
,:
3
.:
n
) = p(:
1
,:
2
,:
3
.:
n
|w
]
)
d
]=1
(3)
Then the posterior probability P(g j |v1, v2, . . . , v n) is
transformed into
p(w
]
|:
1
,:
2
,:
3
.:
n
) =
p(w
]
) p(:
|w
]
)
n
=1
p(:
|w
]
)
n
=1
c
]=1
(4)
Implementation of Text Classifier
Let T
j
the set of training web pages belonging to the
category wj where T
j
={t
1,1
,t
2,2
,t
3,3
.,t
j,kj
} and kj is the number
of web pages in the set T
j
. The conditional probability of the ith
word in the vocabulary (a
1
, a
2
,.an) is given by
p(o
|w
]
) =
1+ h
l,i
k
]
l=1
h
l,i
k
]
l=1
n
l=1
(5)
where H
l
=(h
l,1
,h
l,2
.h
l,n
) denotes the histogram
vector of the lth page in the set Tj corresponding to the word
vocabulary. The estimated probability p(wj) is determined by
p(w
]
) =
k
]
k
]
n
]
(6)
Thus for the requested page T, the probability p(g
j
|T) that the
requested page T belongs to the category g
j
is calculated by
p(w
]
|I) =
p(w
]
) p(u
|w
]
)
h
i,T
R n
=1
p(w
s
)
n
s=1
p(u
|w
s
)
h
i,T
R
n
=1
(7)
Where h
i,T
represents the frequency of the ith word appearing
in the web page T and R is the total number of words
extracted from the page. If the probability p(g
1
|T) exceeds the
threshold
T
the page is classified as unauthorized, otherwise it
is normal. The value for
T
is estimated later using Bayesian
theory.
IMAGE CLASSIFIER
Text content alone is not enough to detect malicious web
pages also it will result in high false positives. Most of the
adult websites predominantly includes text and image content.
In this paper we used the algorithm as in [8] classification of
skin-like pixels based on cascaded AdaBoost algorithm.This
classifier models in three different color spaces such as HSV,
YCgCb and YCgCr as shown in Fig.2.
Bayesian Classifier
Bayesian Classifier is tough against different lighting
conditions which may result in high falsepositives. This
method proposes a novel method for detecting skin-like pixels
by combining the Bayesian classifier with models in the HSV,
YCgCb, YCgCr color spaces with AdaBoost algorithm. We
have tested the ratio of skin area to image area. Our method
reduces the falsepositive rate and maintains its high
truepositive rate.
If p(w
i
|X) is the posterior probability for skin and non-skin
category, and the cost of classification is Cij, then the pixel is
classified as the category which has least cost. According to
Nave Bayes theory,
p(X|w
1
)
p(X|w
2
)
> ,X e w
1
(8)
p(X|w
1
)
p(X|w
2
)
,X e w
2
(9)
is the threshold of classification and it is calculated by
=
C
i]
- C
]]
p(w
]
)
C
]i
- C
ii
p(w
i
)
,
(10)
Where p(w
i
) is the priori probability and ( i =1) and ( j =2).
p(X|w
i
) represents the likelihood probability of skin color and
p(X|w
j
) represents the likelihood probability of non-skin color.
AdaBoost Classifier
Cascaded AdaBoost allows designers to continuously
fetch in new stages of weak learners until the error rate drops
down to the tolerable range. In this method, there are two
training sets prepared, labeled skin and non-skin. Every sample
will be given a weight to calculate the weight of each weak
learner according to the classification results of all the samples.
After each stage has been trained, the positive training set stays
the same for the next stage, but the negative will be
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2122
reconstructed in the way that only the incorrectly classified
non-skin samples remains. In order to maintain the total
number of samples, the current cascaded classifier is used to
classify some candidate negative samples and add the
misrecognized ones into the training set. As a result of this
ideology, this classifier is able to turn down the FP rate step by
step with a high TP rate.
To combine all the strong classifiers, the cascaded
AdaBoost is used following the procedures below.
1. Select values for Rfp, the maximum acceptable
false positive rate per stage and Rdet, the minimum
acceptable detection rate per stage.
2. Select target overall false positive rate, Ftarget.
3. SP =set of positive examples;
4. Sn =set of negative examples.
5. ni =the number of features per stage.
6. Fi =the false positive rate of the current cascaded
AdaBoost classifier.
7. Di =the true positive rate of the current cascaded
AdaBoost classifier.
8. Initialization of F0=1.0, D0=1.0 and i=0.
9. While Fi >Ftarget,
(1) i i + 1;
(2) ni =0, Fi =Fi1;
(3) while Fi >RfpFi1,
i. ni ni + 1,
ii.use SP and Sn to train a classifier
with ni features using AdaBoost,
iii. evaluate current cascaded
classifier on validation set to
determine Fi and Di,
iv.decrease threshold for the ith
classifier until the current cascaded
classifier has a detection rate of at
least RdetDi1 (this also affects Fi);
(4) Sn ;
(5) if Fi >Ftarget then evaluate the current
cascaded detector on the set of non-face
images and put any false detections into the
set Sn.
Fig 2 Skin-like detection using AdaBoost
INTEGRATION OF CLASSIFIERS
The following steps illustrates the fusion algorithm
required in implementing BLOQUANT.
Input the training set, train a text classifier and an image
classifier, and then collect similarity measurements from
different classifiers
Partition the interval of similarity measurements into sub
intervals
Estimate the posterior probabilities conditioning on all the
sub-intervals for the text classifier
Estimate the posterior probabilities conditioning on all the
sub-intervals for the image classifier.
For a new testing web page, classify it into corresponding
category by using the text classifier and the image classifier. If
it is classified into different categories, locate the sub-interval
that the similarity measurement of the web page belongs to
and execute step f, if else, execute step g.
Calculate the decision factor for the testing web page.
Return the final classification results to a user or a web
browser.
CONCLUSION
BLOQUANT presents a new framework for content
filtering. This system is represented by a text classifier, an
image classifier, and a fusion algorithm. Based on the
textual content, the text classifier is able to classify a given
web page into corresponding categories as illegal or normal.
This text classifier was modeled by naive Bayes rule. The
image classifier, which relies on AdaBoost, is able to
calculate the skin-like pixels present in visual content of the
requested page efficiently. The matching threshold used in
both text classifier and image classifier is estimated using a
probabilistic model derived from the Bayesian theory. A
novel data fusion model using the Bayesian theory was
developed and the corresponding integration algorithm is
presented. Integration framework enables us to directly
incorporate the multiple results produced by different
classifiers. The future work will include adding more
features into the content representations such as video
streams, audio with investigating incremental learning
models into this current model to solve the knowledge
updating problem in current probabilistic model.
REFERENCES
[1] A. McCallum and K. Nigam, A comparison of event models
for nave Bayes text classification, in Proc. AAAI Workshop
Learn. Text Categor., Madison, WI, J ul. 1998, pp. 4148.
[2] W. Hu, O. Wu, Z. Chen, and S. Maybank, Recognition of
pornographic web pages by classifying texts and images, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 10191034,
J un. 2007.
[3] Shirali-Shahreza S, Mousavi M E. A new Bayesian classifier for
skin detection [C]// Proceedings of 3rd International Conference
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2123
on Innovative Computing Information and Control (ICICIC
08). Nanjing, China: IEEE, 2008: 18-20.
[4] Bishop C M. Neural networks for pattern recognition [M].
Oxford: Oxford University Press, 1996
[5] Chai D, Phung S L, Bouzerdoum A. A Bayesian skin/non-skin
color classifier using non-parametric density estimation
[C]//Proceedings of the 2003 International Symposium on
Circuits and Systems (ISCAS 03). Nanjing, China: IEEE, 2003:
464-465.
[6] Chai D, Phung S L, Bouzerdoum A. A Bayesian skin/non-skin
color classifier using non-parametric density estimation
[C]//Proceedings of the 2003 International Symposium on
Circuits and Systems (ISCAS 03). Nanjing, China: IEEE, 2003:
465-467.
[7] J ones M J , Rehg J M. Statistical color models with application
to skin detection [J ]. International J ournal of Computer Vision,
2002, 46(1): 81-96
[8] Wan, L. U. "Skin detection method based on cascaded
AdaBoost classifier." (2012).
[9] D. M. Gavrila, A Bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 8, pp. 114, Aug. 2007.
[10] M. Lalmas, DempsterShafers theory of evidence applied to
structured documents: Modeling uncertainty, in Proc. 20th
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,
Philadelphia, PA, J ul. 1997, pp. 110118.
[11] J . Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On
combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 20, no. 3, pp.226239, Mar. 1998.
[12] CBS News, WASHINGTON, April 23, 2010, SEC Staffers
Watched Porn as Economy Crashed, http://www.cbsnews.com
[13] Zhang, Haijun, et al. "Textual and visual content-based anti-
phishing: a Bayesian approach." Neural Networks, IEEE
Transactions on 22.10 (2011): 1532-1546.
[14] Liu, Gang, et al. "A WordNet-based Semantic Similarity
Measure Enhanced by Internet-based Knowledge."Proceedings
of the 23rd International Conference on Software Engineering &
Knowledge Engineering (SEKE'2011), Eden Roc Renaissance,
Miami Beach, USA, J uly 7-9, 2011
[15] Marcial-Basilio, J orge A., et al. "Detection of pornographic
digital images." International journal of computers 2 (2010):
298-305.
[16] Tzeng, Yu-Chang, Kou-Tai Fan, and Kun-Shan Chen. "An
adaptive thresholding multiple classifiers system for
remote sensing image classification." Photogrammetric
Engineering and Remote Sensing 75.6 (2009): 679-687.
[17] Oliveira, V. A., and A. Conci. "Skin Detection using HSV color
space." H. Pedrini, & J. Marques de Carvalho, Workshops
of Sibgrapi. 2009.
[18] Ap-apid, Rigan. "An algorithm for nudity detection." 5th
Philippine Computing Science Congress. 2005.
[19] Cavalin, Paulo R., Robert Sabourin, and Ching Y. Suen.
"Dynamic selection approaches for multiple classifier
systems." Neural Computing and Applications (2011): 1-16.
[20] Bilal, Sara, et al. "Dynamic approach for real-time skin
detection." Journal of Real-Time Image Processing (2012): 1-
15.
[21] Basilio, J orge Alberto Marcial, et al. "Explicit image detection
using YCbCr space color model as skin detection." Proc. Of the
2011 American conference on applied mathematics and
the 5th WSEAS international conference on Computer
engineering and applications. 2011.
[22] Lee, Pui Y., Siu C. Hui, and Alvis Cheuk M. Fong. "Neural
networks for web content filtering." Intelligent Systems,
IEEE 17.5 (2002): 48-57.
[23] Chai D, Phung S L, Bouzerdoum A. A Bayesian skin/non-skin
color classifier using non-parametric density estimation
[C]//Proceedings of the 2003 International Symposium on
Circuits and Systems (ISCAS 03). Nanjing, China: IEEE, 2003:
464-465.
[24] J ones M J, Rehg J M. Statistical color models with application
to skin detection [J ]. International J ournal of Computer Vision,
2002, 46(1): 81-96
[25] D. M. Gavrila, A Bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 8, pp. 114, Aug. 2007.
[26] M. Lalmas, DempsterShafers theory of evidence applied to
structured documents: Modeling uncertainty, in Proc. 20th
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,
Philadelphia, PA, J ul. 1997, pp. 110118.
[27] J . Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On
combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 20, no. 3, pp.226239, Mar. 1998.
[28] CBS News, WASHINGTON, April 23, 2010, SEC Staffers
Watched Porn as Economy Crashed, http://www.cbsnews.com
.