Bloquant: Implementation of Content Filtering - A Collaborative Approach

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 2119

Bloquant: Implementation of Content Filtering - A
Collaborative Approach

Nivedita VS
Department of Information Technology
PSNA College of Engineering and Technology
Dindigul
Dr.D.Shanthi
Department of Computer Science and Engineering
PSNA College of Engineering and Technology
Dindigul

Abstract The usage of internet has increased rapidly due to
its high availability of information. Adult sites and illegal sites
are inappropriate for the working environment. They present
companies and organizations with potential legal issues. A
straightforward method to filter such website is to verify the user
requested websites URL with the block list of (objectionable
websites) URLs maintained in the server. This system is effective
only if the block list database is maintained up to date. Another
method of filtering is based on the content of the web site. But
this method is highly restrictive and may misclassify the web
pages and restrict access to the sites that are not actually illegal.
The proposed system BLOQUANT is a web content filtering
system based on Bayesian approach. In this system the web pages
are classified according to the contents frequency of occurrence.
If the web content is beyond the threshold limit, then the site is
blocked. The system takes textual and visual contents for
analyzing the web page. The main feature of this project is the
exploration of a collaborative Bayesian model to estimate the
matching threshold.
Index Terms Bayesian approach, Content filtering,
Threshold, URL filtering
INTRODUCTION
Web Content Filtering System is optimized for
controlling what content is permitted to a reader, especially
when it is used to restrict material delivered over
the Internet via the web, e-mail, or other means. Content-
filtering software determines what content will be available or
what content will be blocked. The restrictions can be applied at
various levels. The intention is often to prevent persons from
viewing content which the computer's owner(s) or other
authorities may consider objectionable. Many companies point
to legal liability, productivity, and bandwidth usage as
concerns that arise when employees view inappropriate (read:
porn) web sites, shop online incessantly before the holidays, or
download and play MP3 files throughout the day.Web-filtering
systems are either client-based or server-based. A client-based
system performs web content filtering solely on the computer
where it is installed, without consulting remote servers about
the nature of the web content that a user tries to access. A
server-based system provides filtering to computers on the
local area network where it is installed. It screens outgoing web
requests, analyzes incoming web pages to determine their
content type, and blocks inappropriate material from reaching
the clients web browser.This paper aims at implementing
server-based web filtering systems facilitated with artificial-
intelligence options for blocking, which provides effective
content filtering, based on its frequency of occurrence. The
proposed method analyzes the content using Bayes theorem for
a threshold value and if it is above the threshold maintained
then the page is blocked from the user.
RELATED WORK
Adult sites and illegal download sites are
inappropriate for the work environment; they present
companies and organizations with potential legal issues.
Pornography in the workplace is an uncomfortable topic for
most businesses. Company polices may already be in place to
let employees know that this type of behavior is not tolerated.
Accessing illegal content can be easily prevented with web
filtering software. There are three major content filtering
approaches: PICS, URL blocking, Keyword filtering. In URL
blocking, the number of incorrectly classified nonpornographic
web pages is small compared to those using keyword filtering.
Systems that rely on keyword filtering tend to perform well on
pornographic Web pages, but the percentage of incorrectly
classified nonpornographic Web pages can be high. The rate of
falsepositive detection with existing system is 97.2%. These
approaches are less Flexible and non self adaptive. The
proposed approach overcomes the limitations with the existing
system and expected to achieve the success rate of 91.1%
truepositives and 88.1% true negatives. Intelligent
Classification for Web Content Filtering paper summarizes
the features of 10 popular Web-filtering systems and its overall
accuracy results. The URLs of 200 pornographic and 300
nonpornographic Web pages are used for evaluating five
representative systems: Cyber Patrol, Cyber Snoop,
CYBERsitter, SurfWatch, and WebChaperone. For the two
systems that employ URL blocking, the number of incorrectly
classified nonpornographic Web pages is small compared to
those using keyword filtering.
However, both systems have fairly high occurrences
of incorrectly classified pornographic Web pages. This
highlights the problem of keeping the black list up to date.
Systems that rely on keyword filtering tend to perform well on


pornographic Web pages, but the percentage of incorrectly
classified nonpornographic Web pages can be high. This
highlights the keyword approachs major shortcoming. This
performance of five web filtering systems are given in Table.1

TABLE 1 Comparison of Web Filtering Systems

PROPOSED SYSTEM
BLOQUANT is a web content filtering system based on
Bayesian approach. In this system the web pages are classified
according to the content representations. The system takes into
account textual and visual contents to detect the offensive web
pages. The main feature of this paper is the exploration of a
Bayesian model to estimate the matching threshold. This is
required in the classifier for determining the class of the web
page and identifying whether it is illegal or not. In the text
classifier, the naive Bayes rule is used to calculate the
probability that a web page is offensive. Continuous text
classifier outperforms the traditional keyword-statistics-based
classifier. In the image classifier, the cascaded adaptive
boosting (AdaBoost) classifier is used, which consists of
minimum-risk based Bayesian classifier and models in
different color spaces such as HSV (hue-saturation-value),
YCgCb (brightness-green-blue) and YCgCr (brightness-green-
red).The image classifier outperforms the traditional skin-
region-based image classifier. Bayesian model is designed to
determine the threshold.
Overview
In this paper, a novel framework for recognizing
unauthorized web pages is proposed. The content of a given
web page is transformed into two categories, namely, the
textual and the visual, which is addressed by the corresponding
classifier as shown in Fig.1. The proposed approach contains
the following components.
A text classifier - uses the naive Bayes rules to
handle the text content extracted from a given web
page.
An image classifier - uses the AdaBoost Classifier
to handle the pixel level content of the image from a
given web page
A Bayesian approach - to estimate the threshold
used in classifiers.
A data integration algorithm - to combine the
results from the text classifier and the image
classifier

Fig 1 Overview of BLOQUANT
TEXT CLASSIFIER
Preprocessing
A Web filtering system uses text classification
approach to identify desirable and undesirable pages based
on analyzing textual content. In this paper, we extract all
main words of the requested web page by first separating
them from HTML tags. For exact correspondence of textual
content, we suggest using the naive word-based extraction,
Given a web page, we then form a histogram vector (h1,
h2, . . . , hn ), where each component represents the term
frequency and n denotes the total number of components in
the vector.
Bayesian Classifier
In this paper, the Bayes classifier is used to classify
the text content of web pages. In the classifying process, the
Bayes classifier outputs the probabilities that a web page
belongs to the corresponding categories. These probabilities
also can be regarded as the similarities or dissimilarities
that given web pages have with the protected web page. Let
W ={w1, w2, . . . , wj , . . . , wn } denote the set of web
page categories, where n is the total number of categories.

Image Classifier
Web Page
Pre Processing
Text Image
Text Classifier
Illegal?
Web Browser
Result Integration


In fact, for web filtering problem only two categories
are included: the unauthorized web page category w1 and
the normal web page category w2. Given a variable vector
(v1, v2, . . . , vn) of a web page, the classifier is employed
to determine the probability P(wj |v1, v2, . . . , vn) that the
web page belongs to category w j .Applying the Bayes rule,
the posterior probability P(wj |v1, v2, . . . , vn) is calculated
by

p(w
]
|:
1
,:
2
,:
3
.:
n
) =
p(:
1
,:
2
,:
3
.:
n
|w
]
)p(w
]
)
p(:
1
,:
2
,:
3
.:
n
)

(1)
where the prior probability P(w
j
) is estimated by the
frequency of the training samples belonging to category w
j
.
It is difficult to directly estimate the conditional probability
P(v
1
, v
2
, . . . , v
n
|g
j
), because the data samples are sparsely
distributed in a high-dimensional space. However, since we
ignore the semantic associations among terms, the naive
Bayes classifier [1], [2] is used to handle the issue. Naive
Bayesian theory assumes that all the components in the
histogram vector are independent from one another. Thus
the conditional probability is represented by
p(:
1
,:
2
,:
3
.:
n
|w
]
) = p(:
|w
]
)
n
=1
(2)

The joint probability P(v
1
, v
2
, . . . , v
n
) is described by

p(:
1
,:
2
,:
3
.:
n
) = p(:
1
,:
2
,:
3
.:
n
|w
]
)
d
]=1

(3)
Then the posterior probability P(g j |v1, v2, . . . , v n) is
transformed into

p(w
]
|:
1
,:
2
,:
3
.:
n
) =
p(w
]
) p(:
|w
]
)
n
=1
p(:
|w
]
)
n
=1
c
]=1

(4)
Implementation of Text Classifier
Let T
j
the set of training web pages belonging to the
category wj where T
j
={t
1,1
,t
2,2
,t
3,3
.,t
j,kj
} and kj is the number
of web pages in the set T
j
. The conditional probability of the ith
word in the vocabulary (a
1
, a
2
,.an) is given by
p(o
|w
]
) =
1+ h
l,i
k
]
l=1
h
l,i
k
]
l=1
n
l=1

(5)
where H
l
=(h
l,1
,h
l,2
.h
l,n
) denotes the histogram
vector of the lth page in the set Tj corresponding to the word
vocabulary. The estimated probability p(wj) is determined by
p(w
]
) =
k
]
k
]
n
]

(6)
Thus for the requested page T, the probability p(g
j
|T) that the
requested page T belongs to the category g
j
is calculated by
p(w
]
|I) =
p(w
]
) p(u
|w
]
)
h
i,T
R n
=1
p(w
s
)
n
s=1
p(u
|w
s
)
h
i,T
R
n
=1

(7)
Where h
i,T
represents the frequency of the ith word appearing
in the web page T and R is the total number of words
extracted from the page. If the probability p(g
1
|T) exceeds the
threshold
T
the page is classified as unauthorized, otherwise it
is normal. The value for
T
is estimated later using Bayesian
theory.
IMAGE CLASSIFIER
Text content alone is not enough to detect malicious web
pages also it will result in high false positives. Most of the
adult websites predominantly includes text and image content.
In this paper we used the algorithm as in [8] classification of
skin-like pixels based on cascaded AdaBoost algorithm.This
classifier models in three different color spaces such as HSV,
YCgCb and YCgCr as shown in Fig.2.
Bayesian Classifier
Bayesian Classifier is tough against different lighting
conditions which may result in high falsepositives. This
method proposes a novel method for detecting skin-like pixels
by combining the Bayesian classifier with models in the HSV,
YCgCb, YCgCr color spaces with AdaBoost algorithm. We
have tested the ratio of skin area to image area. Our method
reduces the falsepositive rate and maintains its high
truepositive rate.
If p(w
i
|X) is the posterior probability for skin and non-skin
category, and the cost of classification is Cij, then the pixel is
classified as the category which has least cost. According to
Nave Bayes theory,
p(X|w
1
)
p(X|w
2
)
> ,X e w
1

(8)
p(X|w
1
)
p(X|w
2
)
,X e w
2

(9)
is the threshold of classification and it is calculated by
=
C
i]
- C
]]
p(w
]
)
C
]i
- C
ii
p(w
i
)
,
(10)
Where p(w
i
) is the priori probability and ( i =1) and ( j =2).
p(X|w
i
) represents the likelihood probability of skin color and
p(X|w
j
) represents the likelihood probability of non-skin color.
AdaBoost Classifier
Cascaded AdaBoost allows designers to continuously
fetch in new stages of weak learners until the error rate drops
down to the tolerable range. In this method, there are two
training sets prepared, labeled skin and non-skin. Every sample
will be given a weight to calculate the weight of each weak
learner according to the classification results of all the samples.
After each stage has been trained, the positive training set stays
the same for the next stage, but the negative will be


reconstructed in the way that only the incorrectly classified
non-skin samples remains. In order to maintain the total
number of samples, the current cascaded classifier is used to
classify some candidate negative samples and add the
misrecognized ones into the training set. As a result of this
ideology, this classifier is able to turn down the FP rate step by
step with a high TP rate.
To combine all the strong classifiers, the cascaded
AdaBoost is used following the procedures below.
1. Select values for Rfp, the maximum acceptable
false positive rate per stage and Rdet, the minimum
acceptable detection rate per stage.
2. Select target overall false positive rate, Ftarget.
3. SP =set of positive examples;
4. Sn =set of negative examples.
5. ni =the number of features per stage.
6. Fi =the false positive rate of the current cascaded
AdaBoost classifier.
7. Di =the true positive rate of the current cascaded
AdaBoost classifier.
8. Initialization of F0=1.0, D0=1.0 and i=0.
9. While Fi >Ftarget,
(1) i i + 1;
(2) ni =0, Fi =Fi1;
(3) while Fi >RfpFi1,
i. ni ni + 1,
ii.use SP and Sn to train a classifier
with ni features using AdaBoost,
iii. evaluate current cascaded
classifier on validation set to
determine Fi and Di,
iv.decrease threshold for the ith
classifier until the current cascaded
classifier has a detection rate of at
least RdetDi1 (this also affects Fi);
(4) Sn ;
(5) if Fi >Ftarget then evaluate the current
cascaded detector on the set of non-face
images and put any false detections into the
set Sn.

Fig 2 Skin-like detection using AdaBoost
INTEGRATION OF CLASSIFIERS
The following steps illustrates the fusion algorithm
required in implementing BLOQUANT.
Input the training set, train a text classifier and an image
classifier, and then collect similarity measurements from
different classifiers
Partition the interval of similarity measurements into sub
intervals
Estimate the posterior probabilities conditioning on all the
sub-intervals for the text classifier
Estimate the posterior probabilities conditioning on all the
sub-intervals for the image classifier.
For a new testing web page, classify it into corresponding
category by using the text classifier and the image classifier. If
it is classified into different categories, locate the sub-interval
that the similarity measurement of the web page belongs to
and execute step f, if else, execute step g.
Calculate the decision factor for the testing web page.
Return the final classification results to a user or a web
browser.
CONCLUSION
BLOQUANT presents a new framework for content
filtering. This system is represented by a text classifier, an
image classifier, and a fusion algorithm. Based on the
textual content, the text classifier is able to classify a given
web page into corresponding categories as illegal or normal.
This text classifier was modeled by naive Bayes rule. The
image classifier, which relies on AdaBoost, is able to
calculate the skin-like pixels present in visual content of the
requested page efficiently. The matching threshold used in
both text classifier and image classifier is estimated using a
probabilistic model derived from the Bayesian theory. A
novel data fusion model using the Bayesian theory was
developed and the corresponding integration algorithm is
presented. Integration framework enables us to directly
incorporate the multiple results produced by different
classifiers. The future work will include adding more
features into the content representations such as video
streams, audio with investigating incremental learning
models into this current model to solve the knowledge
updating problem in current probabilistic model.
REFERENCES
[1] A. McCallum and K. Nigam, A comparison of event models
for nave Bayes text classification, in Proc. AAAI Workshop
Learn. Text Categor., Madison, WI, J ul. 1998, pp. 4148.
[2] W. Hu, O. Wu, Z. Chen, and S. Maybank, Recognition of
pornographic web pages by classifying texts and images, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 29, no. 6, pp. 10191034,
J un. 2007.
[3] Shirali-Shahreza S, Mousavi M E. A new Bayesian classifier for
skin detection [C]// Proceedings of 3rd International Conference


on Innovative Computing Information and Control (ICICIC
08). Nanjing, China: IEEE, 2008: 18-20.
[4] Bishop C M. Neural networks for pattern recognition [M].
Oxford: Oxford University Press, 1996
[5] Chai D, Phung S L, Bouzerdoum A. A Bayesian skin/non-skin
color classifier using non-parametric density estimation
[C]//Proceedings of the 2003 International Symposium on
Circuits and Systems (ISCAS 03). Nanjing, China: IEEE, 2003:
464-465.
465-467.
[7] J ones M J , Rehg J M. Statistical color models with application
to skin detection [J ]. International J ournal of Computer Vision,
2002, 46(1): 81-96
[8] Wan, L. U. "Skin detection method based on cascaded
AdaBoost classifier." (2012).
[9] D. M. Gavrila, A Bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 8, pp. 114, Aug. 2007.
[10] M. Lalmas, DempsterShafers theory of evidence applied to
structured documents: Modeling uncertainty, in Proc. 20th
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,
Philadelphia, PA, J ul. 1997, pp. 110118.
[11] J . Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On
combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 20, no. 3, pp.226239, Mar. 1998.
[12] CBS News, WASHINGTON, April 23, 2010, SEC Staffers
Watched Porn as Economy Crashed, http://www.cbsnews.com
[13] Zhang, Haijun, et al. "Textual and visual content-based anti-
phishing: a Bayesian approach." Neural Networks, IEEE
Transactions on 22.10 (2011): 1532-1546.
[14] Liu, Gang, et al. "A WordNet-based Semantic Similarity
Measure Enhanced by Internet-based Knowledge."Proceedings
of the 23rd International Conference on Software Engineering &
Knowledge Engineering (SEKE'2011), Eden Roc Renaissance,
Miami Beach, USA, J uly 7-9, 2011
[15] Marcial-Basilio, J orge A., et al. "Detection of pornographic
digital images." International journal of computers 2 (2010):
298-305.
[16] Tzeng, Yu-Chang, Kou-Tai Fan, and Kun-Shan Chen. "An
adaptive thresholding multiple classifiers system for
remote sensing image classification." Photogrammetric
Engineering and Remote Sensing 75.6 (2009): 679-687.

[17] Oliveira, V. A., and A. Conci. "Skin Detection using HSV color
space." H. Pedrini, & J. Marques de Carvalho, Workshops
of Sibgrapi. 2009.
[18] Ap-apid, Rigan. "An algorithm for nudity detection." 5th
Philippine Computing Science Congress. 2005.
[19] Cavalin, Paulo R., Robert Sabourin, and Ching Y. Suen.
"Dynamic selection approaches for multiple classifier
systems." Neural Computing and Applications (2011): 1-16.
[20] Bilal, Sara, et al. "Dynamic approach for real-time skin
detection." Journal of Real-Time Image Processing (2012): 1-
15.
[21] Basilio, J orge Alberto Marcial, et al. "Explicit image detection
using YCbCr space color model as skin detection." Proc. Of the
2011 American conference on applied mathematics and
the 5th WSEAS international conference on Computer
engineering and applications. 2011.
[22] Lee, Pui Y., Siu C. Hui, and Alvis Cheuk M. Fong. "Neural
networks for web content filtering." Intelligent Systems,
IEEE 17.5 (2002): 48-57.
464-465.
[24] J ones M J, Rehg J M. Statistical color models with application
to skin detection [J ]. International J ournal of Computer Vision,
2002, 46(1): 81-96
[25] D. M. Gavrila, A Bayesian, exemplar-based approach to
hierarchical shape matching, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29, no. 8, pp. 114, Aug. 2007.
[26] M. Lalmas, DempsterShafers theory of evidence applied to
structured documents: Modeling uncertainty, in Proc. 20th
Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval,
Philadelphia, PA, J ul. 1997, pp. 110118.
[27] J . Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On
combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 20, no. 3, pp.226239, Mar. 1998.
[28] CBS News, WASHINGTON, April 23, 2010, SEC Staffers
Watched Porn as Economy Crashed, http://www.cbsnews.com
.

Bloquant: Implementation of Content Filtering - A Collaborative Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bloquant: Implementation of Content Filtering - A Collaborative Approach

Uploaded by

Copyright:

Available Formats

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 7July 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 2119

You might also like