You are on page 1of 7

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

Noise Reduction in Web Pages Using Featured DOM Tree.


Athira Paniker1, Surabhi Panicker2,Preetika Ravkhande3,Nazneen Tamboli4, Prof. Renuka Puntambekar
Dept. of Computer Engineering, MIT Academy Of Engineering, Savitribai Phule Pune University, Pune, India
1adhira.kumar7@gmail.com, 2iamsurabhi24@gmail.com, 3pravkhande@gmail.com, 4nazneentamboli@gmail.com

ABSTRACT
The Web contains infinite data. Web Pages have large number of information blocks. Apart from
the content blocks, it usually has blocks such as copyright, navigation panels, privacy notices, as
well as advertisements. Such blocks, that are not the main content blocks of any web page can be
called as noisy blocks. This information items sometimes serve useful function for human viewers
and are important for the web site owners. However, they often hamper automated information
retrieval and web data mining. It has now become a big challenge for the researchers to
implement new techniques for web mining so as to gather information from the Web. Since web is
the repository of massive information, the size of data is increased rapidly at an exponentially
high rate which often comes along with an excess amount of noise. Such noisy terms reduce the
efficiency of feature extraction and final classification accuracy. Therefore, it becomes critical to
clean the web pages before mining in order to improve mining results. Our project focuses on
identifying and removal of local noises in websites or html pages so as to improve the overall
performance of Web mining. We propose a rather novel idea for the easy detection and removal of
local noises with a new tree structure called featured Document Object Model Tree. We use a
three stage algorithm wherein feature selection is performed at the first phase, modelling of a
web page to create a featured DOM tree is done at the second phase and thereafter noise is
marked and pruned in the third phase. Thus the output produced is a Clean Web Page.
Index Terms: Feature Extraction, Web Data Mining, Local noise, DOM Tree, Clean Web Page,
Information Retrieval.

I.

INTRODUCTION

Analyzing the large set of data and extracting useful information from it is called Data Mining. It can also
be defined as mining the information from data. Enormous amount of data is available in the field of
Information technology which needs to be extraction of useful data. Various applications like market
analysis, customer retention, production control, fraud detection, science exploration etc. make use of this
informative data. Extraction of useful information from the content of web is called as Web Content
Mining. Topic discovery, extracting association patterns, clustering of web documents and classification of
Web Pages are various research issues addressed in text mining. Significant amount of work is done in
extracting useful information from the images in the field of image processing whereas not much research
is done for web content mining. The application of data mining method to detect interesting usage
patterns from Web data to understand them and better serve the requirements of Web-based applications
is called Web Usage Mining. Identifying the correct and relevant information from the web pages has
become a difficult task due to the explosive and rapid growth of data contents in the World Wide Web and
the noises within the web pages.
Noises such as advertisements, banners, privacy notices etc surround the informative web content. These
noises can be categorized as- Global Noise: Noises such as mirror sites, duplicated web pages are not
smaller than a single web page and is called as global noise. Local Noise: Noises such as navigation panel,
1 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

advertisement, link, banners is present within the web page and is called as local noise. Our project
focuses on removal of local noise.
II. LITERATURE SURVEY
Classification based Cleaning method [1] is a simple method used for Web Page Cleaning , i.e. to detect
and eliminate specific noise from a web page by certain pattern classification techniques. This method is
semiautomatic and supervised. In this technique noise is detected with the help of decision tree classifier.
Decision tree classifier being a classic machine learning technique is used in many research fields.
Limitation of this technique is that it can eliminate only certain noise items
In Lin and Ho [2], a Segmentation based Cleaning method is a supervised method. It categorizes the
contents of web page as distinguished contents and common contents. Distinguished content is the
informative content while common content is the redundant content. The drawback of this method is
that it takes into consideration only the data contents.
In Bar-Yossefs and S. Rajagopalan work [3], a Template based Cleaning method is automatic and
unsupervised method. This method considers the Template of web page as noise. A set of web pages is
called cluster. Cluster is passed as input and the templates are cleaned from the cluster. It is efficient for
clusters consisting the web pages from different sites only.
SST based Cleaning technique [4] is partially supervised cleaning technique. It is a combination of
Segmentation based Cleaning technique and Template based Cleaning technique. It does analysis of
both, the layout as well as the contents. Site Style Tree is a generalized DOM tree presentation. It can be
used to model HTML and XML web pages. This structure is useful for detecting and eliminating noise
from web pages. Content and presentation styles of some web sites having dynamic web pages are not
common. Detection of noise from these web sites using this technique is difficult. This technique is less
successful in detecting noise patterns different from expected noise patterns.
Feature Weighting Based Cleaning Method [5] is an improved version of Site Style Tree. This method
is automatic and unsupervised noise cleaning method. This method uses a tree based approach which
combines features based on HTML content, visual representation and tree structure. Every documents
elements within the tree are combined if their child elements share identical attributes, attribute values
and tag names. Weight of the element is calculated depending on the number of different presentation
styles. The resulting element weights are used in follow-up tasks like classification. The efficiency of this
approach depends upon the availability of relatively large amount of web documents from a limited
amount of data source.
Kao et al. [6] make use of HITS algorithm. HITS algorithm stands for Hyperlink Induced Topic Search
algorithm. It evaluates the importance of the hyperlinks present in a Web page. Its drawback is that it
only rates the web page but it does not eliminate noise.
Kao, Ho. And Chen [7] InfoDiscoverer, extracts the information from a set of tabular documents with in
the web site. Its limitation is that it is applicable for only web sites with tabular documents. But it is not
applicable for web documents of web site which does not contain any tabular document.
With the help of page layout features and some heuristic rules VIPS [8] at semantic level detects and
eliminates noise from a web page. Its drawback is that it is resource intensive.
2 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

Thanda Htwe et al [9] is a mechanism to detect and eliminate redundant and irrelevant data using Case
Based Reasoning. It analyses multiple noise pattern in Web pages and also uses back propagation neural
network algorithm for matching current noise with expected noise pattern and then noise elimination
takes place from the current page.It takes into consideration only the contents of the web page and not
the layout. Therefore other noise items like images, advertisements etc. are not taken into consideration.
Guohua Hu, Qingshan Zhao [10] proposes a new tree structure called Style Tree. It captures the actual
contents and common layouts or presentation styles of pages in a Website. It then generates a style tree
from the web page provided as input and determines and marks which part of style tree is noisy using
information based marching mechanism. The part of style tree which is marked noisy is deleted and a
clean web page is provided as output.
The technique proposed in this paper overcomes the drawbacks of all the previously mentioned
techniques. It also provides additional advantages. DOM tree can be implemented in any programming
language. It is easy to modify data structure in DOM tree and it is easy to extract data from DOM tree.
The proposed three phase algorithm aims at detecting noises, irrelevant and redundant data and
extracting them from web pages. Further work on this field will help in more efficient noise removal by
directly detecting the informative contents instead of detecting and eliminating them. For better indexing
and web page ranking it can be incorporated with the search engines. By using more efficient methods
for feature selection and featured set generation accuracy can be further improved.[11]
III. PROJECT IDEA
Detection and Elimination of noise can be implemented as a pre-processing step for web content mining.
The objective of this project is to detect and eliminate noisy items from a web page so that complexity is
reduced and efficiency is increased while processing. The three phases of the algorithm that we are
implementing is featuring, modelling and pruning. This algorithm combines different weighing
approaches.
Featuring is the first phase in which various standard web page preprocessing techniques like
tokenization, removal of html tags and stop words, generation of feature sets etc. A featured set of tokens
is Using a standard weighing scheme the tokens are assigned weights. Using some basic approaches, the
assigned weights are further normalized. Features are selected such that they have their weight above
the threshold value. The threshold value is varies dynamically according to the length of the document or
maximum weight of the terms.
Modelling is the next phase in which the DOM Tree of the Web page is generated. A document is
represented as a tree in the Document Object Model. A complete web page can be easily reconstructed
using DOM trees as they are highly transformable. DOM tree is a well-defined HTML document model. A
closing bracket is not included in some HTML tags. For some tags, the closing bracket is concluded by the
following tag, for instance <L1> tag is closed by the following <\L1> tag. When we want to analyze a web
page, we first check the syntax of HTML document since most HTML Web pages are not formed well.
After this stage we put web pages into an HTML parser, this rectifies the markup and creates a Document
Object Model (DOM) tree. Now that we have created a DOM tree, the system will split it into many subtrees depending on threshold level. Individual Web Sites adapt varied layout and presentation styles,
thus the depth of the tree of the Web page is varied depending upon their presentation style. System
should know the maximum level of DOM tree in order to choose the good choice of threshold level.
Hence, the system traverses the entire DOM tree to get the maximum depth of DOM.
3 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

Pruning is the last phase in the algorithm. We pick the best suited threshold level up for the training data
set, by setting various threshold levels. After which, the system picks out the suitable threshold level for
test data set by utilizing these known pair of series. An estimate is derived about the nature of the
relationship between the threshold level and maximum level depending upon linear regression analysis.
A regression is a statistical analysis that helps to assess associations between any two variables and is
also used to determine which among the independent variables are related to the dependent variable,
along with exploring the forms of these relationships. When we have obtained the threshold level, the
system will identify some nodes of DOM that are less than the threshold level as noise and discard them.
The clean web page is then generated from the DOM Tree after the marking and removal of nodes having
weight less than the threshold value. [11]
The Featured DOM Tree is used for Web Data Mining because it is easy to modify data structure in
DOM tree and it is easy to extract data from DOM Tree. Therefore noise detection and elimination in
pruning phase becomes easier. Featured DOM Tree provides detailed presentation of the web page than a
normal DOM Tree. Also DOM extension is independent of the browser size and text size settings.
The output of the proposed technique is a set of clean web pages, which is independent of any noisy
content. This is obtained after bottom up traversal of Featured DOM Tree. During the bottom up traversal
of Featured DOM tree all the nodes having weight less than the threshold is marked as noise and is
eliminated. If all the child node of a parent node is marked as noise then the parent node is itself
eliminated. Hence we get a set of web pages which is free from all noisy contents like copyright,
navigation panels, privacy notices, as well as advertisements.
IV. ARCHITECTURE DIAGRAM
Figure easily explains the process involved in noise elimination and detection.

The web page is provided as input to both, the Parser and the Featured DOM Tree Generator. The Parser
generates the token with some weight assigned to it. The DOM tree generates a Featured DOM tree using
which it is easy to modify data structure. The Featured Set and the Featured DOM Tree is then passed to
Comparator as input so that on the basis of Threshold value and the Featured set comparison can be
made to relevant and related data. On the basis of previous comparisons made Noise detector and
4 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

eliminator marks and eliminate noisy and irrelevant data item thereby modifying the Featured DOM
Tree. The Web Page Generator generates the clean web page from the modified Featured DOM tree.
V. ALGORITHM
The whole process of detecting and eliminating noise is the major challenge that is faced. The solution to
this can be presented in the form of a strategy, devised to combat the problems that appear throughout
the process. [11] The steps followed are as follows:
A.
1.
2.
3.
4.
5.
6.

Noise reduction ()
Start
Take web page as input.
Call feature() function for Featured Set -1 F_Set using featuring technique.
Call modelling() function for Featured Set-2 F_DOM using Dom tree.
Input F_Set and F_DOM for Pruning Stage.
Return

B. Featuring
1. Input: web page including noisy items
2. Apply Pre-processing() method to the web page for feature set generation.
3. Applying weight_scheme to the tokens generated through pre-processing methods to create
F_Seti.
4. Select features having score above threshold value.
5. F_Set i={F_Set1,F_Set2,..}obtained with optimal features,further used for noise detection and
similarity verification.
C.

Modelling
1. HTML document /web page modeled into DOM Tree.
2. Featured DOM Tree created using optimal feature selection for individual leaf nodes of the DOM
Tree
3. As a result , featured sets is obtained F_DOMi={F_DOM1,F_DOM2,}

D. Pruning
1. Noise detection performed on each F_DOM I based on similarity verification.
2. Mininmum Weight Overlapping (MWO) is applied for similarity verification.
Feature
set
terms
x1
x2
x3
x4
Total

F_Set1

F_Set2

Min (W)

W11
W21
0
W41
100

0
W22
W32
W42
100

0
min(W21,W22)
0
min(W41,W42)
MWO = Min (W)

3. F_DOMi is compared with F_Seti in MWO such that certain features overcome the predefined
threshold value t and are marked as noisy node in the tree.

5 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

4. Removal of noisy blocks from DOM Tree is performed by bottom up traversal in such a manner
that a parent node is marked noisy if all its child nodes are also marked noisy and hence it is
removed while bottom up traversal.
5. Finally, a cleaned web page is returned.
6. End.
VI. CONCLUSION
Our project aims at creating an application framework for information retrieval for web data mining and
presenting a clean web page free of noises. Thus, the user would be presented with a clean web page
without advertisements, images etc.
VII. FUTURE WORK
The novel task of detecting and eliminating local noise from a web page is proposed in this paper. The
proposed three phase algorithm aims at detecting noises, irrelevant and redundant data and extracting
them from web pages. Further work on this field will help in more efficient noise removal by directly
detecting the informative contents instead of detecting and eliminating them. For better indexing and
web page ranking it can be incorporated with the search engines. By using more efficient methods for
feature selection and featured set generation accuracy can be further improved.
VIII.

ACKNOWLEDGEMENT

We would like to express our sincere gratitude for the assistance and support of a number of people who
helped us. We are thankful to Prof. Renuka Puntambekar, Department of Computer Engineering, MIT
Academy of Engineering, our internal guide for her valuable guidance that she have provided us at
various stages throughout the project work. She has been a source of motivation enabling us to give our
best efforts in this project. We are also grateful to Prof. Uma Nagaraj, Head of Computer Department, MIT
Academy of Engineering.
IX. REFERENCES

[1]

Detecting Image Purpose in World-Wide Web Documents. S. Paek andJ. R. Smith,January, 1998.

[2]

Discovering informative content blocks from Web documents. S.H. Linand J.M. Ho. In Proceeding
of SIGKDD-2002.

[3]

Template Detection via Data Mining and its Applications. Z. Bar-Yossefand S. Rajagopalan, In
Proceedings of the 11th International World-WideWeb Conference, 2002

[4]

Eliminating noisy information in web pages for data mining. L. Yi, B.Liu, and X. Li. In
Proceedings of the International ACM Conference,2003.

[5]

Web Page Cleaning for Web Mining through Feature Weighting.YI L.et LIU B. , In
International Joint Conference on Artificial Intelligence,2003.

[6]

Entropy-Based Link Analysis for Mining Web Informative Structures.Hung-Yu Kao, MingSyan Chen Shian-Hua Lin, and Jan-Ming Ho, InCIKM, 2002.

6 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 3, Issue 1, January - 2016. ISSN 2348 4853, Impact Factor 1.317

[7]

Wisdom Web Intrapage Informative Structure Mining based on Docu-ment Object Model.H.
Y. Kao, J. M. Ho, and M. S. Chen, In IEEETrans KDD, 2005.

[8]

VIPS: A Vision Based Page Segmentation Algorithm. Cai Deng, YuShipeng, and Wen Jirong, In
Microsoft Technical Report, 2003.

[9]

Noise Removing from Web Pages Using Neural Network. Thanda Htwe,Khin Haymar Saw Hla In
ICCAE, 2010.

[10]

Study to Eliminating Noisy Information in Web Pages. Guohua Hu,Qingshan Zhao.

[11]

Eliminating Noisy Information in Web Pages using featured DOMtree,Shine N. Das, Pramod K.
Vijayaraghavan,Midhun Mathew

7 | 2016, IJAFRC All Rights Reserved

www.ijfarc.org

You might also like