Professional Documents
Culture Documents
By
Siddu P. Algur
Head, Dept. of Information Science & Engineering S D M College of Engg. & Tech., Dharwad. siddu_p_algur@hotmail.com
CONENT
Motivation Solution Existing Approaches New Approach (VSAP Algorithm) Empirical Evaluation Experimental Results Conclusion
Motivation
Huge amount of information on the Internet. Data is distributed over Internet Presence of undesired data along with relevant information Requirement of data from various sources in local repository for further analysis
WEB MINING TAXONOMY Web Mining Web Usage Mining Web Structure Mining Web Content Mining
Web Structure Mining: The structure of a typical web graph consists of Web pages nodes and hyperlinks as edges connecting between two related pages. It can be regarded as the process of discovering structure information from the web Web Usage Mining: It focuses on techniques that could predict user behavior while the user interacts with the web. Web Content Mining: It emphasizes on the content of the web page. It is an automatic process that extracts pattern from web pages and goes beyond only the keyword extraction.
But
Retrieving relevant information from the web seems to be like Finding the Needle in the Haystack...
The Web is highly volatile, distributed and heterogeneous. The Web is a huge chaotic information space without central authority.
MDR Algorithm
( Mining Data Records from Web Pages )
DEPTA Algorithm
( Data Extraction using Partial
MDR Algorithm
Data records.
A group of similar data records being placed in a specific region are under the same parent in a tag tree.
Mine Data Regions in page based upon Tag Tree & string comparison
Identify Data records from data regions.
Data Records
Data Region
TAG _TREE
4 GeneralizedNodes
Data Records
A novel partial tree alignment method is used to align and to extract corresponding data items from the discovered data records and put the data items in a database table.
The computation time for constructing the tag tree and tree matching is an overhead.
Fails to identify the data records, in cases where there may be only a single record on page.
VIPS Algorithm
VIPS algorithm parses the HTML page and visual separators are detected in the parse tree.
The separators receive weights which are adjusted depending on constraints based on separator.
Finally, the content structure of the page is created, by merging visual blocks that are not divided by separators.
Data Region
Data Object 1 ( Data Record 1 )
Content Links
Copyright Statement
Visual Structure
based
Identifying the Data Region Largest Rectangle Identifier Co-ordinates of Bounding Rectangles Of All Tags
Container Identifier
VSAP
VSAP Algorithm
Steps :
Determine the co-ordinates of all the bounding rectangles. Identify the Data Region. Identify the Largest Rectangle. Identify the Container within the Largest Rectangle. Identify the Data Region containing the Data records within that Container.
Component of every Browser Function Parse & Render HTML Pages Used to obtain bounding rectangles for each Tag.
Web Page
Bounding Rectangles
Data Region
Container Identifier
Obtains largest bounding rectangle
Child of the BODY tag
Get smallest rectangle with area greater than half the area of largest bounding rectangle.
Web Page
Container Identified
Filter
Find Average Height of the children of the container Eliminate children whose height is less than average height
Container
Data Region
EMPIRICAL EVALUATION
Data Region Identification MDR Dependent on specific tags for identifying data regions. VSAP Identifies data regions independent of specific tags . Data Record Extraction MDR Identifying data records based on keyword search ( e.g . $ ) VSAP Identifying data records based on visual structure of the web page. Overall Time Complexity MDR O ( NK ) , N is total no. of nodes in tag tree and K is max. no. of tag nodes of a generalized node. DEPTA O ( k2 ) , k is the number of trees. VSAP O ( n ) , n is the no. of tag - comparisons made.
Performance Measures
Recall = Ec Nt Ec Nt Et
is the total number of records on the page . is the total number of records extracted.
Precision = Ec Et
from the web page. Precision : The correctness of the data records identified.
EXPERIMENTAL RESULTS
URL Cor.
1. 2. 3. 4. 5. 6. 7. 8. 9. http://www.tigerdirect.com/. http://www.amazon.com/. http://www.cooking.com/. http://www.ebay.com/.. http://www.powells.com/.. http://www.barnesandnoble.com/. http://www.pricegrabber.com/.. http://www.shoebuy.com/. http://www.smartbuy.com/..
MDR
Wr. 36/ 0 17/25 13/3 30/0 47/1 30/0 0/25 12/84 15/10 38/1 12/0 6/15 Cor. 8 25 20 25 10 10 25 96 10 25 10 15 8 0 17 25 9 10 0 12 0 24 10 0
VSAP
Wr. 0/0 1/0 0/0 0/0 5/0 0/0 0/0 0/0 0/0 3/0 0/0 0/0
13. http://www.drugstore.com/.
14. http://www.bookpool.com/. 15. http://www.target.com/
15
10 0 140
14/0
7/0 0/12 277 / 176 33.5% 44.3%
15
10 12 316
0/0
0/0 1/0 10 / 0 96.93% 100%
Total
Recall Precision
MDR VSAP
Precision
DATA RECORD
Extraction of data records is based on visual clues. Height of each record is obtained. Average height is calculated Data records whose height is greater than the average height is extracted.
DATA REGION
DATA RECORDS
The flat record gives description of a single entity whereas the nested data record gives multiple description of a single entity
Identification of data records is essential in order to simplify the task of extracting the data items, which is very much needed for various applications. The Data Identifier determines the number of data fields in each data record within the data region. The data fields in flat records are less as compared to that of nested records. The number of fields in the nested data records is approximately 40% more than that of the flat records.
In fig1 the number of fields is 12 and in fig2 the number of fields is 7.The number of fields in fig1 is 58.3% more than the number of fields in fig2. Fig1 is a nested record and fig2 is flat record.
Application
VSAP can be used by any application that requires the most relevant information of a web page VSAP can provide a platform for an application that requires to analyze related data from different sources on the web. VSAP can serve as an efficient replacement of MDR, which has already found its place in the industry.
Conclusion
Results show that Performance of VSAP is better than other existing algorithms VSAP is a novel & efficient method of web mining
References
[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum,
23(3-4):3458, 1989.
[2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semistructured information from the web. In Proc. of the Workshop on the
[9] Bing Liu , Kevin chen-chuan chang, Editorial: Special issue on web content mining, WWW 02, 2002. [10] Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. KDD03, 2003.
[11] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. (2003). Extracting Content Structure for Web Pages based on
Visual Representation, Asia Pacific Web Conference (APWeb 2003), pp. 406417. [12] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, ACM
Snapshot
Amazon.com
Result
By VSAP
By MDR
Cooking.com
Result
By VSAP
By MDR
Tigerdirect.com
Result
By VSAP
By MDR
End
Thank You