Professional Documents
Culture Documents
12/10/2011
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Page-Level Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
1. Introduction
WWW is a common and the largest source of
information
Deep Querying Gaining importance
Understanding web page semantics Improves Users
search experience
Within a web page Identify semantic groups
12/10/2011
UML Class
Diagram
12/10/2011
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Page-Level Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
2. Background: MedlinePlus
Web page:
i. Relevant content
a.
Relevant Content:
i.
b.
Topic headings
Irrelevant Content:
Navigation bars, header, footer, advertisements
Common
Web
User
12/10/2011
User
Semantic query
and search
(In future)
12/10/2011
10
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Page-Level Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
11
3. Segmentation algorithms
i.
12
13
14
12/10/2011
15
12/10/2011
16
Year
Template Detection
[9], [6]
2002, 2007
Dom-Node Recognition
Visual-DOM based
Rendering
[2]
2003
Visual-Heuristics based
Method
Proposed
Graph-theoretic Method
[3]
2008
Linguistics based
Method
[7]
2008
[4], [5]
2010,2009
Site-Oriented Method
[1]
2011
Technique
12/10/2011
17
3(c). Comparison
12/10/2011
18
12/10/2011
19
12/10/2011
20
12/10/2011
21
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Page-Level Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
22
12/10/2011
23
24
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Page-Level Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
25
5. Query by Segment
Query by Segment as Query by Tag (Heading) QBT
Based on Content Structure (VisHue algorithm) :
Query by Attributes
MedlinePlus medical encyclopedia 3886 web pages
Target Focused and explicit querying
i. Beneficial skilled and semi-skilled users
ii.
12/10/2011
DB
Title
Caus
es
12/10/2011
Sympt
oms
PostCare
27
28
Outline
1.
Introduction
2.
Background
a) Hierarchical structure
b) Segmentation
3.
4.
5.
Query by Segment
6.
Performance Analysis
7.
Discussions
8.
12/10/2011
29
6. Performance Analysis
i. Qualitative comparison with traditional
keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of the
interface
12/10/2011
30
12/10/2011
31
12/10/2011
32
33
34
35
12/10/2011
36
7. Discussions
Content fragments as perceived by skilled and semiskilled domain users determined by web page
segmentation process
Proposed effort Formulating a generic heuristic
design-rule and visual features based algorithm
The QBT interface Query over user identified
segments (attributes)
37
12/10/2011
38
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
A Site Oriented Method for Segmenting Web Pages, David Fernandes, Edleno S. de Moura, Altigran S.
da Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR11, July 24-28, 2011.
Extracting Content Structure for Web Pages based on Visual Representation, Deng Cai, Shipeng Yu, JiRong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference,
APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
Graph-Theoretic Approach to Webpage Segmentation, Deepayan Chakrabarti, Ravi Kumar, Kunal
Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing,
China.
A segmentation method for web page analysis using shrinking and dividing, Jiuxin Cao, Bo Mao &
Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
Web Page Layout via Visual Segmentation, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP
Laboratories, 2009.
Page-level template detection via isotonic smoothing. D. Chakrabarti, R. Kumar, and K. Punera. In 16th
WWW, pages 6170, 2007.
"A Densitometric Approach to Web Page Segmentation", Christian Kohlschtter, Wolfgang Nejdl, CIKM08,
October 2630, 2008
HTML Page Analysis Based on Visual Cues , Yudong Yang and HongJiang Zhang, IEEE 2001
Template Detection via Data Mining and its Applications , Ziv Bar Yossef, Sridhar Rajagopalan, In
Proceedings of WWW'02, May 711, 2002, Honolulu, Hawaii, USA.
"DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu
Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid
(SKG 2005).
"Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael
Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering
(ICDE02).
12/10/2011
39
Thank you
Questions
12/10/2011
40