Professional Documents
Culture Documents
Web Mining
WEB DATA
Web pages
Content of actual web pages.
Intra-page structures
HTML or XML code for web pages. Anchor
Inter-page structures
Linkage structure between web pages
4
WEB DATA
Usage data
- How web pages are accessed by end
user.
User Profile
-
demographic
and
registration
information obtained about user.
- Information found in cookies
5
Text characteristics
Noisy data
Example: Spelling mistakes
r u available ?
Hey whazzzzzz up
Yup
11
Information Retrieval
Given:
A source of textual
documents
A user query (text Query
E.g. Text
based)
Documents
source
IR
System
Find:
Document
Document
Document
meaning of words
Synonyms buy / purchase
Ambiguity bat (baseball vs. cricket)
order of words in the query
user dependency for the data
direct feedback
indirect feedback
authority of the source
IBM is more likely to be an authorized
source then my second far cousin
Information Extraction(IE)
Itis a type of information retrieval whose
goal is to
automatically extract structured information
from unstructured and/or semi-structured
machine-readable documents.
Subtask of IE
1. Named entity extraction which could include:
a)Named entity recognition: recognition of
known entity names (for people and
organizations)
PERSON located in LOCATION (extracted
from the sentence Amit is in Delhi.")
b)Relationship extraction: identification of
relations between entities, such as: PERSON
works for ORGANIZATION (extracted from
the sentence Amit works for IBM.")
Subtask of IE
2. Semi-structured information extraction :
Table extraction: finding and extracting tables
from documents.
Comments extraction :
extracting comments from actual content of
article
3. Terminology extraction: finding the relevant
terms for a given corpus
Information Extraction
Given:
Find:
What is Information
Extraction? Documents
source
Query 1
Extraction
System
Query 2
(E.g. salary)
Combine
Query Results
Relevant Info 1
Ranked
Documents
Relevant Info 2
Relevant Info 3
Spider
Documents
source
Query
IR / IE
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
10
Text preprocessing
Syntactic/Semantic text analysis
Features Generation
Features Selection
Simple counting
Statistics
Text/Data Mining
Bag of words
Analyzing results
11
Text Preprocessing
Text Cleanup
e.g., remove ads from web pages, normalize text
converted from binary formats, deal with tables,
figures and formulas,
Tokenization
Splitting up a string of characters into a set of
tokens.
Need to deal with issues like:
Apostrophes, e.g., Johns sick, is it 1 or 2 tokens?
Hyphens, e.g., database vs. data-base vs. data base.
How should we deal with C++, A/C, :-), ?
Parsing
12
13
User
Web
Web crawler
Miele
Welcome to Miele, the home of the very best appliances and kitchens in the world.
www.miele.co.uk/ - 3k - Cached - Similar pages
Search
Indexer
The Web
28
Indexes
Ad indexes
14
Crawlers
Robot (spider) is a program that traverses
the hypertext sructure in the Web.
Collect information from visited pages
The Page(or pages) that the crawler starts with
are referred to as the seed URL
Used to construct indexes for search engines
portions
29
Crawlers
Incremental Crawler selectively searches
the Web and only updates the index
incrementally as oppose to replacing them.
incrementally modifies index
Focused Crawler visits pages related to a
particular subject ( Topic of Interest)
30
15
Focused Crawler
Only visit links from a page if that page is
determined to be relevant.
A major piece of architecture is a hypertext
that associate relevance score for each
document with respect to crawl topic.
Distiller is used to determine which pages
contain link to many relevant pages. These
are called hub pages.
The hub pages may not contain the relevant
information, but they would be quite
important to facilitate continue searching
31
Focused Crawler
Prentice Hall
32
16
Intelligent
Search Agents
34
17
36
18
Successors
38
19
39
Page Rank
20
Page Rank
Assume that the Web consists of only three pages A, B, and C. The links among these pages are shown
below.
Let [a, b, c] be
the vector of
importances for
these three pages
A
C
1/2
1/2
1/2
1/2
42
21
Page Rank
The equation describing the asymptotic values of
these three variables is:
a
1/2
1/2
0
a
b =
1/2
0
1
b
0
1/2
0
c
c
We can solve the equations like this one by starting with the
assumption a = b = c = 1, and applying the matrix to the current
estimate of these values repeatedly. The first four iterations give the
following estimates:
a=
b=
c=
1
1
431
1
3/2
1/2
5/4
1
3/4
9/8
11/8
1/2
5/4
17/16
11/16 ...
6/5
6/5
3/5
44
22
b =
c
0
0
0 0
0
a
b
c
45
PageRank
Spider traps
46
23
Spider Traps
Assume now once more that the structure of the Web
has changed. The new matrix describing transitions is:
A
C
b =
c
0
0
0 0
1
a
b
c
47
Google Solution
24
Kleinberg's
hypertext-induced
topic
selection (HITS) algorithm is also
developed for ranking documents based
on the link information among a set of
documents.
49
4/20/2015
25
5
5
2
1 1
7
7
HITS Algorithm
Authorities
52
4/20/2015
26
A=
0
0
0
AT =
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
53
4/20/2015
a = AT . h
h=A.a
4/20/2015
a = AT . A. a
h = A. AT. h
54
27
Example contd..
1
1
1
h =
a=
AT .
0
h= 0
1
0
0
1
0
.
0
0
1
1
1
0
= 0
2
4/20/2015
55
Example contd..
0
h = A. a = 0
0
0
0
0
1
1 . 0
2
0
2
= 2
0
56
28
HITS vs PageRank
29
60
30
date(2006-10-04)
time(04:24:39)
Client side IP (203.200.95.194)
Server site name(W3SVC195)
Server Computer Name (NS),
Server ip(69.41.233.13)
Server port(80)
method (GET)
Client url(/STUDYCENTRE/StudyCentreSearchPage.asp)
Server status code(200 )
Client side byte received (568)
Time taken(210),
User agent Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;)
CS_referer (http://www.mcu.ac.in/search_a_study_institute.htm)
61
62
31
32
Learn::
Learn
Who is visiting you site
The path visitors take through your pages
How much time visitors spend on each page
The most common starting page
Where visitors are leaving your site
65
Raw
Sever log
Pattern Discovery
User session
File
Pattern Analysis
Interesting
Knowledge
66
33
Data transformation
In to forms most appropriate for web usage
mining
67
Summary report
34
sources
personalization
server
receivers
Personalization v. intelligence
35
User modeling
Basic elements:
Constructing models that can be used to adapt
the system to the users requirements.
Different types of requirement: interests
(sports and finance news), knowledge level
(novice - expert), preferences (no-frame GUI),
etc.
Different types of model: personal generic.
User model
[PERSONAL]
[PERSONAL]
User community:
[GENERIC]
User stereotype:
[GENERIC]
36
Community 1
User 1
User 2
User 3
User 4
User 5
User
communities
User
models
37
Cleaning:
User identification:
38
Problems:
Personalized assistants
The Letizia version of the system searches the Web locally, following
outgoing links from the current page.
The Powerscout version uses a search engine to explore the Web.
39