Professional Documents
Culture Documents
Web Mining
to automatically discover and extract
information from Web documents/services
(Etzioni, 1996, CACM 39(11))
1
2
3 4
One Interesting Approach
estimate the size of the The estimate of the average number of pages per
server was obtained by crawling a sample of the servers
web? identified in the first experiment
5
6
Hyper-link information Generate user profiles -> improving customization and provide users with
pages, advertisements of interest
Access and usage information Targeted advertising -> Ads are a major source of revenue for Web portals
(e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the
“hottest” web mining application today
Lots of data on user access patterns Fraud -> Maintain a signature for each user based on buying patterns on the
Web (e.g., amount spent, categories of items bought). If buying pattern changes
Web logs contain sequence of URLs accessed by users significantly, then signal fraud
Network Management
Challenge: Develop new Web mining algorithms and Performance management -> Annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by a factor of three. Result is
adapt traditional data mining algorithms to frequent congestion. During a major event (World cup), an overwhelming number
of user requests can result in millions of redundant copies of data flowing back
Exploit hyper-links and access patterns and forth across the world
Fault management -> analyze alarm and traffic data to carry out root cause
analysis of faults
7 8
Applications of web mining Why is Web Information Retrieval Important?
Information retrieval (Search) on the Web According to most predictions, the majority of human information
Automated generation of topic hierarchies will be available on the Web in ten years
9 10
4 searches were
Web Mining
defined that
returned 141 web
pages.
Web Web
Web Usage
Content Structure
Mining
Mining Mining
Coverage – about 40% in 1999
From http://www.searchengineshowdown.com/stats/overlap.shtml
13 14
15 16
Web Content Minng Database Approaches
One approach is to build a local knowledge base - model data on the
Can be thought of as extending the work performed by web and integrate them in a way that enables specifically designed query
basic search engines. languages to query the data
17 18
Agent-Based Approach
Agents to search for relevant information using domain
characteristics and user profiles
A system for extracting a relation from the web, for example, a
list of all the books referenced on the web. The system is given a
Web Structure Mining
set of training examples which are used to search the web for
similar documents. Another application of this tool could be to
build a relation with the name and address of restaurants
referenced on the web
Brin, S. (1998). Extracting patterns and relations from the world wide web. In Int.
Workshop on Web and Databases, pages 172–183.
19 20
First generation of search engines Modern search engines
Early days: keyword based searches Link structure is very important
Keywords: “web mining” Adding a link: deliberate act
Retrieves documents with “web” and mining” Harder to fool systems using in-links
synonymy problem
polysemy problem Modern search engines use link structure as important
source of information
stop words
21 22
Some answers
Central Question: 1. Structure of Internet
2. Google
Which useful information can be
derived 3. HITS: Hubs and Authorities
from the link structure of the web?
23 24
General Topology
1. The Web Structure
Tendrils
A study was conducted on a graph inferred from two large Tendrils 44mil
Altavista crawls.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW
Conference.
The study confirmed the hypothesis that the number of in-links IN SCC OUT
and out-links to a page approximately follows a Zipf distribution (a 44mil 56mil 44mil
particular case of a power-law)
It’s a “small world” -> between two people only chain of length 6! Select pages containing w and pages which have in-links with
caption w
Small World Graphs Anchor text
High number of relatively small cliques Provide more accurate descriptions of Web pages
Anchors exist for un-indexable documents (e.g., images)
Small diameter
Font sizes of words in text: Words in larger or bolder font are assigned
Internet (SCC) is a small world graph higher weights
27 28
PageRank 3. HITS (Hyperlink-Induced Topic Search)
Page Rank:
Rank: A page is important if many important
pages link to it. HITS uses hyperlink structure to identify authoritative
Link i→j : Web sources for broad-topic information discovery
i considers j important. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment.
the more important i, the more important j becomes. Journal of the ACM, 46(5):604–632.
if i has many out-links: links are less important.
Initially: all importances pi = 1. Iteratively, pi is refined.
Premise: Sufficiently broad topics contain communities
consisting of two types of hyperlinked pages:
Let OutDegreei = # out-links of page i
PageRank (i ) Authorities: highly-referenced pages on a topic
Adjust pj: PageRank ( j ) = p + (1 − p ) ∑
i → j OutDegree (i )
Hubs: pages that “point” to authorities
This is the weighted sum of the importance of the pages referring to Pj A good authority is pointed to by many good hubs; a good hub
points to many good authorities
Parameter p is probability that the surfer gets bored and starts on a new random page
(1-p) is the probability that the random surfer follows a link on current page
29 30
HITS
Hubs and Authorities
Steps for Discovering Hubs and Authorities on a specific topic
Collect seed set of pages S (returned by search engine)
Expand seed set to contain pages that point to or are pointed to by
pages in seed set (removes links inside a site)
Iteratively update hub weight h(p) and authority weight a(p) for
each page:
a (p) = ∑ h (q )
q →p
h(p ) = ∑ a (q )
p →q
33 34
35 36
Website Usage Analysis
Data Sources
37 38
39 40
Analog – Web Log File Analyser
Web Usage Mining Process
http://www.analog.cx/
Gives basic statistics such as
number of hits Web
average hits per time period Server
Log
what are the popular pages in your site Data Data
Clean
who is visiting your site Preparation Mining
Data
what keywords are users searching for to get to you
what is being downloaded Site
Data Usage
Patterns
41 42
Other heuristics involve using a combination of IP address, machine individual IP addresses are sometimes hidden behind proxy servers
name, browser agent, and temporal information to identify users client-side & proxy caching makes server log data less reliable
Transaction identification Standard Solutions/Practices:
All of the page references made by a user during a single visit to a site user registration – practical ????
Size of a transaction can range from a single page reference to all of client-side cookies – not fool proof
the page references
cache busting - increases network traffic
43 44
Sessionizing Web Usage Mining
Time oriented
Commonly used approaches
By total duration of session
not more than 30 minutes Preprocessing data and adapting existing data mining
By page stay times (good for short sessions) techniques
not more than 10 minutes per page
For example associatin rules: does not take into account the
Navigation oriented (good for short sessions and when timestamps order of the page requests
unreliable)
Referrer is previous page in session, or Developing novel data mining models
Referrer is undefined but request within 10 secs, or
Link from previous to current page in web site
45 46
47 48
Hypertext Probabilistic Grammar Model Hypertext Weighted Grammar
A1→A2→A3→A4
A1→A5→A3→A4 → A1
Ngram A5→A2→A4→A6
A5→A2→A3
Navigation Hypertext Dynamic model
Log
Weighted
A5→A2→A3→A6
Files Sessions A4→A1→A5→A3
Grammar
Data Mining
Hypertext Algorithms
User
Probabilisti
c BFS
Navigation Parameter, α, is used when converting the weighted
Patterns grammar to the corresponding probabilistic grammar.
Grammar IFE
FG α=1 – Initial probability proportional to num. page visits
α=0 - Initial probability proportional to num. sessions starting on
To indentify paths with higher probability page
49 50
51 52
Ongoing Work Applications of the HPG Model
Cloning states in order to increase the model accuracy
Provide guidelines for the optimisation of a web site structure.
Work as a model of the user’s preferences in the creation of
adaptive web sites.
Improve search engine’s technologies by enhancing the random surf
concept.
Web personal assistant.
Visualisation tool
Use model to learn access patterns and predict future accesses.
Pre-fetch predicted pages to reduce latency.
Also cache results of popular search engine queries.
53 54
Conduct a set of experiments to evaluate the Data Mining: Introductory and Advanced Topics,
usefulness of the model to the end user. Margaret Dunham (Prentice Hall, 2002)
55 56
Thank you !!!
57