06 WebMining

What is Web Mining?
Web mining is the use of data mining techniques
Web Mining
to automatically discover and extract
information from Web documents/services
(Etzioni, 1996, CACM 39(11))
(another definition: mining of data related to the World Wide Web)
Motivation / Opportunity - The WWW is huge, widely distributed, global

information service centre and, therefore, constitutes a rich source for
data mining
1
2
The Web Abundance and authority crisis

Over 1 billion HTML pages, 15 terabytes
Liberal and informal culture of content generation and
Wealth of information
dissemination.
Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes,
yellow & white pages, maps, markets, .........
Redundancy and non-standard form and content.
Diverse media types: text, images, audio, video
Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 Millions of qualifying pages for most broad queries
Highly Dynamic Example: java or kayaking
1 million new pages each day
No authoritative information about the reliability of a
Average page changes in a few weeks
site
Graph structure with links between pages
Average page has 7-10 links Little support for adapting to the background of
in-links and out-links follow power-law distribution specific users.
Hundreds of millions of queries per day
3 4
One Interesting Approach
The number of web servers was estimated by sampling

and testing random IP address numbers and determining
the fraction of such tests that successfully located a
How do you suggest we could web server
estimate the size of the The estimate of the average number of pages per
server was obtained by crawling a sample of the servers
web? identified in the first experiment
Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the

web. Nature, 400(6740): 107–109.
5
6
The Web Applications of web mining

The Web is a huge collection of documents except for E-commerce (Infrastructure)
Hyper-link information Generate user profiles -> improving customization and provide users with
pages, advertisements of interest
Access and usage information Targeted advertising -> Ads are a major source of revenue for Web portals
(e.g., Yahoo, Lycos) and E-commerce sites. Internet advertising is probably the
“hottest” web mining application today
Lots of data on user access patterns Fraud -> Maintain a signature for each user based on buying patterns on the
Web (e.g., amount spent, categories of items bought). If buying pattern changes
Web logs contain sequence of URLs accessed by users significantly, then signal fraud
Network Management
Challenge: Develop new Web mining algorithms and Performance management -> Annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by a factor of three. Result is
adapt traditional data mining algorithms to frequent congestion. During a major event (World cup), an overwhelming number
of user requests can result in millions of redundant copies of data flowing back
Exploit hyper-links and access patterns and forth across the world
Fault management -> analyze alarm and traffic data to carry out root cause
analysis of faults
7 8
Applications of web mining Why is Web Information Retrieval Important?
Information retrieval (Search) on the Web According to most predictions, the majority of human information
Automated generation of topic hierarchies will be available on the Web in ten years
Web knowledge bases

Effective information retrieval can aid in
Research: Find all papers about web mining
Health/Medicene: What could be reason for symptoms of “yellow
eyes”, high fever and frequent vomitting
Travel: Find information on the tropical island of St. Lucia
Business: Find companies that manufacture digital signal processors
Entertainment: Find all movies starring Marilyn Monroe during the
years 1960 and 1970
Arts: Find all short stories written by Jhumpa Lahiri
9 10
Why is Web Information Retrieval Difficult? Search Engine Relative Size
The Abundance Problem (99% of information of no interest to

99% of people)
Hundreds of irrelevant documents returned in response to a search
query
Limited Coverage of the Web (Internet sources hidden behind
search interfaces)
Largest crawlers cover less than 18% of Web pages
The Web is extremely dynamic
Lots of pages added, removed and changed every day
Very high dimensionality (thousands of dimensions)
Limited query interface based on keyword-oriented search
Limited customization to individual users
http://www.searchengineshowdown.com/stats/size.shtml
11 12
Search Engine Web Coverage Overlap
Web Mining Taxonomy
4 searches were
Web Mining
defined that
returned 141 web
pages.
Web Web
Web Usage
Content Structure
Mining
Mining Mining
Coverage – about 40% in 1999
From http://www.searchengineshowdown.com/stats/overlap.shtml
13 14
Web Mining Taxonomy

Web content mining: focuses on techniques for
assisting a user in finding documents that meet a
certain criterion (text mining)
Web Content Mining
Web structure mining: aims at developing techniques to
take advantage of the collective judgement of web page
quality which is available in the form of hyperlinks
Examines the content of web pages as well as results of web
searching.
Web usage mining: focuses on techniques to study the
user behaviour when navigating the web
(also known as Web log mining and clickstream analysis)
15 16
Web Content Minng Database Approaches
One approach is to build a local knowledge base - model data on the
Can be thought of as extending the work performed by web and integrate them in a way that enables specifically designed query
basic search engines. languages to query the data
Store locally abstract characterizations of web pages. A query

language enables to query the local repository at several levels
Searche engines have crawlers to search the web and of abstraction. As a result of the query the system may have to
gather information, indexing techniques to store the request pages from the web if more detail is needed
information, and query processing support to provide Zaiane, O. R. and Han, J. (2000). WebML: Querying the world-wide web for resources and knowledge.
In Proc. Workshop on Web Information and Data Management, pages 9–12
information to the users. Build a computer understandable knowledge base whose contents
mirrors that of the web and which is created by providing
training examples that characterized the wanted document
classes
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. (1998).
Learning to extract symbolic knowledge from the world wideweb. In Proc. National Conference on
Artificial Intelligence, pages 509–516
17 18
Agent-Based Approach
Agents to search for relevant information using domain
characteristics and user profiles
A system for extracting a relation from the web, for example, a
list of all the books referenced on the web. The system is given a
Web Structure Mining
set of training examples which are used to search the web for
similar documents. Another application of this tool could be to
build a relation with the name and address of restaurants
referenced on the web
Brin, S. (1998). Extracting patterns and relations from the world wide web. In Int.
Workshop on Web and Databases, pages 172–183.
Personalized Web Agents -> Web agents learn user

preferences and discover Web information sources based on these Exploiting Hyperlink Structure
preferences, and those of other individuals with similar interests
SiteHelper is an local agent that keeps tracks of pages viewed by
a given user in previous visits and gives him advice on new pages
of interest in the next visit
Ngu, D. S.W. and Wu, X. (1997). SiteHelper: A localized agent that helps incremental
exploration of the world wide web. In Proc. WWW Conference, pages 691–700.
19 20
First generation of search engines Modern search engines
Early days: keyword based searches Link structure is very important
Keywords: “web mining” Adding a link: deliberate act
Retrieves documents with “web” and mining” Harder to fool systems using in-links
Later on: cope with Link is a “quality mark”
synonymy problem
polysemy problem Modern search engines use link structure as important
source of information
stop words
Common characteristic: Only information on the pages

is used
21 22
Some answers
Central Question: 1. Structure of Internet
2. Google
Which useful information can be
derived 3. HITS: Hubs and Authorities
from the link structure of the web?
23 24
General Topology
1. The Web Structure
Tendrils
A study was conducted on a graph inferred from two large Tendrils 44mil
Altavista crawls.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW
Conference.
The study confirmed the hypothesis that the number of in-links IN SCC OUT
and out-links to a page approximately follows a Zipf distribution (a 44mil 56mil 44mil
particular case of a power-law)
If the web is treated as an undirected graph 90% of the pages

form a single connected component
Tubes Disconnected
If the web is treated as a directed graph four distinct components components
are identified, the four with similar size SCC: set of pages that can be reached by one another
IN: pages that have a path to SCC but not from it
OUT: pages that can be reached by SCC but not reach it
TENDRILS: pages that cannot reach and be reached the SCC pages
25 26
Some statistics 2. Google

Search engine that uses link structure to calculate a quality
Only between 25% of the pages there is a connecting path ranking (PageRank) for each page
BUT Intuition: PageRank can be seen as the probability that a “random
surfer” visits a page
If there is a path: Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine.
In Proc. WWW Conference, pages 107–117
Directed: average length <17
Undirected: average length <7 (!!!) Keywords w entered by user
It’s a “small world” -> between two people only chain of length 6! Select pages containing w and pages which have in-links with
caption w
Small World Graphs Anchor text
High number of relatively small cliques Provide more accurate descriptions of Web pages
Anchors exist for un-indexable documents (e.g., images)
Small diameter
Font sizes of words in text: Words in larger or bolder font are assigned
Internet (SCC) is a small world graph higher weights
Rank pages according to importance
27 28
PageRank 3. HITS (Hyperlink-Induced Topic Search)
Page Rank:
Rank: A page is important if many important
pages link to it. HITS uses hyperlink structure to identify authoritative
Link i→j : Web sources for broad-topic information discovery
i considers j important. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment.
the more important i, the more important j becomes. Journal of the ACM, 46(5):604–632.
if i has many out-links: links are less important.
Initially: all importances pi = 1. Iteratively, pi is refined.

Premise: Sufficiently broad topics contain communities
consisting of two types of hyperlinked pages:
Let OutDegreei = # out-links of page i
PageRank (i ) Authorities: highly-referenced pages on a topic
Adjust pj: PageRank ( j ) = p + (1 − p ) ∑
i → j OutDegree (i )
Hubs: pages that “point” to authorities
This is the weighted sum of the importance of the pages referring to Pj A good authority is pointed to by many good hubs; a good hub
points to many good authorities
Parameter p is probability that the surfer gets bored and starts on a new random page
(1-p) is the probability that the random surfer follows a link on current page
29 30
HITS
Hubs and Authorities
Steps for Discovering Hubs and Authorities on a specific topic
Collect seed set of pages S (returned by search engine)
Expand seed set to contain pages that point to or are pointed to by
pages in seed set (removes links inside a site)
Iteratively update hub weight h(p) and authority weight a(p) for
each page:
a (p) = ∑ h (q )
q →p
h(p ) = ∑ a (q )
p →q
After a fixed number of iterations, pages with highest

hub/authority weights form core of community
Extensions proposed in Clever

A on the left is an authority Assign links different weights based on relevance of link anchor text
A on the right is a hub
31 32
Applications of HITS
Search engine querying

Finding web communities Web Usage Mining
Finding related pages
Populating categories in web directories.
Citation analysis
analyzing user web navigation
33 34
Web Usage Mining Website Usage Analysis

Why analyze Website usage?
Pages contain information
Knowledge about how visitors use Website could
Links are “roads” Provide guidelines to web site reorganization; Help prevent disorientation
How do people navigate over the Internet? Help designers place important information where the visitors look for it
Pre-fetching and caching web pages
⇒ Web usage mining (Clickstream Analysis)
Provide adaptive Website (Personalization)
Questions which could be answered
Information on navigation paths is available in log files. What are the differences in usage and access patterns among users?
What user behaviors change over time?
Logs can be examined from either a client or a server
How usage patterns change with quality of service (slow/fast)?
prespective.
What is the distribution of network traffic over time?
35 36
Website Usage Analysis
Data Sources
37 38
Data Sources An Example of a Web Server Log

Server level collection: the server stores data regarding requests
performed by the client, thus data regard generally just one
source;
Client level collection: it is the client itself which sends to a

repository information regarding the user's behaviour (can be
implemented by using a remote agent (such as Javascripts or Java
applets) or by modifying the source code of an existing browser
(such as Mosaic or Mozilla) to enhance its data collection
capabilities. );
Proxy level collection: information is stored at the proxy side,

thus Web data regards several Websites, but only users whose
Web clients pass through the proxy.
39 40
Analog – Web Log File Analyser
Web Usage Mining Process
http://www.analog.cx/
Gives basic statistics such as
number of hits Web
average hits per time period Server
Log
what are the popular pages in your site Data Data
Clean
who is visiting your site Preparation Mining
Data
what keywords are users searching for to get to you
what is being downloaded Site
Data Usage
Patterns
41 42
Data Preparation Sessionizing

Data cleaning Main Questions:
By checking the suffix of the URL name, for example, all log entries how to identify unique users
with filename suffixes such as, gif, jpeg, etc
how to identify/define a user transaction
User identification
Problems:
If a page is requested that is not directly linked to the previous pages,
multiple users are assumed to exist on the same machine user ids are often suppressed due to security concerns
Other heuristics involve using a combination of IP address, machine individual IP addresses are sometimes hidden behind proxy servers
name, browser agent, and temporal information to identify users client-side & proxy caching makes server log data less reliable
Transaction identification Standard Solutions/Practices:
All of the page references made by a user during a single visit to a site user registration – practical ????
Size of a transaction can range from a single page reference to all of client-side cookies – not fool proof
the page references
cache busting - increases network traffic
43 44
Sessionizing Web Usage Mining
Time oriented
Commonly used approaches

By total duration of session
not more than 30 minutes Preprocessing data and adapting existing data mining
By page stay times (good for short sessions) techniques
not more than 10 minutes per page
For example associatin rules: does not take into account the
Navigation oriented (good for short sessions and when timestamps order of the page requests
unreliable)
Referrer is previous page in session, or Developing novel data mining models
Referrer is undefined but request within 10 secs, or
Link from previous to current page in web site
The task of identifying the sequence of requests from a user is not

trivial - see Berendt et.al., Measuring the Accuracy of Sessionizers for Web
Usage Analysis SIAM-DM01
45 46
An Example of Preprocessing Data and Mining Navigation Patterns

Adapting Existing Data Mining Techniques
Each session induces a user trail through the site
Chen, M.-S., Park, J. S., and Yu, P. S. (1998). Efficient data mining for traversal
patterns. IEEE Transactions on Knowledge and Data Engineering, 10(2): 209–221. A trail is a sequence of web pages followed by a user during a
session, ordered by time of access.
The log data is converted into a tree,
from which is inferred a set of A pattern in this context is a frequent trail.
maximal forward references. The
Co-occurrence of web pages is important, e.g. shopping-basket and
maximal forward references are
then processed by existing
checkout.
association rules techniques. Two Use a Markov chain model to model the user navigation records,
algorithms are given to mine for the
inferred from log data. Hypertext Probabilistic Grammar.
rules, which in this context consist
of large itemsets with the additional
restriction that references must be
consecutive in a transaction.
47 48
Hypertext Probabilistic Grammar Model Hypertext Weighted Grammar
A1→A2→A3→A4
A1→A5→A3→A4 → A1
Ngram A5→A2→A4→A6
A5→A2→A3
Navigation Hypertext Dynamic model
Log
Weighted
A5→A2→A3→A6
Files Sessions A4→A1→A5→A3
Grammar
Data Mining
Hypertext Algorithms
User
Probabilisti
c BFS
Navigation Parameter, α, is used when converting the weighted
Patterns grammar to the corresponding probabilistic grammar.
Grammar IFE
FG α=1 – Initial probability proportional to num. page visits
α=0 - Initial probability proportional to num. sessions starting on
To indentify paths with higher probability page
49 50
Ngram model Ngram model

A1→A2→A3→A4
A1→A5→A3→A4 → A1
We make use of the Ngram concept in order to improve
A5→A2→A4→A6
the model accuracy in representing user sessions.
A5→A2→A3
The Ngram model assumes that only the previous n-1 A5→A2→A3→A6
visited pages have a direct effect on the probability of A4→A1→A5→A3
the next page chosen.
A state corresponds to a navigation trail with n-1 pages
Chi-square test is used to assess the order of the model
(in most cases N=3 is enough)
Experiments have shown that the number of states is
manageable
51 52
Ongoing Work Applications of the HPG Model
Cloning states in order to increase the model accuracy
Provide guidelines for the optimisation of a web site structure.
Work as a model of the user’s preferences in the creation of
adaptive web sites.
Improve search engine’s technologies by enhancing the random surf
concept.
Web personal assistant.
Visualisation tool
Use model to learn access patterns and predict future accesses.
Pre-fetch predicted pages to reduce latency.
Also cache results of popular search engine queries.
53 54
Future Work References
Conduct a set of experiments to evaluate the Data Mining: Introductory and Advanced Topics,
usefulness of the model to the end user. Margaret Dunham (Prentice Hall, 2002)
Individual User Mining the Web - Discovering Knowledge from

Hypertext Data, Soumen Chakrabarti, Morgan-
Web site owner
Kaufmann Publishers
Incorporate the categories that users are navigating
through so we may better understand their activities.
E.g. what type of book is the user interested in; this may be
used for recommendation.
Devise methods to compare the precision of different

order models.
55 56
Thank you !!!
57

06 WebMining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 WebMining

Uploaded by

Copyright:

Available Formats

What is Web Mining?

Web mining is the use of data mining techniques

(another definition: mining of data related to the World Wide Web)

Motivation / Opportunity - The WWW is huge, widely distributed, global

The Web Abundance and authority crisis

 The number of web servers was estimated by sampling

 Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the

The Web Applications of web mining

 Web knowledge bases

Why is Web Information Retrieval Difficult? Search Engine Relative Size

 The Abundance Problem (99% of information of no interest to

Web Mining Taxonomy

 Store locally abstract characterizations of web pages. A query

 Personalized Web Agents -> Web agents learn user

 Later on: cope with  Link is a “quality mark”

 Common characteristic: Only information on the pages

 If the web is treated as an undirected graph 90% of the pages

Some statistics 2. Google

 Rank pages according to importance

 After a fixed number of iterations, pages with highest

 Extensions proposed in Clever

 Search engine querying

analyzing user web navigation

Web Usage Mining Website Usage Analysis

Data Sources An Example of a Web Server Log

 Client level collection: it is the client itself which sends to a

 Proxy level collection: information is stored at the proxy side,

Data Preparation Sessionizing

 The task of identifying the sequence of requests from a user is not

An Example of Preprocessing Data and Mining Navigation Patterns

Ngram model Ngram model

Future Work References

 Individual User  Mining the Web - Discovering Knowledge from

 Devise methods to compare the precision of different

You might also like

The number of web servers was estimated by sampling

Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the

Web knowledge bases

The Abundance Problem (99% of information of no interest to

Store locally abstract characterizations of web pages. A query

Personalized Web Agents -> Web agents learn user

Later on: cope with Link is a “quality mark”

Common characteristic: Only information on the pages

If the web is treated as an undirected graph 90% of the pages

Rank pages according to importance

After a fixed number of iterations, pages with highest

Extensions proposed in Clever

Search engine querying

Client level collection: it is the client itself which sends to a

Proxy level collection: information is stored at the proxy side,

The task of identifying the sequence of requests from a user is not

Individual User Mining the Web - Discovering Knowledge from

Devise methods to compare the precision of different