Professional Documents
Culture Documents
Supercomputing 2002
Yair Even-Zohar
Automated Learning Group National Center for Supercomputing Applications University of Illinois evenzoha@ncsa.uiuc.edu
BOISE, Idaho (CNN) -- Cooler weather Monday in the U.S. Pacific Northwest may help more than 27,000 firefighters in their marathon battle against dozens of wildfires. Ten new large fires were reported overnight into Monday, bringing to 40 the number of large fires ablaze, said officials at the National Interagency Fire Center. They said between 300,000 and 400,000 acres are aflame. The good news is that five large fires were contained Sunday. So far, about 2.8 million acres of forest have burned this year, making 2001 an average year for fires. Center officials said 241 fires were reported Sunday into Monday but 96 percent were contained or extinguished in "initial attacks" by firefighters. A large fire is defined as a fire burning uncontained and extending over 100 acres or more.
Strict definition
Information that not even the writer knows. e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure
Lenient definition
Rediscover the information that the author encoded in the text e.g., Automatically extracting a products name from a web-page.
Outline
Text characteristics
Text mining process Learning methods
Marketing: Discover distinct groups of potential buyers according to a user text based profile
e.g. amazon
e.g., www.flipdog.com
Information Retrieval
Information Extraction
Web Mining
Indexing and retrieval of textual documents and extraction of partial knowledge using the web
Clustering
Information Retrieval
Given:
Documents source
IR System
Find:
A set (ranked) of documents that are relevant to the query Ranked Documents
meaning of words
hot dog stand in the amusement park hot amusement stand in the dog park
Given:
Find:
Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format
Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. According to the police and Garcia Alvarados driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.
Incident Date: 19 Apr 89 Incident Type: Bombing Perpetrator Individual ID: urban guerillas
...
alg | Automated Learning Group
Query 2
(E.g. salary)
Extraction System
Ranked Documents
alg | Automated Learning Group
Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)
People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside SanFrancisco
Web
Spider
Documents source
Query
IR / IE System
Ranked Documents
alg | Automated Learning Group
meaning of words order of words in the query user dependency for the data authority of the source
What is Clustering ?
Given:
Documents source
Similarity measure
Clustering System
Find:
Do Doc
Doc c
Outline
Text characteristics
Text mining process Learning methods
High dimensionality
Several input modes Dependency
Ambiguity
Noisy data Not well structured text
Text characteristics
Efficiency consideration
over 2,000,000,000 web pages almost all publications are also in electronic form
e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
Text characteristics
Dependency
Ambiguity
Word ambiguity
Pronouns (he, she ) buy, purchase
Semantic ambiguity
The king saw the rabbit with his glasses. (8 meanings)
Text characteristics
Chat rooms
r u available ? Hey whazzzzzz up
Speech
Outline
Text characteristics
Text mining process Learning methods
Text preprocessing
Features Generation
Features Selection
Text/Data Mining
Analyzing results
Parsing
Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph
e.g., Lord of the rings {the, Lord, rings, of} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications e.g., flying, flew fly Reduce dimensionality
Stop words: The most common words are unlikely to help text
mining
<book> <title> The making of a bad movie </title> <author> <name> Scooby-Doo </name> <affiliation> Cartoons </affiliation> </author> </book>
alg | Automated Learning Group
Feature selection
Reduce dimensionality
Irrelevant features
do core datatype package complete independence application transformation handle different table previous use add list keep interface call sub-interface
do core datatype package hi complete week independence update application unfortunate transformation month handle action different right table previous due use insistence add member list group keep ncsa interface d2k call modules sub-interface
core
alg | Automated Learning Group
datatype package complete independence application transformation handle different table previous add list interface call sub-interface
Each record contains a set of features (attributes), and the true class (label)
Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as
accurately as possible
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another
Goal:
Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents
Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set
The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
e.g., classifying football document as a basketball document is not as bad as classifying it as crime.
Outline
Text characteristics
Text mining process Learning methods
Classification Clustering
Classification: An Example
Income Hooligan 125K Yes Yes 70K 40K Yes No No Yes Yes
10
England Married
Yes No No
Test Set
Denmark Single
Training Set
Learn Classifier
Model
Ex# Hooligan 1 2 3 4 5 6 7 8
10
An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league
Yes Yes
Test Set
Training Set
alg | Automated Learning Group
Learn Classifier
Model
Classification Techniques
Instance-Based Methods
Decision trees
Neural networks Bayesian classification
Instance-based Methods
Store training examples and delay the processing (lazy evaluation) until a new instance must be classified
football
All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance
The k-NN returns the most common value among the k nearest training examples Voronoi diagram: the decision surface induced by 1-NN for a typical set of training examples
_ _
+
+ _
_
? . +
_ +
+
+ _ +
_ + + +
_ +
Classification Techniques
Instance-Based Methods
Decision trees
Neural networks Bayesian classification
Splitting Attributes
Ex# Country Marital Status 1 2 3 4 5 6 7 8 9 10
10
Income Hooligan 125K 100K 70K 40K Yes Yes Yes No No Yes Yes Yes No No
English Yes Yes No MarSt Single, Divorced Income > 80K NO < 80K YES Married NO
England Married
England Divorced 20K Italy France Single Married 85K 75K 50K
Denmark Single
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Ex# Hooligan 1 2 3 4 5 6 7 8
10
Splitting Attributes
An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league
English
Yes Yes Yes No No Yes Yes Yes
Yes Yes
> 80K NO
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Classification by DT Induction
Decision tree
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Classification Techniques
Instance-Based Methods
Decision trees
Neural networks Bayesian classification
Hooligan
Weights vector
1 1.2 -0.2 -2
Input vector
as
ball
cool
FC soccerfootball
The n-dimensional input vector is used to classify by means of multiplication and a function mapping
hi week update unfortunate month much action right due insistence member group ncsa d2k modules core datatype package complete d2k application handle different table previous module add list keep subinterface call
NO SPAM
Class 1
Class 2
Class 3
0.9
-0.4 -0.8
0 0.7
1.2
-0.2
-2
as
ball
cool
FC
soccer football
Network of threshold gates Target nodes represent class labels Input nodes represent the relations (features) in the example
(Order of 105 input features for many target nodes.)
A Multi-layer Perceptron
Threshold Weights vector Hidden nodes Weights vector
2.5
Hooligan
-0.9 1.2
0.9
-0.4 -0.8 0
0.7
1.2
-0.2
-2
Input vector
as
ball
cool
FC soccerfootball
Neural Networks
Advantages
Prediction accuracy is generally high Robust, works when training examples contain errors Fast evaluation of the learned target function Easy to compute using parallel processors
Disadvantages
Long training time Difficult to understand the learned function (weights) Difficult to incorporate domain knowledge
Classification Techniques
Instance-Based Methods
Decision trees
Neural networks Bayesian classification
Bayesian Classification
Idea: assign to example X the class label C such that P(C|X) is maximal
Standard:
Provide a standard of optimal decision making against which other methods can be measured In a simpler form, provide a baseline against which other methods can be measured
Estimating Probabilities
Bayes theorem:
P(C|X) = P(X|C)P(C) / P(X)
P(X) is constant for all classes Therefore estimate P(C|X) such that:
P(C|X) P(X|C)P(C)
P(x1,,xk|C) = P(x1|C)P(xk|C)
P(xi|C) is estimated as the relative frequency of examples having value xi as feature in class C Computationally easy!!!
makes computation possible yields optimal classifiers when satisfied but is seldom satisfied in practice, as attributes (variables) are
often correlated
Bayesian networks, that combine Bayesian reasoning with causal relationships between features
Clustering Techniques
Partitioning Methods
Hierarchical Methods
Partitioning Algorithms
Given: a set of documents and the number k Find: a partition of k clusters that optimizes the chosen partitioning
criterion
Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means: Each cluster is represented by the center of the cluster
10
10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Clustering Techniques
Partitioning Methods
Hierarchical Methods
Hierarchical Clustering
Agglomerative:
Start with each document being a single cluster. Eventually all document belong to the same cluster.
Divisive:
Does not require the number of clusters k in advance Needs a termination condition
agglomerative
a b
ab abcde
c
d e
Step 4 Step 3
cde
de divisive
Step 2 Step 1 Step 0
Demo
Summary
http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
One may play with any phase of the text mining process
Summary
There are many other scientific and statistical text mining methods
developed but not covered in this talk.
http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html
Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell