Text Mining

Introduction to Text Mining
Supercomputing 2002
Yair Even-Zohar
Automated Learning Group National Center for Supercomputing Applications University of Illinois evenzoha@ncsa.uiuc.edu
Problem: Document Categorization
BOISE, Idaho (CNN) -- Cooler weather Monday in the U.S. Pacific Northwest may help more than 27,000 firefighters in their marathon battle against dozens of wildfires. Ten new large fires were reported overnight into Monday, bringing to 40 the number of large fires ablaze, said officials at the National Interagency Fire Center. They said between 300,000 and 400,000 acres are aflame. The good news is that five large fires were contained Sunday. So far, about 2.8 million acres of forest have burned this year, making 2001 an average year for fires. Center officials said 241 fires were reported Sunday into Monday but 96 percent were contained or extinguished in "initial attacks" by firefighters. A large fire is defined as a fire burning uncontained and extending over 100 acres or more.
Politics Economic xxxxxxx USA World xxx
alg | Automated Learning Group
Problem: Document Clustering

The long list of Defiant in demands that the face of criticism, laid President Bushemerged that Signs Prime outMinister Ariel is deeply in a purposeful the S.E.C. Sharon a very speech set today over the fractured praised an attack tough standardStocks rose this search for that killed 13 Scrambling avoidingcandidates to fill the a to save war. morning after Palestinians. their proposed speech by President new board that will Bush oversee the helpedEchoStar merger, rein in rampant war fears accounting Communications and and skittish lured profession. Hughes investors back into Electronics urged the market the F.C.C. to delay a decision on the $26 billion deal.
Problem: Document Clustering

The long list of demands that President Bush laid out in a purposeful speech set a very tough standard for avoiding war. Defiant in the face of criticism, Prime Minister Ariel Sharon today praised an attack that killed 13 Palestinians. Scrambling to save their proposed Stocks rose this merger, EchoStar morning after a Communications speech by President and Hughes Bush helped rein in Electronics urged rampant war fears the F.C.C. to delay aand lured skittish decision on the $26 investors back into emerged that billion deal. Signsthe market the S.E.C. is deeply fractured over the search for candidates to fill the new board that will oversee the accounting profession.
Text Mining Definition

Many definitions in the literature
The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data. An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge.
Text Mining Definition
What is previously unknown information ?
Strict definition
Information that not even the writer knows. e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure
Lenient definition
Rediscover the information that the author encoded in the text e.g., Automatically extracting a products name from a web-page.
Outline
Text mining applications
Text characteristics
Text mining process Learning methods
Text Mining Applications
Marketing: Discover distinct groups of potential buyers according to a user text based profile
e.g. amazon
Industry: Identifying groups of competitors web pages
e.g., competing products and their prices
Job seeking: Identify parameters in searching for jobs
e.g., www.flipdog.com
Text Mining Methods
Information Retrieval
Indexing and retrieval of textual documents
Information Extraction
Extraction of partial knowledge in the text
Web Mining
Indexing and retrieval of textual documents and extraction of partial knowledge using the web
Clustering
Generating collections of similar text documents
Information Retrieval
Given:

A source of textual documents A user query (text based)
Documents source
Query E.g. Spam / Text
IR System
Find:
A set (ranked) of documents that are relevant to the query Ranked Documents
Document Document Document
Intelligent Information Retrieval
meaning of words

Synonyms buy / purchase Ambiguity bat (baseball vs. mammal)
order of words in the query

hot dog stand in the amusement park hot amusement stand in the dog park
user dependency for the data

direct feedback indirect feedback
authority of the source
IBM is more likely to be an authorized source then my second far cousin
What is Information Extraction?
Given:

A source of textual documents A well defined limited query (text based)
Find:
Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format
Information Extraction: Example
Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. According to the police and Garcia Alvarados driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.
Incident Date: 19 Apr 89 Incident Type: Bombing Perpetrator Individual ID: urban guerillas
Human Target Name: Roberto Garcia Alvarado
...
What is Information Extraction?

Documents source Query 1
(E.g. job title)
Query 2
(E.g. salary)
Extraction System
Combine Query Results

Relevant Info 1
Ranked Documents
Relevant Info 2 Relevant Info 3
Why Mine the Web?
Enormous wealth of textual information on the Web.
Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to retrieve previously unknown information

People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside SanFrancisco
Mining the Web
Web
Spider
Documents source
Query
IR / IE System
Ranked Documents
1. Doc1 2. Doc2 3. Doc3 . .
Unique Features of the Web
The Web is a huge collection of documents where many contain:

Hyper-link information Access and usage information
The Web is very dynamic
Web pages are constantly being generated (removed)
Challenge: Develop new Web mining algorithms to . . .

Exploit hyper-links and access patterns. Be adaptable to its documents source
Intelligent Web Search
Combine the intelligent IR tools

meaning of words order of words in the query user dependency for the data authority of the source
With the unique web features

retrieve Hyper-link information utilize Hyper-link as input
What is Clustering ?
Given:
A source of textual documents Similarity measure

e.g., how many words are common in these documents
Documents source
Similarity measure
Clustering System
Find:
Doc Doc Doc Doc Doc Doc Doc
Several clusters of documents that are relevant to each other
Do Doc
Doc c
Outline
Text characteristics: Outline
Large textual data base
High dimensionality
Several input modes Dependency
Ambiguity
Noisy data Not well structured text
Large textual data base
Efficiency consideration
over 2,000,000,000 web pages almost all publications are also in electronic form
High dimensionality (Sparse input)
Consider each word/phrase as a dimension
Several input modes
e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.
Dependency
relevant information is a complex conjunction of words/phrases

e.g., Document categorization. Pronoun disambiguation.
Ambiguity
Word ambiguity
Pronouns (he, she ) buy, purchase
Semantic ambiguity
The king saw the rabbit with his glasses. (8 meanings)
Noisy data Not well structured text
Example: Spelling mistakes
Chat rooms
r u available ? Hey whazzzzzz up
Speech
Outline
Text mining process
Text mining process
Text preprocessing
Syntactic/Semantic text analysis Bag of words
Features Generation
Features Selection

Simple counting Statistics

ClassificationSupervised learning ClusteringUnsupervised learning
Text/Data Mining

Analyzing results
Syntactic / Semantic text analysis

Part Of Speech (pos) tagging
Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) ~98% accurate.
Word sense disambiguation

Context based or proximity based Very accurate
Parsing
Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph
Feature Generation: Bag of words
Text document is represented by the words it contains (and their

occurrences)

e.g., Lord of the rings {the, Lord, rings, of} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications e.g., flying, flew fly Reduce dimensionality
Stemming: identifies a word by its root

Stop words: The most common words are unlikely to help text
mining
e.g., the, a, an, you
Feature Generation: D2K Example

Hi, Here is your weekly update (that unfortunately hasn't gone out in about a month). Not much action here right now. 1) Due to the unwavering insistence of a member of the group, hi, weekly update (that unfortunately gone out month). much the ncsa.d2k.modules.core.datatype package is now completely action here right now. 1) independent of the d2k application.due unwavering insistence member group, ncsa.d2k.modules.core.datatype package now completely 2) Transformations are now application. 2) transformations now handled independent d2k handled differently in Tables. Previously,differently tables.were done using a transformations previously, transformations done using TransformationModule. That module could then list added to a transformationmodule. module added be exampletable kept. list that an now, interface called transformation sub-interface called ExampleTable kept. Now, there is an interface called Transformation andweek update unfortunate go out month much action here right hi a sub-interface reversibletransformation.called ReversibleTransformation. unwaver insistence member group ncsa d2k modules now 1 due core datatype package now complete independence d2k application 2 transformation now handle different table previous transformation do use transformationmodule module add list exampletable keep now interface call transformation subinterface call reversibletransformation
Feature Generation: XML

Current keyword-oriented search engines cannot handle rich queries like
Find all books authored by Scooby-Doo.
XML: Extensible Markup Language

XML documents have a nested structure in which each element is associated with a tag. Tags describe the semantics of elements.
<book> <title> The making of a bad movie </title> <author> <name> Scooby-Doo </name> <affiliation> Cartoons </affiliation> </author> </book>
Feature selection
Reduce dimensionality
Learners have difficulty addressing tasks with high dimensionality
Irrelevant features
Not all features help!

e.g., the existence of a noun in a news article is unlikely to help classify it as politics or sport
Feature selection: D2K Example I

hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do core datatype package complete independence application hi 2 transformation week update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules
do core datatype package complete independence application transformation handle different table previous use add list keep interface call sub-interface
Feature selection: D2K Example II

hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do core datatype package complete independence application hi 2 transformation week update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules
do core datatype package hi complete week independence update application unfortunate transformation month handle action different right table previous due use insistence add member list group keep ncsa interface d2k call modules sub-interface
core
datatype package complete independence application transformation handle different table previous add list interface call sub-interface
Text Mining: Classification definition
Given: a collection of labeled records (training set)
Each record contains a set of features (attributes), and the true class (label)
Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as
accurately as possible
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Text Mining: Clustering definition
Given: a set of documents and a similarity measure among

documents
Find: clusters such that:

Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another
Goal:
Finding a correct set of documents
Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents
Supervised vs. Unsupervised Learning
Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Evaluation:What Is Good Classification?
Correct classification: The known label of test sample is

identical with the class result from the classification model
Accuracy ratio: the percentage of test set samples that are

correctly classified by the model
A distance measure between classes can be used
e.g., classifying football document as a basketball document is not as bad as classifying it as crime.
Evaluation: What Is Good Clustering?
Good clustering method: produce high quality clusters with . . .

high intra-class similarity low inter-class similarity
The quality of a clustering method is also measured by its ability to

discover some or all of the hidden patterns
Outline
Classification Clustering
Classification: An Example
Ex# Country Marital Status 1 2 3 4 5 6 7 8 9 10

10
Income Hooligan 125K Yes Yes 70K 40K Yes No No Yes Yes
10
Country Marital Status England Single Turkey Married
Income Hooligan 75K 50K 150K ? ? ? ? ? ?
England Single England Married England Single Italy USA Married
England Married
Divorced 95K 60K 20K Single Married 85K 75K 50K
Divorced 90K Single Itlay Married 40K 80K
England Married England Italy France
Yes No No
Test Set
Denmark Single
Training Set
Learn Classifier
Model
Text Classification: An Example
Ex# Hooligan 1 2 3 4 5 6 7 8
10
An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league
Yes Yes
Hooligan A Danish football fan ? ?
Yes No No Yes Yes Yes

10
Turkey is playing vs. France. The Turkish fans
Test Set
Training Set
Learn Classifier
Model
Classification Techniques
Instance-Based Methods
Decision trees
Neural networks Bayesian classification
Instance-based Methods
Instance-based (memory based) learning
Store training examples and delay the processing (lazy evaluation) until a new instance must be classified
k-nearest neighbor approach

Instances (Examples) are represented as points in a Euclidean space
Text Examples in Euclidean Space
The English football fan is a hooligan. . .
Similar to his English equivalent, the Italian football fan is a hooligan. . .
football
football Italian Italian
K-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance
The k-NN returns the most common value among the k nearest training examples Voronoi diagram: the decision surface induced by 1-NN for a typical set of training examples
_ _
+
+ _
_
? . +
_ +
+
+ _ +
_ + + +
_ +
Decision trees
Decision Tree: An Example
Splitting Attributes
Ex# Country Marital Status 1 2 3 4 5 6 7 8 9 10
10
Income Hooligan 125K 100K 70K 40K Yes Yes Yes No No Yes Yes Yes No No
English Yes Yes No MarSt Single, Divorced Income > 80K NO < 80K YES Married NO
England Single England Married England Single Italy USA Married
Divorced 95K 60K
England Married
England Divorced 20K Italy France Single Married 85K 75K 50K
Denmark Single
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Decision Tree: A Text Example
Ex# Hooligan 1 2 3 4 5 6 7 8
10
Splitting Attributes
An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league
English
Yes Yes Yes No No Yes Yes Yes
Yes Yes
No MarSt Single, Divorced Income Married NO < 80K YES
> 80K NO
The splitting attribute at a node is determined based on a specific Attribute selection algorithm
Classification by DT Induction
Decision tree

A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases:
Tree construction Tree pruning

Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute of the sample against the decision tree
Decision trees
A Single Layer Perceptron

Threshold
Hooligan
Weights vector
0.9 -0.4 -0.8 3 0.7
1 1.2 -0.2 -2
Input vector
as
ball
cool
pain Spain foot break
FC soccerfootball
The n-dimensional input vector is used to classify by means of multiplication and a function mapping
Perceptron: D2K Example
hi week update unfortunate month much action right due insistence member group ncsa d2k modules core datatype package complete d2k application handle different table previous module add list keep subinterface call
NO SPAM
One vs. All Classifier
Class 1
Class 2
Class 3
0.9
-0.4 -0.8
0 0.7
1.2
-0.2
-2
as
ball
cool
FC
soccer football
One vs. All Classifier
Network of threshold gates Target nodes represent class labels Input nodes represent the relations (features) in the example
(Order of 105 input features for many target nodes.)
An example is positive to one network negative to others (depends on

the algorithm)
Allocations of nodes (features) and links are Data-Driven (a link between

feature i and target j is created only when i was active with target j)
A Multi-layer Perceptron
Threshold Weights vector Hidden nodes Weights vector
2.5
Hooligan
-0.9 1.2
0.9
-0.4 -0.8 0
0.7
1.2
-0.2
-2
Input vector
as
ball
cool
FC soccerfootball
Training using back-propagation
Neural Networks
Advantages
Prediction accuracy is generally high Robust, works when training examples contain errors Fast evaluation of the learned target function Easy to compute using parallel processors
Disadvantages
Long training time Difficult to understand the learned function (weights) Difficult to incorporate domain knowledge
Decision trees
Bayesian Classification
The classification problem may be formalized using probabilities:

P(C|X) = prob. that the example is of class C
e.g. P(Hooligan | English, fan, married)
Idea: assign to example X the class label C such that P(C|X) is maximal
Bayesian Classification: Why?
Probabilistic learning: Calculate explicit probabilities for hypothesis,

is among the most practical approaches to certain types of learning problems
Incremental: Each training example can incrementally

increase/decrease the probability that a hypothesis is correct
Prior knowledge: can be combined with observed data
Standard:

Provide a standard of optimal decision making against which other methods can be measured In a simpler form, provide a baseline against which other methods can be measured
Estimating Probabilities
Bayes theorem:
P(C|X) = P(X|C)P(C) / P(X)
P(X) is constant for all classes Therefore estimate P(C|X) such that:
P(C|X) P(X|C)P(C)
P(C) = relative freq of class C samples
Problem: computing P(X|C) is unfeasible!
X is likely to be an example we have never seen before
Nave Bayesian Classification
Nave assumption: feature independence
P(x1,,xk|C) = P(x1|C)P(xk|C)
P(xi|C) is estimated as the relative frequency of examples having value xi as feature in class C Computationally easy!!!
The Independence Hypothesis
makes computation possible yields optimal classifiers when satisfied but is seldom satisfied in practice, as attributes (variables) are
often correlated
Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning with causal relationships between features
Clustering Techniques
Partitioning Methods
Hierarchical Methods
Partitioning Algorithms
Partitioning method: Construct a partition of n documents into a set

of k clusters
Given: a set of documents and the number k Find: a partition of k clusters that optimizes the chosen partitioning
criterion

Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means: Each cluster is represented by the center of the cluster
The K-means Clustering Method
k-means algorithm is implemented in 4 steps:

1. 2. 3. 4. Partition objects into k nonempty subsets. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.
The K-means Clustering: Example

10 9 8 7 6 5 4 3 2 1 0
0 10 9 8 7 6 5 4 3 2 1 0
10
10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Clustering Techniques
Partitioning Methods
Hierarchical Methods
Hierarchical Clustering
Agglomerative:

Start with each document being a single cluster. Eventually all document belong to the same cluster.
Divisive:
Start with all document belong to the same cluster.
Eventually each node forms a cluster on its own.
Does not require the number of clusters k in advance Needs a termination condition
The final mode in both Agglomerative and Divisive in of no use.
Hierarchical Clustering: Example

Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a b
ab abcde
c
d e
Step 4 Step 3
cde
de divisive
Step 2 Step 1 Step 0
A Dendogram: Hierarchical Clustering

Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Demo
Summary
Text is tricky to process, but ok results are easily achieved

There exist several text mining systems
e.g., D2K - Data to Knowledge
http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
Additional Intelligence can be integrated with text mining
One may play with any phase of the text mining process
Summary
There are many other scientific and statistical text mining methods
developed but not covered in this talk.

http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html
Also, it is important to study theoretical foundations of data mining.

Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell

Text Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Mining

Uploaded by

Copyright:

Available Formats

Introduction to Text Mining

Problem: Document Categorization

Politics Economic xxxxxxx USA World xxx

alg | Automated Learning Group

Problem: Document Clustering

alg | Automated Learning Group

Problem: Document Clustering

alg | Automated Learning Group

Text Mining Definition

alg | Automated Learning Group

Text Mining Definition

What is previously unknown information ?

alg | Automated Learning Group

Text mining applications

alg | Automated Learning Group

Text Mining Applications

Industry: Identifying groups of competitors web pages

e.g., competing products and their prices

Job seeking: Identify parameters in searching for jobs

alg | Automated Learning Group

Text Mining Methods

Indexing and retrieval of textual documents

Extraction of partial knowledge in the text

Generating collections of similar text documents

alg | Automated Learning Group

A source of textual documents A user query (text based)

Query E.g. Spam / Text

Document Document Document

alg | Automated Learning Group

Intelligent Information Retrieval

Synonyms buy / purchase Ambiguity bat (baseball vs. mammal)

order of words in the query

user dependency for the data

direct feedback indirect feedback

authority of the source

IBM is more likely to be an authorized source then my second far cousin

alg | Automated Learning Group

What is Information Extraction?

A source of textual documents A well defined limited query (text based)

alg | Automated Learning Group

Information Extraction: Example

Human Target Name: Roberto Garcia Alvarado

What is Information Extraction?

Combine Query Results

Relevant Info 2 Relevant Info 3

Why Mine the Web?

Enormous wealth of textual information on the Web.

Lots of data on user access patterns

Web logs contain sequence of URLs accessed by users

Possible to retrieve previously unknown information

alg | Automated Learning Group

Mining the Web

1. Doc1 2. Doc2 3. Doc3 . .

Unique Features of the Web

The Web is a huge collection of documents where many contain:

Hyper-link information Access and usage information

The Web is very dynamic

Web pages are constantly being generated (removed)

Challenge: Develop new Web mining algorithms to . . .

Exploit hyper-links and access patterns. Be adaptable to its documents source

alg | Automated Learning Group

Intelligent Web Search

Combine the intelligent IR tools