You are on page 1of 74

Introduction to Text Mining

Supercomputing 2002

Yair Even-Zohar
Automated Learning Group National Center for Supercomputing Applications University of Illinois evenzoha@ncsa.uiuc.edu

Problem: Document Categorization

BOISE, Idaho (CNN) -- Cooler weather Monday in the U.S. Pacific Northwest may help more than 27,000 firefighters in their marathon battle against dozens of wildfires. Ten new large fires were reported overnight into Monday, bringing to 40 the number of large fires ablaze, said officials at the National Interagency Fire Center. They said between 300,000 and 400,000 acres are aflame. The good news is that five large fires were contained Sunday. So far, about 2.8 million acres of forest have burned this year, making 2001 an average year for fires. Center officials said 241 fires were reported Sunday into Monday but 96 percent were contained or extinguished in "initial attacks" by firefighters. A large fire is defined as a fire burning uncontained and extending over 100 acres or more.

Politics Economic xxxxxxx USA World xxx

alg | Automated Learning Group

Problem: Document Clustering


The long list of Defiant in demands that the face of criticism, laid President Bushemerged that Signs Prime outMinister Ariel is deeply in a purposeful the S.E.C. Sharon a very speech set today over the fractured praised an attack tough standardStocks rose this search for that killed 13 Scrambling avoidingcandidates to fill the a to save war. morning after Palestinians. their proposed speech by President new board that will Bush oversee the helpedEchoStar merger, rein in rampant war fears accounting Communications and and skittish lured profession. Hughes investors back into Electronics urged the market the F.C.C. to delay a decision on the $26 billion deal.

alg | Automated Learning Group

Problem: Document Clustering


The long list of demands that President Bush laid out in a purposeful speech set a very tough standard for avoiding war. Defiant in the face of criticism, Prime Minister Ariel Sharon today praised an attack that killed 13 Palestinians. Scrambling to save their proposed Stocks rose this merger, EchoStar morning after a Communications speech by President and Hughes Bush helped rein in Electronics urged rampant war fears the F.C.C. to delay aand lured skittish decision on the $26 investors back into emerged that billion deal. Signsthe market the S.E.C. is deeply fractured over the search for candidates to fill the new board that will oversee the accounting profession.

alg | Automated Learning Group

Text Mining Definition


Many definitions in the literature
The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data. An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge.

alg | Automated Learning Group

Text Mining Definition

What is previously unknown information ?

Strict definition
Information that not even the writer knows. e.g., Discovering a new method for a hair growth that is described as a side effect for a different procedure

Lenient definition
Rediscover the information that the author encoded in the text e.g., Automatically extracting a products name from a web-page.

alg | Automated Learning Group

Outline

Text mining applications

Text characteristics
Text mining process Learning methods

alg | Automated Learning Group

Text Mining Applications

Marketing: Discover distinct groups of potential buyers according to a user text based profile

e.g. amazon

Industry: Identifying groups of competitors web pages

e.g., competing products and their prices

Job seeking: Identify parameters in searching for jobs

e.g., www.flipdog.com

alg | Automated Learning Group

Text Mining Methods

Information Retrieval

Indexing and retrieval of textual documents

Information Extraction

Extraction of partial knowledge in the text

Web Mining

Indexing and retrieval of textual documents and extraction of partial knowledge using the web

Clustering

Generating collections of similar text documents

alg | Automated Learning Group

Information Retrieval
Given:

A source of textual documents A user query (text based)

Documents source

Query E.g. Spam / Text

IR System

Find:

A set (ranked) of documents that are relevant to the query Ranked Documents

Document Document Document

alg | Automated Learning Group

Intelligent Information Retrieval

meaning of words

Synonyms buy / purchase Ambiguity bat (baseball vs. mammal)

order of words in the query


hot dog stand in the amusement park hot amusement stand in the dog park

user dependency for the data


direct feedback indirect feedback

authority of the source

IBM is more likely to be an authorized source then my second far cousin

alg | Automated Learning Group

What is Information Extraction?

Given:

A source of textual documents A well defined limited query (text based)

Find:
Sentences with relevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a predetermined format

alg | Automated Learning Group

Information Extraction: Example

Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. According to the police and Garcia Alvarados driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.

Incident Date: 19 Apr 89 Incident Type: Bombing Perpetrator Individual ID: urban guerillas

Human Target Name: Roberto Garcia Alvarado

...
alg | Automated Learning Group

What is Information Extraction?


Documents source Query 1
(E.g. job title)

Query 2
(E.g. salary)

Extraction System

Combine Query Results


Relevant Info 1

Ranked Documents
alg | Automated Learning Group

Relevant Info 2 Relevant Info 3

Why Mine the Web?

Enormous wealth of textual information on the Web.

Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)

Lots of data on user access patterns

Web logs contain sequence of URLs accessed by users

Possible to retrieve previously unknown information


People who ski also frequently break their leg. Restaurants that serve sea food in California are likely to be outside SanFrancisco

alg | Automated Learning Group

Mining the Web

Web

Spider

Documents source

Query

IR / IE System

Ranked Documents
alg | Automated Learning Group

1. Doc1 2. Doc2 3. Doc3 . .

Unique Features of the Web

The Web is a huge collection of documents where many contain:


Hyper-link information Access and usage information

The Web is very dynamic

Web pages are constantly being generated (removed)

Challenge: Develop new Web mining algorithms to . . .


Exploit hyper-links and access patterns. Be adaptable to its documents source

alg | Automated Learning Group

Intelligent Web Search

Combine the intelligent IR tools


meaning of words order of words in the query user dependency for the data authority of the source

With the unique web features


retrieve Hyper-link information utilize Hyper-link as input

alg | Automated Learning Group

What is Clustering ?
Given:

A source of textual documents Similarity measure


e.g., how many words are common in these documents

Documents source

Similarity measure

Clustering System

Find:

Doc Doc Doc Doc Doc Doc Doc

Several clusters of documents that are relevant to each other

Do Doc

Doc c

alg | Automated Learning Group

Outline

Text mining applications

Text characteristics
Text mining process Learning methods

alg | Automated Learning Group

Text characteristics: Outline

Large textual data base

High dimensionality
Several input modes Dependency

Ambiguity
Noisy data Not well structured text

alg | Automated Learning Group

Text characteristics

Large textual data base

Efficiency consideration
over 2,000,000,000 web pages almost all publications are also in electronic form

High dimensionality (Sparse input)

Consider each word/phrase as a dimension

Several input modes

e.g., Web mining: information about user is generated by semantics, browse pattern and outside knowledgebase.

alg | Automated Learning Group

Text characteristics

Dependency

relevant information is a complex conjunction of words/phrases


e.g., Document categorization. Pronoun disambiguation.

Ambiguity

Word ambiguity
Pronouns (he, she ) buy, purchase

Semantic ambiguity
The king saw the rabbit with his glasses. (8 meanings)

alg | Automated Learning Group

Text characteristics

Noisy data Not well structured text

Example: Spelling mistakes

Chat rooms
r u available ? Hey whazzzzzz up

Speech

alg | Automated Learning Group

Outline

Text mining applications

Text characteristics
Text mining process Learning methods

alg | Automated Learning Group

Text mining process

alg | Automated Learning Group

Text mining process

Text preprocessing

Syntactic/Semantic text analysis Bag of words

Features Generation

Features Selection

Simple counting Statistics


ClassificationSupervised learning ClusteringUnsupervised learning

Text/Data Mining

Analyzing results

alg | Automated Learning Group

Syntactic / Semantic text analysis


Part Of Speech (pos) tagging
Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) ~98% accurate.

Word sense disambiguation


Context based or proximity based Very accurate

Parsing
Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph

alg | Automated Learning Group

Feature Generation: Bag of words

Text document is represented by the words it contains (and their


occurrences)

e.g., Lord of the rings {the, Lord, rings, of} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications e.g., flying, flew fly Reduce dimensionality

Stemming: identifies a word by its root


Stop words: The most common words are unlikely to help text
mining

e.g., the, a, an, you

alg | Automated Learning Group

Feature Generation: D2K Example


Hi, Here is your weekly update (that unfortunately hasn't gone out in about a month). Not much action here right now. 1) Due to the unwavering insistence of a member of the group, hi, weekly update (that unfortunately gone out month). much the ncsa.d2k.modules.core.datatype package is now completely action here right now. 1) independent of the d2k application.due unwavering insistence member group, ncsa.d2k.modules.core.datatype package now completely 2) Transformations are now application. 2) transformations now handled independent d2k handled differently in Tables. Previously,differently tables.were done using a transformations previously, transformations done using TransformationModule. That module could then list added to a transformationmodule. module added be exampletable kept. list that an now, interface called transformation sub-interface called ExampleTable kept. Now, there is an interface called Transformation andweek update unfortunate go out month much action here right hi a sub-interface reversibletransformation.called ReversibleTransformation. unwaver insistence member group ncsa d2k modules now 1 due core datatype package now complete independence d2k application 2 transformation now handle different table previous transformation do use transformationmodule module add list exampletable keep now interface call transformation subinterface call reversibletransformation

alg | Automated Learning Group

Feature Generation: XML


Current keyword-oriented search engines cannot handle rich queries like

Find all books authored by Scooby-Doo.

XML: Extensible Markup Language


XML documents have a nested structure in which each element is associated with a tag. Tags describe the semantics of elements.

<book> <title> The making of a bad movie </title> <author> <name> Scooby-Doo </name> <affiliation> Cartoons </affiliation> </author> </book>
alg | Automated Learning Group

Feature selection

Reduce dimensionality

Learners have difficulty addressing tasks with high dimensionality

Irrelevant features

Not all features help!


e.g., the existence of a noun in a news article is unlikely to help classify it as politics or sport

alg | Automated Learning Group

Feature selection: D2K Example I


hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do core datatype package complete independence application hi 2 transformation week update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules

do core datatype package complete independence application transformation handle different table previous use add list keep interface call sub-interface

alg | Automated Learning Group

Feature selection: D2K Example II


hi week update unfortunate go out month much action here right now 1 due unwaver insistence member group ncsa d2k modules do core datatype package complete independence application hi 2 transformation week update handle unfortunate different go table out previous month use much transformationmodule action add here list exampletable right now keep due interface insistence call sub-interface member group reversibletransformation ncsa d2k modules

do core datatype package hi complete week independence update application unfortunate transformation month handle action different right table previous due use insistence add member list group keep ncsa interface d2k call modules sub-interface

core
alg | Automated Learning Group

datatype package complete independence application transformation handle different table previous add list interface call sub-interface

Text Mining: Classification definition

Given: a collection of labeled records (training set)

Each record contains a set of features (attributes), and the true class (label)

Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as
accurately as possible

A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

alg | Automated Learning Group

Text Mining: Clustering definition

Given: a set of documents and a similarity measure among


documents

Find: clusters such that:


Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another

Goal:

Finding a correct set of documents

Similarity Measures:
Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents

alg | Automated Learning Group

Supervised vs. Unsupervised Learning

Supervised learning (classification)


Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set

Unsupervised learning (clustering)


The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

alg | Automated Learning Group

Evaluation:What Is Good Classification?

Correct classification: The known label of test sample is


identical with the class result from the classification model

Accuracy ratio: the percentage of test set samples that are


correctly classified by the model

A distance measure between classes can be used

e.g., classifying football document as a basketball document is not as bad as classifying it as crime.

alg | Automated Learning Group

Evaluation: What Is Good Clustering?

Good clustering method: produce high quality clusters with . . .


high intra-class similarity low inter-class similarity

The quality of a clustering method is also measured by its ability to


discover some or all of the hidden patterns

alg | Automated Learning Group

Outline

Text mining applications

Text characteristics
Text mining process Learning methods

Classification Clustering

alg | Automated Learning Group

Classification: An Example

Ex# Country Marital Status 1 2 3 4 5 6 7 8 9 10


10

Income Hooligan 125K Yes Yes 70K 40K Yes No No Yes Yes
10

Country Marital Status England Single Turkey Married

Income Hooligan 75K 50K 150K ? ? ? ? ? ?

England Single England Married England Single Italy USA Married

England Married

Divorced 95K 60K 20K Single Married 85K 75K 50K

Divorced 90K Single Itlay Married 40K 80K

England Married England Italy France

Yes No No

Test Set

Denmark Single

Training Set

Learn Classifier

Model

alg | Automated Learning Group

Text Classification: An Example

Ex# Hooligan 1 2 3 4 5 6 7 8
10

An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league

Yes Yes

Hooligan A Danish football fan ? ?

Yes No No Yes Yes Yes


10

Turkey is playing vs. France. The Turkish fans

Test Set

Training Set
alg | Automated Learning Group

Learn Classifier

Model

Classification Techniques

Instance-Based Methods

Decision trees
Neural networks Bayesian classification

alg | Automated Learning Group

Instance-based Methods

Instance-based (memory based) learning

Store training examples and delay the processing (lazy evaluation) until a new instance must be classified

k-nearest neighbor approach


Instances (Examples) are represented as points in a Euclidean space

alg | Automated Learning Group

Text Examples in Euclidean Space

The English football fan is a hooligan. . .

Similar to his English equivalent, the Italian football fan is a hooligan. . .

football

football Italian Italian

alg | Automated Learning Group

K-Nearest Neighbor Algorithm

All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance
The k-NN returns the most common value among the k nearest training examples Voronoi diagram: the decision surface induced by 1-NN for a typical set of training examples
_ _

+
+ _

_
? . +

_ +

+
+ _ +

_ + + +

_ +

alg | Automated Learning Group

Classification Techniques

Instance-Based Methods

Decision trees
Neural networks Bayesian classification

alg | Automated Learning Group

Decision Tree: An Example

Splitting Attributes
Ex# Country Marital Status 1 2 3 4 5 6 7 8 9 10
10

Income Hooligan 125K 100K 70K 40K Yes Yes Yes No No Yes Yes Yes No No

English Yes Yes No MarSt Single, Divorced Income > 80K NO < 80K YES Married NO

England Single England Married England Single Italy USA Married

Divorced 95K 60K

England Married

England Divorced 20K Italy France Single Married 85K 75K 50K

Denmark Single

The splitting attribute at a node is determined based on a specific Attribute selection algorithm

alg | Automated Learning Group

Decision Tree: A Text Example

Ex# Hooligan 1 2 3 4 5 6 7 8
10

Splitting Attributes
An English football fan During a game in Italy England has been beating France Italian football fans were cheering An average USA salesman earns 75K The game in London was horrific Manchester city is likely to win the championship Rome is taking the lead in the football league

English
Yes Yes Yes No No Yes Yes Yes

Yes Yes

No MarSt Single, Divorced Income Married NO < 80K YES

> 80K NO

The splitting attribute at a node is determined based on a specific Attribute selection algorithm

alg | Automated Learning Group

Classification by DT Induction

Decision tree

A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases:

Tree construction Tree pruning


Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample

Test the attribute of the sample against the decision tree

alg | Automated Learning Group

Classification Techniques

Instance-Based Methods

Decision trees
Neural networks Bayesian classification

alg | Automated Learning Group

A Single Layer Perceptron


Threshold

Hooligan

Weights vector

0.9 -0.4 -0.8 3 0.7

1 1.2 -0.2 -2

Input vector

as

ball

cool

pain Spain foot break

FC soccerfootball

The n-dimensional input vector is used to classify by means of multiplication and a function mapping

alg | Automated Learning Group

Perceptron: D2K Example

hi week update unfortunate month much action right due insistence member group ncsa d2k modules core datatype package complete d2k application handle different table previous module add list keep subinterface call

NO SPAM

alg | Automated Learning Group

One vs. All Classifier

Class 1

Class 2

Class 3

0.9

-0.4 -0.8

0 0.7

1.2

-0.2

-2

as

ball

cool

pain Spain foot break

FC

soccer football

alg | Automated Learning Group

One vs. All Classifier

Network of threshold gates Target nodes represent class labels Input nodes represent the relations (features) in the example
(Order of 105 input features for many target nodes.)

An example is positive to one network negative to others (depends on


the algorithm)

Allocations of nodes (features) and links are Data-Driven (a link between


feature i and target j is created only when i was active with target j)

alg | Automated Learning Group

A Multi-layer Perceptron
Threshold Weights vector Hidden nodes Weights vector
2.5

Hooligan
-0.9 1.2

0.9

-0.4 -0.8 0

0.7

1.2

-0.2

-2

Input vector

as

ball

cool

pain Spain foot break

FC soccerfootball

Training using back-propagation

alg | Automated Learning Group

Neural Networks

Advantages

Prediction accuracy is generally high Robust, works when training examples contain errors Fast evaluation of the learned target function Easy to compute using parallel processors

Disadvantages

Long training time Difficult to understand the learned function (weights) Difficult to incorporate domain knowledge

alg | Automated Learning Group

Classification Techniques

Instance-Based Methods

Decision trees
Neural networks Bayesian classification

alg | Automated Learning Group

Bayesian Classification

The classification problem may be formalized using probabilities:


P(C|X) = prob. that the example is of class C

e.g. P(Hooligan | English, fan, married)

Idea: assign to example X the class label C such that P(C|X) is maximal

alg | Automated Learning Group

Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities for hypothesis,


is among the most practical approaches to certain types of learning problems

Incremental: Each training example can incrementally


increase/decrease the probability that a hypothesis is correct

Prior knowledge: can be combined with observed data

Standard:

Provide a standard of optimal decision making against which other methods can be measured In a simpler form, provide a baseline against which other methods can be measured

alg | Automated Learning Group

Estimating Probabilities

Bayes theorem:
P(C|X) = P(X|C)P(C) / P(X)

P(X) is constant for all classes Therefore estimate P(C|X) such that:
P(C|X) P(X|C)P(C)

P(C) = relative freq of class C samples

Problem: computing P(X|C) is unfeasible!

X is likely to be an example we have never seen before

alg | Automated Learning Group

Nave Bayesian Classification

Nave assumption: feature independence

P(x1,,xk|C) = P(x1|C)P(xk|C)
P(xi|C) is estimated as the relative frequency of examples having value xi as feature in class C Computationally easy!!!

alg | Automated Learning Group

The Independence Hypothesis

makes computation possible yields optimal classifiers when satisfied but is seldom satisfied in practice, as attributes (variables) are
often correlated

Attempts to overcome this limitation:

Bayesian networks, that combine Bayesian reasoning with causal relationships between features

alg | Automated Learning Group

Clustering Techniques

Partitioning Methods

Hierarchical Methods

alg | Automated Learning Group

Partitioning Algorithms

Partitioning method: Construct a partition of n documents into a set


of k clusters

Given: a set of documents and the number k Find: a partition of k clusters that optimizes the chosen partitioning
criterion

Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means: Each cluster is represented by the center of the cluster

alg | Automated Learning Group

The K-means Clustering Method

k-means algorithm is implemented in 4 steps:


1. 2. 3. 4. Partition objects into k nonempty subsets. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

alg | Automated Learning Group

The K-means Clustering: Example


10 9 8 7 6 5 4 3 2 1 0
0 10 9 8 7 6 5 4 3 2 1 0

10

10

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

alg | Automated Learning Group

Clustering Techniques

Partitioning Methods

Hierarchical Methods

alg | Automated Learning Group

Hierarchical Clustering

Agglomerative:

Start with each document being a single cluster. Eventually all document belong to the same cluster.

Divisive:

Start with all document belong to the same cluster.

Eventually each node forms a cluster on its own.

Does not require the number of clusters k in advance Needs a termination condition

The final mode in both Agglomerative and Divisive in of no use.

alg | Automated Learning Group

Hierarchical Clustering: Example


Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative

a b

ab abcde

c
d e
Step 4 Step 3

cde
de divisive
Step 2 Step 1 Step 0

alg | Automated Learning Group

A Dendogram: Hierarchical Clustering


Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters). Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

alg | Automated Learning Group

Demo

alg | Automated Learning Group

Summary

Text is tricky to process, but ok results are easily achieved


There exist several text mining systems

e.g., D2K - Data to Knowledge

http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/

Additional Intelligence can be integrated with text mining

One may play with any phase of the text mining process

alg | Automated Learning Group

Summary

There are many other scientific and statistical text mining methods
developed but not covered in this talk.

http://www.cs.utexas.edu/users/pebronia/text-mining/ http://filebox.vt.edu/users/wfan/text_mining.html

Also, it is important to study theoretical foundations of data mining.


Data Mining Concepts and Techniques / J.Han & M.Kamber Machine Learning, / T.Mitchell

alg | Automated Learning Group

You might also like