Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
What is Data?
An attribute is a property or
characteristic of an object
Attributes
Attributes
Objects
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
4/18/2004
Data
Definition 1: An attribute is a property or
characteristic of an object that may vary; either
from one object to another or from one time to
another.
Definition 2: A measurement scale is a rule
(function) that associates a numerical or symbolic
value with an attribute of an object.
The process of Measurement is the application of a
measurement scale to associate a value with a
particular attribute of a specific object.
Tan,Steinbach, Kumar
4/18/2004
Attribute Values
4/18/2004
Measurement of Length
The way you measure an attribute is somewhat may not match the
attributes properties.
5
B
7
2
C
D
10
15
Tan,Steinbach, Kumar
4/18/2004
Types of Attributes
Ordinal
Interval
Ratio
Tan,Steinbach, Kumar
4/18/2004
Distinctness:
Order:
Addition:
Multiplication:
Tan,Steinbach, Kumar
=
< >
+ */
4/18/2004
Types of Attributes
Nominal and ordinal attributes are collectively
referred to as Categorical or Qualitative attributes.
As the name suggests, qualitative attributes, such as
employee ID, lack most of the properties of
numbers.
The remaining two types of attributes, Interval and
Ratio, are collectively referred to as Quantitative or
Numeric Attributes.
Quantitative attributes are represented by numbers
and have most of the properties of numbers.
Note: Quantitative attributes can be integer-valued or
continuous.
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Attribute
Type
Description
Examples
Nominal
mode, entropy,
contingency
correlation, 2 test
Ordinal
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Ratio
Operations
Attribute
Level
Transformation
Comments
Nominal
Ordinal
Interval
new_value =a * old_value + b
where a and b are constants
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio
new_value = a * old_value
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection of
documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finite
number of digits.
Continuous attributes are typically represented as floating-point variables.
Asymmetric Attributes: Binary attributes where only non-zero values are important
are called asymmetric binary attributes.
Example: If the number of credits associated with each course is recorded, then the
resulting data set will consist of asymmetric discrete or continuous attributes.
Tan,Steinbach, Kumar
4/18/2004
12
Record
Data Matrix
Document Data
Transaction Data
Graph
Ordered
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Tan,Steinbach, Kumar
4/18/2004
13
Dimensionality: The dimensionality of a data set is the number of attributes that the objects
in the data set possess.
Sparsity
Only Presence Counts: For some data sets, such as those with asymmetric features, most
attributes of an object have values of 0; in many cases fewer than 1% of the entries are nonzero.
In practical terms, sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated. This results in significant savings with respect to computation time and
storage.
Resolution
It is frequently possible to obtain data at different levels of resoIution, and often the properties of
the data are different at different resolutions.
Example: The surface of the Earth seems very uneven at a resolution of a few meters, but is
relatively smooth at a resolution of tens of kilometers.
Tan,Steinbach, Kumar
4/18/2004
14
Record Data
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Tan,Steinbach, Kumar
4/18/2004
15
This type of data is called Market Basket Data because the items in each record are the
products in a person's "market basket."
Tan,Steinbach, Kumar
4/18/2004
16
Data Matrix
Projection
of y load
Distance
Load
Thickness
10.23
5.27
15.22
2.7
1.2
12.65
6.25
16.22
2.2
1.1
Tan,Steinbach, Kumar
4/18/2004
17
4/18/2004
18
Document Data
team
coach
pla
y
ball
score
game
wi
n
lost
timeout
season
Document 1
Document 2
Document 3
Tan,Steinbach, Kumar
4/18/2004
19
Transaction Data
Items
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Tan,Steinbach, Kumar
4/18/2004
20
(2)
The data objects are mapped to nodes of the graph, while the
relationships among objects are captured by the links between objects
and link properties, such as direction and weight.
Data with Objects which are Graphs: If objects have structure i.e. the
objects contain sub-objects that have relationships, then such objects are
frequently represented as graphs.
e.g.: The structure of chemical compounds can be represented by a
graph, where the nodes are atoms and the links between nodes are
chemical bonds.
Tan,Steinbach, Kumar
4/18/2004
21
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
2
1
5
2
5
4/18/2004
22
Chemical Data
Tan,Steinbach, Kumar
4/18/2004
23
Ordered Data
An element of
the sequence
Tan,Steinbach, Kumar
4/18/2004
24
Consider a retail transaction data set that also stores the time at which
the transaction took place.
This time information makes it possible to find patterns such as "candy
sales peak before Halloween." A time can also be associated with
each attribute.
4/18/2004
25
Tan,Steinbach, Kumar
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
4/18/2004
26
Time Series Data: Time series data is a special type of sequential data in which each
record is a time series, i.e., a series of measurements taken over time.
e.g.: A financial data set might contain objects that are time series of the daily prices of
various stocks.
e.g.: Consider Figure below, which shows a time series of the average monthly
temperature for Minneapolis during the years 1982 to 1994. When working with
temporal data, it is important to consider temporal autocorrelation; i.e., if two
measurements are close in time, then the values of those measurements are often
very similar.
4/18/2004
27
Ordered Data
Spatial Data: Some objects have spatial attributes, such as positions or areas,
as well as other types of attributes.
e.g.: Weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations.
An important aspect of spatial data is spatial autocorrelation; i.e., objects that
are physically close tend to be similar in other ways as well. Thus, two points
on the Earth that are close to each other usually have similar values for
temperature and rainfall.
Spatio-Temporal Data:
Average Monthly
Temperature
of
land and ocean
Tan,Steinbach, Kumar
4/18/2004
28
Data Quality
Data Mining focuses on the
Detection and correction of data quality problems and
Use of algorithms that can tolerate poor data quality.
missing values
duplicate data
Tan,Steinbach, Kumar
4/18/2004
29
Even if all the data is present and "looks fine," there may be inconsistencies-a person
has a height of 2 meters, but weighs only 2 kilograms.
Measurement and Data Collection Errors: The term Measurement error refers to any
problem resulting from the measurement process it may be the value recorded differs
from the true value.
For continuous attributes, the numerical difference of the measured and true value is
called the error.
The term data collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object.
e.g.: A study of animals of a certain species might include animals of a related species
that are similar in appearance to the species of interest. Both measurement errors and
data collection errors can be either systematic or random.
Tan,Steinbach, Kumar
4/18/2004
30
Noise
31
4/18/2004
32
Outliers
Tan,Steinbach, Kumar
4/18/2004
33
Missing Values
all
cases
collected
Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city.
Tan,Steinbach, Kumar
4/18/2004
34
Duplicate Data
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Tan,Steinbach, Kumar
4/18/2004
35
Tan,Steinbach, Kumar
4/18/2004
36
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Tan,Steinbach, Kumar
4/18/2004
37
Aggregation
Purpose
Data reduction
Change of scale
Tan,Steinbach, Kumar
4/18/2004
38
Aggregation
39
Sampling
Tan,Steinbach, Kumar
4/18/2004
40
Sampling
Tan,Steinbach, Kumar
4/18/2004
41
Types of Sampling
Stratified sampling
Split the data into several partitions; then draw random samples
from each partition
Tan,Steinbach, Kumar
4/18/2004
42
Sample Size
8000 points
Tan,Steinbach, Kumar
2000 Points
500 Points
4/18/2004
43
Sample Size
Tan,Steinbach, Kumar
4/18/2004
44
Curse of Dimensionality
When
dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
Tan,Steinbach, Kumar
4/18/2004
45
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
Tan,Steinbach, Kumar
4/18/2004
46
x1
Tan,Steinbach, Kumar
4/18/2004
47
x2
e
x1
Tan,Steinbach, Kumar
4/18/2004
48
Tan,Steinbach, Kumar
4/18/2004
49
Tan,Steinbach, Kumar
4/18/2004
50
4/18/2004
51
Redundant features
duplicate much or all of the information contained in
one or more other attributes
Example: purchase price of a product and the amount
of sales tax paid
Irrelevant features
contain no information that is useful for the data
mining task at hand
Example: students' ID is often irrelevant to the task of
predicting students' GPA
Tan,Steinbach, Kumar
4/18/2004
52
Brute-force approach:
Try
Embedded approaches:
Filter approaches:
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset of attributes
Tan,Steinbach, Kumar
4/18/2004
53
4/18/2004
54
Feature Creation
Tan,Steinbach, Kumar
combining features
Introduction to Data Mining
4/18/2004
55
Fourier Transform
Wavelet Transform
Tan,Steinbach, Kumar
Frequency
4/18/2004
56
Tan,Steinbach, Kumar
4/18/2004
57
Data
Equal frequency
Tan,Steinbach, Kumar
K-means
Introduction to Data Mining
4/18/2004
58
Attribute Transformation
Tan,Steinbach, Kumar
4/18/2004
59
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Tan,Steinbach, Kumar
4/18/2004
60
Tan,Steinbach, Kumar
4/18/2004
61
Euclidean Distance
Euclidean Distance
dist
( pk
k 1
qk )
Tan,Steinbach, Kumar
4/18/2004
62
Euclidean Distance
point
p1
p2
p3
p4
p1
p3
p4
1
p2
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
63
Minkowski Distance
dist ( | pk
k 1
qk
1
r r
|)
Tan,Steinbach, Kumar
4/18/2004
64
r = 2. Euclidean distance
Tan,Steinbach, Kumar
4/18/2004
65
Minkowski Distance
point
p1
p2
p3
p4
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p1
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
L
p1
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
3
1
0
2
5
3
2
0
Distance Matrix
Tan,Steinbach, Kumar
4/18/2004
66
Mahalanobis Distance
mahalanobis( p, q)
( p q)
( p q)
1
j ,k
n 1i
( X ij
X j )( X ik
4/18/2004
67
Xk)
Covariance Matrix:
0.3 0.2
0.2 0.3
C
B
Mahalanobis Distance
A: (0.5, 0.5)
B: (0, 1)
A
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Tan,Steinbach, Kumar
4/18/2004
68
d(p, q)
0
for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2.
3.
d(p, r)
d(p, q) + d(q, r)
(Triangle Inequality)
Tan,Steinbach, Kumar
4/18/2004
69
2.
s(p, q) = s(q, p)
Tan,Steinbach, Kumar
4/18/2004
70
Tan,Steinbach, Kumar
4/18/2004
71
Tan,Steinbach, Kumar
4/18/2004
72
Cosine Similarity
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1
d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
4/18/2004
73
Tan,Steinbach, Kumar
4/18/2004
74
Correlation
Correlation measures the linear relationship
between objects
To compute correlation, we standardize data
objects, p and q, and then take their dot product
pk
( pk
qk
( qk
correlation( p, q)
Tan,Steinbach, Kumar
p q
4/18/2004
75
Scatter plots
showing the
similarity from
1 to 1.
Tan,Steinbach, Kumar
4/18/2004
76
Tan,Steinbach, Kumar
4/18/2004
77
Tan,Steinbach, Kumar
4/18/2004
78
Density
Examples:
Euclidean density
Probability density
Graph-based density
Tan,Steinbach, Kumar
4/18/2004
79
Tan,Steinbach, Kumar
4/18/2004
80
Tan,Steinbach, Kumar
4/18/2004
81
THANK YOU
Tan,Steinbach, Kumar
4/18/2004
82