LECTURE02 03 SimilarityMetrices DataVisualization

Distance or Similarity Measures
Many data mining and analytics tasks involve the comparison of objects and
determining in terms of their similarities (or dissimilarities)
Clustering
Nearest-neighbor search, classification, and prediction
Correlation analysis
Many of todays real-world applications rely on the computation similarities or

distances among objects
Personalization
Recommender systems
Document categorization
Information retrieval
Similarity and Dissimilarity

Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are

Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

Measuring Distance
In order to group similar items, we need a way to measure the
distance between objects (e.g., records)
Often requires the representation of objects as feature vectors
An Employee DB
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
Feature vector corresponding to

Employee 2: <M, 51, 64000.0>
Term Frequencies for Documents

Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
Feature vector corresponding to Document 4:

<0, 1, 0, 3, 0, 0>

Properties of Distance Measures:
for all objects A and B, dist(A, B) 0, and dist(A, B) =
dist(B, A)
for any object A, dist(A, A) = 0
Representation of objects as vectors:
Each data object (item) can be viewed as an n-dimensional vector, where
the dimensions are the attributes (features) in the data
The vector representation allows us to compute distance or similarity
between pairs of items using standard vector operations, e.g.,
Cosine of the angle between vectors
Manhattan distance
Euclidean distance
Hamming Distance
Data Matrix and Distance Matrix

Data matrix
x11
...
Conceptual representation of a table

Cols = features; rows = data objects
x
i1
...
x
n1
n data points with p dimensions

Each row in the matrix is the vector
representation of a data object
Distance (or Similarity) Matrix

n data points, but indicates only the
pairwise distance (or similarity)
A triangular matrix
Symmetric
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
...
...
...
xnf
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
d ( n,1) d ( n,2) ...
... 0
Proximity Measure for Nominal Attributes

If object attributes are all nominal (categorical), then
proximity measure are used to compare objects
Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
Method 1: Simple matching
d (i,
m: # of matches, p: total # of variables
m
j) p
p
Method 2: Convert to Standard Spreadsheet format

For each attribute A create M binary attribute for the
M nominal states of A
Then use standard vector-based similarity or distance
metrics
Proximity Measure for Binary Attributes

A contingency table for binary data
Object j
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
Simple matching coefficient
d (i, j)
bc
a bc d
Proximity Measure for Binary Attributes

Example
Let the values Y and P be set to 1, and the value N be set to

0
01
d ( jack , mary )
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
sum
a b
cd
sum a c b d
d (i, j)
bc
a bc
Common Distance Measures for Numeric Data

Consider two vectors
Rows in the data matrix
Common Distance Measures:

Manhattan distance:
Euclidean distance:
dist ( X , Y ) 1 sim( X , Y )
sim( X , Y )
( xi yi )
i
xi yi
2
Distance can be defined as a dual of a similarity measure
Data Normalization for Numeric Data

A database can contain n numbers of numeric
attributes
A larger range attribute or noise can shift the
samples distances
For example: Income attribute can dominate the
distance as compared to Weight and Age
attributes
The objective of normalization is convert all
numeric attributes, so that there values fall within a
small specified range, such as 0 to 1.0
Normalization
isAttr3
particularly
useful for clustering
Attr1
Attr2
Attr4
and12distance
based
algorithms
such as k-nearest25
33
34
neighbor
22
24
32
36
26
25
35
31
212
23
36
36
Data Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
Example:- Suppose that the minimum and
maximum values for the attribute income are
12,000 and 98,000. By min-max normalization, a
value of 73,600 for income is transformed to
((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 =
0.716
Data Normalization
z-score normalization
This method of normalization is useful when the
actual minimum and maximum of any attribute
are unknown.
Or when outliers which dominate the min-max
normalization.
v meanA
v'
stand _ devA
v scaling normalization
Decimal
v'
10
Where j is the smallest integer such that Max(|

v'
|)<1
Example: Data Matrix and Distance Matrix

Data Matrix
Distance Matrix (Manhattan)
Distance Matrix (Euclidean)
Vector-Based Similarity Measures

In some situations, distance measures provide a skewed view of data
E.g., when the data is very sparse and 0s in the vectors are not
significant
In such cases, typically vector-based similarity measures are
used
X x1 , x2 ,L , xn
Y y1 , y2 ,L , yn
Most common measure: Cosine similarity
Dot product of two vectors:
sim( X , Y ) X Y xi yi
i
Cosine Similarity = normalized dot product

X
the norm of a vector X is:

the cosine similarity is:
2
i
X Y
sim( X , Y )
X y
(x
2
i
yi )
y
i
2
i
Example Application: Information Retrieval

Documents are represented as bags of words
A vector is an array of floating point (or binary in
case of bit maps)
Has direction and magnitude
Each vector has a place for every term in
collection (most are sparse)
Example (Vector Space Model)
Q: gold silver truck

D1: Shipment of gold damaged in a fire
D2: Delivery of silver arrived in a silver truck
D3: Shipment of gold arrived in a truck
The idf for the terms in the three documents is given below:
Example (Vector Space Model)

Weight the items of vectors using tf weighting
scheme
cell contain raw term frequency of
term within document
17
Documents & Query in n-dimensional Space
Documents are represented as vectors in the term space

Typically values in each dimension correspond to the frequency of
the corresponding term in the document
Queries represented as vectors in the same vector-space

Cosine similarity between the query and documents is
often used to rank retrieved documents
18
Example: Similarities among Documents
Consider the following document-term matrix

Doc1
Doc2
Doc3
Doc4
Doc5
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
Dot-Product(Doc2,Doc4)
Dot-Product(Doc2,Doc4) == <3,1,4,3,1,2,0,1>
<3,1,4,3,1,2,0,1>**<0,1,0,3,0,0,2,0>
<0,1,0,3,0,0,2,0>
00++11++00++99++00++00++00++00==10
10
Norm
Norm(Doc2)
(Doc2)==SQRT(9+1+16+9+1+4+0+1)
SQRT(9+1+16+9+1+4+0+1)==6.4
6.4
Norm
(Doc4)
=
SQRT(0+1+0+9+0+0+4+0)
=
3.74
Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74
Cosine(Doc2,
Cosine(Doc2,Doc4)
Doc4)==10
10/ /(6.4
(6.4**3.74)
3.74)==0.42
0.42
19
Rank Correlation as Similarity

In cases where we want to analyze correlation among
two features or we have high mean variance across
data objects (e.g., movies, product ratings), Rank
Correlation coefficient is the best option
Spearman's rank correlation coefficient,
Kendall tau rank correlation coefficient,
Often used in recommender systems based on

Collaborative Filtering
20

Products Rating
Dell, Apple, Samsung, Acer, HP, Sony
User1 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(7/10), HP(4/10), Sony(3,10)
User2 Ratings:
Dell(1.7/10), Apple (1/10), Samsung(2.0/10),
Acer(1.5/10), HP(0.5/10), Sony(0.3,10)
User3 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(4/10), HP(6/10), Sony(7,10)

Which User is most to User1
User1 and User2
User1 and User3

(Example)
Find Correlation between Student

IQ and House of TV per week
di is the distance between ranks

of two attributes
n total samples

(Example)

(Example)
Range [-1 to +1]

Close to +1 indicates +
correlation
Close to -1 indicates
Dynamic Time Warping
[Berndt, Clifford, 1994]

Allows acceleration-deceleration of signals along the
time dimension
Basic idea
Consider X = x1, x2, , xn , and Y = y1, y2, , yn
We are allowed to extend each sequence by
repeating elements
Euclidean distance now calculated between the
extended sequences X and Y
Matrix M, where mij = d(xi, yj)
Example
Euclidean distance vs DTW
How to Calculate DTW

Steps
Distance table calculation
Calculating shortest path in the table
Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }
How to Calculate DTW

Distance Table Calculation
Distance Matrix computing
dist(i,j) = dist(si, qj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))
Example 1
q1
q2
q3
q4
q5
q6
q7
q8
q9
s1
3.76
2.02
6.35
16.8
3.20
3.39
4.75
0.96
0.02
s2
8.07
5.38
11.70
25.10
7.24
7.51
9.49
3.53
1.08
s3
s4
1.64 1.08
0.58 2.43
3.46 0.21
11.90 1.28
1.28 1.42
1.39 1.30
2.31 0.64
0.10 4.00
0.27 8.07
s5
s6
2.86 0.00
4.88 0.31
1.23 0.29
0.23 4.54
3.39 0.04
3.20 0.02
2.10 0.04
7.02 1.00
12.18 3.39
s7
0.06
0.59
0.11
3.69
0.16
0.12
0.00
1.46
4.20
s8
s9
1.88 1.25
3.57 2.69
0.62 0.29
0.64 1.10
2.31 1.61
2.16 1.49
1.28 0.77
5.43 4.33
10.05 8.53
Matrix of the pair-wise distances for element si with qj

Matrix computed with Dynamic Programming based on the:
dist(i,j) = dist(si, yj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))
Example 1
s1
s2
s3
s4
s5
s6
s7
s8
s9
q1 3.76 11.83 13.47 14.55 17.41 17.41 17.47 19.35 20.60
q2 5.78 9.14 9.72 12.15 17.03 17.34 17.93 21.04 22.04
q3 12.13 17.48 12.60 9.93 11.16 11.45 11.56 12.18 12.47
q4 29.02 37.23 24.50 11.21 10.16 14.70 15.14 12.20 13.28
q5 32.22 36.26 25.78 12.63 13.55 10.20 10.36 12.67 13.81
q6 35.61 39.73 27.17 13.93 15.83 10.22 10.32 12.48 13.97
q7 40.36 45.10 29.48 14.57 16.03 10.26 10.22 11.50 12.27
q8 41.32 43.89 29.58 18.57 21.59 11.26 11.68 15.65 15.83
q9 41.34 42.40 29.85 26.64 30.75 14.65 15.46 21.73 24.18

Example 2
Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }
Example 2

Visualizing Data
Visualization of data with inherent 2D
semantics done even before advent of
computers
With computers, novel techniques developed
and existing techniques extended
Objective: Go beyond 2D page to draw the
viewer's mind into multi-dimensional (MD)
spaces
Goals of Visualization
Explorative Analysis
starting point: data without hypotheses about the data
process: interactive, usually undirected search for
structures, trends, etc.
result: visualization of the data, which provides hypotheses
about the data
Hypothesis Analysis
starting point: hypotheses about the data
process: goal-oriented examination of the hypotheses
result: visualization of the data, which allows the
confirmation or rejection of the hypotheses
Presentation
starting point: facts to be presented are fixed a priori
process: choice of an appropriate presentation technique
result: high-quality visualization of the data presenting the
facts
Data Visualization
The cube grows a dimension by doubling

itself,
Increasing dimensions expand physical limits
of display, to the point of bursting out of the
assigned space in 4D
Data Visualization Techniques

Various techniques for multidimensional
multivariate data analysis
Parallel Coordinates
Icon-bases techniques
Graph-based techniques
Pixel oriented techniques
Parallel Coordinates
N equidistant axes which are parallel to one of the
screen axes and correspond to the attributes
The axes are scaled to the [minimum, maximum] range of the corresponding attribute
Every data item corresponds to a polygonal line
which intersects each of the axes at the point
which corresponds to the value for the attribute
Too many data points?
Query Dependent Coloring
Icon-based Techniques
Basic Idea: Visualization of the data values
as features of icons
Examples
Chernoff-Faces [Che73, Tuf83]

Stick Figures [Pic70, PG88]
Shape Coding [Bed90]
Color Icons [Lev91, KK94]
TileBars [Hea95]
use of small icons representing the relevance feature
vectors in document retrieval
Chernoff-faces
Chernoff-faces
Stick Figures
Visualization of the multidimensional data using
stick figure icons
Two attributes of the data are mapped to the
display axes and the remaining at-tributes are
mapped to the angle and/or length of the limbs
Texture patterns in the visualization show certain
data characteristics
Stick Figures
Shape Coding
Shape Coding
Color Icons
Color Icon Example
Pixel-Oriented Techniques
Basic Idea
each attribute value is represented by one colored pixel
the value ranges of the attributes are mapped to a fixed
colormap
the attribute values for each attribute are presented in

separate subwindows
Question: How to arrange the pixels on the
screen ???
Query Dependent
Visualizes data in the context of a specific user
query giving users feedback on their queries and
directs their search
Instead of directly mapping attribute values to
colors, distances of attribute values to the query
are mapped to colors
Pixel-Oriented Techniques: All

attributes

LECTURE02 03 SimilarityMetrices DataVisualization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LECTURE02 03 SimilarityMetrices DataVisualization

Uploaded by

Copyright:

Available Formats

Distance or Similarity Measures

Many of todays real-world applications rely on the computation similarities or

Similarity and Dissimilarity

Dissimilarity (e.g., distance)

Numerical measure of how different two data objects are

Proximity refers to a similarity or dissimilarity

Distance or Similarity Measures

Feature vector corresponding to

Term Frequencies for Documents

Feature vector corresponding to Document 4:

Distance or Similarity Measures

Data Matrix and Distance Matrix

Conceptual representation of a table

n data points with p dimensions

Distance (or Similarity) Matrix

d ( n,1) d ( n,2) ...

Proximity Measure for Nominal Attributes

Method 2: Convert to Standard Spreadsheet format

Proximity Measure for Binary Attributes

Simple matching coefficient

Proximity Measure for Binary Attributes

Let the values Y and P be set to 1, and the value N be set to

Common Distance Measures for Numeric Data

Common Distance Measures:

Distance can be defined as a dual of a similarity measure

Data Normalization for Numeric Data

Where j is the smallest integer such that Max(|

Example: Data Matrix and Distance Matrix

Distance Matrix (Manhattan)

Distance Matrix (Euclidean)

Vector-Based Similarity Measures

Cosine Similarity = normalized dot product

the norm of a vector X is:

Example Application: Information Retrieval

Example (Vector Space Model)

Q: gold silver truck

Example (Vector Space Model)

Documents & Query in n-dimensional Space

Documents are represented as vectors in the term space

Queries represented as vectors in the same vector-space

Example: Similarities among Documents

Consider the following document-term matrix

Rank Correlation as Similarity

Often used in recommender systems based on

Rank Correlation as Similarity

Rank Correlation as Similarity

Rank Correlation as Similarity

Find Correlation between Student

di is the distance between ranks

Rank Correlation as Similarity

Rank Correlation as Similarity

Range [-1 to +1]

Dynamic Time Warping

[Berndt, Clifford, 1994]

Euclidean distance vs DTW

How to Calculate DTW

How to Calculate DTW

Matrix of the pair-wise distances for element si with qj

Matrix computed with Dynamic Programming based on the:

Matrix computed with Dynamic Programming based on the:

The cube grows a dimension by doubling

Data Visualization Techniques

Too many data points?

Query Dependent Coloring