You are on page 1of 53

Distance or Similarity Measures

Many data mining and analytics tasks involve the comparison of objects and
determining in terms of their similarities (or dissimilarities)
Clustering
Nearest-neighbor search, classification, and prediction
Correlation analysis

Many of todays real-world applications rely on the computation similarities or


distances among objects
Personalization
Recommender systems
Document categorization
Information retrieval

Similarity and Dissimilarity


Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]

Dissimilarity (e.g., distance)

Numerical measure of how different two data objects are


Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies

Proximity refers to a similarity or dissimilarity

Distance or Similarity Measures


Measuring Distance
In order to group similar items, we need a way to measure the
distance between objects (e.g., records)
Often requires the representation of objects as feature vectors

An Employee DB
ID
1
2
3
4
5

Gender
F
M
M
F
M

Age
27
51
52
33
45

Salary
19,000
64,000
100,000
55,000
45,000

Feature vector corresponding to


Employee 2: <M, 51, 64000.0>

Term Frequencies for Documents


Doc1
Doc2
Doc3
Doc4
Doc5

T1
0
3
3
0
2

T2
4
1
0
1
2

T3
0
4
0
0
2

T4
0
3
0
3
3

T5
0
1
3
0
1

T6
2
2
0
0
4

Feature vector corresponding to Document 4:


<0, 1, 0, 3, 0, 0>

Distance or Similarity Measures


Properties of Distance Measures:
for all objects A and B, dist(A, B) 0, and dist(A, B) =
dist(B, A)
for any object A, dist(A, A) = 0
Representation of objects as vectors:
Each data object (item) can be viewed as an n-dimensional vector, where
the dimensions are the attributes (features) in the data
The vector representation allows us to compute distance or similarity
between pairs of items using standard vector operations, e.g.,
Cosine of the angle between vectors
Manhattan distance
Euclidean distance
Hamming Distance

Data Matrix and Distance Matrix


Data matrix

x11

...

Conceptual representation of a table


Cols = features; rows = data objects

x
i1
...
x
n1

n data points with p dimensions


Each row in the matrix is the vector
representation of a data object

Distance (or Similarity) Matrix


n data points, but indicates only the
pairwise distance (or similarity)
A triangular matrix
Symmetric

...

x1f

...

x1p

...

...

...

...

xif

...

...
xip

...
...

...

...
...

xnf

...
xnp

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0

Proximity Measure for Nominal Attributes


If object attributes are all nominal (categorical), then
proximity measure are used to compare objects
Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
Method 1: Simple matching
d (i,
m: # of matches, p: total # of variables

m
j) p
p

Method 2: Convert to Standard Spreadsheet format


For each attribute A create M binary attribute for the
M nominal states of A
Then use standard vector-based similarity or distance
metrics

Proximity Measure for Binary Attributes


A contingency table for binary data
Object j

Object i

1
0

1
a
c

0
b
d

sum a c b d

sum
a b
cd
p

Simple matching coefficient

d (i, j)

bc
a bc d

Proximity Measure for Binary Attributes


Example

Let the values Y and P be set to 1, and the value N be set to


0

01
d ( jack , mary )
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2

sum

a b

cd

sum a c b d
d (i, j)

bc
a bc

Common Distance Measures for Numeric Data


Consider two vectors
Rows in the data matrix

Common Distance Measures:


Manhattan distance:

Euclidean distance:

dist ( X , Y ) 1 sim( X , Y )

sim( X , Y )

( xi yi )
i

xi yi
2

Distance can be defined as a dual of a similarity measure

Data Normalization for Numeric Data


A database can contain n numbers of numeric
attributes
A larger range attribute or noise can shift the
samples distances
For example: Income attribute can dominate the
distance as compared to Weight and Age
attributes
The objective of normalization is convert all
numeric attributes, so that there values fall within a
small specified range, such as 0 to 1.0
Normalization
isAttr3
particularly
useful for clustering
Attr1
Attr2
Attr4
and12distance
based
algorithms
such as k-nearest25
33
34
neighbor
22
24
32
36
26

25

35

31

212

23

36

36

Data Normalization
min-max normalization

v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
Example:- Suppose that the minimum and
maximum values for the attribute income are
12,000 and 98,000. By min-max normalization, a
value of 73,600 for income is transformed to
((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 =
0.716

Data Normalization
z-score normalization
This method of normalization is useful when the
actual minimum and maximum of any attribute
are unknown.
Or when outliers which dominate the min-max
normalization.
v meanA

v'

stand _ devA

v scaling normalization
Decimal
v'

10

Where j is the smallest integer such that Max(|


v'
|)<1

Example: Data Matrix and Distance Matrix


Data Matrix

Distance Matrix (Manhattan)

Distance Matrix (Euclidean)

Vector-Based Similarity Measures


In some situations, distance measures provide a skewed view of data

E.g., when the data is very sparse and 0s in the vectors are not
significant
In such cases, typically vector-based similarity measures are
used
X x1 , x2 ,L , xn
Y y1 , y2 ,L , yn
Most common measure: Cosine similarity
Dot product of two vectors:

sim( X , Y ) X Y xi yi
i

Cosine Similarity = normalized dot product


X

the norm of a vector X is:


the cosine similarity is:

2
i

X Y
sim( X , Y )

X y

(x

2
i

yi )

y
i

2
i

Example Application: Information Retrieval


Documents are represented as bags of words
A vector is an array of floating point (or binary in
case of bit maps)
Has direction and magnitude
Each vector has a place for every term in
collection (most are sparse)

Example (Vector Space Model)

Q: gold silver truck


D1: Shipment of gold damaged in a fire
D2: Delivery of silver arrived in a silver truck
D3: Shipment of gold arrived in a truck

The idf for the terms in the three documents is given below:

Example (Vector Space Model)


Weight the items of vectors using tf weighting
scheme
cell contain raw term frequency of
term within document

17

Documents & Query in n-dimensional Space

Documents are represented as vectors in the term space


Typically values in each dimension correspond to the frequency of
the corresponding term in the document

Queries represented as vectors in the same vector-space


Cosine similarity between the query and documents is
often used to rank retrieved documents
18

Example: Similarities among Documents

Consider the following document-term matrix


Doc1
Doc2
Doc3
Doc4
Doc5

T1
0
3
3
0
2

T2
4
1
0
1
2

T3
0
4
0
0
2

T4
0
3
0
3
3

T5
0
1
3
0
1

T6
2
2
0
0
4

T7
1
0
3
2
0

T8
3
1
0
0
2

Dot-Product(Doc2,Doc4)
Dot-Product(Doc2,Doc4) == <3,1,4,3,1,2,0,1>
<3,1,4,3,1,2,0,1>**<0,1,0,3,0,0,2,0>
<0,1,0,3,0,0,2,0>
00++11++00++99++00++00++00++00==10
10
Norm
Norm(Doc2)
(Doc2)==SQRT(9+1+16+9+1+4+0+1)
SQRT(9+1+16+9+1+4+0+1)==6.4
6.4
Norm
(Doc4)
=
SQRT(0+1+0+9+0+0+4+0)
=
3.74
Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74
Cosine(Doc2,
Cosine(Doc2,Doc4)
Doc4)==10
10/ /(6.4
(6.4**3.74)
3.74)==0.42
0.42

19

Rank Correlation as Similarity


In cases where we want to analyze correlation among
two features or we have high mean variance across
data objects (e.g., movies, product ratings), Rank
Correlation coefficient is the best option
Spearman's rank correlation coefficient,
Kendall tau rank correlation coefficient,

Often used in recommender systems based on


Collaborative Filtering

20

Rank Correlation as Similarity


Products Rating
Dell, Apple, Samsung, Acer, HP, Sony

User1 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(7/10), HP(4/10), Sony(3,10)

User2 Ratings:
Dell(1.7/10), Apple (1/10), Samsung(2.0/10),
Acer(1.5/10), HP(0.5/10), Sony(0.3,10)

User3 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(4/10), HP(6/10), Sony(7,10)

Rank Correlation as Similarity


Which User is most to User1
User1 and User2
User1 and User3

Rank Correlation as Similarity


(Example)

Find Correlation between Student


IQ and House of TV per week

di is the distance between ranks


of two attributes
n total samples

Rank Correlation as Similarity


(Example)

Rank Correlation as Similarity


(Example)

Range [-1 to +1]


Close to +1 indicates +
correlation
Close to -1 indicates

Dynamic Time Warping

[Berndt, Clifford, 1994]


Allows acceleration-deceleration of signals along the
time dimension
Basic idea
Consider X = x1, x2, , xn , and Y = y1, y2, , yn
We are allowed to extend each sequence by
repeating elements
Euclidean distance now calculated between the
extended sequences X and Y
Matrix M, where mij = d(xi, yj)

Example

Euclidean distance vs DTW

How to Calculate DTW


Steps
Distance table calculation
Calculating shortest path in the table

Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }

How to Calculate DTW


Distance Table Calculation
Distance Matrix computing
dist(i,j) = dist(si, qj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))

Example 1

q1
q2
q3
q4
q5
q6
q7
q8
q9

s1
3.76
2.02
6.35
16.8
3.20
3.39
4.75
0.96
0.02

s2
8.07
5.38
11.70
25.10
7.24
7.51
9.49
3.53
1.08

s3
s4
1.64 1.08
0.58 2.43
3.46 0.21
11.90 1.28
1.28 1.42
1.39 1.30
2.31 0.64
0.10 4.00
0.27 8.07

s5
s6
2.86 0.00
4.88 0.31
1.23 0.29
0.23 4.54
3.39 0.04
3.20 0.02
2.10 0.04
7.02 1.00
12.18 3.39

s7
0.06
0.59
0.11
3.69
0.16
0.12
0.00
1.46
4.20

s8
s9
1.88 1.25
3.57 2.69
0.62 0.29
0.64 1.10
2.31 1.61
2.16 1.49
1.28 0.77
5.43 4.33
10.05 8.53

Matrix of the pair-wise distances for element si with qj


Matrix computed with Dynamic Programming based on the:
dist(i,j) = dist(si, yj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))

Example 1
s1
s2
s3
s4
s5
s6
s7
s8
s9
q1 3.76 11.83 13.47 14.55 17.41 17.41 17.47 19.35 20.60
q2 5.78 9.14 9.72 12.15 17.03 17.34 17.93 21.04 22.04
q3 12.13 17.48 12.60 9.93 11.16 11.45 11.56 12.18 12.47
q4 29.02 37.23 24.50 11.21 10.16 14.70 15.14 12.20 13.28
q5 32.22 36.26 25.78 12.63 13.55 10.20 10.36 12.67 13.81
q6 35.61 39.73 27.17 13.93 15.83 10.22 10.32 12.48 13.97
q7 40.36 45.10 29.48 14.57 16.03 10.26 10.22 11.50 12.27
q8 41.32 43.89 29.58 18.57 21.59 11.26 11.68 15.65 15.83
q9 41.34 42.40 29.85 26.64 30.75 14.65 15.46 21.73 24.18

Matrix computed with Dynamic Programming based on the:


dist(i,j) = dist(si, yj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))

Example 2
Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }

Example 2

Matrix computed with Dynamic Programming based on the:


dist(i,j) = dist(si, yj) + min {dist(i-1,j-1), dist(i, j-1), dist(i-1,j))

Visualizing Data
Visualization of data with inherent 2D
semantics done even before advent of
computers
With computers, novel techniques developed
and existing techniques extended
Objective: Go beyond 2D page to draw the
viewer's mind into multi-dimensional (MD)
spaces

Goals of Visualization
Explorative Analysis
starting point: data without hypotheses about the data
process: interactive, usually undirected search for
structures, trends, etc.
result: visualization of the data, which provides hypotheses
about the data

Hypothesis Analysis
starting point: hypotheses about the data
process: goal-oriented examination of the hypotheses
result: visualization of the data, which allows the
confirmation or rejection of the hypotheses

Presentation
starting point: facts to be presented are fixed a priori
process: choice of an appropriate presentation technique
result: high-quality visualization of the data presenting the
facts

Data Visualization

The cube grows a dimension by doubling


itself,
Increasing dimensions expand physical limits
of display, to the point of bursting out of the
assigned space in 4D

Data Visualization Techniques


Various techniques for multidimensional
multivariate data analysis

Parallel Coordinates
Icon-bases techniques
Graph-based techniques
Pixel oriented techniques

Parallel Coordinates
N equidistant axes which are parallel to one of the
screen axes and correspond to the attributes
The axes are scaled to the [minimum, maximum] range of the corresponding attribute
Every data item corresponds to a polygonal line
which intersects each of the axes at the point
which corresponds to the value for the attribute

Too many data points?

Query Dependent Coloring

Icon-based Techniques
Basic Idea: Visualization of the data values
as features of icons
Examples

Chernoff-Faces [Che73, Tuf83]


Stick Figures [Pic70, PG88]
Shape Coding [Bed90]
Color Icons [Lev91, KK94]
TileBars [Hea95]
use of small icons representing the relevance feature
vectors in document retrieval

Chernoff-faces

Chernoff-faces

Stick Figures
Visualization of the multidimensional data using
stick figure icons
Two attributes of the data are mapped to the
display axes and the remaining at-tributes are
mapped to the angle and/or length of the limbs
Texture patterns in the visualization show certain
data characteristics

Stick Figures

Shape Coding

Shape Coding

Color Icons

Color Icon Example

Pixel-Oriented Techniques
Basic Idea
each attribute value is represented by one colored pixel
the value ranges of the attributes are mapped to a fixed
colormap

the attribute values for each attribute are presented in


separate subwindows

Pixel-Oriented Techniques

Pixel-Oriented Techniques
Question: How to arrange the pixels on the
screen ???
Query Dependent
Visualizes data in the context of a specific user
query giving users feedback on their queries and
directs their search
Instead of directly mapping attribute values to
colors, distances of attribute values to the query
are mapped to colors

Pixel-Oriented Techniques: All


attributes

You might also like