Professional Documents
Culture Documents
Many data mining and analytics tasks involve the comparison of objects and
determining in terms of their similarities (or dissimilarities)
Clustering
Nearest-neighbor search, classification, and prediction
Correlation analysis
An Employee DB
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
x11
...
x
i1
...
x
n1
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
...
...
...
xnf
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
m
j) p
p
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
d (i, j)
bc
a bc d
01
d ( jack , mary )
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
sum
a b
cd
sum a c b d
d (i, j)
bc
a bc
Euclidean distance:
dist ( X , Y ) 1 sim( X , Y )
sim( X , Y )
( xi yi )
i
xi yi
2
25
35
31
212
23
36
36
Data Normalization
min-max normalization
v minA
v'
(new _ maxA new _ minA) new _ minA
maxA minA
Example:- Suppose that the minimum and
maximum values for the attribute income are
12,000 and 98,000. By min-max normalization, a
value of 73,600 for income is transformed to
((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 =
0.716
Data Normalization
z-score normalization
This method of normalization is useful when the
actual minimum and maximum of any attribute
are unknown.
Or when outliers which dominate the min-max
normalization.
v meanA
v'
stand _ devA
v scaling normalization
Decimal
v'
10
E.g., when the data is very sparse and 0s in the vectors are not
significant
In such cases, typically vector-based similarity measures are
used
X x1 , x2 ,L , xn
Y y1 , y2 ,L , yn
Most common measure: Cosine similarity
Dot product of two vectors:
sim( X , Y ) X Y xi yi
i
2
i
X Y
sim( X , Y )
X y
(x
2
i
yi )
y
i
2
i
The idf for the terms in the three documents is given below:
17
T1
0
3
3
0
2
T2
4
1
0
1
2
T3
0
4
0
0
2
T4
0
3
0
3
3
T5
0
1
3
0
1
T6
2
2
0
0
4
T7
1
0
3
2
0
T8
3
1
0
0
2
Dot-Product(Doc2,Doc4)
Dot-Product(Doc2,Doc4) == <3,1,4,3,1,2,0,1>
<3,1,4,3,1,2,0,1>**<0,1,0,3,0,0,2,0>
<0,1,0,3,0,0,2,0>
00++11++00++99++00++00++00++00==10
10
Norm
Norm(Doc2)
(Doc2)==SQRT(9+1+16+9+1+4+0+1)
SQRT(9+1+16+9+1+4+0+1)==6.4
6.4
Norm
(Doc4)
=
SQRT(0+1+0+9+0+0+4+0)
=
3.74
Norm (Doc4) = SQRT(0+1+0+9+0+0+4+0) = 3.74
Cosine(Doc2,
Cosine(Doc2,Doc4)
Doc4)==10
10/ /(6.4
(6.4**3.74)
3.74)==0.42
0.42
19
20
User1 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(7/10), HP(4/10), Sony(3,10)
User2 Ratings:
Dell(1.7/10), Apple (1/10), Samsung(2.0/10),
Acer(1.5/10), HP(0.5/10), Sony(0.3,10)
User3 Ratings:
Dell(8/10), Apple (5/10), Samsung(9,10),
Acer(4/10), HP(6/10), Sony(7,10)
Example
Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }
Example 1
q1
q2
q3
q4
q5
q6
q7
q8
q9
s1
3.76
2.02
6.35
16.8
3.20
3.39
4.75
0.96
0.02
s2
8.07
5.38
11.70
25.10
7.24
7.51
9.49
3.53
1.08
s3
s4
1.64 1.08
0.58 2.43
3.46 0.21
11.90 1.28
1.28 1.42
1.39 1.30
2.31 0.64
0.10 4.00
0.27 8.07
s5
s6
2.86 0.00
4.88 0.31
1.23 0.29
0.23 4.54
3.39 0.04
3.20 0.02
2.10 0.04
7.02 1.00
12.18 3.39
s7
0.06
0.59
0.11
3.69
0.16
0.12
0.00
1.46
4.20
s8
s9
1.88 1.25
3.57 2.69
0.62 0.29
0.64 1.10
2.31 1.61
2.16 1.49
1.28 0.77
5.43 4.33
10.05 8.53
Example 1
s1
s2
s3
s4
s5
s6
s7
s8
s9
q1 3.76 11.83 13.47 14.55 17.41 17.41 17.47 19.35 20.60
q2 5.78 9.14 9.72 12.15 17.03 17.34 17.93 21.04 22.04
q3 12.13 17.48 12.60 9.93 11.16 11.45 11.56 12.18 12.47
q4 29.02 37.23 24.50 11.21 10.16 14.70 15.14 12.20 13.28
q5 32.22 36.26 25.78 12.63 13.55 10.20 10.36 12.67 13.81
q6 35.61 39.73 27.17 13.93 15.83 10.22 10.32 12.48 13.97
q7 40.36 45.10 29.48 14.57 16.03 10.26 10.22 11.50 12.27
q8 41.32 43.89 29.58 18.57 21.59 11.26 11.68 15.65 15.83
q9 41.34 42.40 29.85 26.64 30.75 14.65 15.46 21.73 24.18
Example 2
Example:
Series1 = {2.0, 3.0, 7.0, 7.0, 8.0, 8.0, 6.0, 2.0,
5.0, 2.0, 4.0, 5.0, 5.0 }
Series2 = {4.0, 7.0, 7.0, 8.0, 2.0, 2.0, 6.0, 5.0,
3.0, 4.0, 5.0, 5.0 }
Example 2
Visualizing Data
Visualization of data with inherent 2D
semantics done even before advent of
computers
With computers, novel techniques developed
and existing techniques extended
Objective: Go beyond 2D page to draw the
viewer's mind into multi-dimensional (MD)
spaces
Goals of Visualization
Explorative Analysis
starting point: data without hypotheses about the data
process: interactive, usually undirected search for
structures, trends, etc.
result: visualization of the data, which provides hypotheses
about the data
Hypothesis Analysis
starting point: hypotheses about the data
process: goal-oriented examination of the hypotheses
result: visualization of the data, which allows the
confirmation or rejection of the hypotheses
Presentation
starting point: facts to be presented are fixed a priori
process: choice of an appropriate presentation technique
result: high-quality visualization of the data presenting the
facts
Data Visualization
Parallel Coordinates
Icon-bases techniques
Graph-based techniques
Pixel oriented techniques
Parallel Coordinates
N equidistant axes which are parallel to one of the
screen axes and correspond to the attributes
The axes are scaled to the [minimum, maximum] range of the corresponding attribute
Every data item corresponds to a polygonal line
which intersects each of the axes at the point
which corresponds to the value for the attribute
Icon-based Techniques
Basic Idea: Visualization of the data values
as features of icons
Examples
Chernoff-faces
Chernoff-faces
Stick Figures
Visualization of the multidimensional data using
stick figure icons
Two attributes of the data are mapped to the
display axes and the remaining at-tributes are
mapped to the angle and/or length of the limbs
Texture patterns in the visualization show certain
data characteristics
Stick Figures
Shape Coding
Shape Coding
Color Icons
Pixel-Oriented Techniques
Basic Idea
each attribute value is represented by one colored pixel
the value ranges of the attributes are mapped to a fixed
colormap
Pixel-Oriented Techniques
Pixel-Oriented Techniques
Question: How to arrange the pixels on the
screen ???
Query Dependent
Visualizes data in the context of a specific user
query giving users feedback on their queries and
directs their search
Instead of directly mapping attribute values to
colors, distances of attribute values to the query
are mapped to colors