You are on page 1of 16

Data Attributes Needed

for Data Mining


Types of Data
Objective

Objective
Recognize
attributes of data
needed for data
mining
Types of Data

Categorical Categorical Continuous Class

Data that consists of Tid Refund Marital Taxable Cheat


a collection of Status Income

records, each of 1 Yes Single 125K No

2 No Married 100K No
which consists of a
3 No Single 70K No
fixed set of attributes 4 Yes Married 120K No

5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
Data Matrix

|If data objects have the |Data set can be


same fixed set of represented by an m by
numeric attributes, the n matrix, where there
data objects can be are m rows, one for
thought of as points in each object, and n
a multi-dimensional columns, one for each
space attribute
- each dimension represents
a distinct attribute
Data Matrix

Projection Projection Distance Load Thickness


of x Load of y Load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data

T C P B S G W L T S
e o l a c a i o i e
a a a l o m n s m a
m c y l r e t e s
h e e o o
r u n
t
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

|A special type of record TID Items

data, where 1 Bread, Coke, Milk


2 Beer, Bread
- each record (transaction)
involves a set of items. 3 Beer, Coke, Diapers, Milk
4 Beer, Bread, Diapers, Milk
5 Coke, Diapers, Milk
Graph Data

|Generic graph and


HTML Links 5

2
1
2

5
Chemical Data

|Benzene Molecule:
C6H6
Ordered Data

|Sequences of
transactions
Ordered Data

|Spatio-Temporal Data January

Average Monthly Temperature of land and ocean


Data Quality

|What kinds of data |Examples of data


quality problems? quality problems:
- Noise and outliers
|How can we detect - missing values
problems with the - duplicate data
data?

|What can we do about


these problems?
Noise

|Noise refers
to
modification
of original
values
Outliers

|Outliers are data


objects with
characteristics that are
considerably different
than most of the other
data objects in the data
set
Missing Values

|Reasons for missing |Handling missing


values values
- Information is not collected - Eliminate Data Objects
- Attributes may not be - Estimate Missing Values
applicable to all cases - Ignore the Missing Value
During Analysis
Duplicate Data

|Data set may |Examples: |Data cleaning


include data - Same person with - Process of
objects that are multiple email dealing with
duplicates, or addresses duplicate data
issues
almost duplicates
of one another
- Major issue when
merging data from
heterogenous
sources

You might also like