Informa) CS: Lecture 6 - Processing Informa4on

Informa)cs
Lecture 6 Processing Informa4on

Introduc)on
We have no shortage of data about almost
anything of interest
A well designed database can make that data
easy to access
The use of SQL can do simple interroga)ons of
the data
A huge amount of useful informa4on lies
hidden however the need for data mining
Introduc)on
So in this lecture we will look at the elements
of data mining
We will begin however by looking at simple
ways in which our original data may be
processed so that the more complex stages
later on are not compromised
Processing data
Regardless of the source of the data we can
encounter a number of issues:
Errors some data is wrong due to a fault or a
simple transcrip)on error.
Outliers some data is very dierent to the
rest can be signicant if true
Calibra)on the data may need to be
converted to a physical quan)ty to check
Processing data
Test ar)fact it is some)mes possible to
include an object in the data collec)on whose
proper)es are well known we can then
check what has been recorded
Processing data
With data that begins as analogue, especially
audio and video, there are a number of
processing methods that can be used to prepare
the data for later stages:
Stretch if the data can range from 0-100 but
we only record 0-20 we can stretch the data
to use the whole range
Equalise we can modify a range of 20-60 to
use 0-100
Processing data
Filtering
Lo pass lter hiss and noise
Hi pass lter rumble and hum
Band pass selec)ve ltering
Averaging to smooth noisy data and prevent

data spikes
Enhancements a huge range in images for
deblur, distor)on and feature extrac)on
Examples
What is data mining?

The non-trivial extrac)on of implicit,
previously unknown and poten)ally useful
knowledge from data
KDD a process of Knowledge Discovery in
Databases
Associated areas are Sta)s)cs, SQL, Machine
Learning, AI and Expert Systems
Knowledge is power
Remember the hierarchy that we aspire to work
through:

Data facts and gures accuracy important
Informa)on organised data for analysis
Knowledge interpreta)on to inform ac)on
Applica)on areas
Insurance claim analysis and risk

Medical diagnosis and preventa)ve medicine
Banking iden)fying fraud
Marke)ng new customers and sales
Science human genome project
Security iden)fy behaviours
Business intelligence trends and threats
Scope of data mining

Data mining can try to use data in a variety of
ways using sophis)cated mathema)cal
techniques:
Classica)on
Es)ma)on
Clustering
Associa)on
Classica)on
Use data to predict the category of an object
e.g. someone to lend money to or perhaps
arrest or perhaps someone who will make a
certain kind of purchase etc.
The result of a classica)on problem can be a
decision tree which shows how a new object
can be classied on the basis of the exis)ng
data
Classica)on
Data
age
cartype
risk
23
saloon
low
30
sports
low
36
saloon
low
25
hatchback
high
30
saloon
low
23
hatchback
high
30
hatchback
low
25
sports
high
18
saloon
low
Age
<= 25
> 25
Car Type
Saloon
Low risk
Low risk
sports,
hatchback
high risk
Es)ma)on
Similar to classica)on in that a model is
created
The model allows the output of a con)nuous
variable to be predicted
The model could be a mathema)cal func)on
to predict a value or could be a theorem
which then also predicts a value or perhaps
even a behaviour.
Clustering
Can we analyse the data for a set of objects
and iden)fy sub-groups and their membership
We may know the sub-groups and some
exis)ng members and want to know what
data helps iden)fy which cluster a new object
will belong to.
Clustering
Reproduced from Adriaans and Zantinge
Clustering K means example
The general idea of a clustering techniques is to divide

the population into partitions
Starts with an initial random selection of K partitions
Then points are moved into each partition using a
centroid calculation and a similarity measure in an
iterative process until the final set of clusters stabilises
The final set is then evaluated
Associa)on
Seeking co-occurrences of groups of data
items in a data set
Associa)on can be in )me i.e. a sequen)al
pa[ern
Can be very popular with retailers to target
adver)sing for related purchases and for store
layouts
Associa)on rules
Rules are of the form X => Y
where X and Y are distinct sets of items
Importance of a rule described by its

support and its confidence
Support : % of transactions containing X
and Y
Confidence: % of transactions with X that
also contain Y
Associa)on rules
All transactions
Transactions
with X
Transactions
with X and Y
Transactions
with Y
Support of X=>Y = Support of Y=>X =

3/10 = 30%
Confidence of X=>Y = = 75%
Confidence of Y=>X = 3/5 = 60%
Associa)on rules example

Transaction
1
2
3
4
5
Rule
Milk => Eggs
Eggs => Tea
sugar => {butter, milk}
Items bought
milk, eggs, tea
butter, milk, sugar, tea
biscuits, sugar, eggs
tea, coffee, eggs
coffee, chocolate, sugar
Support, Confidence
20%, 50%
40%, 66.7%
20%, 33.3%
Associa)on - issues
number of rules grows exponentially with number
of items
User to specify
Minimum Support (e.g. 10%) and
Minimum Confidence (e.g. 70%) levels
Which rules are interesting - define interesting
Negative rules can also be interesting
70% buying crisps => do not buy cream
absence implies millions of useless rules!
Hierarchies
Items are grouped
e.g. pen, pencil are writing tools
Can have different rules for groups than for
individual items
e.g., strong positive association between
crisps and biscuits, but negative
associations lower in hierarchy
use to define interesting
e.g. rules across groups can be more
interesting than rules within groups
Hierarchies
+ve
Crisps
Biscuits
C
-ve
+ve
X
-ve
Process
Cleansing, quality
Input data
from repository
Data
Pre-processing
Mining patterns
Data
Post-processing
Redrawn from Du, p14

Output patterns
Pre-processing
We need to understand the
data that we are using type
and quality
This will inform the mining
technique to be used
Data visualisa)on can also
inform the mining process
Target
Precise, inaccurate, biased

Precise, accurate, unbiased
imprecise, inaccurate, biased
imprecise, accurate, unbiased
DM vs. Query Tools

If you know what you want, use SQL (the database
query language)
SQL finds data under known constraints
SQL cannot readily find hidden knowledge
DM finds hidden nuggets
DM can find interesting patterns, irregularities and
optimal clusters
DM can use repeated SQL queries
DM gives more possibilities
DM requires a good foundation in the data
Reading
Hongbo Du (generally online resource)

Adriaans and Zantinge (a small book)
Witten & Frank (the WEKA software)
Christopher Westphal: Data mining for
intelligence, fraud, & criminal detection :
advanced analytics & information sharing
technologies
Marcus Maloof (e-book on Dawsonera)
Machine Learning and Data Mining for
Computer Security

Informa) CS: Lecture 6 - Processing Informa4on

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Informa) CS: Lecture 6 - Processing Informa4on

Uploaded by

Copyright:

Available Formats

Informa)cs

Lecture 6 Processing Informa4on

Averaging to smooth noisy data and prevent

What is data mining?

Insurance claim analysis and risk

Scope of data mining

Reproduced from Adriaans and Zantinge

Clustering K means example

The general idea of a clustering techniques is to divide

Importance of a rule described by its

Support of X=>Y = Support of Y=>X =

Associa)on rules example

Redrawn from Du, p14

Precise, inaccurate, biased

DM vs. Query Tools

Hongbo Du (generally online resource)

You might also like