Professional Documents
Culture Documents
Copyright c 1998
ACSys
1
ACSys
About Us
2
ACSys
Outline
3
ACSys
4
ACSys
Database
Machine
Learning
Visualisation
5
ACSys
6
ACSys
7
ACSys
Link Analysis
association rules, sequential patterns, time sequences
Predictive Modelling
tree induction, neural nets, regression
Database Segmentation
clustering, k-means,
Deviation Detection
visualisation, statistics
8
ACSys
9
ACSys
Sales/Marketing
– Provide better customer service
– Improve cross-selling opportunities (beer and nappies)
– Increase direct mail response rates
Customer Retention
– Identify patterns of defection
– Predict likely defections
10
ACSys
Mt Stromlo Observatory
11
ACSys
Some Research
Virtual Environments
Feature Selection
12
ACSys
Outline
– History
– Motivation
13
ACSys
Drivers
– Focus on the customer, competition, and data assets
Enablers
– Increased data hoarding
– Cheaper and faster hardware
14
ACSys
Research Community
– KDD Workshops 1989, 1991, 1993, 1994
– KDD Conference annually since 1995
– KDD Journal since 1997
– ACM SIGKDD http://www.acm.org/sigkdd
Commercially
– Research: IBM, Amex, NAB, AT&T, HIC, NRMA
– Services: ACSys, IBM, MIP, NCR, Magnify
– Tools: TMC, IBM, ISL, SGI, SAS, Magnify
15
ACSys
Outline
16
ACSys
17
ACSys
18
ACSys
19
ACSys
Information overload
Internet navigation
20
ACSys
Outline
21
ACSys
Link Analysis
links between individuals rather than characterising whole
22
ACSys
Outline
23
ACSys
24
ACSys
25
ACSys
Example
Transaction Items
12345 ABC
12346 AC
12347 AD
12348 BEF
Input Parameters: confidence = 50%; support = 50%
26
ACSys
Typical Application
Millions of transactions
27
ACSys
s = s(A; C ) = 50%
28
ACSys
Algorithm Outline
29
ACSys
HIC Example
30
ACSys
Psuedo Algorithm
31
ACSys
32
ACSys
33
ACSys
34
ACSys
35
ACSys
Outline
36
ACSys
37
ACSys
Classification: C5.0
38
ACSys
Basic motivation:
– A dataset contains a certain amount of information
– A random dataset has high entropy
– Work towards reducing the amount of entropy in the data
– Alternatively, increase the amount of information exhibited
by the data
39
ACSys
Algorithm
40
ACSys
Algorithm
Select best S* in S
41
ACSys
Discriminating Descriptions
categorical attributes
– define a disjoint cell for each possible value: sex = “male”
– can be grouped: transport 2 (car, bike)
continuous attributes
– define many possible binary partitions
– Split A < 24 and A 24
– Or split A < 28 and A 28
42
ACSys
Information Measure
43
ACSys
Information Measure
P
info(T ) = mj =1 ,pj log(pj )
is the amount of information
needed to identify class of an object in T
P
Expected information requirement is then the weighted sum:
infox(T ) = ni=1 jjTTijj info(Ti)
44
ACSys
Information Measure
45
ACSys
End Result
A
b c
E Y
<63 >=63
Y N
46
ACSys
Types of Parallelism
47
ACSys
Example: ScalParC
Data Structures
– Attribute Lists: separate lists of all attributes, distributed
across processors
– Node Table: Stores node information for each record id
– Count Matrices: stored for each attribute, for all nodes at
a given level
48
ACSys
49
ACSys
Pruning
50
ACSys
Error-Based Pruning
51
ACSys
Pruning
52
ACSys
Issues
53
ACSys
Classification Rules
A
b c
E Y
<63 >=63
Y N
– A=c)Y
– A = b ^ E < 63 ) Y
– A = b ^ E 63 ) N
– Rule Pruning: Perhaps E 63 ) N
54
ACSys
Pros
– Greedy Search = Fast Execution
– High dimensionality not a problem
– Selects important variables
– Creates symbolic descriptions
Cons
– Search space is huge
– Interaction terms not considered
– Parallel axis tests only (A = v )
55
ACSys
Recent Research
Bagging
– Sample with resubstitution from training set
– Build multiple decision trees from different samples
– Use a voting method to classify new objects
Boosting
– Build multiple trees from all training data
– Maintain a weight for each instance in the training set that
reflects its importance
– Use a voting method to classify new objects
56