Professional Documents
Culture Documents
Summary Generation
Tae-Wan
Tae-WanRyu
Ryuand
andChristoph
ChristophF.F.Eick
Eick
Similarity
Similarity Measures
Measures For
For Multi-valued
Multi-valued
Attributes
Attributes for
for Database
Database Clustering
Clustering
Tae-wan Ryu and Christoph F. Eick
Department of Computer Science
University of Houston
Talk Organization
Database Clustering
Problems of Database Clustering
Extended Data Sets
Similarity Measures for Sets and Bags
An Architecture for Database Clustering
Summary and Conclusion
General
General KDD
KDD Steps
Steps
Data preparation
Research
Research Goal
Goal
Our approach
Partition the database into groups of similar objects using cluster
analysis
Find commonalities that objects belonging to each group share
using genetic programming
Database
Database Summary
Summary Generation
Generation
Steps
Steps and
and Example
Example
< Steps > < Example >
Database Clustering
Clusters
Groups of
similar objects
Young White color Retired
Summary Generation
Summaries describing
the commonalities
within each group
An
An Example
Example Schema
Schema Diagram
Diagram
Marriage
Key Problems
How to support a users viewpoint including attribute selection
Data model discrepancy between storage format and the input
format that clustering algorithms assume
How to cope with structural information, especially 1:n and n:m
relationships
Input
Input Format
Format for
for Data
Data Mining
MiningAlgorithms
Algorithms
(a) (b)
ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1:n
(a) an example of Personal relational database, (b) a joined table from Person and
Purchase relations
Existing
ExistingApproaches
Approaches
Problems
User has to make a critical decision (e.g., which aggregate
function to use?)
Valuable related information may be lost.
Extended
Extended Data
Data Sets
Sets
name age sex ptype amount location
Johny 43 M 1 400 Mall
Johny 43 M 2 70 Grocery
Johny 43 M 3 200 Warehouse
Andy 42 F 2 100 Mall
Andy 42 F 3 100 Grocery
Post 67 M 1 30 Mall
Jenny 35 F null null null
<Current approach>
<Proposed approach>
Related
Related Work
Work
Database
d1, d2, , dn
Users
Extended data set
interests and
generator
objectives
Extended
data set1
AAUnified
Unified Similarity
Similarity Measure
Measure for
for
Clustering
Clustering Extended
Extended Data
Data Sets
Sets
Group Similarity Measures
Mixed Types: qualitative, quantitative types.
By taking the average of all the inter-object measures for those pairs of
objects from which each object of a pair is in different groups.
AAFramework
Framework for
for Mixed
Mixed Type
Type Similarity
Similarity
Measures
Measures for
for Extended
Extended Data
Data Sets
Sets
Gowers similarity measure for data sets with mixed-types.
m m
S ( a, b) wi si ( ai , bi ) / wi
i 1 i 1
l q l q
S ( a, b) [ wi sl ( ai , bi ) w j sq ( a j , b j )]/ ( wi w j )
i 1 j 1 i 1 j 1
where m = l + q. The functions, sl(a,b) and sq(a,b) are similarity functions for
qualitative attributes and quantitative attributes respectively.
Clustering
ClusteringAlgorithms
Algorithms for
for Extended
Extended Data
Data Sets
Sets
Nearest-neighbor clustering
DBSCAN
Leader algorithm
Hierarchical clustering
Database
Database Clustering
Clustering Environment
Environment
A set of
Library of clusters
clustering algorithms
Extended Similarity
Clustering Tool
measure
Data set
Library of
similarity
measures
Data Extraction User Interface Similarity
Tool Measure Tool
Type and
Default choice weight
and domain information
information
DBMS
AA More
More Detailed
Detailed Tool
Tool Architecture
Architecture
Other
Processed Query result
Flat Extended Pre-
Data file data set processor
mining data DBMS
tools Query
Form
Our data translator
User's interests and objectives
mining tools
Database name
Join form Relationship
definitions
Data set of interest
Selected attributes
Other information
AA Join
Join Template
Template Form
Form
Begin-spec
Database-name: DB;
Link-definitions: Link-list;
Begin-join
Dataset-of-interest: Dsetintrest;
Selected-attributes: Attr-list;
Objective-attributes: Obj-attr-list;
Extended-data-set: E;
End-join
End-spec
An
An Example
Example of
of the
the Interface
Interface of
of
the
the Extended
Extended Data
Data Set
Set Generation
Generation Tool
Tool
Begin-spec
DB-name: Company
Link-definitions:
superv(Employee.ssn, Employee.superssn),
husband(Employee.ssn, Marriage.hssn),
wife(Employee.ssn,Marriage.wssn),
ehusband(Marriage.hssn, Employee.ssn),
ewife(Marriage.wssn, Employee.ssn),
works_on(Employee.ssn, Works_on.essn),
project(Works_on.pno, Project.pnum),
works_for(Employee.dno, Department.dnum),
works_loc(Department.dnum, Dept_loc.dnum)
Begin-join
Dateset-of-interest: Employee
Selected-attributes: ssn, sex, salary,
superv.salary, wife.ewife.salary,
works_on.hours, works_on.project.pname,
works_for.works_loc.dloc
Objective-attributes: ssn
Output-data-set: E1
End-join
End-spec
Algorithm
Algorithm to
to Generate
Generate Extended
Extended Data
Data Sets
Sets
DB
select
Query set DBMS
user input Interface
KB GP engine
Domain Query result
knowledge evaluate return
system input
n: number of generation
m: the size of population
Evolution
Evolution Process
Process
n: number of generation
m: the size of population