Social Networks and Data Mining

SUSHIL KULKARNI
JAI-HIND COLLEGE
sushiltry@yahoo.co.in
Social Networks : Example
Technology used
What is Data Mining?
DM Process & Example
DM Queries
DM Tasks and Methods
Relation & Data Warehouse
What is ETL ?
Data Preprocessing
What is a Network?
node
Lin
k
node node node
node node
node node
node
node node
node node
node node
node node
Web Definition : A set of nodes, points, or locations

connected by means of data, voice, and video
communications for the purpose of exchange.
Social Networks
 A social network is
a social structure of
people, related
(directly or indirectly)
to each other through
a common relation or
interest
Social Network Analysis
 Social network analysis [SNA] is the mapping and measuring
of relationships and flows between people, groups,
organizations, computers or other information/knowledge
processing entities.
 The nodes in the network are the people and groups while the
links show relationships or flows between the nodes.
A shift in approach: from ‘synthesis’ to
‘analysis’
Cognitive
Problems Cognitive network for B
network for A
• High cost of
manual surveys
• Survey bias B
- Perceptions of
individuals may be
incorrect
• Logistics
- Organizations
are now spread A
Cognitive
across several network for C
countries.
Sdfdsfsdf
Fvsdfsdfsd
C
Employee
Sdfdsfsd
fdfsd f
Sdfdsfsdf Fvsdfsdfs Sdfdsfsd
Sdfsdfs
Surveys
dfdfsd f
` Sdfdsfsd Fvsdfsdfs
f dfdfsd
Sdfsdfs Sdfdsfsd
` f
- Email Analysis
Sdfsdfs
`
- Web logs
Electronic
Synthesis communication
Social Shift in approach Social Cognitive

Network network network
Technology
Various technologies that help in creating
Social Networks are:
 Email
 Blogs
 Social Networking Software like Orkut,
Face Book, Flickr etc.
SOCIAL NETWORK:
Profile & Platforms
USENET
SOCIAL NETWORK:
Profile & Platforms
Social Community
SOCIAL NETWORK: Growth
SOCIAL NETWORK : Growth Rate
Technology :
 What is Your Network?
- When your connections invite their connections, your
Network starts to grow.
- Your Network is your connections, their connections, and
so on out from you at the center.
 How do you classify users?

- Your Network contains professionals out to “three degrees”
that is, friends-of-friends-of-friends. If each person had 10
connections (and some have many more) then your
network would contain 10,000 professionals.
 How do you see who is in your Network?

Facebook lets you see your network as one large group of
searchable professional profiles.
SOCIAL NETWORK: Visualization
FRIEND FRIEND
FRIEND
ME
FRIEND
FRIEND
ON ANY OF SOCIAL NETWORK
Name
Gender
Age
Birth date/Home town
School attended FRIEND
Interests/ Hobbies
Photoes
Friends
Activities
Audio clips
Video clips
Name
Gender
Age
School attended
Interests/ Hobbies YOU
Photoes
Friends
Activities
Audio clips
Video clips
ON ANY OF SOCIAL NETWORK
Name
Gender
Age
Birth date/Home town After making the friend,
School attended FRIEND
Interests/ Hobbies I can able to access his/ her friends
Photoes
Friends
, audios, videos, share information
Activities A friend may be from any remote site.
Audio clips
Video clips
Name
Gender
Age YOU
School attended
Interests/ Hobbies
Photoes
Friends
Activities
Audio clips
Video clips
SOCIAL NETWORK : Visualization
Between friends: How many of them ?
Male vs. Female Young vs. Old
Thin vs. Fat

Between friends: Relationships
Thick Friends Just Friends

Between friends: Likes
Coffee Chocolate
Friends Friends
HOW
HOWMANY
MANYOF
OFMADHURI DIXIT’S
PRASHANT FRIEND
DAMLE’S LIKE LIKE
FRIEND ? ?
FRIENDS OF A FRIENDS OF A FRIEND
SHOULD KNOW
 How many friends use a social network
regularly?
 How many friends send messages
frequently?
 What is the mood of your friend list?
 How many friends are vegetarian?
 How many friends are closest or far from
you?
 How many friends studied or studying in
your school?
FRIENDS OF A FRIENDS OF A FRIEND
SHOULD KNOW
INTERESTING PATTERNS
FROM UNKNOWN DATA
DEFINE DATA MINING
Data Mining is:
The analysis of (often large) observational

data sets to find unsuspected
relationships and to summarize the data
in novel ways that are both
understandable and useful to the data
owner.
THUS : DATA MINING
 Methods for exploring and modeling
relationships in large amount of data
 Finding hidden information in a database
 Fit data to a model

Data Mining Process
 Understand the Domain
- Understands particulars of the business
or scientific problems
 Create a Data set
- Understand structure, size, and format
of data
- Select the interesting attributes
- Data cleaning and preprocessing
Data Mining Process
 Choose the data mining task and the
specific algorithm
- Understand capabilities and limitations of
algorithms that may be relevant to the
problem
 Interpret the results, and possibly return to

 bullet 2
EXAMPLE
 Understand social networks.
 Grow connections.
 Choose appropriate built in methods to

find hidden information.
Example :E-mail Communication
 A sends an e-mail to B
 With Cc to C B
A C E
 And Bcc to D
 C forwards this e-mail to E D
 From analyzing the header, we can infer

 A and D know that A, B, C and D know about this e-mail
 B and C know that A, B and C know about this e-mail
 C also knows that E knows about this e-mail
 D also knows that B and C do not know that it knows about
this e-mail; and that A knows this fact
 E knows that A, B and C exchanged this e-mail; and that
neither A nor B know that it knows about it
 and so on and so forth …
DB VS DM PROCESSING
• Query • Query
– Well defined – Poorly defined
– SQL – No precise query language
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
– Subset of – Not a subset
database of database
QUERY EXAMPLES
Database
– Find all credit applicants with first name of Sane.
– Identify customers who have purchased
more than Rs.10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor
credit risks. (classification)
– Identify customers with similar buying
habits. (Clustering)
– Find all items which are frequently
purchased with milk. (association rules)
ARE ALL THE ‘DISCOVERED’
PATTERNS INTERESTING?
 Interestingness measures:
A pattern is interesting if it is easily

understood by humans, valid on new or
test data with some degree of purity,
potentially useful, novel, or validates
some hypothesis that a user seeks to
confirm
DATA MINING DEVELOPMENT
 Similarity Measures
 Hierarchical Clustering
 Relational Data Model  IR Systems
 SQL  Imprecise Queries
 Association Rule Algorithms  Textual Data
 Data Warehousing
 Scalability Techniques  Web Search Engines
 Bayes Theorem
 Regression Analysis
 EM Algorithm
 K-Means Clustering
 Time Series Analysis
Algorithm Design Techniques
 Algorithm Analysis  Neural Networks
 Data Structures
 Decision Tree
Algorithms
RELATION (r)
 D 1, D 2, ……, D n are domains
 Relation r is a subset of a Cartesian

product D 1× D 2× ……× D n
r ⊆ D 1× D 2 × … … × D n
EXAMPLE : r
D1 = {Ram, Shyam} , D 2 = {24, 34}
D 1× D 2 = { (Ram, 24), (Ram, 34),

(Shyam, 24), (Shyam, 34)}
r is a sub set of D 1× D 2
r = { (Ram, 24), (Shyam, 34)}

SUSHIL KULKARNI
RELATION is TABLE
NAME
Ram
Employee
TUPLES OR ROWS : t
 Instance of the relation is a tuple or row
 Notation :
t < (a(1), a(2), a(3),… a(n)):
a(i) ∈ A(i); i ∈ N >
 Example: t < (Ram,24) >
RELATION (r)
R A 1
A 2
A 3
…… A k
……. A n
a 11
a 21
a 31
…… a k1
……. a n1
a 12
a 22
a 32
…… a k2
…… a n2
t ….. ….. …….... ………… …..

a 1i
a 2i
a 3i
…… a ki
…… a n3
……. ……. ……. ……. …….

a 1m
a 2m
a 3m
a nm
…… a nm
k th attribute R of i th tuple t
WHAT IS
DATA WAREHOUSE ?
Subject-oriented:
customers, patients, students,
products, time.
Integrated: Gathered CENTRALLY from
1.several internal systems of records

2. sources external to the organization
WHAT IS
DATA WAREHOUSE ?
 Time - variant:
Use to study trends and changes.
 Non - updatable:
cannot updated by end users.

BIG PICTURE
The ETL Process
 Capture
 Scrub or data cleansing
 Transform
 Load and Index
ETL = Extract, Transform, and Load

Steps in data reconciliation
Capture = extract…obtaining a snapshot of a

chosen subset of the source data for loading
into the data warehouse
Static extract = Incremental extract =

capturing a snapshot of capturing changes that
the source data at a have occurred since the
point in time last static extract
Scrub = cleanse…uses pattern

recognition and AI techniques to
upgrade data quality
Fixing errors: misspellings, Also: decoding, reformatting,
erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
Transform = convert data from

format of operational system to
format of data warehouse
Record-level: Field-level:
Selection – data partitioning single-field – from one field to
Joining – data combining one field
Aggregation – data multi-field – from many fields to
summarization one, or one field to many
Load/Index = place transformed

data into the warehouse and
create indexes
Refresh mode: bulk Update mode: only

rewriting of target data at changes in source data are
periodic intervals written to data warehouse
DIRTY DATA
Data in the real world is dirty:
– incomplete: lacking attribute values,

lacking certain attributes of interest, or
containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies
in codes or names
WHY DATA
PREPROCESSING?
No quality data, no quality mining results!
Quality decisions must be based on
quality data
Data warehouse needs consistent
integration of quality data
Required for Data Mining!
Why can Data be
Incomplete?
 Attributes of interest are not available
(e.g., customer information for sales
transaction data)
 Data were not considered important at

the time of transactions, so they were
not recorded!
Why can Data be
Incomplete?
 Data not recorder because of
misunderstanding or malfunctions
 Data may have been recorded and later

deleted!
 Missing/unknown values for some data

Why can Data be
Noisy / Inconsistent ?
 Faulty instruments for data collection
 Human or computer errors
 Errors in data transmission
 Technology limitations (e.g., sensor data come

at a faster rate than they can be processed)
Why can Data be
Noisy / Inconsistent ?
 Inconsistencies in naming conventions or
data codes (e.g., 2/5/2002 could be 2 May
2002 or 5 Feb 2002)
 Duplicate tuples, which were received twice

should also be removed
Major Tasks in Data
Preprocessing
outliers=exceptions!
Data cleaning
– Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
– Integration of multiple databases or files
Data transformation
– Normalization and aggregation
Major Tasks in Data
Preprocessing
Data reduction
– Obtains reduced representation in volume
but produces the same or similar
analytical results
Data discretization
– Part of data reduction but with particular
importance, especially for numerical data
Forms of data preprocessing
DATA CLEANING
Data cleaning tasks

- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
HOW TO HANDLE MISSING
DATA?
 Ignore the tuple: usually done when class
label is missing (assuming the tasks in
classification)— not effective when the
percentage of missing values per attribute
varies considerably.
 Fill in the missing value manually: tedious +

infeasible?
DATA?
 Use a global constant to fill in the missing value:
e.g., “unknown”, a new class?!
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging

to the same class to fill in the missing value:
smarter
 Use the most probable value to fill in the missing
value: inference-based such as Bayesian formula
or decision tree
DATA?
Age Income Team Gender
23 24,200 Red Sox M
39 ? Yankees F
45 45,390 ? F
Fill missing values using aggregate functions (e.g.,

average) or probabilistic estimates on global value
distribution
E.g., put the average income here, or put the most
probable income based on the fact that the person is
39 years old
E.g., put the most frequent team here
HOW TO HANDLE NOISY DATA?
Discretization
The process of partitioning continuous

Variables into categories is called
Discretization.
Discretization : Smoothing techniques
Binning method:
- first sort data and partition into (equi-depth) bins
- then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering
- detect and remove outliers
Discretization : Smoothing techniques
Combined computer and human inspection

- computer detects suspicious values,
which are then checked by humans
Regression
- smooth by fitting the data into regression
functions
SIMPLE DISCRETISATION
METHODS: BINNING
Equal-width (distance) partitioning:
- It divides the range into N intervals of equal size:

uniform grid
- if A and B are the lowest and highest values of the
attribute, the width of intervals will be:
W = (B-A)/N.
- The most straightforward
- But outliers may dominate presentation
- Skewed data is not handled well.
METHODS: BINNING
Equal-depth (frequency) partitioning:
- It divides the range into N intervals, each

containing approximately same number of
samples
- Good data scaling – good handing of
skewed data
BINNING : EXAMPLE
 Binning is applied to each individual feature
(attribute)
 Set of values can then be discretized by replacing

each value in the bin, by bin mean, bin median, bin
boundaries.
 Example: Set of values of attribute Age:

 0. 4 , 12, 16, 14, 18, 23, 26, 28
EXAMPLE: EQUI- WIDTH BINNING
Example : Set of values of attribute Age:
0. 4 , 12, 16, 16, 18, 23, 26, 28
Take bin width = 10
Bin # Bin Elements Bin Boundaries
1 {0,4} [ - , 10)
2 { 12, 16, 16, 18 } [10, 20)
3 { 23, 26, 28 } [ 20, +)

EXAMPLE: EQUI- DEPTH BINNING
Example : Set of values of attribute Age:
0. 4 , 12, 16, 16, 18, 23, 26, 28
Take bin depth = 3
Bin # Bin Elements Bin Boundaries
1 {0,4, 12} [ - , 14)
2 { 16, 16, 18 } [14, 21)
3 { 23, 26, 28 } [ 21, +)

SMOOTHING USING BINNING
METHODS
 Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
 Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
 Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
 Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
METHODS: BINNING
number
of values
Example: customer ages
Equi-width
binning: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-depth
binning: 0-22 22-31 62-80
38-44 48-55
32-38 44-48 55-62
THANK YOU !
Any Questions?
SUSHIL KULKARNI
sushiltry@yahoo.co.in

Social Networks and Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Social Networks and Data Mining

Uploaded by

Copyright:

Available Formats

SUSHIL KULKARNI

node node node

Web Definition : A set of nodes, points, or locations

Social Shift in approach Social Cognitive

 How do you classify users?

 How do you see who is in your Network?

Thin vs. Fat

Thick Friends Just Friends

The analysis of (often large) observational

 Finding hidden information in a database

 Fit data to a model

 Interpret the results, and possibly return to

 Choose appropriate built in methods to

 From analyzing the header, we can infer

A pattern is interesting if it is easily

 Relation r is a subset of a Cartesian

D 1× D 2 = { (Ram, 24), (Ram, 34),

r = { (Ram, 24), (Shyam, 34)}

t ….. ….. …….... ………… …..

……. ……. ……. ……. …….

Integrated: Gathered CENTRALLY from

1.several internal systems of records

Use to study trends and changes.

cannot updated by end users.

 Scrub or data cleansing

 Load and Index

ETL = Extract, Transform, and Load

Capture = extract…obtaining a snapshot of a

Static extract = Incremental extract =

Scrub = cleanse…uses pattern

Transform = convert data from

Load/Index = place transformed

Refresh mode: bulk Update mode: only

– incomplete: lacking attribute values,

 Data were not considered important at

 Data may have been recorded and later

 Missing/unknown values for some data

 Human or computer errors

 Errors in data transmission

 Technology limitations (e.g., sensor data come

 Duplicate tuples, which were received twice

Data cleaning tasks

 Fill in the missing value manually: tedious +

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging

23 24,200 Red Sox M

Fill missing values using aggregate functions (e.g.,

The process of partitioning continuous

Combined computer and human inspection

- It divides the range into N intervals of equal size:

- It divides the range into N intervals, each

 Set of values can then be discretized by replacing

 Example: Set of values of attribute Age:

Bin # Bin Elements Bin Boundaries

2 { 12, 16, 16, 18 } [10, 20)

3 { 23, 26, 28 } [ 20, +)

Bin # Bin Elements Bin Boundaries

1 {0,4, 12} [ - , 14)

2 { 16, 16, 18 } [14, 21)

3 { 23, 26, 28 } [ 21, +)

You might also like