You are on page 1of 35

Tutorialon

MiningHeterogeneous
InformationNetworks
RokiaMissaoui
LARIM
UniversitduQubecenOutaouais,Canada
http://w3.uqo.ca/missaoui

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 1

Acknowledgement

I am grateful to Professor Jiawei Han who kindly


allowed me to use his slides on mining
heterogeneous information networks.
This presentation is a slight adaptation of his
presentation material.
(see http://www.cs.uiuc.edu/~hanj/ and the
following key reference for more details)

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 2

1
InternetMap

3
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 2/17/2013

Outline
Introduction
Networkrepresentationandkindsofnetworks
Whyminingheterogeneousinformationnetworks(HINs)?
ResearchworkofHansteamonminingHINs
Combiningrankingwithclustering
Combiningrankingwithclassification
Metapathbasedexplorationofinformationnetworks
Rolediscoveryandevolutionanalysis
Othercontributions
OurcurrentresearchonHINs
Conclusion
References
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 4

2
Introduction

Social networks
A social structure of nodes (e.g., individuals or
organizations) that are related to each other by various
ties such as friendship, affinity, collaboration,
Typical social networks
Social bookmarking (Del.icio.us)
Friendship networks (Facebook, Myspace)
Professional networks (LinkedIn)
Media Sharing (Flickr, Youtube)
Folksonomy: collaborative tagging using three entities:
users, resources and tags
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 5

Introduction
Information network analysis (Sun & Han, 2012)
Database as an information network: entities and relationships
Focus on heterogeneous information networks since they
contain rich and inter-related semantics
Data mining (DM) techniques: clustering, classification, ranking,
similarity search, link prediction, trends and evolution analysis
Construction of semantically rich networks by exploring links
among node types through DM techniques
A lot of topics that still need to be explored
Main reference

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 6

3
NetworkRepresentations
A network/graph: G = (V, E), where V: vertices/nodes, E:
edges/links

Adjacency matrix:
Aij = 1 if there is an edge between vertices i and j; 0 otherwise
Weighted graph:
Edges having weight (strength), usually a real number
Directed network (directed graph): if each edge has a direction
Labeled graph:
Edges have a label (e.g., creation date)

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 7

InformationNetwork(IN)
A network where each node represents an entity (e.g., actor in a
social network) and each link a relationship between entities
Each node/link may have attributes, labels, and weights
Links may carry rich semantic information
Homogeneous networks
Single object type and single link type (one-mode data)
Web: a collection of linked Web pages
Heterogeneous or multi-typed networks
Multiple object and link types
Medical network: patients, doctors, diseases, treatments
Bibliographic network: publications, authors, venues

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 8

4
InformationNetwork
Informationnetwork(Sun&Han,2012)

Heterogeneousinformationnetwork
Whenthenumberofobjecttypesorlinktypesismorethan1

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 9

Homogeneousvs.HeterogeneousNetworks

Co-author Network Conference-Author Network


Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 10

5
MiningInformationNetworks
Homogeneous networkscanoftenbederivedfromtheiroriginal
heterogeneous networks
E.g.,coauthornetworkscanbederivedfromauthorpaper
conferencenetworksby projection onauthorsonly
Papercitationnetworkscanbederivedfromacomplete
bibliographicnetworkwithpapersandcitationsprojected
HeterogeneousINs(HINs)carryricherinformationthantheir
correspondingprojectedhomogeneousnetworks
TypedHINsvs.nontypedHINs(i.e.,notdistinguishingdifferent
typesofnodes)
TypednodesandlinksimplyamorestructuredIN,andthus
oftenleadtomoreinformativediscovery

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 11

WhyMiningINs?
Informationnetworksareeverywhere!
Biologicalnetworks
Bibliographicnetworks:DBLP,ArXiv,PubMed,
Socialnetworks:Facebook>100millionactiveusers
WorldWideWeb(WWW):>3billionnodes,>50billionedges
Cyberphysicalnetworks

Yeast protein
The Web network Co-author network Social network sites
interaction network
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 12

6
WhatCanbeMinedfromHeterogeneousNetworks?
DBLP:AComputerSciencebibliographicdatabase
>1.8Mpapers,>0.7Mauthors,>10Kvenues,>70Kterms
(appearingmorethanonce)

A sample publication record in DBLP


DBLPNetwork MiningFunctions
HowareCSresearchareasstructured? Clustering
Whoaretheleading researchersonWebsearch? Ranking
Whoare thepeer researchersofJureLeskovec? SimilaritySearch
Whom will ChristosFaloutsos collaboratewith inthefuture? RelationshipPrediction
Whether will anauthorpublish apaperinKDD,andwhen? RelationshipPredictionwith
Time
Which typesofrelationships aremostinfluential foran RelationStrengthLearning
authortodecidehertopics?
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 13

BipartiteGraphs
G=(V1V2,E V1xV2)
IncidencematrixwherecellAij =1ifthereexists
alinkbetweeni andj, 0otherwise
Onecanconvertatwomodenetworkdatainto
two onemode data,butwithinformationloss V1
AxAT givesthenumberofnodesinV2 colinkedbyboth V2
therowandthecolumninV1. E.g.,two authorshave
papersinbothAAAI andICML
AT xA givesthenumberofnodesinV1 whicharelinked
toboththerowandthecolumninV2.E.g.,Jack and
Tracy havepapersinone conference (SDM)

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 14

7
ClusteringandRanking:TwoCriticalFunctions
Clustering AC
E G H
A C Not distinguishing
I objects in each
E G cluster?
B D
J H
F J I
B 1 A 1 B
F D
2 C 2 D
3 E 3 F
Ranking A
1
C 4 G 4 I
2
E 5 H 5 J
3
A C 4
B A better solution:
I ComparingDapples Integrating
E G 5
and oranges?
B D 6
G clustering with
J H 7
I ranking
F 8
H
9
F
10 J
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 15

RankClus:IntegratingClusteringwithRanking

Simple solution: Project the bi-typed network into


homogeneous conference network?
Information-loss projection!

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 16

8
ANewMethodology:RankClus
Rankingasthefeature ofthecluster
Rankingisconditionalonaspecificcluster
E.g.,VLDBsrankinTheoryvs.itsrankintheDBarea
Thedistributionsofrankingscoresoverobjectsaredifferentin
eachcluster
Clusteringandrankingaremutuallyenhanced
Betterclustering:rankdistributionsforclustersaremore
distinguishingfromeachother
Betterranking:bettermetricforobjectsislearnedfromthe
ranking
Noteveryobjectshouldbetreatedequallyinclustering!

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 17

SimpleRankingvs.AuthorityRanking
SimpleRanking
Proportionalto#ofpublicationsofanauthor/aconference
Considersonlyimmediateneighborhood inthenetwork

What about an author publishing 100


papers in very weak conferences?

AuthorityRanking:
Moresophisticatedrankrulesareneeded
Propagate therankingscoresinthenetworkoverdifferent
types

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 18

9
RulesforAuthorityRanking
Rule1:Highlyrankedauthorspublishmanypapersinhighly
rankedconferences

Rule2:Highlyrankedconferencesattractmanypapersfrom
manyhighlyrankedauthors

Rule3:Therankofanauthorisenhancedifheorshecoauthors
withmany highlyrankedauthors

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 19

RankClus:AlgorithmFramework
Sub-Network
Ranking

Initialization

Randomlypartition

Repeat

Ranking
Clustering
Rankingobjectsineachsubnetworkinducedfromeachcluster

Generatingnewmeasurespace

Estimatemixturemodelcoefficients foreachtargetobject

Adjustingcluster

Untilstable

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 20

10
StepbyStepRunningCaseIllustration

Initially, ranking Two clusters of


distributions are objects mixed

mixed together together, but
preserve similarity
somehow
Improved a little
Two clusters are
almost well
separated

Improved Well separated


significantly

Stable

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 21

RankClus:Clustering&RankingCSConferences

Top-10 conferences in 5 clusters using RankClus in DBLP (when k = 15)


RankClus
outperforms
spectral
clustering [Shi
and Malik, 2000]
algorithms on
projected
homogeneous
networks

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 22

11
TimeComplexity:Linearto#ofLinks
Ateachiteration,|E|:edgesinnetwork,m:numberoftarget
objects,K:numberofclusters
Rankingforsparsenetwork
~O(|E|)
Mixturemodelestimation
~O(K|E|+mK)
Clusteradjustment
~O(mK^2)
Inall,linearw.r.t.|E|
~O(K|E|)
Note:SimRankwillbeatleastquadraticateachiterationsinceit
evaluatesdistancebetweeneverypairinthenetwork
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 23

NetClus [KDD09]:BeyondBiTypedNetworks
Beyondbitypedinformationnetwork
AStarNetworkSchema[richerinformation]
Splitanetworkintodifferentlayers
Eachrepresentedbyanetworkcluster

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 24

12
MultiTypedNetworksLeadtoBetterResults
Thenetworkclusterfordatabasearea:Conferences,Authors,
andTerms
BetterclusteringandrankingthanRankClus

NetClus vs.RankClus:16% higheraccuracyonconference


clustering

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 25

NetClus:DatabaseSystemCluster
Surajit Chaudhuri 0.00678065
database 0.0995511 VLDB 0.318495 Michael Stonebraker 0.00616469
databases 0.0708818 SIGMOD Conf. 0.313903 Michael J. Carey 0.00545769
system 0.0678563 ICDE 0.188746 C. Mohan 0.00528346
data 0.0214893 PODS 0.107943 David J. DeWitt 0.00491615
query 0.0133316 EDBT 0.0436849 Hector Garcia-Molina 0.00453497
systems 0.0110413
H. V. Jagadish 0.00434289
queries 0.0090603
David B. Lomet 0.00397865
management 0.00850744
Raghu Ramakrishnan 0.0039278
object 0.00837766
Philip A. Bernstein 0.00376314
relational 0.0081175
Joseph M. Hellerstein 0.00372064
processing 0.00745875
Jeffrey F. Naughton 0.00363698
based 0.00736599
Yannis E. Ioannidis 0.00359853
distributed 0.0068367
Jennifer Widom 0.00351929
xml 0.00664958
Per-Ake Larson 0.00334911
oriented 0.00589557
Rakesh Agrawal 0.00328274
design 0.00527672
Dan Suciu 0.00309047
web 0.00509167
Michael J. Franklin 0.00304099
information 0.0050518
Umeshwar Dayal 0.00290143
model 0.00499396
Abraham Silberschatz 0.00278185
efficient 0.00465707 Ranking authors in XML
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 26

13
InterestingResultsfromOtherDomains

RankCompete:Organizeyourphotoalbumautomatically!

RanktreatmentsforAIDSfromMEDLINE

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 27

FromRankClus toRankClass
RankClus[EDBT09]:Clusteringandrankingworkingtogether
Notraining,noavailableclasslabels,noexpertknowledge
RankClass[KDD11]:Integrationofrankingandclassification
Ranking:informativeunderstanding&summaryofeach
class
Classmembershipiscriticalinformationwhenranking
objects
Letrankingandclassificationmutuallyenhanceeachother!
Output:Classificationresults+rankinglistofobjects
withineachclass

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 28

14
ClassificationGeneratesGoodRankingResults
DBLP:4fieldsdataset(DB,DM,AI,IR)formingaheterog.info.network
Rankobjectswithineachclass(withextremelylimitedlabelinformation)
ObtainHighclassificationaccuracyandexcellentrankingswithineachclass
Listobjectswiththehighestconfidencemeasurebelongingtoconf.&terms
Database DataMining AI IR
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
Top5ranked
ICDE ICDM ICML CIKM
conferences
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
data mining learning retrieval
database data knowledge information
Top5ranked
query clustering reasoning web
terms
system classification logic search
xml frequent cognition text

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 29

SimilaritySearch:FindSimilarObjectsinNetworks
DBLP
WhoarethemostsimilartoChristosFaloutsos?
IMDB
WhichmoviesarethemostsimilartoLittleMiss
Sunshine?
ECommerce
WhichproductsarethemostsimilartoKindle?

How to systematically answer these questions ?

Study similarity search in heterogeneous networks

Y.Sun,J.Han,X.Yan,P.S.Yu,andTianyiWu,PathSim:
MetaPathBasedTopKSimilaritySearchin
HeterogeneousInformationNetworks,VLDB'11
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 30

15
NetworkSchemaandMetaPath
Networkschema
Metaleveldescriptionofanetwork

MetaPath
Metaleveldescription ofapathbetweentwoobjects
Apathonnetworkschema
Denoteanexistingorconcatenatedrelation betweentwo
objecttypes
Jim-P1-Ann
Mike-P2-Ann Co-authorship
Mike-P3-Bob

Relation: Describes the type
Path instances Meta-path
of relationships
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 31

DifferentMetaPathsTellDifferentSemantics
WhoaremostsimilartoChristosFaloutsos?

Meta-Path: Author-Paper-Author Meta-Path: Author-Paper-Venue-Paper-Author

Christoss students or close collaborators Work on similar topics and


have similar reputation
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 32

16
OneMetaPathIsBetterThanOthers
Whichpicturesaremostsimilarto?

Group

Image

Evaluatethesimilarity Tag User Evaluatethesimilarity


betweenimagesaccording betweenimagesaccording
totheirlinkedtags totagsandgroups
Meta-Path: Image-Tag-Image Meta-Path: Image-Tag-Image-Group-Image-Tag-Image

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 33

PathSim:SimilarityinTermsofPeers
Whypeers?
Stronglyconnected,whilesimilarvisibility

Amazon Kindle

B&N Nook
Sony Reader

Kobo eReader

Inadditiontometapath
Needtoconsidersimilaritymeasures
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 34

17
ExistingSimilarityMeasures

x p y

x y
Note: P-PageRank and SimRank do not
z
distinguish object type and relationship type
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 35

OnlyPathSim CanFindPeers
LimitationsofExistingMeasures
Randomwalk(RW):Favorhighlyvisibleobjects
objectswithlargedegrees
Pairwiserandomwalk(PRW):Favorpure objects
objectswithhighlyskeweddistributionintheirinlinksoroutlinks
PathSim
Favorpeers:objectswithstrongconnectivityandsimilar
visibilityunderthegivenmetapath

x y

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 36

18
ComparingSimilarityMeasuresinDBLPData
Which venues are most
similar to DASFAA?

Favorhighly
visibleobjects

Which venues are most


similar to SIGMOD?

Arethesetiny
forumsmost
similartoSIGMOD?

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 37

FindAcademicPeersbyPathSim
AnhaiDoan Jignesh Patel
CS,Wisconsin CS, Wisconsin
Databasearea Database area
PhD:2002 PhD: 1998

Meta-Path: Author-Paper-Venue-Paper-Author

Amol Deshpande Jun Yang


CS, Maryland CS, Duke
Database area Database area
PhD: 2004 PhD: 2001
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 38

19
PathPredict:MetaPathBasedRelationshipPrediction
Wideapplications
WhomshouldIcollaborate with?
WhichpapershouldIcite forthistopic?
WhomelseshouldIfollow onTwitter?
WhetherAnnwillbuy thebookSteveJobs?
WhetherBobwillclick theadonhotel?

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 39

RelationshipPredictionvs.LinkPrediction
Linkpredictioninhomogeneousnetworks[LibenNowell and
Kleinberg,2003,Hasan etal.,2006]
E.g.,friendshipprediction

Relationshippredictioninheterogeneousnetworks
Differenttypesofrelationshipsneeddifferentprediction
models
vs.
Differentconnectionpathsneedtobetreatedseparately!
Metapathbasedapproachtodefinetopologicalfeatures.

vs.
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 40

20
WhyPredictionUsingHeterogeneousInfoNetworks?

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 41

MetaPathBasedCoauthorshipPredictioninDBLP

Coauthorshippredictionproblem
Whethertwoauthorsaregoingtocollaborateforthefirsttime
Coauthorshipencodedinmetapath
AuthorPaperAuthor
Topologicalfeaturesencodedinmetapaths

Meta-Path Semantic Meaning

Metapathsbetweenauthorsunderlength4
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 42

21
CaseStudy:PredictingConcreteCoAuthors
Highqualitypredictivepowerforsuchadifficulttask

UsingdatainT0=[1989;1995]and
T1=[1996;2002]
Predictnewcoauthorrelationship
inT2=[2003;2009]
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 43

WhenWillItHappen?
Fromwhethertowhen
Whether:WillJim rentthemovieAvatarinNetflix?
Within 1month?3months?1year?Needtobuild
differentmodels!
When:WhenwillJim rentthemovieAvatar?
WhatistheprobabilityJimwillrentAvatarwithin2
months?
Bywhen JimwillrentAvatarwith90%probability?
Whatistheexpectedtime itwilltakeforJimtorent
Avatar?
Y.Sun,J.Han,C.C.Aggarwal,andN.Chawla,WhenWillIt
Happen?RelationshipPredictioninHeterogeneousInformation
Networks,WSDM'12,Feb.2012
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 44

22
RoleDiscoveryinNetworks:WhyDoesItMatter?
Armycommunication
network(imaginary)

Automatically Commander
infer Captain
Solider

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 45

RoleDiscovery
Objective:Extractsemanticmeaningfromplainlinkstofinely
modelandbetterorganizeinformationnetworks
Challenges
Latentsemanticknowledge
Interdependency
Scalability
Opportunity
Humanintuition
Realisticconstraint
Crosscheckwithcollectiveintelligence
Methodology:propagatesimpleintuitiverulesandconstraints
overthewholenetwork
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 46

23
DiscoveryofAdvisorAdvisee
RelationshipsinDBLPNetwork
Input:DBLPresearchpublicationnetwork
Output:Potentialadvisingrelationshipanditsranking(r,[st,ed])
Ref.C.Wang,J.Han,etal., MiningAdvisorAdvisee
RelationshipsfromResearchPublicationNetworks,SIGKDD2010
Input: Temporal Output: Relationship analysis Visualized chorological hierarchies
collaboration network
1999 (0.9, [/, 1998])

Ada (0.4,
Ada 2000
2 000 Bob
[/, 1998])
(0.5, [/, 2000])
2000
(0.8, [1999,2000])
Jerry (0.49,
2001 Jerry
[/, 1999])
(0.7,
[2000, 2001])
Ying 2002 Smith
th Bob
(0.2,
Ying [2001, 2003])
2003
(0.65, [2002, 2004])

2004
Smith

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 47

MiningEvolutionandDynamicsinHINs
Manynetworksarewithtimeinformation
E.g.,accordingtopaperpublicationyear,DBLPnetworkscanformnetwork
sequences
Motivation:Modelevolutionofcommunitiesinheterogeneousnetwork
Automaticallydetectthebestnumberofcommunitiesineachtimestamp
Modelthesmoothnessbetweencommunitiesofadjacenttimestamps
Modeltheevolutionstructureexplicitly
Birth,death,split
EvoNetClus:Modeling evolutionofdynamicheterogeneousnetworks
Coevolutionwithinacommunity
heterogeneousmultitypedobject/links
Discoveryofevolutionstructuresamongdifferentcommunities
Y.Sun,etal.,"StudyingCoEvolutionofMultiTypedObjectsinDynamic
HeterogeneousInformationNetworks",MLG10
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 48

24
Evolution:IdeaIllustration
Fromnetworksequencestoevolutionarycommunities

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 49

RelatedWork
Backstrom&Leskovec(2011)
Supervisedrandomwalksforlinkprediction and
recommendation inHINs
Dongetal,2012
Arankingfactorgraphmodel(RFG)forpredictinglinksin
HINs
Tangetal.(2008,2012)
CommunityevolutioninHINs(calledmultimodenetworks)
usingaclusteringmethodonevolvingnetworks
Davisetal.(2011)
LinkpredictioninHINsusinganextensiontoAdamic/Adar
measureandexploitingclassification
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 50

25
OurCurrentWork
Goal
Miningheterogeneousinformationnetworks
Approach
Exploitthepotentialandtheoreticalbasisofformalconcept
analysis (FCA)andtwoofitsextensionstomanage
multidimensionality andheterogeneityinnetworks
Useandadaptasetoffindingsonconceptpruning,
core/peripheralnodeidentification,networkpartitioning
(e.g.,biclustering andtriclustering),taxonomybasedmining
tobetteranalyzelargenetworksandextractrichpatterns
suchasgroupsandassociationrules

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 51

OurCurrentWork
Networksunderstudy
AnalysisofaHINwithaninteractionnetworktogetherwith
anaffiliationoneforlinkpredictionandrecommendation
GeneralizetheapproachtoanarbitraryHIN
Analysisofanetworkwithtridimensionalandevenn
dimensional datausingtriadic(andlateronpolyadic)
conceptanalysis
Explorationofourpreviousworkonformalconceptanalysis
(e.g.,implicationswithnegation,attribute/object
generalization,operationsonlattices)todetectricherand
userorientedpatterns inHINs

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 52

26
OurCurrentWork
Linkpredictionandrecommendation
AddanewnodeNi andatleastalinkinaHINwithtwotypes
ofnodesandlinks
Useformalconceptanalysistogetherwithconcept(cluster)
pruningandweightingtosuggestasetoflinkstobeadded
betweenthenewnodeandexistingnodes

Links to
recommend

Alan

New node
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 53

OurCurrentWork
Networkswithtridimensionaldata
Objects with attributes under conditions
E.g.events (1..5),researchers (P,N,R,K,S)androles (a,b,c,d)
a :speaker (at agiven event),b :organizer
c:author,d:PCmember
E.g.,Researcher K attendsEvent 2 with two roles:author andPC
member

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 54

27
OurCurrentWork
Triadicconceptanalysis(Lehmann&Wille,1995)

Fritz Lehmann and R. Wille. A triadic Approach to Formal Concept Analysis. In


ICCS, p. 32-43, 1995.
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 55

OurCurrentWork

E.g.Thetriadicconcept(345,RK,ab)meansthatEvents 3,4&5
attractResearchers R&Kwithroles aandb

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 56

28
OurCurrentWork
Triadicassociationrules
E.g.,any role (e.g.,speaker)played byS isalsoplayedbyP
ad
N P:whenever researcher N attendsevents as
aspeaker andPCmember,P does so

Rokia Missaoui & Lonard Kwuida. Mining Triadic Association Rules from Ternary Relations.
ICFCA 2011, p. 204-218.

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 57

Conclusion(Sun&Han,2012)
Richknowledgecanbeminedfrominformationnetworks
Whatisthemagic?
Heterogeneous,structured informationnetworks!
Clustering,rankingandclassification:Integratedclustering,
rankingandclassification:RankClus,RankClass,
MetaPathbasedsimilaritysearchandrelationshipprediction
Rolediscoveryandevolutionaryanalysis
Knowledgeispower,butknowledgeishiddeninmassivelinks!
Miningheterogeneous informationnetworks:Muchmoretobe
explored!!
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 58

29
FutureResearch(Sun&Han,2012)
Discoveringontologyandstructure ininformationnetworks
Discoveringandmininghiddeninformationnetworks
Mininginformationnetworksformedbystructureddatalinking
withunstructureddata(text,multimediaandWeb)
Miningcyberphysicalnetworks(networksformedbydynamic
sensors,image/videocameras,withinformationnetworks)
Enhancingthepowerofknowledgediscoverybytransforming
massiveunstructureddata:Incrementalinformationextraction,
rolediscovery, multidimensionalstructuredinfonet
Miningnoisy,uncertain,untrustablemassivedatasetsby
informationnetworkanalysisapproach
TurningWikipediaand/orWebintostructuredorsemistructured
databasesbyheterogeneousinformationnetworkanalysis
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 59

References:BooksonNetworkAnalysis
A.L.Barabasi.Linked:HowEverythingIsConnectedtoEverythingElseandWhatItMeans.Plume,
2003.
M.Buchanan.Nexus:SmallWorldsandtheGroundbreakingTheoryofNetworks.W.W.Norton&
Company,2003.
D.J.CookandL.B.Holder.MiningGraphData.JohnWiley&Sons,2007
S.Chakrabarti.MiningtheWeb:DiscoveringKnowledgefromHypertextData.MorganKaufmann,
2003
A.DegenneandM.Forse.IntroducingSocialNetworks.SagePublications,1999
P.J.Carrington,J.Scott,andS.Wasserman.ModelsandMethodsinSocialNetworkAnalysis.
CambridgeUniversityPress,2005.
J.Davies,D.Fensel,andF.vanHarmelen.TowardstheSemanticWeb:OntologyDriven
KnowledgeManagement.JohnWiley&Sons,2003.
D.Fensel,W.Wahlster,H.Lieberman,andJ.Hendler.SpinningtheSemanticWeb:Bringingthe
WorldWideWebtoItsFullPotential.MITPress,2002.
L.GetoorandB.Taskar(eds.).Introductiontostatisticallearning.InMITPress,2007.
B.Liu.WebDataMining:ExploringHyperlinks,Contents,andUsageData.Springer,2006.
J.P.Scott.SocialNetworkAnalysis:AHandbook.SagePublications,2005.
J.Watts.SixDegrees:TheScienceofaConnectedAge.W.W.Norton&Company,2003.
D.J.Watts.SmallWorlds:TheDynamicsofNetworksbetweenOrderandRandomness.Princeton
UniversityPress,2003.
S.WassermanandK.Faust.SocialNetworkAnalysis:MethodsandApplications.Cambridge
UniversityPress,1994.
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 60

30
References:SomeOverviewPapers
T.BernersLee,J.Hendler,andO.Lassila.Thesemanticweb.ScientificAmerican,May
2001.
C.CooperandAFrieze.Ageneralmodelofwebgraphs.Algorithms,22,2003.
S.ChakrabartiandC.Faloutsos.Graphmining:Laws,generators,andalgorithms.ACM
Comput.Surv.,38,2006.
T.Dietterich,P.Domingos,L.Getoor,S.Muggleton,andP.Tadepalli.Structured
machinelearning:Thenexttenyears.MachineLearning,73,2008
S.DumaisandH.Chen.Hierarchicalclassificationofwebcontent.SIGIR'00.
S.Dzeroski.Multirelationaldatamining:Anintroduction.ACMSIGKDDExplorations,
July2003.
L.Getoor.Linkmining:anewdataminingchallenge.SIGKDDExplorations,5:84{89,
2003.
L.Getoor,N.Friedman,D.Koller,andB.Taskar.Learningprobabilisticmodelsof
relationalstructure.ICML'01
D.JensenandJ.Neville.Datamininginnetworks.InPapersoftheSymp.Dynamic
SocialNetworkModelingandAnalysis,NationalAcademyPress,2002.
T.WashioandH.Motoda.Stateoftheartofgraphbaseddatamining.SIGKDD
Explorations,5,2003.

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 61

References:SomeInfluentialPapers
A.Z.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,
andJ.L.Wiener.Graphstructureintheweb.ComputerNetworks,33,2000.
S.BrinandL.Page.Theanatomyofalargescalehypertextualwebsearchengine.
WWW'98.
S.Chakrabarti,B.E.Dom,S.R.Kumar,P.Raghavan,S.Rajagopalan,A.Tomkins,D.
Gibson,andJ.M.Kleinberg.Miningtheweb'slinkstructure.COMPUTER,32,1999.
M.Faloutsos,P.Faloutsos,andC.Faloutsos.Onpowerlawrelationshipsoftheinternet
topology.ACMSIGCOMM'99
M.GirvanandM.E.J.Newman.Communitystructureinsocialandbiologicalnetworks.
InProc.Natl.Acad.Sci.USA99,2002.
B.A.HubermanandL.A.Adamic.Growthdynamicsofworldwideweb.Nature,
399:131,1999.
G.JehandJ.Widom.SimRank:ameasureofstructuralcontextsimilarity.KDD'02
D.Kempe,J.Kleinberg,andE.Tardos.Maximizingthespreadofinfluencethrougha
socialnetwork.KDD'03
J.M.Kleinberg,R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Thewebasa
graph:Measurements,models,andmethods.COCOON'99
J.M.Kleinberg.Smallworldphenomenaandthedynamicsofinformation.NIPS'01
R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal.
Stochasticmodelsforthewebgraph.FOCS'00
M.E.J.Newman.Thestructureandfunctionofcomplexnetworks.SIAMReview,45,
2003.
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 62

31
References:ClusteringandRanking(1)
E.Airoldi,D.Blei,S.FienbergandE.Xing,MixedMembershipStochasticBlockmodels,
JMLR08
LiangliangCao,AndreyDelPozo,XinJin,JieboLuo,JiaweiHan,andThomasS.Huang,
RankCompete:SimultaneousRankingandClusteringofWebPhotos,WWW10
G.JehandJ.Widom,SimRank:ameasureofstructuralcontextsimilarity,KDD'02
JingGao,FengLiang,WeiFan,ChiWang,YizhouSun,andJiaweiHan,Community
OutliersandtheirEfficientDetectioninInformationNetworks",KDD'10
M.E.J.NewmanandM.Girvan,Findingandevaluatingcommunitystructurein
networks,PhysicalReviewE,2004
M.E.J.NewmanandM.Girvan,Fastalgorithmfordetectingcommunitystructurein
networks,PhysicalReviewE,2004
J.ShiandJ.Malik,NormalizedcutsandimageSegmentation,CVPR'97
YizhouSun,YintaoYu,andJiaweiHan,"RankingBasedClusteringofHeterogeneous
InformationNetworkswithStarNetworkSchema",KDD09
YizhouSun,JiaweiHan,PeixiangZhao,ZhijunYin,HongCheng,andTianyiWu,
"RankClus:IntegratingClusteringwithRankingforHeterogeneousInformation
NetworkAnalysis",EDBT09

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 63

References:ClusteringandRanking(2)
YizhouSun,JiaweiHan,JingGao,andYintaoYu,"iTopicModel:InformationNetworkIntegrated
TopicModeling",ICDM09
YizhouSun,CharuC.Aggarwal,andJiaweiHan,"RelationStrengthAwareClusteringof
HeterogeneousInformationNetworkswithIncompleteAttributes",PVLDB5(5),2002
A.Wu,M.Garland,andJ.Han.Miningscalefreenetworksusinggeodesicclustering.KDD'04
Z.WuandR.Leahy,Anoptimalgraphtheoreticapproachtodataclustering:Theoryandits
applicationtoimagesegmentation,IEEETrans.PatternAnal.Mach.Intell.,1993.
X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger.SCAN:Astructuralclusteringalgorithmfor
networks.KDD'07
XiaoxinYin,JiaweiHan,PhilipS.Yu."LinkClus:EfficientClusteringviaHeterogeneousSemantic
Links",VLDB'06.
YintaoYu,CindyX.Lin,YizhouSun,ChenChen,JiaweiHan,BinbinLiao,TianyiWu,ChengXiang
Zhai,DuoZhang,andBoZhao,"iNextCube:InformationNetworkEnhancedTextCube",VLDB'09
(demo)
X.Yin,J.Han,andP.S.Yu.Crossrelationalclusteringwithuser'sguidance.KDD'05

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 64

32
References:NetworkClassification(1)
A.Appice,M.Ceci,andD.Malerba.Miningmodeltrees:Amultirelationalapproach.
ILP'03
JingGao,FengLiang,WeiFan,YizhouSun,andJiaweiHan,"BipartiteGraphbased
ConsensusMaximizationamongSupervisedandUnsupervisedModels",NIPS'09
L.Getoor,N.Friedman,D.KollerandB.Taskar,LearningProbabilisticModelsofLink
Structure,JMLR02.
L.Getoor,E.Segal,B.TaskarandD.Koller,ProbabilisticModelsofTextandLink
StructureforHypertextClassification,IJCAIWSTextLearning:BeyondClassification,
2001.
L.Getoor,N.Friedman,D.Koller,andA.Pfeffer,LearningProbabilisticRelational
Models,chapterinRelationDataMining,eds.S.DzeroskiandN.Lavrac,2001.
M.Ji,Y.Sun,M.Danilevsky,J.Han,andJ.Gao,Graphbasedclassificationon
heterogeneousinformationnetworks,ECMLPKDD10.
M.Ji,J.Jan,andM.Danilevsky,RankingbasedClassificationofHeterogeneous
InformationNetworks,KDD11.
Q.LuandL.Getoor,Linkbasedclassification,ICML'03
D.LibenNowellandJ.Kleinberg,Thelinkpredictionproblemforsocialnetworks,
CIKM'03
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 65

References:NetworkClassification(2)
J.Neville,B.Gallaher,andT.EliassiRad.Evaluatingstatisticaltestsforwithinnetwork
classifiersofrelationaldata.ICDM'09.
J.Neville,D.Jensen,L.Friedland,andM.Hay.Learningrelationalprobabilitytrees.
KDD'03
JenniferNeville,DavidJensen,RelationalDependencyNetworks,JMLR07
M.SzummerandT.Jaakkola,Partiallylabeledclassicationwithmarkovrandomwalks,
InNIPS,volume14,2001.
M.J.Rattigan,M.Maier,andD.Jensen.Graphclusteringwithnetworkstructure
indices.ICML'07
P.Sen,G.M.Namata,M.Galileo,M.Bilgic,L.Getoor,B.Gallagher,andT.EliassiRad.
Collectiveclassificationinnetworkdata.AIMagazine,29,2008.
B.Taskar,E.Segal,andD.Koller.Probabilisticclassificationandclusteringinrelational
data.IJCAI'01
B.Taskar,P.Abbeel,M.F.Wong,andD.Koller,RelationalMarkovNetworks,chapter
inL.GetoorandB.Taskar,editors,IntroductiontoStatisticalRelationalLearning,2007
X.Yin,J.Han,J.Yang,andP.S.Yu,CrossMine:EfficientClassificationacrossMultiple
DatabaseRelations,ICDE'04.
D.Zhou,O.Bousquet,T.N.Lal,J.Weston,andB.Scholkopf,Learningwithlocaland
globalconsistency,InNIPS16,Vancouver,Canada,2004.
X.ZhuandZ.Ghahramani,Learningfromlabeledandunlabeleddatawithlabel
propagation,TechnicalReport,2002.
Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 66

33
References:SocialNetworkAnalysis
B.AlemanMeza,M.Nagarajan,C.Ramakrishnan,L.Ding,P.Kolari,A.P.Sheth,I.B.
Arpinar,A.Joshi,andT.Finin.Semanticanalyticsonsocialnetworks:experiencesin
addressingtheproblemofconflictofinterestdetection.WWW'06
R.Agrawal,S.Rajagopalan,R.Srikant,andY.Xu.Miningnewsgroupsusingnetworks
arisingfromsocialbehavior.WWW'03
P.BoldiandS.Vigna.TheWebGraphframeworkI:Compressiontechniques.WWW'04
D.Cai,Z.Shao,X.He,X.Yan,andJ.Han.Communityminingfrommultirelational
networks.PKDD'05
P.Domingos.Miningsocialnetworksforviralmarketing.IEEEIntelligentSystems,20,
2005.
P.DomingosandM.Richardson.Miningthenetworkvalueofcustomers.KDD'01
P.DeRose,W.Shen,F.Chen,A.Doan,andR.Ramakrishnan.Buildingstructuredweb
communityportals:Atopdown,compositional,andincrementalapproach.VLDB'07
G.Flake,S.Lawrence,C.L.Giles,andF.Coetzee.Selforganizationandidentificationof
webcommunities.IEEEComputer,35,2002.
J.Kubica,A.Moore,andJ.Schneider.Tractablegroupdetectiononlargelinkdatasets.
ICDM'03

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 67

References:LinkandRelationshipPrediction
V.Leroy,B.B.Cambazoglu,andF.Bonchi,Coldstartlinkprediction,KDD10.
D.LibenNowellandJ.Kleinberg,Thelinkpredictionproblemforsocial
networks,CIKM03,
R.N.Lichtenwalter,J.T.Lussier,andN.V.Chawla,Newperspectivesand
methodsinlinkprediction,KDD10.
YizhouSun,RickBarber,ManishGupta,CharuC.AggarwalandJiaweiHan,
"CoAuthorRelationshipPredictioninHeterogeneousBibliographic
Networks,ASONAM11.
YizhouSun,JiaweiHan,CharuC.Aggarwal,andNiteshV.Chawla,"WhenWill
ItHappen? RelationshipPredictioninHeterogeneousInformation
Networks",WSDM12.
B.Taskar,M.faiWong,P.Abbeel,andD.Koller,Linkpredictioninrelational
data,NIPS03.
XiaoYu,QuanquanGu,MianweiZhou,andJiaweiHan,"CitationPredictionin
HeterogeneousBibliographicNetworks,SDM12.

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 68

34
References:RoleDiscoveryandSummarization
D.Archambault,T.Munzner,andD.Auber.Topolayout:Multilevelgraphlayoutbytopological
features.IEEETrans.Vis.Comput.Graph,2007.
XinJin,JieboLuo,JieYu,GangWang,DhirajJoshi,andJiaweiHan,iRIN:ImageRetrievalin
ImageRichInformationNetworks,WWW'10(demopaper)
LuLiu,FeidaZhu,ChenChen,XifengYan,JiaweiHan,PhilipYu,andShiqiangYang,Mining
DiversityonNetworks",DASFAA'10
Y.Tian,R.A.Hankins,andJ.M.Patel.Efficientaggregationforgraphsummarization.SIGMOD'08
ChiWang,JiaweiHan,YuntaoJia,JieTang,DuoZhang,YintaoYu,andJingyiGuo,Mining
AdvisorAdviseeRelationshipsfromResearchPublicationNetworks ",KDD'10
ZhijunYin,ManishGupta,TimWeningerandJiaweiHan,LINKREC:AUnifiedFrameworkforLink
RecommendationwithUserAttributesandGraphStructure,WWW10
PeixiangZhao,XiaoleiLi,DongXin,andJiaweiHan,GraphCube:OnWarehousingandOLAP
MultidimensionalNetworks,SIGMOD'11
PeixiangZhaoandJiaweiHan,OnGraphQueryOptimizationinLargeNetworks",Proc.2010Int.
Conf.onVeryLargeDataBases(VLDB'10),Singapore,Sept.2010

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 69

References:NetworkEvolution
L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationin
largesocialnetworks:Membership,growth,andevolution.KDD'06
M.S.KimandJ.Han.Aparticleanddensitybasedevolutionaryclustering
methodfordynamicnetworks.VLDB'09
J.Leskovec,J.Kleinberg,andC.Faloutsos.Graphsovertime:Densification
laws,shrinkingdiametersandpossibleexplanations.KDD'05
YizhouSun,JieTang,JiaweiHan,ManishGupta,BoZhao,Community
EvolutionDetectioninDynamicHeterogeneousInformationNetworks,KDD
MLG10

Tutorial Mining Heterogeneous Information Networks EGC2013 - Toulouse 70

35