You are on page 1of 6

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Similarity Search in Information Networks using


Meta-Path Based between Objects
Abstract Real world physical and abstract data

in many applications. For example, in spatial

objects are interconnected, forming enormous,

database, people are interested in finding the k

interconnected networks. By structuring these

nearest neighbors for a given spatial object.

data objects and interactions between these

Object similarity is also one of the most

objects into multiple types, such networks

primitive concepts for object clustering and

become

many other data mining functions.

semi-structured

heterogeneous

information networks. Therefore, the quality


In a similar context, it is critical to provide

analysis of large heterogeneous information

effective

networks poses new challenges. In current

search for the most similar pictures for a given

distance, connectivity and co-citation. By using

relationships

between

we

measure

objects

rather

in

such as flicker, a user may be interested in

Wikipedia by reflecting all three concepts:

approach

functions

a given entity. In a network of tagged images

introduced for measuring the relationship on

current

search

information networks, to find similar entities for

system, a generalized flow based method is

the

similarity

picture. In an e-commerce system, a user would

only

be interest in search for the most similar

than

products for a given product. Different attribute-

similarities. To address these problems we

based similarity search, links play an important

introduce a novel solution meta-path based

role for

similarity searching approach for dealing with

similarity search in

information

networks, especially when the full information

heterogeneous information networks using a

about attributes for objects is difficult to obtain.

meta-path-based method. Under this framework,


similarity search and other mining tasks of the

There are a few studies leveraging link

network structure.

information in networks for similarity search,


but most of these revisions are focused on

Index terms similarity search, information

homogeneous or bipartite networks such as P-

network, and meta-path based, clustering.

PageRank and SimRank. These similarity


measures disregard the subtlety of different
types among objects and links. Adoption of such
I. INTRODUCTION

measures

to

heterogeneous

networks

his

Similarity search, which aims at locating the

significant drawbacks: even if we just want to

most relevant information for a query in large

compare objects of the same type, going through

collections of datasets, has been widely studied

link paths of different types leads to rather

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

317

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

different semantics meanings, and it makes little

connections represent different relationships

sense to mix them up and measure the similarity

between authors, each having some different

without distinguishing their semantics.

semantic meaning.

To systematically distinguish the semantics

Now the questions are, given an arbitrary

among paths connecting two objects, we

heterogeneous information network, is there any

introduce

similarity

way systematically identify all the possible

framework for objects of the same type in a

connection type between two objects types? In

heterogeneous network. A meta-path is a

order to do so, we propose two important

sequence of relations between object types,

concepts in the following.

meta-path

based

which defines a new composite relation between


a) Network Schema And Meta-Path

its starting type and ending type. The meta-path


framework provides a powerful mechanism for a

First,

user to select appropriate similarity semantics,

information network, it is necessary to provide

by choosing a proper meta-path, or learn it from

its

a set of training examples of similar objects.

understanding the network. Therefore, we

for

level

complex

description

heterogeneous

for

better

describe the Meta structure of a network.

relate it to two well-known existing link-based


functions

Meta

propose the concept of network scheme to

The meta-path based similarity framework, and

similarity

given

homogeneous

The concept of network scheme is similar to that

information networks. We define a novel

of the Entity Relationship model in database

similarity measure, PathSim that is able to find

systems, but only captures the entity type and

peer objects that are not only strongly connected

their binary relations, without considering the

with each other but also share similar visibility

attributes for each Entity type. Network schema

in the network. Moreover, we propose an

serves as a template for a network, and tells how

efficient algorithm to support online top-k

many types of objects there are in the network

queries for such similarity search.

and where the possible links exist.

II. A META-PATH BASED SIMILARITY

b) Bibliographic Scheme and Meta-Path

MEASURE
For the bibliographic network scheme, where an
The similarity between two objects in a link-

explicitly shows the direction of a relation.

based similarity function is determined by how


III. META-PATH BASED SIMILARITY

the objects are connected in a network, which

FRAMEWORK

can be described using paths. In a heterogeneous


information network, due to the heterogeneity of

Given

the types of links, the way to connect two

similarity measures can be defined for a pair of

objects can be much more diverse. The schema

objects, and according to the path instances

a user-specified meta-path,

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

318

several

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

between them following the met-path. There are

their movie styles and productivity and finding

several

similar product.

straightforward

measures

in

the

following.
This motivated us to propose a new, meta-path
Path count: the number of path instances

based similarity measure, call PathSim that

between objects.

captures the subtle of peer similarity. The insight


behind it is that two similar peer objects should

Random Walk: s(x, y) is the probability of the

not only be strongly connected, but also share

random walk that starts from x and ends with y

comparable observations. As the relation of peer

following meta-path P, which is the sum of the

should be symmetric, we confine PathSim to

probabilities of all the path instances.

symmetric meta-paths. The calculation of

Pair wise random walk: for a meta-path P that

PathSim between any two objects of the same

can be decomposed into two shorter meta-paths

type given a certain meta-path involves matrix

with the same length is then the pair wise

multiplication.

random walk probability starting from objects x

In this paper, we only consider the meta-path in

and y and reaching the same middle object.

the round trip from, to guarantee its symmetry

In general, we can define a meta-path based

and therefore the symmetry of the PathSim

similarity framework for two objects x and y.

measure.

Note that P-PageRank and SimRank, two wellknown

network

similarity

functions,

Properties of PathSim

are

weighted combinations of random walk measure

1. Symmetric.

or pair wise random walk measure, respectively,

2. Self-maximum

over meta-paths with different lengths in

3. Balance of visibility

homogeneous networks. In order to use PAlthough using meta-path based similarity we

PageRank and SimRank in heterogeneous

can define similarity between two objects given

information networks.

any round trip meta-paths.


a) A Novel Similarity Measure
As primary eigenvectors can be used as
There have been several similarity measures are

authority ranking of objects, the similarity

presented and they are partially to either highly

between two objects under an infinite meta-path

visible objects or highly concentrated objects but

can be viewed as a measure defined on their

cannot capture the semantics of peer similarity.

rankings. Two objects with more similar

However, in many scenarios, finding similar

rankings scores will have higher similarity. In

objects in networks is to find similar peers, such

the next section we discuss online query

as finding similar authors based on their fields

processing for ingle meta-path.

and reputation, finding similar actors based on

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

319

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

IV.

ISBN: 378 - 26 - 138420 - 5

check every possible object. This will be much

QUERY PROCESSING FOR SINGLE META-

more

PATH

efficient

than

Pairwise

computation

between the query and all the objects of that


Compared with P-PageRank and SimRank, the

type. We call baseline concatenation algorithm

calculation is much more efficient, as it is a local

as PathSim-baseline.

graph measure. But still involves expensive


matrix multiplication operations for top k

The PathSim-baseline algorithm is still time

search functions, as we need to calculate the

consuming if the candidate set is large. The time

similarity between a query and every object of

complexity of computing PathSim for each

the same type in the network. One possible

candidate, where is O(d) on average and O(m) in

solution is to materialize all the meta-paths.

the worst case. We now propose a co-clustering


based top-k concatenation algorithm, by which

In order to support fast online query processing


for

large-scale

networks,

we

propose

non-promising target objects are dynamically

filtered out to reduce the search space.

methodology that partially materializes short


length meta-paths and then concatenates them
online

to

derive

longer

b) Co-Clustering-Based Pruning

meta-path-based
In the baseline algorithm, the computational

similarity. First, a baseline method is proposed,

costs involve two factors. First, the more

which computes the similarity between query

candidates to check, the more time the algorithm

object x and all the candidate object y of the

will take; second, for each candidate, the dot

same type. Next, a co-clustering based pruning

product of query vector and candidate vector

method is proposed, which prunes candidate

will at most involve m operations, where m is

objects that are not promising according to their

the vector length. Based on the intuition, we

similarity upper bounds. Both algorithms return

propose, we propose a co-clustering-based path

exact top-k results the given query.

concatenation method, which first generates coclusters of two types of objects for partial

a) Baseline

relation matrix, then stores necessary statics for


Suppose we know that the relation matrix for

each of the blocks corresponding to different co-

meta-path and the diagonal vector in order to get

cluster pairs, and then uses the block statistics to

top-k objects with the highest similarity for the

prune the search space. For better picture, we

query, we need to compute the probability of

call cluster of type as target clusters, since the

objects. The straightforward baseline is: (1) first

objects are the targets for the query and call

apply vector matrix multiplication (2) calculate

clusters of type as feature clusters. Since the

probability of objects (3) sort the probability of

objects serve as features to calculate the

objects and return top-k list in the final step.

similarity between the query and the target

When a large matrix, the vector matrix

objects. By partitioning into different target

computation will be too time consuming to

clusters, if a whole target cluster is not similar to

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

320

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

the query, then all the objects in the target

extracted from DBLP and Flicker in the

cluster are likely not in the final top-k lists and

experiments.

can be pruned. By partitioning in different


The PathSim algorithm significantly improves

feature clusters, cheaper calculations on the

the query processing speed comparing with the

dimension-reduced query vector and candidate

baseline algorithm, without affecting the search

vectors can be used to derive the similarity


upper

bounds.

The

PathSim-Pruning

quality.

can

significantly improve the query processing speed

For additional case studies, we construct a

comparing with the baseline algorithm, without

Flicker network from a subset of the Flicker data

affecting the search quality.

which contains four types of objects such as


images, users, tags, and groups. We have to

c) Multiple Meta-Paths Combination

show that our algorithms improve similarity


In the previous section, we presented algorithms

search between object based on the potentiality

for similarity search using single meta-path.

and correlation between objects.

Now, we present a solution to combine multiple


VI. CONCLUSION

meta-paths. The reason why we need to combine


several meta-paths is that each meta-path

In this paper we introduced novel similarity

provides a unique angle to view the similarity

search using meta-path based similarity search

between objects, and the ground truth may be a

using baseline algorithm and co-clustering based

cause of different factors. Some useful guidance

pruning algorithms to improve the similarity

of the weight assignment includes: longer meta-

search based on the strengths and relationships

path utilize more remote relationship and thus

between objects.

should be assigned with a smaller weight, such


REFERENCES

as in P-PageRank and SimRank and meta-paths


with more important relationships should be

[1]

Jiawei Han, Lise Getoor, Wei Wang,

assigned with a higher weight. For automatically

Johannes Gehrke, Robert Grossman "Mining

determining the weights, users cloud provides

Heterogeneous

training examples of similar objects to learn the

Principles and Methodologies"

weights of different meta-paths using learning

Information

Networks

[2] Y. Koren, S.C. North, and C. Volinsky,

algorithm.

Measuring and Extracting Proximity in


Networks, Proc. 12th ACM SIGKDD Intl

V. EXPECTED RESULTS

Conf. Knowledge Discovery and Data


To show the effectiveness of the PathSim

Mining, pp. 245-255, 2006.

measure and the efficiency of the proposed


[3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,

algorithms we use the bibliographic networks

Association

Thesaurus

Construction

Methods Based on Link Co-Occurrence

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

321

www.iaetsd.in

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Analysis for Wikipedia, Proc. 17th ACM


Conf.

Information

and

Knowledge

Management (CIKM), pp. 817-826, 2008.


[4] K. Nakayama, T. Hara, and S. Nishio,
Wikipedia Mining for an Association Web
Thesaurus Construction, Proc. Eighth Intl
Conf. Web Information Systems Eng.
(WISE), pp. 322-334, 2007.
[5] M. Yazdani and A. Popescu-Belis, A
Random Walk Framework to Compute
Textual Semantic Similarity: A Unified
Model for Three Benchmark Tasks, Proc.
IEEE

Fourth

Intl

Conf.

Semantic

Computing (ICSC), pp. 424-429, 2010.


[6] R.L. Cilibrasi and P.M.B. Vitanyi, The
Google Similarity Distance, IEEE Trans.
Knowledge and Data Eng., vol. 19, no. 3,
pp. 370-383, Mar. 2007.
[7] G. Kasneci, F.M. Suchanek, G. Ifrim, M.
Ramanath,

and

G.

Weikum,

Naga:

Searching and Ranking Knowledge, Proc.


IEEE 24th Intl Conf. Data Eng. (ICDE), pp.
953-962, 2008.
[8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin,
Network Flows: Theory, Algorithms, and
Applications. Prentice Hall, 1993.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

322

www.iaetsd.in