IAETSD-Similarity Search in Information Networks Using

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
ISBN: 378 - 26 - 138420 - 5
Similarity Search in Information Networks using

Meta-Path Based between Objects
Abstract Real world physical and abstract data
in many applications. For example, in spatial
objects are interconnected, forming enormous,
database, people are interested in finding the k
interconnected networks. By structuring these
nearest neighbors for a given spatial object.
data objects and interactions between these
Object similarity is also one of the most
objects into multiple types, such networks
primitive concepts for object clustering and
become
many other data mining functions.
semi-structured
heterogeneous
information networks. Therefore, the quality

In a similar context, it is critical to provide
analysis of large heterogeneous information
effective
networks poses new challenges. In current
search for the most similar pictures for a given
distance, connectivity and co-citation. By using
relationships
between
we
measure
objects
rather
in
such as flicker, a user may be interested in
Wikipedia by reflecting all three concepts:
approach
functions
a given entity. In a network of tagged images
introduced for measuring the relationship on
current
search
information networks, to find similar entities for
system, a generalized flow based method is
the
similarity
picture. In an e-commerce system, a user would
only
be interest in search for the most similar
than
products for a given product. Different attribute-
similarities. To address these problems we
based similarity search, links play an important
introduce a novel solution meta-path based
role for
similarity searching approach for dealing with
similarity search in
information
networks, especially when the full information
heterogeneous information networks using a
about attributes for objects is difficult to obtain.
meta-path-based method. Under this framework,

similarity search and other mining tasks of the
There are a few studies leveraging link
network structure.
information in networks for similarity search,

but most of these revisions are focused on
Index terms similarity search, information
homogeneous or bipartite networks such as P-
network, and meta-path based, clustering.
PageRank and SimRank. These similarity

measures disregard the subtlety of different
types among objects and links. Adoption of such
I. INTRODUCTION
measures
to
heterogeneous
networks
his
Similarity search, which aims at locating the
significant drawbacks: even if we just want to
most relevant information for a query in large
compare objects of the same type, going through
collections of datasets, has been widely studied
link paths of different types leads to rather
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
317
www.iaetsd.in
ISBN: 378 - 26 - 138420 - 5
different semantics meanings, and it makes little
connections represent different relationships
sense to mix them up and measure the similarity
between authors, each having some different
without distinguishing their semantics.
semantic meaning.
To systematically distinguish the semantics
Now the questions are, given an arbitrary
among paths connecting two objects, we
heterogeneous information network, is there any
introduce
similarity
way systematically identify all the possible
framework for objects of the same type in a
connection type between two objects types? In
heterogeneous network. A meta-path is a
order to do so, we propose two important
sequence of relations between object types,
concepts in the following.
meta-path
based
which defines a new composite relation between

a) Network Schema And Meta-Path
its starting type and ending type. The meta-path

framework provides a powerful mechanism for a
First,
user to select appropriate similarity semantics,
information network, it is necessary to provide
by choosing a proper meta-path, or learn it from
its
a set of training examples of similar objects.
understanding the network. Therefore, we
for
level
complex
description
heterogeneous
for
better
describe the Meta structure of a network.
relate it to two well-known existing link-based

functions
Meta
propose the concept of network scheme to
The meta-path based similarity framework, and
similarity
given
homogeneous
The concept of network scheme is similar to that
information networks. We define a novel
of the Entity Relationship model in database
similarity measure, PathSim that is able to find
systems, but only captures the entity type and
peer objects that are not only strongly connected
their binary relations, without considering the
with each other but also share similar visibility
attributes for each Entity type. Network schema
in the network. Moreover, we propose an
serves as a template for a network, and tells how
efficient algorithm to support online top-k
many types of objects there are in the network
queries for such similarity search.
and where the possible links exist.
II. A META-PATH BASED SIMILARITY
b) Bibliographic Scheme and Meta-Path
MEASURE
For the bibliographic network scheme, where an
The similarity between two objects in a link-
explicitly shows the direction of a relation.
based similarity function is determined by how

III. META-PATH BASED SIMILARITY
the objects are connected in a network, which
FRAMEWORK
can be described using paths. In a heterogeneous

information network, due to the heterogeneity of
Given
the types of links, the way to connect two
similarity measures can be defined for a pair of
objects can be much more diverse. The schema
objects, and according to the path instances
a user-specified meta-path,
318
several
www.iaetsd.in
ISBN: 378 - 26 - 138420 - 5
between them following the met-path. There are
their movie styles and productivity and finding
several
similar product.
straightforward
measures
in
the
following.
This motivated us to propose a new, meta-path
Path count: the number of path instances
based similarity measure, call PathSim that
between objects.
captures the subtle of peer similarity. The insight

behind it is that two similar peer objects should
Random Walk: s(x, y) is the probability of the
not only be strongly connected, but also share
random walk that starts from x and ends with y
comparable observations. As the relation of peer
following meta-path P, which is the sum of the
should be symmetric, we confine PathSim to
probabilities of all the path instances.
symmetric meta-paths. The calculation of
Pair wise random walk: for a meta-path P that
PathSim between any two objects of the same
can be decomposed into two shorter meta-paths
type given a certain meta-path involves matrix
with the same length is then the pair wise
multiplication.
random walk probability starting from objects x
In this paper, we only consider the meta-path in
and y and reaching the same middle object.
the round trip from, to guarantee its symmetry
In general, we can define a meta-path based
and therefore the symmetry of the PathSim
similarity framework for two objects x and y.
measure.
Note that P-PageRank and SimRank, two wellknown
network
similarity
functions,
Properties of PathSim
are
weighted combinations of random walk measure
1. Symmetric.
or pair wise random walk measure, respectively,
2. Self-maximum
over meta-paths with different lengths in
3. Balance of visibility
homogeneous networks. In order to use PAlthough using meta-path based similarity we
PageRank and SimRank in heterogeneous
can define similarity between two objects given
information networks.
any round trip meta-paths.

a) A Novel Similarity Measure
As primary eigenvectors can be used as
There have been several similarity measures are
authority ranking of objects, the similarity
presented and they are partially to either highly
between two objects under an infinite meta-path
visible objects or highly concentrated objects but
can be viewed as a measure defined on their
cannot capture the semantics of peer similarity.
rankings. Two objects with more similar
However, in many scenarios, finding similar
rankings scores will have higher similarity. In
objects in networks is to find similar peers, such
the next section we discuss online query
as finding similar authors based on their fields
processing for ingle meta-path.
and reputation, finding similar actors based on
319
www.iaetsd.in
IV.
ISBN: 378 - 26 - 138420 - 5
check every possible object. This will be much
QUERY PROCESSING FOR SINGLE META-
more
PATH
efficient
than
Pairwise
computation
between the query and all the objects of that

Compared with P-PageRank and SimRank, the
type. We call baseline concatenation algorithm
calculation is much more efficient, as it is a local
as PathSim-baseline.
graph measure. But still involves expensive

matrix multiplication operations for top k
The PathSim-baseline algorithm is still time
search functions, as we need to calculate the
consuming if the candidate set is large. The time
similarity between a query and every object of
complexity of computing PathSim for each
the same type in the network. One possible
candidate, where is O(d) on average and O(m) in
solution is to materialize all the meta-paths.
the worst case. We now propose a co-clustering

based top-k concatenation algorithm, by which
In order to support fast online query processing

for
large-scale
networks,
we
propose
non-promising target objects are dynamically
filtered out to reduce the search space.
methodology that partially materializes short

length meta-paths and then concatenates them
online
to
derive
longer
b) Co-Clustering-Based Pruning
meta-path-based
In the baseline algorithm, the computational
similarity. First, a baseline method is proposed,
costs involve two factors. First, the more
which computes the similarity between query
candidates to check, the more time the algorithm
object x and all the candidate object y of the
will take; second, for each candidate, the dot
same type. Next, a co-clustering based pruning
product of query vector and candidate vector
method is proposed, which prunes candidate
will at most involve m operations, where m is
objects that are not promising according to their
the vector length. Based on the intuition, we
similarity upper bounds. Both algorithms return
propose, we propose a co-clustering-based path
exact top-k results the given query.
concatenation method, which first generates coclusters of two types of objects for partial
a) Baseline
relation matrix, then stores necessary statics for

Suppose we know that the relation matrix for
each of the blocks corresponding to different co-
meta-path and the diagonal vector in order to get
cluster pairs, and then uses the block statistics to
top-k objects with the highest similarity for the
prune the search space. For better picture, we
query, we need to compute the probability of
call cluster of type as target clusters, since the
objects. The straightforward baseline is: (1) first
objects are the targets for the query and call
apply vector matrix multiplication (2) calculate
clusters of type as feature clusters. Since the
probability of objects (3) sort the probability of
objects serve as features to calculate the
objects and return top-k list in the final step.
similarity between the query and the target
When a large matrix, the vector matrix
objects. By partitioning into different target
computation will be too time consuming to
clusters, if a whole target cluster is not similar to
320
www.iaetsd.in
ISBN: 378 - 26 - 138420 - 5
the query, then all the objects in the target
extracted from DBLP and Flicker in the
cluster are likely not in the final top-k lists and
experiments.
can be pruned. By partitioning in different

The PathSim algorithm significantly improves
feature clusters, cheaper calculations on the
the query processing speed comparing with the
dimension-reduced query vector and candidate
baseline algorithm, without affecting the search
vectors can be used to derive the similarity

upper
bounds.
The
PathSim-Pruning
quality.
can
significantly improve the query processing speed
For additional case studies, we construct a
comparing with the baseline algorithm, without
Flicker network from a subset of the Flicker data
affecting the search quality.
which contains four types of objects such as

images, users, tags, and groups. We have to
c) Multiple Meta-Paths Combination
show that our algorithms improve similarity

In the previous section, we presented algorithms
search between object based on the potentiality
for similarity search using single meta-path.
and correlation between objects.
Now, we present a solution to combine multiple

VI. CONCLUSION
meta-paths. The reason why we need to combine

several meta-paths is that each meta-path
In this paper we introduced novel similarity
provides a unique angle to view the similarity
search using meta-path based similarity search
between objects, and the ground truth may be a
using baseline algorithm and co-clustering based
cause of different factors. Some useful guidance
pruning algorithms to improve the similarity
of the weight assignment includes: longer meta-
search based on the strengths and relationships
path utilize more remote relationship and thus
between objects.
should be assigned with a smaller weight, such

REFERENCES
as in P-PageRank and SimRank and meta-paths

with more important relationships should be
[1]
Jiawei Han, Lise Getoor, Wei Wang,
assigned with a higher weight. For automatically
Johannes Gehrke, Robert Grossman "Mining
determining the weights, users cloud provides
Heterogeneous
training examples of similar objects to learn the
Principles and Methodologies"
weights of different meta-paths using learning
Information
Networks
[2] Y. Koren, S.C. North, and C. Volinsky,
algorithm.
Measuring and Extracting Proximity in

Networks, Proc. 12th ACM SIGKDD Intl
V. EXPECTED RESULTS
Conf. Knowledge Discovery and Data

To show the effectiveness of the PathSim
Mining, pp. 245-255, 2006.
measure and the efficiency of the proposed

[3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,
algorithms we use the bibliographic networks
Association
Thesaurus
Construction
Methods Based on Link Co-Occurrence
321
www.iaetsd.in
ISBN: 378 - 26 - 138420 - 5
Analysis for Wikipedia, Proc. 17th ACM

Conf.
Information
and
Knowledge
Management (CIKM), pp. 817-826, 2008.

[4] K. Nakayama, T. Hara, and S. Nishio,
Wikipedia Mining for an Association Web
Thesaurus Construction, Proc. Eighth Intl
Conf. Web Information Systems Eng.
(WISE), pp. 322-334, 2007.
[5] M. Yazdani and A. Popescu-Belis, A
Random Walk Framework to Compute
Textual Semantic Similarity: A Unified
Model for Three Benchmark Tasks, Proc.
IEEE
Fourth
Intl
Conf.
Semantic
Computing (ICSC), pp. 424-429, 2010.

[6] R.L. Cilibrasi and P.M.B. Vitanyi, The
Google Similarity Distance, IEEE Trans.
Knowledge and Data Eng., vol. 19, no. 3,
pp. 370-383, Mar. 2007.
[7] G. Kasneci, F.M. Suchanek, G. Ifrim, M.
Ramanath,
and
G.
Weikum,
Naga:
Searching and Ranking Knowledge, Proc.

IEEE 24th Intl Conf. Data Eng. (ICDE), pp.
953-962, 2008.
[8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin,
Network Flows: Theory, Algorithms, and
Applications. Prentice Hall, 1993.
322
www.iaetsd.in

IAETSD-Similarity Search in Information Networks Using

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IAETSD-Similarity Search in Information Networks Using

Uploaded by

Copyright:

Available Formats

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

Similarity Search in Information Networks using

in many applications. For example, in spatial

objects are interconnected, forming enormous,

database, people are interested in finding the k

interconnected networks. By structuring these

nearest neighbors for a given spatial object.

data objects and interactions between these

Object similarity is also one of the most

objects into multiple types, such networks

primitive concepts for object clustering and

many other data mining functions.

information networks. Therefore, the quality

analysis of large heterogeneous information

networks poses new challenges. In current

search for the most similar pictures for a given

distance, connectivity and co-citation. By using

such as flicker, a user may be interested in

Wikipedia by reflecting all three concepts:

a given entity. In a network of tagged images

introduced for measuring the relationship on

information networks, to find similar entities for

system, a generalized flow based method is

picture. In an e-commerce system, a user would

be interest in search for the most similar

products for a given product. Different attribute-

similarities. To address these problems we

based similarity search, links play an important

introduce a novel solution meta-path based

similarity searching approach for dealing with

networks, especially when the full information

heterogeneous information networks using a

about attributes for objects is difficult to obtain.

meta-path-based method. Under this framework,

There are a few studies leveraging link

information in networks for similarity search,

Index terms similarity search, information

homogeneous or bipartite networks such as P-

network, and meta-path based, clustering.

PageRank and SimRank. These similarity

Similarity search, which aims at locating the

significant drawbacks: even if we just want to

most relevant information for a query in large

compare objects of the same type, going through

collections of datasets, has been widely studied

link paths of different types leads to rather

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT

INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

ISBN: 378 - 26 - 138420 - 5

different semantics meanings, and it makes little

connections represent different relationships

sense to mix them up and measure the similarity

between authors, each having some different

without distinguishing their semantics.

To systematically distinguish the semantics

Now the questions are, given an arbitrary

among paths connecting two objects, we

heterogeneous information network, is there any

way systematically identify all the possible

framework for objects of the same type in a

connection type between two objects types? In

heterogeneous network. A meta-path is a

order to do so, we propose two important