You are on page 1of 9

Future Generation Computer Systems 26 (2010) 14091417

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

A hybrid collaborative filtering recommendation mechanism for P2P networks


Zhaobin Liu a , Wenyu Qu a, , Haitao Li b , Changsheng Xie c
a

School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, PR China

Institute for Photogrammetry and Remote Sensing, Chinese Academy of Surveying and Mapping, Beijing, 100039, PR China

Wuhan National Lab for Optoelectronics (WNLO), Huazhong University of Science and Technology, Wuhan, 430074, PR China

article

info

Article history:
Received 30 November 2009
Received in revised form
6 April 2010
Accepted 16 April 2010
Available online 2 May 2010
Keywords:
Collaborative filtering
Recommendation
Sparse matrix
Eigenvalue matrix
Peer-to-peer (P2P) networks

abstract
With the increasing number of commerce facilities using peer-to-peer (P2P) networks, challenges exist
in recommending interesting or useful products and services to a particular customer. Collaborative
Filtering (CF) is one of the most successful techniques that attempts to recommend items (such as music,
movies, web sites) which are likely to be of interest to the people. However, conventional collaborative
filtering encounters a number of challenges on its recommendation accuracy. One of the most important
challenges may be due to the sparse attributes inherent to the rating data. Another important challenge
is that existing CF methods consider mainly user-based or item-based ratings respectively. In this paper a
P2P-based hybrid collaborative filtering mechanism for the support of combining user-based and item
attribute-based ratings is considered. We take advantage of the inherent item attributes to construct
a Boolean matrix to predict the blank elements for a sparse useritem matrix. Furthermore, a Hybrid
collaborative filtering (HCF) algorithm is presented to improve the predictive accuracy. Case studies and
experiment results illustrate that our approaches not only contribute to predicting the unrated blank data
for a sparse matrix but also improve the prediction accuracy as expected.
2010 Elsevier B.V. All rights reserved.

1. Introduction
In recent years, peer-to-peer (P2P) file-sharing networks have
become a popular new way to exchange resources, information
and services across a large number of autonomous peers [1,2].
Examples of P2P file sharing systems are: Gnutella, BitTorrent
and P2P Music streaming systems like iTunes [3]. These systems
enable users to form communities for sharing different types of
files. However, due to the explosive growth of the volume of
information, such as in the web, users should be able to make
choices without knowing all of the alternatives [4,5]. Moreover,
both the users and the data are distributed and dynamically
changing which make it difficult to filter (and search) and localize
the available content within the P2P network [6].
These significant promotions and the associated requirement
challenges have motivated the development of recommendation
systems. Collaborative filtering (CF) [7] is such a personalized
recommendation technique that has been very promising both
in research and industry. CF leverages the usage history of
groups of similar users in order to make recommendations to a

Corresponding author.
E-mail addresses: zhbliu@gmail.com (Z. Liu), eunice.qu@gmail.com,
buyhorse@gmail.com (W. Qu), lhtao@casm.ac.cn (H. Li), cs_xie@mail.hust.edu.cn
(C. Xie).
0167-739X/$ see front matter 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.future.2010.04.002

target user [8,9]. Nowadays, CF technology has gradually been


implemented in various applications (e.g. Netflix, TiVo, Google
news and Amazon [10,11]).
However, regardless of its success in many application settings,
conventional collaborative filtering encounters a number of
limitations which influence its recommendation accuracy. One of
the most important limitations may be the data sparsity problem.
The underlying assumption of CF is that the active user will prefer
to choose those items which similar users prefer [12]. In many
real collaborative filtering systems there are many potentially
recommendable items and many user profiles, but a typical user
may have rated only a tiny percentage of these items [13]. In
others words, the useritem matrix is a sparse matrix populated
primarily with blanks. Another most important limitation is that
existing CF methods consider mainly user-based or item-based
ratings respectively, and this is a major issue that limits the quality
of recommendations and the applicability of collaborative filtering
in general.
To alleviate the sparse problem issue, in this paper our first
contribution focuses on the sparse data problem in CF. We propose
to bring in the user-based and item attribute-based ratings.
Using the item attributes Boolean matrix, a new item similarity
computing mechanism is presented to predict the blank elements
in the sparse useritem matrix. For this reason, we take advantage
of the inherent item attributes to construct a Boolean matrix.
By comparing the Euclidean distance between two items, we

1410

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

seek a novel blank element prediction approach to compute the


similarity of items. Case studies demonstrate that our methodology
contributes to effectively predicting the blank elements for a
sparse matrix. Test results also show that the filling-in accuracy
is acceptable and reasonable.
Our second contribution is that a Hybrid collaborative filtering
(HCF) mechanism is presented to improve the predictive accuracy.
In this paper, we pre-classify the user clusters based on the P2P
user attributes, that is, selecting the most similar users into the
same cluster. Then we apply the k-means clustering algorithm
searching within the similar users instead of a whole database.
Our experiment results illustrate that Hybrid collaborative filtering
achieves reasonable quality performance.
The remainder of the manuscript is organized as follows. The
Problem Statement and related works are reviewed in Section 2.
Section 3 investigates the theoretical analysis for unrated rating
prediction and proposes a new filling-in methodology for a sparse
useritem matrix. The case study and results analysis are also
described. In Section 4, a Hybrid collaborative filtering mechanism
is presented to improve the predictive accuracy. Our experimental
results are analyzed in Section 5. The conclusion and future works
are concluded in Section 6.
2. Problem statement and related works
Many schemes have been proposed for efficient peer-to-peer
recommendation. In this section, we review the major research
works on the limitations of existing sparsity and clustering
algorithm in P2P Collaborative Filtering technology.
Before problem formulation, we first introduce some notations and abbreviations used in this paper. Given a set of
users U = {user 1 , user 2 , . . . , user m }, and a set of items T =
{item1 , item12 , . . . , itemn }, the useritem rating matrix is represented as an U T matrix R = (Ri,j ). The value in this matrix is
either a real number within a range (from 1 to 5) or , the symbol
for unrated rating, and Ri,j defines the rating of user i for itemj .
Due to the promotion of recommendation systems, currently
many researchers and experts are focusing on the CF problem
and have made great progress. The number of users and items
in major P2P network e-commerce systems is very large [14].
Even very popular items can only have been rated by a few
users available in the database [15]. Moreover, new items may
result in a cold-start problem which concerns the issue that
they cannot be recommended unless they have been rated by a
substantial number of users. Many approaches have been proposed
to alleviate the sparsity problem [1618]. The basic CF method
gives a simplification average rating for unrated elements in a
sparse useritem matrix. It has been proved to perform poorly
on prediction accuracy. Dimensionality reduction approaches [19
21] address the sparsity problem by removing unrepresentative
or insignificant users or items from the useritem matrix.
Unfortunately, potentially useful information might be lost during
this reduction process. One approach combines CF with a contentbased method to solve the sparsity problem [22,23]. Most
studies using this approach have demonstrated improvement
in quality. However, regarding items, this approach requires
additional information. Another solution is to respectively use
user-based or item-based collaborative filtering to predict the
blank unrated data. Nevertheless, filling unrated data only through
user-based or item-based approaches will potentially ignore
valuable information that will make the prediction more accurate.
In 1967, MacQueen presented a k-means algorithm which
assigns each node to the cluster whose center (also called centroid)
is nearest. The algorithm steps are [24]: (1) Choose the number of
clusters, k. (2) Randomly generate k clusters and determine the
cluster centers, or directly generate k random nodes as cluster

centers. (3) Assign each node to the nearest cluster center. (4)
Recompute the new cluster centers. (5) Repeat the two previous
steps until some convergence criterion is met (usually that the
assignment hasnt changed). Because the resulting clusters depend
on the initial random assignments, its disadvantage is that it
may not yield the same result with each run [25]. Another
disadvantage is the search for neighbors among the whole P2P
network may decrease the performance quality. The traditional
k-means algorithm for collaborative filtering mainly refers to
item-based or user-based separately. For this reason, in [26] the
authors propose a hybrid predictive algorithm with smoothing to
consider both users aspects and items aspects. In [27] eSciGrid
was presented to take into account the physical distance between
peers and the amount of traffic carried by each node. Wang
et al. present a unified probabilistic model for collaborative
filtering using Parzen-window density estimation for acquiring
the probabilities of the proposed unified relevance model [28,29].
However, these approaches may dont suit well for P2P network
application scenarios or ignore real-time performance quality
while finding closer neighbors.
3. Sparsity limitation solution
In this section we present a method for alleviating the sparsity
challenge in collaborative filtering based on the item attributes
Boolean matrix.
3.1. Collaborative filtering based on item attributes
The first approach is to compose the eigenvalue matrix of items.
Its methodology is as follows: Each item could be divided into
several dimensions. Each dimension is the items attribute. At the
same time, each attribute has its initial eigenvalue. To simplify
the problem, we use the Boolean variable (uniformly 0 or 1) to
construct the eigenvalue matrix. We assume that a 1 indicates
the item is of that attribute, a 0 indicates it is not. In order to
determine which items are similar, we need to define a similarity
function. We take advantage of Euclidean distance to calculate the
similar degrees among items,

v
u n
uX
d(itemi , itemj ) = t
(pitemi pitemj )2

(1)

m=1

where d(itemi , itemj ) is the Euclidean distance of itemi and itemj ,


pitemi is the mth index value of itemi , pitemj is the mth index value of
itemj .
Then the similarity between itemi and itemj could be defined as,
sims (itemi , itemj ) =

1
1 + d(itemi , itemj )

(2)

For all blank (unrated) entries, R(user ,item) can be generated


according to,
n
P

Ri =

Rj sims (itemi , itemj )

j =1
n
P

(3)
sims (itemi , itemj )

j =1

where Ri is the predictive rating of users to itemi , Rj is the rating of


users to itemj , sims (itemi , itemj ) is the similarity between itemi and
itemj .
The above equation gives the formula for similarity computing
between two items. In other words, we can predict all the unrated
entries to fill in the sparse useritem matrix.
3.2. Case study
We use a typical movie recommender system as a specific
example of collaborative filtering. Generally, we assume there

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417


Table 1
The useritem matrix (m n) before data prediction. ( means unrated rating).

user 1
user 2
user 3
user m

item1

item2

item3

itemn

5
3

4
3
4

are m users and n items (movies), and then we can get the
useritem matrix (as shown in Table 1). Each movie can be
regarded as an item. Each item often has detailed information
on primitive concept levels. If such a rating exists, the element
of the matrix means the users rating on item, otherwise, if the
element is blank, which means that there has been no such
rating. In this paper, each movie item contains four preferences:
genre, language, release year and country. Each preference has
several primitive values. For instance, genre contains {action,
adventure, mystery, drama, documentary, romance and comedy
etc}. Language contains {Chinese, Cantonese, English, Japanese and
Korean etc}. Release year contains {less than two years, less than
five years and more than five years}. Country contains {Chinese
Mainland, HK &Taiwan, Occident, Japan and Korea etc}.
Each user of the system expresses his opinions about movies
which he loves or dislikes by rating the score. The opinion of a
customer can be divided into five ratings of the preference (from
1 to 5). All of these ratings are captured in the useritem matrix.
In this matrix, users are in the rows and items (movies) are in the
columns. Each space contains the users rating of that item shown
in Table 1. For example, 1 star expresses that the user feels awful
and 5 stars expresses that the user feels excellent. From the Table 1,
we can see that user 1 could rate the movie item1 5 stars, and
user 2 could rate the same movie 3 stars.
Obviously, most of the elements in useritem matrix are unrated blank data. To improve the accuracy of filling-in, we first construct the Boolean matrix in terms of item attributes. In this specific
instance, we give that item1 is Crouching Tiger, Hidden Dragon,
item2 is Tomb Raider, item3 is The Lord of the Rings and item4 is
Mr and Mrs Smith. As a result, we can get the eigenvalue matrix
of items respectively (as shown in Table 2(ad)).
To compute the prediction rating of user 1 to item2 , according to
Eq. (1), we first calculate the Euclidean distance between item2 and
item1 , item3 , item4 respectively.
d(item2 , item1 ) =

1411

Table 2
Eigenvalue matrix of items.
(a) Eigenvalue matrix of item1
Eigenvalue matrix of item1
Genre
Language
Year
Country

1
1
0
1

0
0
1
0

0
1
0
0

0
0
0
0

0
0
0
0

1
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

0
0
0
0

1
0
0
0

(b) Eigenvalue matrix of item2


Eigenvalue matrix of item2
Genre
Language
Year
Country

1
0
0
0

0
0
0
0

1
1
1
1

(c) Eigenvalue matrix of item3


Eigenvalue matrix of item3
Genre
Language
Year
Country

1
0
0
0

0
0
0
0

0
1
1
1

(d) Eigenvalue matrix of item4


Eigenvalue matrix of item4
Genre
Language
Year
Country

1
0
1
0

0
0
0
0

0
1
0
1

7 = 2.646.

d(item2 , item3 ) = 1.

Fig. 1. The Euclidean distance between item2 and the others.

d(item2 , item4 ) = 2.
Hence, from the Fig. 1 we can get that the closest distance occurring
from item2 to item3 .
In order to compute the similarity between item2 and item1 ,
item3 , item4 respectively, according to Eq. (2) we can get:
sim(item2 , item1 ) = 0.27.
sim(item2 , item3 ) = 0.5.
sim(item2 , item4 ) = 0.33.
As a result, in terms of Eq. (3), the prediction rating of user1 to item2
is:
5 0.27 + 4 0.5 + 5 0.33

= 4.54
0.27 + 0.5 + 0.33
= 5 (after near-integer rounded down).

Similarly to the above, we can obtain the rest of the unrated data in
the sparse useritem matrix. To compare the prediction accuracy
of our methodology, we also compute the rated data. After filling
in the entire element, the new useritem matrix (m n) can be
depicted in Table 3.

Table 3
The useritem matrix (m n) after data prediction.

user 1
user 2
user 3
user m

item1

item2

item3

itemn

4.48
4
3.5
3.64

4.54
3.64
4
3.35

5
3.64
3.41
4

4.22
3.58
3.45
3.47

As for the rated elements in the user-based matrix, to compare


the prediction accuracy of our method, Fig. 2 shows the comparison
result of prediction data and initial data. The near-integer rounded
down result is described in Fig. 3. The result indicates that all the
results are close to the initial rating data, and the round-off result
also matches our expectations.
4. A hybrid collaborative filtering mechanism
There are a number of papers on the technical aspects of P2P
collaborative filtering clustering algorithms. However, in many
ways, the large size of users and items in a P2P network could

1412

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

6
initial

R(user,item)

prediction

4
3
2
1
0

R(u1,i1) R(u1,i3) R(u1,i4) R(u2,i1) R(u2,i2) R(u3,i2) R(u3,i4) R(u4,i2) R(u4,i3)


Fig. 2. Comparison of initial rating and after prediction (before rounding down).

6
initial

R(user,item)

prediction

4
3
2
1
0

R(u1,i1) R(u1,i3) R(u1,i4) R(u2,i1) R(u2,i2) R(u3,i2) R(u3,i4) R(u4,i2) R(u4,i3)


Fig. 3. Comparison of initial rating and after prediction (after rounding down).

cause the additional time during the period of finding the closer
neighbors. It could also result in the space complexity of clustering
algorithm. On the other hand, on a large-scaled P2P network,
the number of active users may impact the network congestion.
Therefore, determining the right network scale is very important
for P2P collaborative filtering clustering algorithm operations.
Considering P2P scalability and clustering efficiency, in this
paper, the P2P users may be classified into different groups
(clusters) with respect to the user personality features. In other
words, a collection of users are similar within clusters and are
dissimilar to the users belonging to other clusters. Our hybrid
collaborative filtering algorithm based on k-means will search
neighbors within the similar use cluster instead of searching the
whole user space. As a result, it can not only reduce the algorithm
complexity but also improve the prediction accuracy because of
considering the preference of users.
4.1. A quantitative approach for P2P user attributes
Before proposing our new hybrid collaborative filtering algorithm, the personality features of P2P users should be expressed
quantificationally. Generally, when a new customer registers into
a P2P network, we can get user profiles, such as age, gender, career, character and preference etc. and usually they are stored into
a database. Obviously, age is a numerical value and gender is a dualistic value (for example, 0 means male, 1 means female). Educational background can be divided into elementary school, middle
school, bachelor, master and Ph.D., which can be described from 1
to 5 respectively. As for the quantitative profession and character,
to describe this we can adopt a hierarchical tree.
Following is the quantifying stage of profession and character
for P2P users. In his theory of career choice, psychologist John L.
Holland created Holland Codes to measure an individuals type
and match it with a list of career choices [30]. In this section,

we use Holland Codes to classify careers as 6 types: Realistic,


Investigative, Artistic, Social, Enterprising and Conventional. Based
on the Holland Codes classification, we propose the profession
hierarchical tree for P2P networks.
As shown in Fig. 4, every profession may also contain several
sub-categories. We set serial number tags for each layer from the
top-down for the profession tree. The profession type is 0, the
realistic type is 1, the investigative type is 2, the artistic type is 3,
the social type is 4, the enterprising type is 5 and the conventional
type is 6. Additionally for the next level, the technical operation is
011, the operator is 0111, the manual operation is 012, locksmith
is 0121, carpenter is 0122 and so forth. Each layer can be deduced
in the same manner which is set with the number tags 1, 2, 3 etc.
from the top down. As a result we can quantify the professional
information registered on the website by P2P users.
Just like profession hierarchical tree, we can also get the
character hierarchical tree for P2P users. Fig. 5 illustrates the effect
of character category. We can see that it has been partitioned into
humorous, enthusiastic, serious type, slow-witted, hot inside cold
outside, cold inside hot outside and so on.
Put it all together as introduced above, and we can quantify
the users information when they register as users of the P2P
network service. For instance, we choose User A, Gender: Male,
Age: 28, Educational qualification: College, Profession: Pianist,
Character trait: Lively type, and the quantifying result of this
personal information would be {0, 28, 3, 0311, 0121}; also another
user B, Gender: Female, Age: 21, Educational background: college,
Profession: Dancer, Character trait: Lively type, and this persons
quantifying result is {1, 21, 3, 0322, 0121}. So we can make all
the users registered information quantified according to gender,
age, education qualification, professional tree and character tree.
Therefore, we can use our hybrid collaborative filtering k-means
cluster algorithm to classify the users, sorting similar users into

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

1413

Fig. 4. A quantitative approach for P2P users profession.

Fig. 5. A quantitative approach for P2P users character traits.

one cluster, searching from the nearest neighbor in this user


cluster, which will shorten the clustering time and improve the
recommendation precision.
4.2. Algorithm implementation description
Using the quantified P2P users features described above,
considering the personalities of P2P customers, we have the

following hybrid collaborative filtering k-means algorithm for a


P2P network.
Fig. 6 depicts the operation of our new hybrid Collaborative
Filtering algorithm. The nodes represent the P2P users, the edges
between nodes represent the distance, which are calculated by
Euclidean distance methodology. Therefore, a connected graph
N = (V , {E }) is a representation of a set of nodes and edges
where they are joined up, where V is the set of all users, and

1414

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

Fig. 6. A hybrids collaborative filtering algorithm.

E means all the ways between the nodes. The initial state is an
unconnected graph T = (V , {}), where there are no edges and
the number of nodes is n. Every node self-composes a connected
vector. If the nodes associated with the minimum cost edge belong
to the sub-vector of T then put this edge into T , otherwise remove
this edge and select the next minimum cost edge. Followed by
analogy, while some connecting nodes form a loop, all nodes
associated with this loop will be added into the same cluster M, at
the same time, removed from T set. Repeat the above procedures
until all nodes are allocated into k clusters. After recalculating
the centroid of each cluster as new centroid, we then apply the
conventional k-means to finish the operations. The algorithm
stops when all the distances become less than the initialized
threshold.
A motivating example is illustrated in Fig. 7. Suppose there are
6 users in a P2P network. As shown in Fig. 7(a), two nodes (user
v1 and v3) with the minimum distance are connected by an edge.
The rest can be done in the same manner until the state Fig. 7(f).
We can see all the nodes compose two loops. They are separated
in two clusters as shown in Fig. 7(g) and (h) respectively. In other
words, elements in the same cluster are similar in some sense. The
average distance of each cluster will be treated as the new centroid,
then the classical k-means algorithm is executed.
Through the similar user clustering method, similar customers
with similar attributes or behaviors will be gathered into the same
cluster. This will be more effective and precise through performing
the clustering algorithm directly only with each cluster vector
to determine the relative nearest neighbors. It also conforms
the real-time requirements in the recommendation system.
Considering the new users problem appearing in collaborative
filtering algorithm, theoretically speaking, users with similar
information dont have large differences about their interests.
Therefore, we can recommend the average-scored item from the
same cluster to the new user. Incidentally, this resolves cold start
problems effectively.
5. Experiments evaluation
5.1. Test dataset
To test the efficiency of our methodology, in this section, we
experimented with classical movie-rating datasets: the MovieLens [31] dataset. The MovieLens dataset was collected by the
GroupLens group through the MovieLens Web site during the pe-

Table 4
The basic characteristics of the test dataset.
MovieLens
Number of users
Number of items
Sparsity
Rating scales
Training set
Test set

943
1682
93.7%
15
80%
20%

riod between September 1997 and April 1998. The basic characteristics of MovieLens datasets with different sizes are summarized in
Table 4. The dataset contains three sets: Movies.dat, Rating.dat and
User.dat. Movies.dat contains 1682 movies (items), including the
detail in formations: movie code, name, type (for example Action,
Adventure, Animation, Comedy, Crime, Documentary, Drama, Fantasy, Horror, Romance etc.). User.dat has 943 users features, such
as user ID, gender, age and profession. Rating.dat contains ratings
by 943 users for 1682 movies (items). Each user had rated at least
20 movies. The rating scale takes values from 1 (lowest rating) to 5
(highest rating). As a result, the sparsity of the MovieLens dataset
100000
is 1 1682
= 0.936993 = 93.7%.
943
5.2. Item attributes impact
In order to evaluate how close forecasts or predictions with
experiments are, we report our results using the mean absolute
error (MAE) evaluation metric. Just as its name implies, the mean
absolute error is an average of the absolute errors pi qi , where pi
is the prediction set and qi is the true value set. For all test datasets,
we have,
N
P

MAE =

i =1

|pi qi |

(4)
N
where N denotes the number of tested ratings. MAE gives
expression to the average absolute deviation of predictions to
the actual data. Note that a smaller value indicates a better
performance.
The recommendation prediction influencing in CF has been
mainly attributed to two factors: one is the sparsity level of
datasets, and the other is the number of neighbors. Based on these

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

1415

Table 5
The number of neighbors found by conventional CF algorithm.

Num. of clusters

User ID

2
3
4
5

16
11
11
10
10

317
11
10
9
8

608
10
9
8
8

912
11
10
9
8

608
12
11
9
8

912
11
11
10
8

90
83.3
75
71.7

Table 6
The number of neighbors found by HCF algorithm.
Num. of clusters
2
3
4
5

User ID
16
12
12
11
10

Avg. (%)
121
12
11
10
9

317
11
10
9
8

96.7
91.7
81.7
71.7

Both of the algorithms are linearly proportioned to the number


of neighbors. The figure also shows that item attributes based CF
have a smaller MAE than traditional CF in all cases. This means that
our algorithm has better accuracy under the same sparsity level
conditions. Furthermore, when the amount of neighbors becomes
less, the gap between the two algorithms becomes larger than in
the other situations. With an increase of neighbors, the differences
change smoothly. Thus, an item attributes based CF algorithm has
better performance than a traditional CF.

Avg. (%)
121
11
10
9
9

5.3. Hybrid collaborative filtering clustering algorithm

Fig. 7. Demonstration of the solution procedure.

two parameters, we conduct our experiments to further compare


our proposed algorithm respectively.
To evaluate the sensitivity of the traditional CF and item
attributes based CF algorithms under diverse sparse rating levels,
we implemented our first experiment in which we let the sparsity
level take the values of 0.90, 0.84, 0.80, 0.75 and 0.72 respectively.
Looking at the results in Fig. 8, we can see that both of the
algorithms are declining with the increase in the density of rating
metrics. Also the item attributes based on the CF algorithm has less
MAE values than traditional a CF algorithm regarding the range of
sparsity level. However, with sparsity increasing, the gap between
them becomes larger. That is, the sparser the rating metric is, the
better the MAE performance of the item attributes based on theCF
algorithm is. The reason why it comes out like this is because using
the item attributes to predict the unrated entries for sparse metrics
could enrich the useritem matrix and result in more accurate
prediction.
Our second experiment in this section was designed to evaluate
the effects of a different number of neighbors on MAE performance.
For all experiments, the sparsity level was set to 0.8. Fig. 9
compares the results of tradition CF algorithm and item attributes
based on the CF algorithm when the number of neighbors is from
5 to 40 scaling with 5 intervals.

To demonstrate the hybrid collaborative filtering clustering


algorithm, by randomly selecting 20% of the dataset to be the test
set, and the remaining 80% to be the training set, we split each of
the MovieLens into two sets. Obviously, the training set is used to
make predictions, while the test set will be considered to measure
prediction accuracy.
We consider two test methodologies. For one thing, we adopt
the test dataset to compare the performance of the traditional
CF algorithm and our hybrid algorithm. We report our results
by finding more neighbors in the least space. Furthermore, we
compare the MAE evaluation metric for different algorithms under
different numbers of neighbors.
Without loss of generality, we select five users at random: 16,
121, 317, 608, and 912. Assign the threshold value of the closest
neighbor to be 12. Assign the number of cluster to be 2, 3, 4
and 5 respectively. As for our hybrid algorithm, each active user
searches the closest neighbors only within its cluster. The result of
conventional CF algorithm is depicted in Table 5.
From Table 5 we can see that, when the number of clusters is 2,
the traditional CF algorithm can find 90% neighbors in the 60.12%
user space. When the number of clusters is 3, it can find 83.3%
neighbors in the 35.21% user space. When the number of clusters
is 4, it can find 75% neighbors in the 29.34% user space. When the
number of clusters is 5, it can find 71.7% neighbors in the 23.46%
user space. In summary, it can find 79.98% neighbors in the 37%
user space.
Table 6 shows that with the hybrid collaborative filtering
clustering algorithm, when the number of clusters is 2, the hybrid
algorithm can find 96.7% neighbors in the 59.21% user space. When
the number of clusters is 3, it can find 91.7% neighbors in the
34% user space. When the number of clusters is 4, it can find
81.7% neighbors in the 27.21% user space. When the number of
clusters is 5, it can find 71.7% neighbors in the 23.72% user space. In
summary, it can find 85.35% neighbors in the 36% user space. Thus
it is clear that the hybrid algorithm can find more neighbors in less
user space than a traditional CF algorithm. In addition, it also can

1416

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

0.95
0.9

MAE

0.85

_
Item based CF

Traditional CF

0.8
0.75
0.7
0.65
0.6

0.9

0.84

0.8

0.75

0.72

Sparsity level
Fig. 8. MAE impact of different CF for different sparsity levels.

0.95
0.9
Traditional CF

_
Item based CF

MAE

0.85
0.8
0.75
0.7
0.65
0.6

10

15

20
25
Num of neighbors

30

35

40

Fig. 9. MAE impact of different CF for a different number of neighbors.

0.9
0.85

Traditional CF
K-means based CF

MAE

0.8

Hybrid CF

0.75
0.7
0.65
0.6
5

10

15

20
25
Num of neighbors

30

35

40

Fig. 10. MAE comparison of different CF for a different number of neighbors.

improve the efficiency and precision while searching the closest


neighbors.
To show the sensitivity of the neighbor parameter regarding the
recommendation performance, we depict the number of neighbors
against the MAE measurement in Fig. 10. When the number of
neighbors is from 5 to 40 with 5 intervals, the phenomena of
performance is similar. With an increase of neighbors, the MAE
decreases because of more information provided for prediction.
Another observation is that the hybrid algorithm has less MAE than
traditional CF and item-based CF algorithms. The reason for this
is that the hybrid collaborative filtering algorithm only searches
similar users in the cluster sets and it does give a reasonably good
prediction estimation.

6. Conclusions and future works


Collaborative filtering is employed to fulfill the recommendation system for P2P network services, which allows prediction of
interesting information for an active user from a set of similar users
or items rating data. In this study, for the sparse useritem matrix problem, we proposed a novel mechanism to fill in the unrated rating in sparse matrix. We considered both the user-based
rating and item attribute-based eigenvalue matrix to compute the
item similarity. Moreover, a Hybrid Collaborative Filtering (HCF)
framework for the attribute-based mechanism extending from the
traditional CF algorithm is proposed to improve the predictive accuracy. We describe an effective mechanism for finding similar

Z. Liu et al. / Future Generation Computer Systems 26 (2010) 14091417

users with similar purchasing motives. Case studies and experimental results illustrate that our approach is a feasible technique
for recommendation in a P2P network. Our hybrid mechanism
prediction mainly depends on user-similarity of P2P networks for
prediction. In the future work, we intend to deal with fraudulent behavior, anonymity and privacy problems under P2P network
conditions.
Acknowledgements
This work has been partially supported by the National
Natural Science Foundation of China (Grant No. 90818002,
60973115 and 60933002), Ph.D. Programs Foundation of Ministry
of Education of China (Grant No. 20070151020), National Basic
Research Program of China (973 Program) under the Grant No.
2006CB701303 and Hi-Tech Research and Development Program
of China (863 Program) under the Grant No. 2007AA12Z151 and
2009AA01A402.
References
[1] Amund Tveit, Peer-to-peer based recommendations for mobile commerce, in:
Proceedings of the 1st International Workshop on Mobile Commerce, 2001,
pp. 2629.
[2] Fuyong Yuan, Jian Liu, Chunxia Yin, Yulian Zhang, Nan Shen, A novel collaborative filtering mechanism for product recommendation in P2P networks,
in: Third International IEEE Conference on Signal-Image Technologies and
Internet-Based System, 2007, pp. 254261.
[3] S. Eyheramendy, D. Lewis, D. Madigan, On the naive bayes model for text
categorization, in: Proc. of Artificial Intelligence and Statistics, 2003.
[4] Giancarlo Ruffo, Rossano Schifanella, A peer-to-peer recommender system
based on spontaneous affinities, ACM Transactions on Internet Technology 9
(1) (2009) Article 4.
[5] Keqiu Li, Hong Shen, Francis Y.L. Chin, Si-Qing Zheng, Optimal methods
for coordinated enroute web caching for tree networks, ACM Transactions
Internet Technology 5 (3) (2005) 480507.
[6] Jun Wang, Johan Pouwelse, Reginald L. Lagendijk, Marcel J.T. Reinders,
Distributed collaborative filtering for peer-to-peer file sharing systems, in:
Proceedings of the 2006 ACM Symposium on Applied Computing, 2006,
pp. 10261030.
[7] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl, GroupLens: an open
architecture for collaborative filtering of netnews, in: Proceedings of ACM
Conference on Computer Supported Cooperative Work, 1994.
[8] J. Konstan, B. Miller, D. Maltz, J. Herlocker, L. Gordon, J. Riedl, Grouplens:
applying collaborative filtering to usenet news, Communications of the ACM
40 (3) (1997) 7787.
[9] B. Smyth, P. Cotter, Personalized electronic programme guides, Artificial
Intelligence Magazine 21 (2) (2001).
[10] Greg Linden, Brent Smith, Jeremy York, Amazon.com recommendations: itemto-item collaborative, in: IEEE Internet Computing, vol. 7, IEEE Computer
Society, 2003, pp. 7680.
[11] Keqiu Li, Hong Shen, Francis Y.L. Chin, Weishi Zhang, Multimedia object
placement for transparent data replication, IEEE Transactions on Parallel and
Distributed System 18 (2) (2007) 212224.
[12] Hao Ma, Irwin King, Michael R. Lyu, Effective missing data prediction for
collaborative filtering, in: Proceedings of the 30th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval,
2007, pp. 3946.
[13] Derry O Sullivan, David Wilson, Barry Smyth, Preserving recommender
accuracy and diversity in sparse datasets, in: FLAIRS Conference 2003,
pp. 139143.
[14] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item
collaborative filtering, IEEE Internet Computing (January) (2003).
[15] Manos Papagelis, Dimitris Plexousakis, Themistoklis Kutsuras, Alleviating the
sparsity problem of collaborative filtering using trust inferences, in: iTrust
International Conference 2005, in: LNCS, vol. 3477, 2005, pp. 224239.
[16] Arnaud De Bruyn, C. Lee Giles, David M. Pennock, Offering collaborativelike recommendations when data is sparse: the case of attraction-weighted
information filtering, in: International Conference on Adaptive Hypermedia
and Adaptive Web-based Systems, in: Lecture Notes in Computer Science,
vol. 3137, 2004, pp. 393396.
[17] Alexandrin Popescul, Lyle H. Ungar, David M. Pennock, Steve Lawrence, Probabilistic models for unified collaborative and content-based recommendation
in sparse-data environments, in: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001, pp. 437444.
[18] Jun Wang, Arjen P. de Vries, Marcel J.T. Reinders, Unified relevance models for
rating prediction in collaborative filtering, ACM Transactions on Information
Systems 26 (3) (2008) 142. Article 16.
[19] K. Goldbergh, T. Roeder, D. Gupta, C. Perkins, Eigentaste: a constant time
collaborative filtering algorithm, Information Retrieval 4 (2) (2001) 133151.

1417

[20] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, Indexing
by latent semantic analysis, Journal of the American Society for Information
Science 41 (6) (1990).
[21] T. Hofmann, Collaborative filtering via Gaussian probabilistic latent semantic
analysis, in: Proc. of the 26th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, 2003.
[22] M. Balabanovic, Y. Shoham, Fab: content-based, collaborative recommendation, Communications of the ACM 40 (1997) 6672.
[23] C.-N. Ziegler, G. Lausen, L. Schmidt-Thieme, Taxonomy-driven computation of
product recommendations, in: Proceedings of the Thirteenth ACM Conference
on Information and Knowledge Management, 2004.
[24] J.B. MacQueen, Some methods for classification and analysis of multivariate
observations, in: Proceedings of 5-th Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, University of California Press, Berkeley,
pp. 281297.
[25] http://en.wikipedia.org/wiki/Clusteranalysis.
[26] Rong Hu, Yansheng Lu, A hybrid user and item-based collaborative filtering
with smoothing on sparse data, in: 16th International Conference on Artificial
Reality and Telexistence, 2006, pp. 184189, doi:10.1109/ICAT.2006.12.
[27] Marc Snchez-Artigas, Pedro Garca-Lpez, eSciGrid: A P2P-based e-science
Grid for scalable and efficient data sharing, Future Generation Computer
Systems 26 (5) (2010) 704719.
[28] J. Wang, A.P. De Vries, M.J.T. Reinders, A useritem relevance model for
logbased collaborative filtering, in: Proceedings of the European Conference
on IR Research, Springer, London, 2006, pp. 3748.
[29] J. Wang, A.P. De Vries, M.J.T. Reinders, Unifying user-based and itembased collaborative filtering approaches by similarity fusion, in: Proceedings
of the 29th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, ACM Press, New York, 2006,
pp. 501508.
[30] http://www.absoluteastronomy.com/topics/Holland_Codes.
[31] Grouplens, EachMovil, datadet, MovieLens, 2003. http://www.grouplens.org/.

Zhaobin Liu is an Associate Professor in the School of


Information Science and Technology, Dalian Maritime
University, China. He received his Ph.D. in Computer
Science from Huazhong University of Science and Technology in China in 2004. His research areas include Parallel/Distributed/Cloud computing, File and Storage I/O
Systems, Peer-to-Peer Computing, Multi-core Systems,
Performance Evaluation and Modeling, Computer Networks, Embedded Systems, and Trusted Computing. He
has more than 40 publications in international conferences or journals, and has successfully coordinated several
research projects funded by various funding agencies across China.
Wenyu Qu is a Professor at the School of Information
and Technology, Dalian Maritime University, China. She
got her bachelors and masters degrees from Dalian
University of Technology, China in 1994 and 1997, and her
doctorate degree from Japan Advanced Institute of Science
and Technology in 2006. She was a lecturer in Dalian
University of Technology from 1997 to 2003. Wenyu Qus
research interests include mobile agent-based technology,
distributed computing, computer networks, and grid
computing. Wenyu Qu has published more than 50
technical papers in international journals and conferences.
She is on the committee board for a couple of international conferences.
Haitao Li received the Ph.D. degree from Huazhong
University of Science and Technology, in China, and major
in Pattern Recognition and Artificial Intelligence. He is
currently an associate researcher of Key Laboratory of Geoinformatics of State Bureau of Surveying and Mapping,
Chinese Academy of Surveying and Mapping, and the
member of National Standardization Technical Committee
for Geomatics (SAC/TC230). His research interests include
photogrammetry and remote sensing, pattern recognition,
and high performance processing for remote sensing
imagery.
Changsheng Xie received the B.S. and M.S. degrees in
computer science from Huazhong University of Science
and Technology (HUST), China, in 1982 and 1998, respectively. Presently, he is a professor in the Department of
Computer Engineering at Huazhong University of Science
and Technology. He is also the director of the Data Storage Systems Laboratory of HUST and the deputy director
of the Wuhan National Laboratory for Optoelectronics. His
research interests include computer architecture, disk I/O
system, networked data storage system, and digital media
technology. He is the vice chair of the expert committee of
Storage Networking Industry Association (SNIA), China.

You might also like