You are on page 1of 84

Using Folksonomy to Improve the Performance of Blog

Ranking

(Jia-Zen Fan)





(957)

(1 )
/()
( ) ()
( ) ()
( ) ()
( )

: 955202093
: Using Folksonomy to Improve the Performance of Blog Ranking


97 7 14


1. 15 3


http://blog.lib.ncu.edu.tw/plog/
2.

3.

4.




PageRank

PageRank
PageRank

(Folksonomy)
Folksonomy


:

I
Abstract
Using PageRank to ranking search results on the web has been adopted as a
reliable method; however, the results are not so satisfying. Many researches found
that there are too few interlinks between blogposts that PageRank will be unable to
recommended novel and high-related blogposts weak-connected to the users.
Moreover, PageRank is lack of Topic discovery, which makes the rank advantages
the valuable blogposts but does nothing to the relative blogposts.
We attempted to present a better ranking method on solving these problem.
Moreover, we tried to compare the degree of reliably between the latest
topic-discovery page ranking method and Folksonomy as they are both used to
generate the common topic relation. This paper will describe this discovery.


Keywords: Folksonomy, PageRank, Blog



II
Acknowledgements



III
Contents
.........................................................................................................0
Abstract ...................................................................................................... I
Acknowledgements ................................................................................. II
Contents .................................................................................................. III
List of Figures ......................................................................................... VI
List of Tables ........................................................................................ VIII
Chapter 1 Introduction .............................................................................1
1.1 What is the motivation of this research? .......................................................... 1
1.2 What kinds of problems to be solved? ............................................................. 3
1.3 Why are the problems significant? ................................................................... 4
1.4 Solutions .......................................................................................................... 9
1.5 Contributions .................................................................................................. 10
Chapter 2 Related Works ....................................................................... 11
2.1 General description of PageRank ................................................................... 11
2.1.1 Current research status and challenges ............................................... 15
2.1.2 Various approaches of PageRank ........................................................ 16
2.1.3 Industry Product of Blog Search ......................................................... 18
2.2 Comparison of various approaches with our approach .................................. 19

IV
2.2.1 Strength, Weakness ............................................................................. 19
2.2.2 Opportunity, Threat ............................................................................. 22
Chapter 3 Method and Solutions ...........................................................24
3.1 Definition, axiom, theorem ............................................................................ 24
3.1.1 Folksonomy ......................................................................................... 24
3.1.2 Topic Importance and Blogpost Importance ....................................... 26
3.2 Problem Model ............................................................................................... 33
3.2.1 Web Surfing Model ............................................................................. 33
3.2.2 Topic Surfing Model on Folksonomy ................................................. 36
3.3 Algorithm ....................................................................................................... 40
3.3.1 Procedure of Blog Search ................................................................... 40
3.3.2 Folkonomy BlogRank Calculating ..................................................... 43
Chapter 4 System Implementation ........................................................46
4.1 Implementation environment ......................................................................... 46
4.1.1 Hardware and software platforms ....................................................... 46
4.1.2 Implementation languages and tools ................................................... 47
4.2 System architecture ........................................................................................ 48
4.2.1 High-level system design and analysis ............................................... 48
4.2.2 Low-level system design and analysis ................................................ 50

V
4.2.2.1 Web Application ............................................................................... 50
4.2.2.2 Backend Application ........................................................................ 54
4.2.2.3 Database ........................................................................................... 56
4.3 System demo .................................................................................................. 57
Chapter 5 Experiment and Discussion .................................................60
5.1 Experiment design and setup ......................................................................... 60
5.1.1 Experiment scenario ............................................................................ 60
5.1.2 Roles, hardware, software, and network requirements setup .............. 61
5.2 Quantitative evaluation .................................................................................. 62
5.2.1 Effectiveness ....................................................................................... 62
5.2.2 Precision .............................................................................................. 64
5.2.3 Results and lesson learned .................................................................. 67
Chapter 6 Conclusion and Future Work ..............................................68
References .............................................................................................69

VI
List of Figures
Figure 1. Google Blog Search for Europe Travel ........................................ 4
Figure 2. Funp Blog Search for Europe Travel ........................................... 6
Figure 3. the few-links problem of PageRank ............................................. 14
Figure 4. A tag cloud in delicious ................................................................ 25
Figure 5. A bookmark in delicious ............................................................... 28
Figure 6. An example of topic surfing model. ............................................. 37
Figure 7. Procedure of Blog Search ............................................................. 40
Figure 8. Pseudocode of Post Match Algorithm .......................................... 41
Figure 9. Pseudocode of Importance Calculation Algorithm ....................... 43
Figure 10. Pseudecode of Rank Calculation Algorithm ............................... 45
Figure 11. The architecture of our system. .................................................. 48
Figure 12. Flow of httpRequest in AJAX .................................................... 51
Figure 13. Class Diagram of Web Application ............................................ 52
Figure 14. Code of function get_rank() ....................................................... 53
Figure 15. Class Diagram of Backend Application ..................................... 54
Figure 16. Fields of Database ...................................................................... 56
Figure 17. System Screenshot of Web Application ...................................... 57
Figure 18. System Screenshot of FRCrawler ............................................... 58

VII
Figure 19. System Screenshot of BlogInfo Crawler .................................... 59
Figure 20. Screenshot of Search Result with Satisfaction Question ............ 66


VIII
List of Tables
Table 1. Top 10 of keywords Europe Travel in Google Blog Search ......................... 5
Table 2. Top 10 of keywords Europe Travel in Funp Blog Search ............................ 7
Table 3. Strength and Weakness of our approach and other various approaches ......... 19
Table 4. Strength and Weakness of our approach and Topic-sensitive PageRank ....... 22
Table 5. Opportunity and Threat of our approach and other various approaches ........ 23
Table 6. Statistics of the blogpost-topic Graph ............................................................ 62
Table 7. Statistics of the blogpost Graph ..................................................................... 63
Table 8. Examples of SSI score ................................................................................... 65
Table 9. Average SI Score ............................................................................................ 67

1
Chapter 1 Introduction

1.1 What is the motivation of this research?
Blogpost, as a web page on the blog, contains an article and usually available to
extend comments left by the visitors. Compare with a general web page, which
offers high quality of any services, a blogpost focus on providing richer content and
the later information. Along with the massive growth of the web, both the quality of
web pages and blogposts to readers have become more and more important; there
are more and more unimportant or harmful pages, such as spam that will greatly
depreciate the quality of search. Therefore, it is more and more important to measure
the importance and relativity of web pages, which has been a bottleneck in content
analysis until the appearance of PageRank.
Search quality has greatly improved by PageRank [52], which ranks the
importance of web pages by how many times they referred in other pages and how
importance of those referring pages. Pages themselves score each others through
their links between and more important web pages will rank higher because they
have more links from important pages. Such a measure function originates from
calculating the citation scores of academic papers.
PageRank makes more intensive cluster of pages website will be higher on
average in the rankings. For example, in contrast to Wiki pages, published in the
way of link before editing, blog posts, which people post their thinking
conveniently and usually contain far less inter-link, are far harder to be ranked in

2
front of the result of Google search than wiki pages. Border [5] found that the web is
weak connected and aperiodic, moreover, there are only 0.27 hyperlinks on each
blogpost, which is found in Kritikopouloss statistics. [39]. Therefore, it is poor
performance to using PageRank on most blogposts, especially those latest blogposts
without other inlinks.
This weakness can be improved by measuring the topic relativity. Many
researches involve in measuring the topic relativity of web pages efficiently, such as
TSPR (Topic-sensitive PageRank) and HillTop algorithm; however, the performance
is still not good on searching blogposts, due to the accuracy of topic classification on
pages adopted in the algorithms.
To overcome the above problems, we tried to use a more precise ranking
method by taking good use of Folksonomy, which make a dual evaluation on the
blogposts both in fitness and importance of topic and score them in social common
consensus.






3
1.2 What kinds of problems to be solved?
Blog search, as a part of web search, may be applied by any method on the web
search, but it is limitative that blogpost is substantially a web page with few
hyperlinks, which makes lots of blogposts not able to be ranked as well as reduces
the search performance in some search method ranking on the link importance, such
as PageRank. Otherwise, PageRank is not safe due to the scores are easily
manipulated by the hyperlinks modified by the bloggers. For examples, they can
increase some others links in return for some interests, without considering the
importance of their content
Moreover, PageRank itself does not consider the impact of the topic of search
keyword on the rank, which enables the score to be calculated in advance, however,
bring about the lack of topic judgment. PageRank rank the posts in a general case,
rather than in the importance with different topics. As long as a post getting a higher
rating, it will always in a higher ranking in each different topic, even if it rarely
discusses about.
These weaknesses indicate that it is necessary to find out other relationships
besides hyperlink that can improve ranking method, as well as not easy to be
manipulated. For these reasons, we explore the topic relationship between the posts
and at the same time ensure the objectivity and fairness of topic judgment to
preventing from the manipulation.


4
1.3 Why are the problems significant?
PageRank brings a convenient and good-performed method to the search, but
not a perfect method. Observing the results that PageRank used in the blog search,
however, the shortcomings of PageRank emerged more obviously. We illustrate our
observation in the following example.

Figure 1. Google Blog Search for Europe Travel
Figure 1 shows the screenshot in April 18, 2008 11:00 a.m. when we inputted
Europe, travel as keywords to search for the Chinese blogposts on the Google
Blog Search site, which uses PageRank to do the main rank and is available for two
options: rank by time or rank by relativity as secondary considering. We choice
the latter to omit the time factor on the impact of search results and list the top 10
result in Table 1:

5
Table 1. Top 10 of keywords Europe Travel in Google Blog Search
rank blogpost / URL drift of content
1 <> Europe Travel Book
Advertisement
http://blog.xuite.net/windshape/smile/16675406
2 A Hot News about
Taiwanese Cable Car http://www.eurotravel.idv.tw/forum/read.php?tid=13474
3 ~ Question for Europe
Travel
http://www.eurotravel.tw/forum/read.php?tid=13469
4 Question for Europe
Travel
http://www.backpackers.com.tw/forum/showthread.php?t=78474
5 15 24 Europe Travel News
http://www.travelrich.com.tw/adredirect.aspx?P_Class=10&Table_id=5263&Ch
eckID=E6D6BFA9-B19B-47FE-82A8-91D19671C5CE
6 [2008 ] ~ Europe Travel Note
http://blog.pixnet.net/sallysoup/post/16602518
7 ___ Europe Travel Note
http://www.wretch.cc/blog/tina1025168&article_id=10407366
8 Europe Travel Comments
http://jason-ontheroad.blogspot.com/2008/04/blog-post.html
9 Europe Travel Comments
http://blog.yam.com/anika/article/14766902
10 KITARO A history about Japanese
Musician
http://jc2007106.blog.sohu.com/83545258.html
Observation of the top 10 result, obviously, there are two posts with very
different content from Europe Travel: the second and 10th post. The former is a

6
news discussed without Europe but come from a blog of European travel agency,
and the latter introduce about a Japanese celebrity's life and talk very little about his
experience to travel in Europe. Both of them pass the term match, which is usually
used with PageRank and filters the post by counting the appearance of keywords
both in the content and title of post and title of the blog site, and leave behind the
posts which appearance is under a threshold, and are respected as important posts by
PageRank. This result showed that PageRank is not able enough to dilute the
seriousness of the problem that unrelated or less related posts are not easy to be
filtered by term match thoroughly, which can be reduced by ranking the irrelative
posts behind the list to avoid confusing the users.
To comparing the performance of PageRank, we also search for the same
keywords in a blog search interface in a folksonomy site named as Funp, as the same
time in our search on Google Blog Search.

Figure 2. Funp Blog Search for Europe Travel

7
Table 2. Top 10 of keywords Europe Travel in Funp Blog Search
rank blogpost / URL drift of content
1 Europe Travel Info (Hotel
Comparison)
http://blog.pixnet.net/Barbie99/post/15385995
2 ~ Europe Travel Note
http://blog.pixnet.net/Barbie99/post/14928100
3 -

Europe Travel Note
http://blog.liontravel.com/MALUCHUNCHUN/post/2133/7219
4 - Europe Travel Info (Hotel
Introduction)
http://blog.liontravel.com/52099lu/post/1604/7199
5 OiaLena's house Europe Travel Note
http://blog.pixnet.net/rolla/post/13301594
6 Hallstatt Pension Sarstein Europe Travel Info (Hotel
Introduction)
http://blog.pixnet.net/trevia/post/11622170
7 Four Seasons, Milan Europe Travel Note
http://living-style.blogspot.com/2007/09/four-seasons-milan.html
8 100 Europe Travel
Advertisement http://blog.xuite.net/wm_dsc/music/14349212
9 Europe Travel Note
http://blog.pixnet.net/trevia/post/9965852
10 2007 Europe Travel Note
http://basil.idv.tw/blog/?p=1039
As shown in Figure 2, we find the post we want on the Funp blog search

8
interface which provided a search interface to search on the tags. Likewise, we
entered the keywords "Europe", "travel" and choice options rank by the relativity,
and the top 10 results are displayed in Table 2.
Compare with Table 1, we found the result here are almost relative with the
keywords, and a higher quality of the article can easily position in front. Its because
that Funp uses the folksonomy to filter the posts and the number of recommender to
rank them. Taking folksonomy filtering as an alternative way for the term match
makes the posts with richer and more important content discovered, however,
reduces the recall of search results if being without a good way of match between
tag and keyword.
As mention above, we thought that the problem of term match can be avoided
by a good rank method. Therefore, in our research, we will not take up the filtering
problem in detail, in alternative; we will concentrate on using folksonomy to rank
the search result. Moreover, we will combine both the advantage of folksonomy and
PageRank to make rank more precise.




9
1.4 Solutions
It was point out in the previous sections that it is hard to calculate the importance
of a blogpost by its sparse hyperlinks, furthermore, we have discussed earlier that it
is important to analyzing the importance of the post according to the topic of its
content. So that, we will both discuss with the importance of the topic in the content
of a post and the importance of the post in a topic coming with folksonomy, then, we
will rank by these importance.
Folksonomy, definition without any standard taxonomy but by the participants
in any word they think about, creates complex classification structure but achieve
topics of objective more detailed through the collection of bias from more people.
Through the Internet, it's free to anyone involved in the resources to do a
classification based on the topic; each with its own cognition to determine what
topic the resource belongs to. As the same time, it summarizes all the views to the
resources and opens them on the Internet, so people can be free, in part, seconded
the views of others by using the same topic to classifying. Finally, it turns out to be a
classification based on the common consensus. It is conducive to search engine that
can acquire such a social common consensus from the Internet, which will help the
search engine optimizing their results to be in line with public awareness.
We can use folksonomy to rank blogpost by importance of the topic which can
be viewed from the summary of the classification; once a user wish to find what post
they used to classify on the folksonomy site on the search engine, they may strongly
use the topic from their opinion as a keyword to search. Considering the case using

10
only one keyword to search, we define the probability of a topic selected as the
keyword as the importance of the topic. On the other hand, not every post for each
search is so important, so we focus on the posts classified by a topic and consider the
probability of the expected selection of them as the importance of post in search for
a topic.
Taking above two importances as user selected preferences, we can get the
probability distribution as well as the mathematic expectation of the post selection,
so that we can take the expectation as post score to rank the search result.

1.5 Contributions
We discussed in the relationship outside the hyperlinks to reduce the impact of
weak connectivity on the performance of rank. Such the post-topic relation
generated from folksonomy gives chances to those blog without linked from many
hyperlinks to be rank higher, if they have high importance on topic.
Furthermore, folksonomy can reach a correct classification on topic by
so-called the social consensus. Our contribution is to study the importance of the
blogpost to the topic, and prove that rank considering the importance of the topic
can improve the user's satisfaction on blog search the rank of blog search result.


11
Chapter 2 Related Works

2.1 General description of PageRank
It is hard for people to using the full terms to search something unknown, so
that, people will use few terms as broad conditions as searching on the web to bring
the wide and fine results back. For such results, people would hard to choice what
they really want until ranking them in a degree of fitness to the search terms.
In the past, search engine only ranking the results by content analyzing the
page, such as term match, which calculates the compatibility between search
keywords and the content of web pages, however, this method didnt perform well
both in precision and recall. The former is reduced by the less relativity with the
search keywords of a proportion of the results, as the later is by the disability to
find those related pages which didnt contain the keywords in them. These result
from many complex reasons, including the inadequate of relevance calculation, and
that the implied request which searchers did not show up in search conditions need
to be found.
As Internet observers observing the behavior of users on the web, they found
that pages which users incline to get is reflected on the hyperlink and focus on a
certain ones as authoritative pages, which are high-prestige pages that are regarded
as quality-high and helpful and frequently to be linked by other pages. [13]
PageRank[52] assumption that there is a meaning of the prestige within a

12
hyperlink, and such prestige between pages also can be used as a prediction of the
prestige for user to the page. PageRank is like a vote for web pages to evaluate
between each other. Through their links, they vote the web pages they linked to.
The more the same links to a web page, more times they will vote on it, which
means they give higher degree of support to the page.
PageRank is a link-based ranking algorithm which finds the importance of
connectivity through the links to rank. PageRank is also an expectation formula
which calculate the expected times of pages been surfed after unlimited times user
surf the web randomly with a fixed probability of surfing each page. It assumed that
there are two kind of surfing models with different probability when user surfing the
web:
Link surfing model: Users surf the next page according to the hyperlink in
current page they have just surfed, for example, they are likely to surf pages B, C or
D after surfing page A if and only if page A contains the hyperlink to the pages B, C
and D.
Direct surfing model: User surfs the web randomly which means there are the
same probabilities users surfing from any page to any other, in other words, the
probability will be never changed with the page whatever page users have read this
times.
All expected times of pages is defined as X as follow:

13
n
T
n
X
N
d L d X

+ =
+
1
) 1 (
1

X
n
is a column vector, which is a set of the expected read times of each page
after n
th
times of surfing. L is the probability distribution of the pages in surfing
another page under link surfing model. N is the number of total pages, therefore any
page will be surfed in the probability 1/N under direct surfing model. The variable d
is a damping factor, which means an impact coefficient accompanied by the formula
before it. How large the damping factor (no more than 1.0) means how often the
searcher would prefer to click the results computed by this formula, on the other
word, the higher the value of coefficient, the more reliably the ranking result
formula gains from the searcher. Instead, searcher would click the search results
randomly, i.e.: a 1/N probability. In Google Search, this coefficient is around 0.85.
By this formula, we can get the rank score as long as let X
n+1
=X
n
and solve X
n,
if
there is an n that is big enough to be satisfied this equation.
This formula needs to get all hyperlinks of the page on the web, but it is a very
arduous task. Google solves this problem in the Bruce force way, which uses the
web spider (or web crawler) to automate this task. Web spider scans on page for
hyperlink and retrieves the pages from the links repeatedly, until there is no
unduplicated link on the retrieved pages.
PageRank has been challenged on several problems, that is:
PageRank is ineffective if the connectivity among the pages is sparse. As

14
shown in Figure 3, when there are fewer link among the entire web graph, their
scores calculated by PageRank are close to the scores which are calculated by
part of PageRank on only considering direct surfing model, in other words,
most pages have the same low score.

Figure 3. the few-links problem of PageRank
Another problem PageRank facing is that PageRank is a rank method with
general case; however, there should be different ranks for the different topic of
search, which is result from the lack of considering of relevance between the
rank and the search keywords.
Furthermore, There are some links unrelated to the content, such as the
exchange of advertising links, which are seriously impact on the reliability of
ranking and may even be used as a Google Bomb (or Google Wash)[51][54][70]
to increase ranking of specific web pages.
The first two problems highlighted in the blog search, such as presented in
section 1.2, the studies of blog pointed out that there are few links among the

15
blogposts: excluding the links from the owners blog, blogpost itself is nearly a text
files. This makes the performance on the blog search of such link-based ranking
algorithm greatly reduced.
2.1.1 Current research status and challenges
In this section, we discuss about various adaptation processes of current
research, then point out some important issues of PageRank and the challenges of
PageRank. To break through or to avoid the impact of foregoing reasons on the rank
precision, many studies have considered to identify the search topic and tried to
create a topic-considering PageRank, which can be divided into two ways, that is,
with term match or with using an artificial classification database to match the query
with the page.
The advantage of the former is that any pages can be matched by this way
which is compared to the latter will miss match some pages out of the classification
database. However, due to the systematic classification, term match is not easy to
catch the meaning of the words in a page, nor is that improved by applying natural
language processing technique, which has a very limited improvement for precision.
In addition to the reason that topic is implied in the page but not appear in a form of
words in it, meaningless words made by malicious authors to match the hot topics
will also decline the precision.
Going forward, there is another way to increase the precision which is quite
different from those mentioned earlier former that do effort in satisfy the precision

16
on personalized search, which provide different search results to the user. Such
method collects the history of the user behavior on search, such as the keywords
they had requested or the pages they had selected, and then finds the rules between
the search result and the behaviors for each user.

2.1.2 Various approaches of PageRank
TF-IDF (Term Frequency Inverse Document Frequency)
[35][57][58][59]
TF-IDF is the most common method of term match, which assessed the
importance of a search keyword for a web page on the web. Term frequency is the
frequency of the appearance of the keyword in the texts of a page and inverse
document frequency is the inverse number of the total proportion of the pages
contained the keyword to all the pages. The scores for evaluating the relativity
between the pages and the search keywords which is product of two frequencies are
higher for those pages with more keywords, which in particular are not within most
of the pages.
Kurland Lee Method [41]
Kurland and Lee established the implicate link by computing the content
similarity of the page in the technology of data mining [63], however, it is very
time-consuming, and not able to achieve the topic of which the relevant words are
not frequently or even not appear in the text.

17
HillTop [6]
HillTop raise the rank of those pages among which is stronger connective by
the expert score, which is computed by matching the query keyword with each
expert page selected as a group of pages that has hyperlinks more than a given
number, and the target score on a page linked only by two expert pages, which is the
sum of products of expert score and connected-hyperlink proportion of an expert
page.
Collaborative Filtering [21][38][55][60]
Collaborative filtering records the keywords which are requested by the
searchers, and groups the searchers who have the similar keywords. It is expected
that a searcher will interest in the pages have interested others in the same group, so
that, the pages clicked by more user in a group, more higher scores are given to them
in the request of search by the member of the group. Although it is a good way of
personalized search, however, does not perform well for those new users who have
not yet been assigned into any group.
TSPR (Topic-sensitive PageRank) [26][27]
Topic-sensitive PageRank improve the search result by dividing PageRank into
multi-dimension according to the top-level directory in the Open Directory Project,
giving each search keywords a vector value as the relativity to all the dimensions
according to the words database as CIRCA, and calculating all relativity from the
search keywords to the pages, which vector value is given by the PageRank score
and only for one dimension which is the top directory of the pages. However, Open
Directory Project classifying web pages comes from the artificial judgment.
Although it made classification more organized and weight calculation easier based

18
on a standard taxonomy (e.g.: ACM classification), but such an subjective artificial
judgment may cause bias. Therefore, it must ensure the process of classification
under a reliable estimation.
2.1.3 Industry Product of Blog Search
Google Blog Search [73]
In 2005, Google Blog Search is introduced from Google Inc, which expands the
Google web search technology to the blog search service. Google Blog Search
engines use web crawler which is also used by Google web search to get the pages
and divide blogposts from other pages by with RSS feed inside or not, and use
PageRank as ranking algorithm. Google Inc. provides Google Blog API as open
development tool for other webpage authors to use their search service, including
the blog search.
Funp Blog Search [72]
Funp is the web 2.0 commercial web site established in 2007 by graduated
students from Taiwan's institute, which provides the platform on sharing
recommending and tagging web media, such as web pages, video and blogpost, and
the search engine. Funp Blog Serach is one of the services to recommend in the
score as similarity of number of recommenders and tags. Posts in Funp are also
provided by the users, therefore the post collection has a higher quality but smaller
size.


19
2.2 Comparison of various approaches with our
approach
In this section, we use a SWOT analysis to identify the strength and weakness
of our approach, as well as the opportunity and threat. In general, the different the
importance by which the different web search technology stress on, the different
perform they will, as well as the different protection and security threat on
preventing from the malicious manipulation on increasing the ranking.
2.2.1 Strength, Weakness
We will not take up the personal search in our research, so as we compare the
strength and weakness of the search technologies in section 2.1 classified by the
importance in spite of collaborative filtering. As shown in table 3:
Table 3. Strength and Weakness of our approach and other various approaches
Importance
(Algorithm)
Strength Weakness
Link Importance
(PageRankHITS)
improve the quality of
search result by
ranking the more
linkable page higher
insensitive on topic
those page without linked
by other cannot be rank
less preliminary for new
page on rank
Content Importance
(TF-IDFKurland-Lee)
improve the quality of
search result by
ranking the more
keyword contained
insensitive on implied
topic
hard to make sure the
quality of page content

20
page higher
all pages can be rank.
Link Importance +
Content Importance
(HillTop)
sensitive on both topic
and link importance
insensitive on implied
topic
Link Importance +
Topic Importance
(TSPROur Approach)
sensitive on both topic
and link importance
keyword-relative topic
not contained in the
page is sensed
those page have not been
read topic judging by the
users cannot be rank
large space of additional
data needed
The benefits brought from increasing of link importance as well as finding a
research papers with a high degree of reference. As all links are add in a normal
purpose, the quality of the search result posts would be good as well as those posts
citing on them. However, it is not thorough on only considering this point, because
most of the search concern about some special topics coming from the keywords
inputted, and it is necessary to analyze whether the blogposts satisfy the topic
needed.
Term match analyzes the content importance to get the posts contained with
most keywords. Content importance is text-related, so that is usually analyzed by
the natural language processing. It is hard to get the high quality of page by
analyzing content importance, which results from the variability of grammar and
that users can also write a high-quality post without obeying the grammars.
Although it can get a better result by considering both in link importance and
content importance, the importance of post for the topic, which implicate in the
content, however, cannot be known by this way.
The most difference of topic importance from content importance is that topic

21
importance relying on the large records coming from the artificial topical judgment.
Although it will increase the space and time complexity of calculating (inevitably, it
need more times to read the additional information), however, by getting the human
judgment on the topic of keyword and post maximally, it can pass the bottleneck on
the rule analysis of keyword - topic - post relation by natural language processing.
Both our approach and topic-sensitive PageRank are the search methods on
topic importance, which have two differences:
At first, topic-sensitive PageRank only has eight topics from top-level
directories from Open Directory Project, which might lead to the un-sufficient
match on the keywords or posts with the eight topics, and miss-match the
demand post requested by the request. Our approach maximizes the number of
topics to fulfill the posts and the keywords. Although it will make an even
greater amount of data, by enhancing the range of the topic discovery, we can
improve the precision of the search.
Second, topic-sensitive PageRank uses CIRCA database to match the search
keywords and corresponding topic, and uses to ODP (Open Directory Project)
to match the post and corresponding topic; our approach only use the database
from the folksonomy site to match the topic with the post and keyword. By the
information of folksonomy site, we cannot get the topic of post evaluated by
the expert objectively, however, we can collect the rich, diverse, bias-including
topic of post, which makes the strengthening range of topic discovery become
possible, and make those, who have bias against the objective opinion of expert,
also have the opportunity to find posts, therefore, the satisfaction of search

22
results will become higher.
Figure 4 compares the strength and weakness between our approach and
topic-sensitive PageRank.
Table 4. Strength and Weakness of our approach and Topic-sensitive PageRank
Strength Weakness
Our approach
Topic is polynomial, which
make precision of search
higher.
Post recommending by reader
can be discovered.
Higher time complexity of
computing
Post without classified on
folksonomy site cannot be
scored.
TSPR
Topic of post is objective.
Lower time complexity of
computing.
The number of topic is too low
and less-polynomial to reduce
the precision.
Post without classified on ODP
cannot be scored.

2.2.2 Opportunity, Threat
In the past research on the search and ranking, little attention has been given to
combine folksonomy to the search. The outlook of our method is that the use of the
folksonomy to rank the post improve the precision of search results, on the other
hand, unlike PageRank or HillTop, taking advantage of such a democratic
classifying way, the manipulation by the author of authoritative pages can be
avoided.

23
However, this method is greatly affected by the quality of the folksonomy
information. There are two threat to our method, on the one side, the less sufficient
the users in participation on classifying a blogpost, the lower accuracy of the topic
classification; on the other side, the fewer kind of the tag, the lower accuracy of
matching the keyword and the topic.
Table 5. Opportunity and Threat of our approach and other various approaches
Opportunity Threat
Our approach
The difficulty of malicious
manipulation on rank
Optimization ranking method
on the blogpost
The lack of folksonomy
information will lead to match
error of the keyword and topic
PageRank
Ranking by using social
intelligence
link-dependent; someone can
manipulate the rank by
hyperlink easily
TSPR
the search results classified by
experts
The lack of the number of
kinds of topic will lead to
match error of the keyword and
topic
HillTop
Finding the authority more
easier
Cannot guarantee the fairness
of the important link; the
manipulation of an important
post will greatly affect the rank


24
Chapter 3 Method and Solutions
Blog rank is a particular issue of page rank which there are few links in a blog
to computed by PageRank. In this chapter, we proposed a method for blog rank to
heighten precision of ranking, which ranks blog posts by topic importance
evaluating from folksonomy.
3.1 Definition, axiom, theorem
Before we use the topic importance to rank the blogpost, we try to research on
what kinds of topic importance can be found from a blogpost on folksonomy. In
section 3.1.1, we first introduce the summary and basic theorem of folksonomy, and
in the next section, define two kinds of the topic importance we found, that is, the
topic importance in a blogpost as well as the blogpost importance in a topic.
3.1.1 Folksonomy
Folksonomy [1][11][12][22], which also known as collaborative tagging, social
classification, social index or social tagging, is a open classification on the Internet
which make any resource be classified by any users on giving any tags as categories
freely. This way of classification is like a vote, which the outcome of classification
depends on the majority consensus.
Tag, as a classification, also as a topic, limits in one term as its name. Tag plays

25
an importance role in the social media [34][61][64][65], which tagged on the social
resource after uploaded by an author or read by a browser, so that, tag is not only a
good reference for other readers not yet to read., but is also as the basis for
information processing done by the machine, such as resource classification and
exploration [4][8][43].
Folksonomy allows user to use a simple word as a tag to describe a topic of a
resource and represent the classification of resource on topic as a tag cloud, as like
in figure 3, which shows the tag cloud on a web page in a folksonmy site. Each font
size on tag is for simplified visualizing the tag times, which means the number of
people on participating in classifying a resource who thinks the tag can classify the
resource.

Figure 4. A tag cloud in delicious
In folksonomy, tag, user, and resource together form a ternary relation, by
which any self relation or binary relation projected has its meaning. For example, a
user self-relation from user to user can be a social network relationship where users
who have relation if both of them tag the same tag on a same resource, and a
tag-user relation is just like a bookmark on a resource. [42]. Such a relation is used
on our method to calculate the topic importance of blogpost and blogpost
importance of topic. We will define in the next section.

26
On the other hand, as using folksonomy as a basis for calculating on some
relations, we must pay attention to some properties of folksonomy, such as that
according to some research on a folksonomy site named as delicious [71], the
distribution of tag cloud of a web page will become a stable state after tagged by
more than 100 people [22]. This shows that we must filter out those blogpost with
too few tag times. Notices that, we wont filter out the usage time of tag, for the
reason of that we want to gather the bias on search as more people as possible.
3.1.2 Topic Importance and Blogpost Importance
When users search for a post on the web search engine, they will first enter
some keywords as the topics to request the match results, and then, select the post
from the posts returned for the topics. Similarly, when users visit folksonomy site,
they will first select a tag as concerned topic, and then pick a post from the posts
under this tag.
Based on users behavior of picking post on folksonomy site, we wish to
predict users demand on search result. For the search, an important topic is the topic
may be concerned by people, as well as an important post is the post people may
choose much easily. The higher the importance of the post, the higher the post will
be ranked in the search results and easier to click.
Before we define for the topic importance on blogpost and blogpost importance
on a topic, we first clearly define the scope of the blogpost and the topic in our
discussion:

27
Blogpost: A blogpost refers to a post published by the author on a blog site,
which has a permalink (permanent link) used to browse the content on inputting
an URL directly. A blogpost is a web page, but not vice versa.
Topic: We see each tag in the bookmark of a blogpost as a topic of it. Since the
tag times represents that there are more people agree of using the tag to
describe the blogpost, in other words, tag times can be a degree on the relativity
to the topic.
Besides, in Folksonomy, a blogpost with a set of that include all the tagged
topic on it is so call a bookmark:
Bookmark: In this paper, we have the term bookmark refers specifically to
to a bookmark of a blogpost in the folksonomy site. A bookmark ,which
contains a tag cloud on a post as shown in figure 5, can be represent as follow
mathematical model:
)) ( , , ( ) ( p R U T p Bookmark =

Where p is the blogpost, T is a set of the different tags in the bookmark and U
represents a set of all the users who once tagged on blogpost p. Any user can
use any words contain without space to create a tag and tag any tag on the post.
We must assume that each tagger on the post is also a visitor on it, in other
words, they must read the post before tag. So we can get the T U relation
matrix R(p) that if a tag has relation with a user, then it is tagged on the post p

28
by him; vice versa.
We can get the tag times of a blogpost on a bookmark by getting the sum of
row of R(p), that is, multiplying R(p) with a U 1 row vector where each
element is 1:
1
1 ) ( ) (

=
U
p R p TagTimes

Such a T 1 vector can be used to show the tag cloud in a bookmark.

Figure 5. A bookmark in delicious
On the contrary, getting the sum of column of R(p) by multiplying with 1 T
vector where each element is 1, we can get the number of the kinds of tags used
on the post by each user:
) ( 1 ) (
1
p R p UserTimes
T
=


Tag Bookmark: Furthermore, in order to facilitate the explaining of our
algorithms, we additionally define the tag bookmark that contains all usage of a
tag t on all blogpost by all the users. Such a P U matrix of tag t is:

29

=
) (
) 2 (
) 1 (
) (
p R
R
R
t Q
t
t
t
M

Where R
t
(i) means the t
th
row matrix of the R(i).
Similarly, A in Q(t) multiplied by a before the elements were multiplied by a 1
P row vector where all elements are 1 is the number of usage times on tag t
for each of all users:
) ( 1 ) (
1
t Q t UserTimes
P
=


Then, before our definition of the importance, we must assume the following
facts are true:
Assumption 1: when users wish to find the posts on the search engine which
they have tagged on the folksonomy site, they will use those tags that they have
used as the search keywords.
Assumption 2: when users use any tag they have used on the folksonomy site
as the search keyword to search, what they wish to find the posts is equal to the
posts which they have used the keyword to tag.

30
Thus, we can then use these assumptions and definition to the
mathematicization of the two importances:
Topic Importance: The topic importance in a post. That is a probability of a
topic been selected for searching the specific posts.
We have observed all those users involved in the classification of a given
blogpost. From assumption 1, each user has an equal opportunity to select the
tags they have used on the blogpost, so for a post p, the probability of each
topic selected is equivalent to the probability of the selection on this topic by a
random selected user. We can get the probability of all the topics by the result
of column-normalizing the matrix R(p) multiples the multiplicative the U1
column vector filled with inverse of the number of non-zero columns of R(p):
( )
1
| )) ( ( |
1
) (

=
U
p
p R NZC
p R CN

Function CN represents the column-normalizing operator to a matrix which
results in a same size but with each element in each column dividing into each
sum of column. CN can be calculated by the follow equation, which is a
direct product, that is, the result is the product of the element with the same
column and row between two matrices with same size:

31
) (
) (
1
) (
1
) (
1
)) ( ( p R
p usertimes
p usertimes
p usertimes
p R CN
U T

M

NZC(R(p)) get a number of nonzero column in R(p), which is also the column
rank of R(p) and means the number of users who classify the post p.
We obtain the probability matrix as the selection of each blogpost on each topic,
by assumption that there are P blogposts and T topics (tag):
[ ]
P T
n
TP T T
P
P
f f f
f f f
f f f
F

= L
L
M O M M
L
L
2 1
2 1
2 22 21
1 12 11
Each element
ij
f in F represents the topic importance of the topic j on the
blogpost i. We define F as the topic importance matrix.
Blogpost Importance: The blogpost importance under a topic. That is a
probability of a blogpost been selected for searching the specific topics.
From assumption 2, each user has an equal opportunity to select the blogposts
they have tagged on the topic, so the probability of each blogpost selected by a
random user is:

32
( )
1
| )) ( ( |
1
) (

=
U
t
p R NZC
t Q CN

The formula means that we can get the
t
need only by replace R(p) by
Q(t).We therefore obtain the probability matrix as the selection of each topic on
each blogpost:
[ ]
T P n
PT P P
T
T
f f f
f f f
f f f
F

L
L
M O M M
L
L
2 1
2 1
2 22 21
1 12 12

After the definition of the two importances, we will soon discuss the model of
using them to solve the real problem.




33
3.2 Problem Model
Web surfing model, which is our major theoretical basis, was introduced in
section 3.2.1. Based on these theory, we will discuss in section 3.2.2 related to how
we used the two importances to rank the search results.
3.2.1 Web Surfing Model
Web surfing is a model which assumes that users make a sequence of decisions
to proceed to another page, continuing as long as the value of the current page
exceeds some threshold, yields the probability distribution for the number of pages,
or depth, that a user visits within a web site. Simply speaking, it is an undirected
web browsing. While surfing, the user simply selects interesting links as they appear,
obtaining new information without any firm goals regarding the target information.
web surfing model can be used to predict the probability of user to read the
page. Calculating the expectations of the visit times of a page in all condition as we
start surfing on the web from any page with the probability distribution of a web
surfing model, the researchers observed that higher the expectations are, higher click
rate the pages tend to be in search result.
This model is design for partly improving the difficulty in finding relevant
pages, which is due to the impossibility of cataloguing an exponentially growing
amount of information in ways that anticipate users needs [33][50][56]. Another

34
reason that make the web surfing model success is on the balkanization of the
internet structure [68]. Web surfing model made seeking regularities in user patterns
by a high-efficient classifying scheme easier to solve this fragmentation problem by
designing an effective and efficient classification scheme.
In mathematics, Markov chain [44][45][46][47] is used to represent the model,
which could be a matrix equation:
n n
X P X =
+1
X
n
is the set of all the random variables in the n
th
stage and X
n+1
is in the next
stage of X
n
. P, which is a Markov transition matrix and square matrix, is formed
from the probability distribution of random variables changed to other. The
equation has to satisfy that:
In order to make the equation set up at all n state, it will never change on
probability distribution when the random variable changed from any state to the
next.
P is a stationary distribution (equilibrium distribution) [7][9][16][34]: there are
no other possible for each random variable but change to the other random
variable; sum of each row of P is 1.
Furthermore, in order to make the web surfing model predictable in
expectations, random variable in markov chain must be convergent after a state.

35
That is, when n is equal to a constant k, it would make Xn+1=Xn. Perron-Frobenius
theorem [5][20][23][30][47] points out that the conditions of convergence which
matrix P makes Markov chain:
P is a primitive matrix [67]. It means that there is a path existed from any point
to another in the surfing model, that is, it will certainly be able to visit to every
page no matter started from any page.
P is an irreducible matrix [22]. Therefore, it cannot exists a dangling node in
the surfing model; there is not any one of pages or the group of pages that make
user no longer surf to the other web page.
Web surfing model can also be a model limited not only to have a probability
distribution. We use Bayesian probability [2][3] to combine two or two of above the
probability distributions:
n n
X X P X =
+
) (
1

Where P(X)
( ) ( ) ( ) ( ) ( ) ( )
n n
A X P A P A X P A P A X P A P X P | | | ) (
2 2 1 1
+ + + = L

And where:

) ( ) ( ) ( 1
2 1 n
A P A P A P + + + = L


36
P(A
i
) is the probability of the occurrence of the probability distribution, and
P(X|A
i
) is equal to the probability distribution that is happened.
Web surfing model was first achieved by the PageRank, such as described in
section 2.1, which uses a link surfing model as well as direct surfing model. Link
surfing model is a Markov chain but not necessarily made it converged, however,
direct surfing model does. So that PageRank could be convergence with the
probability distribution of combination of two models [52].
3.2.2 Topic Surfing Model on Folksonomy
As we mentioned in section 3.1.2, when users visit folksonomy site, they will
first choose their topic of concern, then find the posts they want under the list of
posts of the topic they have chosen. We design topic surfing model, which is a web
surfing model acting as the user behavior describe above, and get the probability
distribution of the sum of the product of two independent events.
We use an example to illustrate how to calculate the probability distribution
from the topic surfing model. Assuming that there are three blogposts:{A, B, C},
four users and three tags:{travel, Europe, bike} in the folksonomy site, using the
definition in section 3.1.2, we can express the bookmark of each blogpost as follow:

=
0 0 0 0
0 1 0 1
0 1 0 1
) ( ,
0 0 1 0
0 0 0 0
0 1 1 1
) ( ,
1 1 1 0
0 1 1 1
0 1 1 1
) ( C R B R A R


37
We have topic importance matrix F and blogpost importance matrix F ' as
follows:

=
0
6
1
6
5
6
1
0
6
5
9
2
18
7
18
7
,
0
6
1
24
10
2
1
0
24
7
2
1
6
5
24
7
F F

We used a picture to illustrate such a probability distribution:

Figure 6. An example of topic surfing model.

38
The left boxes in figure 6 represent the post current read, the right boxes are
about the post to read, and the circle is the topic, the number on each edge is the
probability of the chosen to pass through the edge, so the probability of surfing from
one of the left posts to one of the right posts is the total probability of all path
between them. Each path must pass through exactly one topic and the length of it
must be 2, so the probability of this path is the product of the probability of from a
post to a topic and from a topic to a post. For example, there are two paths from the
blogpost A to C, which through the topic travel is
108
7
9
2
24
7
= and through the
topic Europe is
144
7
6
1
24
7
= , so that, the total probability is
432
49
.
Therefore, the probability distribution of a post on the topic surfing model
through the topic importance and blogpost importance can form the ranking formula
as follow:
n n
X F F X

=
+1
In our ranking formula, the original probability distribution of PageRank
(define in section 2.1) will be considered to retain, and combined with the
probability distribution in our model by Bayesian probability theorem to get the
formula:
n
T
n
X
N
L F F X

+ +

=
+
1
1



39
Where,
+ + = 1

L and N, as described in section 2.1, are the matrix of probability of surfing
through the hyperlink on the blogpost and the total number of blogpost; , , are
also called the damping factor, that is, the probability of that users will choose the
way to surfing on the web. Such a Markov chain is convergent, so we will be able to
calculate the expectations of the posts selected as the rank scores, by letting the Xn
+1 = Xn, and then solving the matrix equation. In section 3.3, we will use an
efficient way to Calculated.


40
3.3 Algorithm
3.3.1 Procedure of Blog Search
Rank Google Blog Search Content about
1 <>
http://blog.xuite.net/windshape/smile/16675406
Europe Travel Book
Advertisement
2
http://www.eurotravel.idv.tw/forum/read.php?tid=1347
4
AHot News about
Taiwanese Cable Car
3 ~
http://www.eurotravel.tw/forum/read.php?tid=13469
Question for Europe
Travel
4
http://www.backpackers.com.tw/forum/showthread.ph
p?t=78474
Question for Europe
Travel
5 1524

http://www.travelrich.com.tw/adredirect.aspx?P_Class
=10&Table_id=5263&CheckID=E6D6BFA9-B19B-
47FE-82A8-91D19671C5CE
Europe Travel News
6 [2008] ~
http://blog.pixnet.net/sallysoup/post/16602518
Europe Travel Note
7 ___
http://www.wretch.cc/blog/tina1025168&article_id=10
407366
Europe Travel Note
8
http://jason-ontheroad.blogspot.com/2008/04/blog-
post.html
Europe Travel Comments
9
http://blog.yam.com/anika/article/14766902
Europe Travel Comments
10 KITARO
http://jc2007106.blog.sohu.com/83545258.html
Ahistory about Japanese
Musician
PageRank
Link
Graph
Topic
Graph
Folksonomy
BlogRank
Ranks
Keywords
Post Match
Ranking
Matched
Posts
Pre-processing Process
Query-time
Process
Search Result Ranks

Figure 7. Procedure of Blog Search
As shown in in Figure 7, the process of blog search is divided into two parts:
ranking calculation as pre-processing process and query as query-time process.
Pre-processing process, which run independently from query and update the data
regularly to re-calculating the rank score of the blogposts in finding whether there
are new blogposts on the Internet, or some changes of the classification of blogposts
on the folksonomy site. The scores are gotten by combing PageRank and

41
Folksonomy BlogRank as we know in section 3.2.2, which have to combine the
probability distribution of the link surfing and topic surfing model. To simplify the
calculation time, we will get the PageRank score from Google in advance to
preserve the time on the matrix computing time on PageRank part, so the the total
score is equal to the linear combination of the PageRank scores and Folksonomy
BlogRank score, which algorithm will be introduced in section 3.3.2.
Query-time process allow users input keywords as request for search and
responds the result post ranked with the scores calculated from pre-processing
process, that is, when the search keywords entering, query-time process first uses the
post match algorithm to match all possible blogposts with the keywords, and then
ranks these posts in descendant order by the scores. The algorithm is as shown in
figure 8:

Figure 8. Pseudocode of Post Match Algorithm

42
The algorithm need a set with size s coming from the user request and a set of
tags with size T collected from the folksonomy site as inputs. At first, tags which are
similar to one of the keywords will be selected. Then, blogpost tagged by at least
one of the selected tags will be matched out, which need to query in a table collected
from the folksonomy site, which is three-dimension and formed by the tag, user, and
post. In order to match in an efficient and convenient way, we use an extra space to
store a two-dimension table to query faster whether a tag is tagged on the post and
vice versa.

43
3.3.2 Folkonomy BlogRank Calculating

Figure 9. Pseudocode of Importance Calculation Algorithm
In this section, we actualize the ranking formula of section 3.2.2 into the
machine. By using spider on the web, we can get the link relation L between
blogposts and the bookmark of each post as R(p). Considering the theory advanced
by Golder and Huberman [21], we omit the bookmark of the post which the number
of taggers in participation, which are the people who have tagged any tag on the
bookmark to classify the post, is fewer than 100 and use the rest of bookmarks to
calculate the topic importance matrix and blogpost importance matrix, as the

44
algorithm in figure 9.
As defined in section 3.1.3, topic importance and blogpost importance can be
calculated by two operations, that is, column-normalizing and nonzero column
counting on R(p) and Q(t). The folksonomy information we can obtain is a user
post tag ternary relation. Such a relation turn out to be a sparse three-dimensional
matrix as we observed in section 5.2.2, so we store them as a two-dimension table F
with size k*3, which k is the row size of the table, which is equal to the number of
elements with value 1 on the three-dimensional matrix but far smaller than the size
of it (user number * post number * tag number), 3 is the column size of each row,
which fields are followed by user, post and tag, and each row represents the truth
that one tag is tagged by one user on one post.
Calculating on both topic importance and blogpost importance we have to get
retrieve user times of topic or post from the table F. As shown in figure 9, B is the
table to store such information. Subsequently, we calculate how many users
participate in each topic or post. Finally, topic importance matrix F and blogpost
importance matrix can be calculated out as matrix C, which the topics or posts with
which number of users less than 100 are omitted in calculation to make the results
more reliable.
As shown in figure 10, multiplying the two importance matrices can get the
matrix R to solve the A in matrix equation A=RA. Many solutions can be taken, such
as Power Method or Eigenvector Solution. We choose the latter as our solution,
because the matrix of multiplication is a non-sparse matrix (see section 5.2.2), which
makes the Power Method is not so efficient and spends more time to converge to a

45
stable state. After solving the equation to get the scores of Folksonomy Rank, score
of each blogpost for ranking can be calculated our by combining with each
PageRank score.

Figure 10. Pseudecode of Rank Calculation Algorithm
Complexity Analyze
The highest time complexity in the whole Rank Calculation Algorithm happens
both in the operation of the product of the F' and F and solving matrix equation, in
which the time complexity of the former is O(P
2
*T), which is determined by the
time complexity of matrix multiplication, as well as the latter is O(P3). So the total
complexity of the algorithms is O(P
2*
max(T,P)).
Performance Analyze
We believe that our algorithms will get the better search results, and have done
a series of experiments to evaluate our performance, and the results will be shown in
Chapter 5.


46
Chapter 4 System Implementation

4.1 Implementation environment

4.1.1 Hardware and software platforms
Our system is a Web application. Our resources and the environment are as
follows:
1. Hardware: IBM Personal Computer
2. Test OS: Windows XP
3. Network: WWW
The software platform constructed on the IBM computer is a search interface
with the ranked result returned. Otherwise, there is a backend program to crawl for
the folksonomy information from Delicious and pre-compute rank of post:
1. Web Server and Database: AppServ2.5.8 (Apache 2.2 + PHP5 + MySQL)
2. Backend application: Java VM
The client browsers supporting to browse the search interface include Microsoft
Windows Internet Explorer and FireFox, etc. Notice that the browser has to support
to excute the JavaScript function.

47
4.1.2 Implementation languages and tools
The structure of our system is like a general search engine. Web user interface
developed in AJAX (Asynchronous JavaScript and XML) technology and PHP
language. AJAX has a good interaction mode between the user and server and speed
up the information transmitting on the web. PHP is a popular dynamic HTML,
which allow us to calculate the rank of blogpost hidden in the back-end.
To collect the information in the back-end database, we use the Java SE to
develop our Web Crawler and blogposts ranking calculator. Web Crawler crawled a
lot of blogposts from the WWW and Folksonomy site.







48
4.2 System architecture

4.2.1 High-level system design and analysis

Figure 11. The architecture of our system.
The dataset of our system, blogpost information are come from: Folksonomy
site which provide blogpost collection with their URL and bookmark used to
compute the score of Folksonomy BlogRank, and Google which provide PageRank

49
of each blogpost.
Blogpost Information as our basis of ranking can divide into URL and
Folksonomy information and Rank Information. The former is used on the blogpost
filtering and addressing, the latter is additional calculation through the score of
Folksonomy BlogRank and PageRank to rank.
The procedure of our server is divided into three layers:
User Interface: The main function of our system is to provide user to query for
blogposts. When users enter keywords as search criteria, the request submitted
by the User Interface to Query Time Proccess, which return the ranked list of
posts as result and present the result returned to the user.
Query Time Process: The operation of query is divided into filtering and
ranking; the purpose of screening is to satisfy user demand for the full results,
therefore, just need to remove posts that is far from the search keywords, and
which is opposite on the purpose of the rank that that keeping the most satisfied
posts in front of the result list. The methods of filtering and ranking have been
described in details in section 3.2 and 3.3.
Data Pre-processing Process: Data Pre-processing Process collects blogposts
and computes their scores. Data Pre-processing Process is executed periodically
to update the latest blog and folksonomy Information, and re-calculate the score
of blogposts:

50
Bookmark crawler updates bookmark changes, which are the changes of
the tag times, according to the blog post kept in the system, and then,
identifies and saves the blogposts not include in the blog and folksonomy
information database under different tags.
PageRank Query blogpost update the PageRank, which is achieved from
Google rather than computed by our system, as a result, we can save the
space and computing time, and through the larger database from Google,
blog post can get the more accurate PageRank data.
Rank Calculator recalculated after each time updating blogpost information, of
which the method is shown in section 3.2.
4.2.2 Low-level system design and analysis
Considering processing time performance and function, in implementation,
with the need of client, server and back-end execution, our server divided into two
major parts introduced in each sub-section. Section 4.2.2.1 introduces the web
application program, including Web UI and Query Time Process. Section 4.2.2.2
highlights the back-end program for Data Pre-processing Process. And the last
sub-section presents the structure of our database.
4.2.2.1 Web Application
Web application develop the Web UI by AJAX[19], which is an expansion

51
method of Javascript for strengthening the communicative competence with the
server. Javascript is a dynamic homepage language, of which characteristic is that
providing codes executed on the client in a web page and made the page
intersection with user as a client program. However, the weakness of Javascript is
that it is disabled from intersection with server. AJAX keeps the advantage of
dynamic homepage language that is distributing the majority of computing to the
client, but interacts with server by system call httpRequest which can ask for an
hypertext which can be create by an active homepage, such as PHP, ASP or JSP. At
the same time, AJAX restricts to ask for the hypertext in the form of XML format,
which made the server not only return datagram with the expansibility and
readability by any heterological computer, but also make use of DOM technology
to update the hypertext partially to achieve the goal of asynchronous presentation.
In summary, AJAX is a web technique of both dynamic and active homepage. The
intersection with HttpRequest is shown in Figure 12.

Figure 12. Flow of httpRequest in AJAX

52
The user interface is implemented in html with Javascript, and provides both
of Folksonomy BlogRank Search and Google Blog Search for comparison.
Therefore, sending two httpRequest of AJAX at the same time is necessary. We use
Google AJAX API to available Google Blog Search as the same time as use a call
and receive function AJAX call and receive function and PHP respectively to
implement the client of Folksonomy BlogRank Search.

Figure 13. Class Diagram of Web Application
As mention in Figure 13, as in the index page in our system, search.js will be
loaded and Javascript function will be executed, inclusive of OnLoad,
MyCompleteHandler, View_result which is in charge of achieve the search result
from a Google Blog Search and results displayed on the page, and get_rank which is
our AJAX calls and receiving function as Figure 14. And rankSearch.php implement
the pseudocode at Figure 8 in section 3.4.1, then call getTagList, getTop10Post and
printXML successively. The first two functions filter the tags and posts from the
database, which printXML prints the results in XML format returned to the client.
When the server returns XML file to the function get_rank, get_rank will call
function listFolkResult to present search results on the interface.

53

Figure 14. Code of function get_rank()







54
4.2.2.2 Backend Application
We used Java for the development language for the backend operation of the
Data pre-processing process. Java is a well network programming language with
many convenient APIs and safe system calls for network transmission or others.

Figure 15. Class Diagram of Backend Application
Figure 15 figured out six major classes of our backend application program.

55
Class BlogCrawlerWorkbench, as an interface of the crawler, runs Class
FRCrawler and Class PageInfoCrawler as the implemented class of Interface
Crawler, which must be implemented on two functions: shouldVisit and visit,
respectively, to determine whether the page can be crawled down and how to access
the crawled pages.
Class FRCrawler crawls entire subtree of a folksonomy site such as Delicious
for the topic classification on tag to compute the Folksonomy BlogRank score, and
Class PageInfoCrawler crawls the information of the blogposts known by the
former.
Class Link is a data structure class of the information of a web page about its
URL, XTree or content, which is saved before shouldVisit so that shouldVisit can
determine whether to deal with this Link or not.
Class Page is the class that stores the XTree for the Link; through the DOM like
API reading nodes and attributes from Page, this page can be handled by the crawler.
Class TagLink is specialized Class Link to store the information we need from
folksonomy site which is needless to store an entire Page but tag, user and post
information. After captured by Class FolkCrawler, each TagLink is store into the
database.

56
4.2.2.3 Database

Figure 16. Fields of Database
There are five tables in the database, as shown in Figure 16. Since Folksonomy
is a tripartite relation on user, tag and post, our database creates tables not only for
these three entities but also the relation, which is store in three fields, result in that it
is very wasteful of space to store in a multidimensional matrix. In addition, we build
a table of pid_tid to facilitate the post match algorithm for indexing all posts tagged
on given tags.

57
4.3 System demo

We have developed a web application for blog search as shown in Figure 17.
When user enters keywords to search, system will return both result ranked in two
ranking algorithms.

Figure 17. System Screenshot of Web Application
Figure 18 is our backend application program in using the web crawler
crawling on the pages contains information about tags in folksonomy site. As shown
in the bottom-left corner in the figure, we captured the information links by links,
which a text icon in the figure means a page visited successfully, green point is one
in visiting, and the red cross means if it is in-visitable. After crawling on all visitable
pages, the application saved all the tagged blogposts retrieved from the page.

58

Figure 18. System Screenshot of FRCrawler
After we captured all the tagged blogposts from the folksonomy site, we than
crawl these post by URL to get the detail information such as title, author, published
date, etc. The left of the figure 19 is the screen of system preparing for crawling.
When all the pages we started to crawl have inputted in line into the Start URLs text
area, page information crawler will act as the right of Figure 19 after we press the
start button.

59

Figure 19. System Screenshot of BlogInfo Crawler







60
Chapter 5 Experiment and Discussion
In this chapter, we give some experiment results to prove our system
performance. We first describe our experiment design with a scenario, and discuss
our system performance upon quantitative evaluation and qualitative evaluation. In
order to prove that our approach can be ranked most of the blogposts, we analyze on
the information which we used to rank; in order to prove our design the algorithms
and systems we designed has higher precision on search results, we prove that our
approach can get a better appreciation through the result of usage by a group of
people.
5.1 Experiment design and setup
5.1.1 Experiment scenario
Blog search can be popularly used in www. We give a scenario to describe the
application of Blog Search.
Scenario
Tony is a college student used to take advantage of the Internet as assistant on
study. One day, he wanted to read some articles with the discussion on google, so he
went to a blog search engine to search; however, although many posts can be found,
most of the posts found are some news or articles without deeper discussion on the
topic google. Interesting and high-quality posts are still hard to be found. Finally,
he knew our system and did the same search through our system, our system

61
response those posts that are thought the most relevant to the topic by a number of
web users to Tony. So that he could read those posts with rich and novel content for
study.
5.1.2 Roles, hardware, software, and network
requirements setup
In order to take the experiments on our method of effectiveness and precision,
we use our system as well as a hardware and software in our experiment
environment, which is used to achieve some data for analysis or adding some
functions to record the information from user behaviors for statistics.
In the experiments of precision, we take the web users without specific
identification as the role. We will find some people who may usually access the
www via web search engines and ask them to use our system. In the use of our
system, they can also get different kinds of search technology obtained from the
search results. Over a span of several days, we compute the hit rate of them on the
search results, and investigate the feeling of them and hear their opinion.




62
5.2 Quantitative evaluation
5.2.1 Effectiveness
Effectiveness here means the degree of the influence of the others post on the
score of a post. If a post wont be influenced by any other post, we can call it an
isolated point. As shown in the Figure 3 in section 2.1, if most of the posts are
isolated points, the ranking would be no longer useful. Furthermore, the more posts
the score of the post can be influence by, the more credible prestige through others
posts the post can get.
In this section we measure the effective of ranking method for ranking the
blogposts by investigating the density of the matrix of probability distribution of
topic surfing model using in the formula of Folksononmy BlogRank method, which
means the ratio of non-zero elements in the matrix. The higher the density, the
more blogposts can be ranked by our method.

Table 6. Statistics of the blogpost-topic Graph
Nodes Edges
blogpost topic blogpost importance links topic Importance links
Number 6,122 35,510 114,358 140,696
Density of
Matrix
- - 0.0000526 0.0000647

63
Table 7. Statistics of the blogpost Graph
Nodes Edges
blogpost Topic Topic links between blogposts
Number 6,122 35,510 16390406
Density of Matrix - - 0.437
As shown in Table 6, neither topic importance nor blogpost importance is a
sparse matrix that made either a topic or a blogpost has no choice to be selected
averagely. In other words, according to the density of the matrix, there are only
1.868 topics (35,510*0.0000526) can be choice on each post and only 0.395 posts
(6,122*0.0000647) to choice on each topic on average.
However, seeing the statistical results on Table 7, there are 2675 posts
(6,122*0.437) can be choice by user from each post. The result is far higher than
Kritikopoulos statistics, which only 0.27 post could be choice on surfing through the
hyperlink. Therefore, few-link problem can be greatly improved.




64
5.2.2 Precision
We validate our search result by improving Success Index metric [40], which
records the user's actions and order of hits on the result list to read the post, and then
calculating the Success Index used to reflect the user's response is as follows:

=

+
=
n
t
t
n d
t n
n
SI
1
1 1

Values of SI (Success Index) range from 0 to 1, where n is the total number of
hits by a user and d
t
is the rank of the t
th
post hit by him. The earlier the higher rank
post was hit, more reward on SI values will be, for example, when users hit the post
(click on the ranked result list) by the order 2-1-3-4 or 4-3-1-2 (each number means
the rank of each post), the former SI value (about 0.340) will be better than the latter
(0.281).
Success Index metric has the advantage of that the vote on satisfaction from users
is unnecessary to get the rate of successful search, however, has disadvantage on the
contrary, which is not able to work with the inconsistency between the order of hits
by the subjects and the true rank on their mind. Therefore, we still need to measure
the degree of satisfaction, and order the post based on the satisfaction. And we
modified the formula to become:

=

=
n
t
t
t
n d
rs
n
SSI
1
1


65
We said the new score as Success Satisfaction Index score, and the only
difference from the SI is that we replace the hit order t, which is no longer
considered, by rs
t
, which is the order of the hit post ordered by the satisfaction no.
The order of satisfaction orders from low to high and gives each post a rs
t
started
from 1 and increased progressively order by order. When the satisfaction of one post
is equal to one of which in front, the rs
t
would be as same as that of which in front.
And, there is an exception that when the satisfaction of the post is 0, then rs
t
= 0.
Take the Table 8 for example. The order of satisfaction on case 1 is 3<2=4<1 and
there is a 0 on the satisfaction, so that rs
t
is 0,1,1,2 by this order; In case 2, while
there is no 0 on the satisfaction, so the order is 4<3<2<1 and rs
t
is 1,2,3,4 by order.
Both the SSI scores of the case are shown in the last column.
Table 8. Examples of SSI score
Ranks 1 2 3 4 SSI score
Case 1 Satisfaction 10 8 0 8 0.302
Case 2 Satisfaction 10 6 2 1 0.401
We gather statistics of the use cases of which many subjects invited to use our
system search for the blogposts from April 2008 to June 2008. As they using our
system, in order to eliminate the factor of their bias on the ranking algorithms,
search interface of the experiment is different from the original system we have
developed, that is, the system return only one result that is ranked by the method
selected fair and randomly among Folksonomy BlogRank, PagerRank, and
Random-rank after a users query request. Such design is for taking double-blind test,

66
which both the subjects and we know nothing about which method is selected in
advance as well as after the request; only the statistics of request times and average
SSI of each method can we get.
As shown in figure 20, experiment system asked the user whether the post is
that he expected under each post in each time of search, and also asked to score the
post ranged from 0 to 9 with the degree of interest started from low to high. And
then, it would record the score accompanying with the number and rank of post, the
search keywords, and the search algorithm used on that time. We call those subjects
to score all the top 10 post of the search result, so that, we can use the average SSI
score to compare the algorithms.

Figure 20. Screenshot of Search Result with Satisfaction Question
Table 5 reveals that our method has a higher average SSI, and both T-test
values of PageRank and random rank (ranking randomly) with our method are far
smaller than the significance level, which clearly shows that our Folksonomy
BlogRank raise the users satisfaction and the precision of the search result.
Total subjects: 72
Total search requests with satisfaction answering521 times
Significance level in T-test: 0.01

67
Table 9. Average SI Score
Ranking
Algorithm
Number
of Queries
Average Satisfaction
Success Index
T test
( with Folksonomy BlogRank)
PageRank 203 0.395 0.008
Random 180 0.148 0.001
Folksonomy
BlogRank
138 0.503 -

5.2.3 Results and lesson learned
We can take the average SI score calculated in the section 5.2.1 as the damping
factor of in Folksonomy BlogRank formula. Besides, observing the result in section
5.2.2, we can easily found that the topic used is far more than a blogpost tagged in a
folksonomy site, which shows that it is time-consumed to match the keywords with
tags in the purpose of finding the matched posts of keywords. We can improve it by
clustering the tags.




68
Chapter 6 Conclusion and Future Work
Our study propose a topic surfing model, which is used to calculate the rank
score from a probability distribution of a user in Folksonomy site surfing randomly,
to improve the precision of ranking on the blog search. We designed Folksonomy
BlogRank algorithm and furthermore took some experiments to evaluate it. Through
the experiments, we found that not only the precision but the few-link problem can
be improved by our algorithm.
However, the method requires a larger time and space to calculating the rank
than the PageRank. The seriousness of the higher time complexity can be mitigated
for the needless of executing this method frequently; however, the waste of the space
will not only increase the execution time but the load of the database. Our
observation on the matrix density experiment found that we may reduce the number
of topic that need to be matched with the keywords. Furthermore, we may reduce the
number of tags for all calculation for the reason of some synonyms within the tags.
In our related work, we have mentioned about another issue on blog search,
that is, personalized search, which is lack of discussion in our study. we have also
pointed out in the same chapter that folksonomy can achieve a diverse and
bias-including classification, and it would be helpful to the personalized search if we
analyze what post are required for people with different bias under such
classification. In this point of view, it will be an interesting study to improve our
method to satisfying the personalized search in the future.

69
References
[1] Bateman, S., Brooks, C., & McCalla, G. (2006). Collaborative Tagging
Approaches for Ontological Metadata in Adaptive E-Learning Systems. The
Proceedings of the Fourth International Workshop on Applications of Semantic
Web Technologies for E-Learning, 3-12.
[2] Bayes, T. (1763/1958). Studies in the History of Probability and Statistics:
IX. Thomas Bayes' Essay Towards Solving a Problem in the Doctrine of
Chances, Biometrika ,45. 296315.
[3] Bayes, T. (1763).An essay towards solving a Problem in the Doctrine of
Chances. Philosophical Transactions of the Royal Society, 53. 370418.
[4] Berendt, B., & Hanser, C. (2007). Tags are not Metadata, but Just More
Contentto Some People. ICWSM.
[5] Berman, A., & Plemmons, R. J. (1979). Nonnegative Matrices in the
Mathematical Sciences. Academic Press, 2
[6] Bharat, K., & Mihaila, G. A. (2000). Hilltop: A search engine based on
expert documents. In the WWW9 Conference, Ansterdam, 15-19.
[7] Breiman, L. (1968). Probability. Addison-Wesley.
[8] Brooks, C. H., & Montanez, N. (2005). An analysis of the effectiveness of
tagging in blogs. Proceedings of the 2005 AAAI Spring Symposium on
Computational Approaches to Analyzing Weblogs.
[9] Booth, T. L. (1967). Sequential Machines and Automata Theory. John Wiley
and Sons.
[10] Border, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A., & Wiener, J. (2000). Graph Structure in the Web. In the 9th
International World Wide Web Conference.
[11] Cattuto, C., Loreto, V., & Pietronero, L. (2006). Collaborative Tagging and
Semiotic Dynamics. Proceedings of the 2nd Workshop on Scripting for the
Semantic Web.
[12] Cattuto, C., Loreto, V., & Pietronero, L. (2006). Semiotic dynamics in online
social communities. The European Physical Journal C, 33-37.
[13] Chakrabarti, S. (2002). Mining the Web: Discovering Knowledge from
Hypertext Data. Morgan-Kaufmann Publishers.
[14] Chi, Y. L. (2006). Applying Knowledge Acquisition and Knowledge
Representation Synergy to Construct Ontology Conceptual Structures. Journal
of Information Management 13(2), 193-215.
[15] Damianos, L. E., Cuomo, D., Griffith, J., Hirst, D. M., & Smallwood, J.
(2007). Exploring the Adoption, Utility, and Social Influences of Social

70
Bookmarking in a Corporate Environment. Proceedings of the 40th Hawaii
International Conference on System Sciences.
[16] Doob, J. L. (1953). Stochastic Processes. John Wiley and Sons.
[17] Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., & Tomkins.
A. (2006). Visualizing tags over time. 15th International World Wide Web
Conference, 193202.
[18] Fujimura, K., Inoue, T., & Sugisaki, M. (2005). The eigenrumor algorithm
for ranking blogs. In 2nd Workshop on the Weblogging Ecosystem, at WWW
2005.
[19] Garrett, J. (2005). Ajax: A New Approach to Web Applications.
http://www.adaptivepath.com/publications/essays/archives/000385.php
[20] Godsil, C., & Royle, G. (2001). Algebraic Graph Theory. Springer, 8.
[21] Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using
collaborative filtering to weave an information tapestry. Communications of the
ACM, 35(12), 61-70.
[22] Golder, S., & Huberman, B. A. (2006). Usage patterns of collaborative
tagging systems. Journal of Information Science 32, 198-208.
[23] Gradshteyn, I. S., & Ryzhik, I. M. (2007). Tables of Integrals, Series, and
Products, Academic Press. 1103-2000.
[24] Graham, A. (1987). Nonnegative Matrices and Applicable Topics in Linear
Algebra. John Wiley&Sons.
[25] Hassan-Montero, Y., & Herrero-Solana, V. (2006). Improving tag-clouds as
visual information retrieval interfaces. International Conference on
Multidisciplinary Information Sciences and Technologies, 25-28.
[26] Haveliwala, T. H. (2002). Topic-sensitive PageRank. Proceedings of the
11th international conference on World Wide Web. 517-526.
[27] Haveliwala, T. H. (2003). Topic-Sensitive PageRank: A Context-Sensitive
Ranking Algorithm for Web Search. IEEE Transactions on Knowledge and
Data Engineering, 15(4), 784-796.
[28] Hayes, C., Avesani, P., & Veeramachaneni, S. (2006). An analysis of
bloggers and topics for a blog recommender system. Workshop on Web Mining
(7).
[29] Hayes, C., & Avesani, P. (2007). Using Tags and Clustering to Identify
Topic-Relevant Blogs. http://www.icwsm.org/papers/2--Hayes-Avesani.pdf.
[30] Hayes, C., Avesani, P., & Veeramachaneni, S. (2007). An analysis of the use
of tags in a blog recommender system. The International Joint Conference on
Artificial Intelligence, 2772-2777.

71
[31] Horn, R. A., & Johnson, C.R. (1990). Matrix Analysis. Cambridge
University Press
[32] Hotho, A., Jschke, R., Schmitz, C., & Stumme, G. (2006). BibSonomy: A
Social Bookmark and Publication Sharing System. In A. de Moor, S. Polovina,
and H. Delugach, editors, Proceedings of the Conceptual Structures Tool
Interoperability Workshop at the 14th International Conference on Conceptual
Structures.
[33] Huberman, B. A., Pirolli, P. L. T., Pitkow, J. E., & Lukose R. M. (1998).
Strong regularities in world wide web surfing. Science, 280.
[34] Johnson, S. (2005). Everything Bad is Good for You: How Todays Popular
Culture Is Actually Making Us Smarter. Journal of Popular Culture, 39(6).
1104-1106.
[35] Jones, K. S. (1972). A statistical interpretation of term specificity and its
application in retrieval. Journal of Documentation, 28(1), 1121.
[36] Kemeny, J. G., Mirkil, H., Snell, J. L., & Thompson, G. L. (1959). Finite
Mathematical Structures. Prentice-Hall.
[37] Kipp, M., & Campbell, G. (2006). Patterns and inconsistencies in
collaborative tagging systems: An examination of tagging practices.
Proceedings of the ASIST Annual Meeting.
[38] Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., &
Riedl, J. (1997).GroupLens: Applying Collaborative Filtering to Usenet News.
Communications of the ACM, 40(3), 77-87.
[39] Kritikopoulos, A., Sideri, M., & Varlamis, I. (2006). BlogRank: ranking
weblogs based on connectivity and similarity features. In Proceedings of the
2nd international Workshop on Advanced Architectures and Algorithms For
internet Delivery and Applications (Pisa, Italy, October 10 - 10, 2006).
AAA-IDEA '06, vol. 198. ACM, New York, NY, 8.
[40] Kritikopoulos A., Sideri M., Varlamis I. (2007). Success Index: Measuring
the efficiency of search engines using implicit user feedback. In the 11th
Pan-Hellenic Conference on Informatics, Special Session on Web Search and
Mining.
[41] Kurland, O. & Lee, L. (2005). PageRank without hyperlinks: structural
re-ranking using links induced by language models. In Proceedings of the 28th
Annual international ACM SIGIR Conference on Research and Development in
information Retrieval (Salvador, Brazil, August 15 - 19, 2005). SIGIR '05.
ACM, New York, NY, 306-313.
[42] Lambiotte, R., & Ausloos M. (2005). Collaborative tagging as a tripartite
network. http://arxiv.org/abs/cs.DS/0512090.

72
[43] Macgregor, G., & McCulloch, E. (2006). Collaborative tagging as a
knowledge organisation and resource discovery tool. Library Review (55),
291-300.
[44] Markov, A. A. (1971). Extension of the limit theorems of probability theory
to a sum of variables connected in a chain. R. Howard. Dynamic Probabilistic
Systems, 1. John Wiley and Sons.
[45] Markov, A. A. (1906). Rasprostranenie zakona bol'shih chisel na velichiny,
zavisyaschie drug ot druga. Izvestiya Fiziko-matematicheskogo obschestva pri
Kazanskom universitete, 2-ya seriya, 15. 135-156.
[46] Meyn, S. P., & Tweedie, R. L. (1993). Markov Chains and Stochastic
Stability. Cambridge University Press.
[47] Meyn, S. P. (2007). Control Techniques for Complex Networks. Cambridge
University Press.
[48] Millen, D. R., Feinberg, J., Kerr, B. (2006), Dogear: Social Bookmarking in
the Enterprise. Proceedings of the SIGCHI conference on Human Factors in
computing systems, 111-120.
[49] Minc, H. (1988). Nonnegative matrices, John Wiley&Sons
[50] Newfield, D., Sethi, B. S., & Ryall, K. (1998). Scratchpad: mechanisms for
better navigation in directed Web searching. Proceedings of the 11th annual
ACM symposium on User interface software and technology, 1-8.
[51] Orlowski, A. (2003). Anti-war slogan coined, repurposed and
Googlewashed ... in 42 days.
http://www.theregister.co.uk/2003/04/03/antiwar_slogan_coined_repurposed/.
[52] Page, L., Brin, S., Motwani R., & Winograd, T. (1998). The PageRank
Citation Ranking: Bringing Order to the Web. Stanford Digital Libraries
Working Paper.
[53] Paolillo, J. C., & Penumarthy S. (2007). The Social Structure of Tagging
Internet Video on del.icio.us. Proceedings of the 40th Hawaii International
Conference on System Sciences.
[54] Price, G. (2005). Google and Google Bombing Now Included New Oxford
American Dictionary. http://blog.searchenginewatch.com/blog/050516-184202.
[55] Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994).
GroupLens: an open architecture for collaborative filtering of netnews.
Proceedings of the 1994 ACM conference on Computer supported cooperative
work, 175-186.
[56] Richardson, M. & Domingos, P. (2002). The intelligent surfer: Probabilistic
combination of link and content information in pagerank. In Advances in
Neural Information Processing Systems 14.

73
[57] Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic
text retrieval. Information Processing & Management, 24(5), 513-523.
[58] Salton, G., Fox, E. A., & Wu, H. (1983). Extended Boolean information
retrieval. Communications of the ACM, 26(11), 1022-1036.
[59] Salton, G., & McGill, M. J. (1983). Introduction to modern information
retrieval. McGraw-Hill.
[60] Sarwar, B., Karypis, G., Konstan, J., & Reidl, J. (2001). Item-based
collaborative filtering recommendation algorithms, Proceedings of the 10th
international conference on World Wide Web, 285-295.
[61] Scoble, R., & Israel, S. (2006). Naked Conversations: How Blogs are
Changing the Way Businesses Talk with Customers. Wiley & Sons.
[62] Sifry, D. (2006). State of the blogosphere: Part 1 on blogosphere growth,
from http://technorati.com/weblog/2006/04/96.html.
[63] Sparck Jones, K. (1972). A statistical interpretation of term specificity and its
application to retrieval. Journal of. Documentaion, 28(1), 11-20.
[64] Surowiecki, J. (2005). The Wisdom of Crowds. American Journal of Physics,
75(2). 190-192.
[65] Tapscott, D., & Williams, A. D. (2006). Wikinomics, How Mass
Collaboration Changes Everything. Portfolio Hardcover.
[66] Trant, J. (2006). Exploring the potential for social tagging and folksonomy in
art museums: Proof of concept. New Review of Hypermedia and Multimedia
12(1), 83-105.
[67] Tseng, B., Tatemura, J., & Wu., Y. (2005). Tomographic clustering to
visualize blog communities as mountain views. In WWW 2005 Workshop on
the Weblogging Ecosystem.
[68] van Alstyne, M. (1996). Could the Internet Balkanize. Science, 274(5292) ,
1479-1480.
[69] Varga, R. S. (1962). Matrix Iterative Analysis. Englewood Cliffs, NJ:
Prentice-Hall.
[70] Zeller, T. Jr. (2006). A New Campaign Tactic: Manipulating Google Data.
The New York Times, 26 October 2006. 20.
Web Pages:
[71] Delicious Website in Delicious from http://del.icio.us
[72] Funp Blog Search website in Funp from http://funp.com/blog/search.php
[73] Google Blog Search website in Google from http://blogsearch.google.com.tw

You might also like