Professional Documents
Culture Documents
1
http://toolbar.google.com/
2
http://toolbar.aol.com/
Figure 1: Proposed Dataflow
On the other, the proposed framework provides complete Figure 2: Design and implementation of modified
transparency to the user. Users are aware of the information Proxy server and Time tracker
sent to the WSE. Most importantly, as it is done through
proxy server, information sent to the server can be checked
by the user. Moreover, proxy server can share the informa- 3.1 Modified Proxy Server
tion with negligible overhead. The Proxy Server designed based on protocols defined in
RFC 2616 and implemented using socket programming con-
cepts of Java. The Proxy Server can be configured to listen
3. PROPOSED FRAMEWORK on any available port and can be used by different client
Almost in all organizations, users have access to Internet machines at same time. In order to use it as proxy server,
through a proxy server. A proxy server receives URL re- one has to change the proxy setting, i.e. IP and Port, of web
quests from users and it in turn submits the requests to browser to IP and Port of the machine where proxy server is
actual target Web server. For a Web search engine, a proxy running. The Proxy Server uses persistent TCP connection
server, therefore, has the capacity to add additional infor- between proxy to host & target, and to speed up processing
mation required by the engine to optimize the search re- of requests it also takes multiple requests at a time and uses
sults. Figure 1 shows an abstract level data flow of the the pipe lining concept. Later requests will be served the
proposed framework. Broadly, the proposed framework has same way the requests are received.
three modules namely proxy server, information miner and
client side activities database. All the inward and outward traffic of web browser goes
through proxy server, it stores the URL, referrer informa-
The client side activities database stores the following infor- tion along with client machine IP Address and Port infor-
mation about the activities performed by individual users, mation during request time; web page content and encoding
i.e., list of visited URLs and their contents, list of non- type etc during response time. As there will be several un-
Web documents read by user on Desktop and their contents, necessary URLs dynamically generated by the request URL
query requested and corresponding clicks, and the time at page, such as several ad URLs, the proposed proxy filters
which each activity is performed. We design and implement out them by checking the response content type. In the cur-
two customized tools i.e., a modified proxy server and a time rent implementation, we consider only text/html/pdf type
tracker to gather the above information. We discuss these Web content.
tools in Sections 3.1 and 3.2.
The information miner is the module which processes the 3.1.1 Why Proxy Server
database and extract information which may be useful by User’s Web activities can be tracked from the web browser
WSEs to determine what user wants and customize the search history database. This method has two drawbacks - 1)
results accordingly. In this paper, for experimental under- We found that history database getting updated with de-
standing we focus on two kinds information only: (a) class lay span of 1-4 minutes. 2) The visited URLs needs to be
distribution of the documents visited by the user in recent crawled once again for their web content, leads to bandwidth
time and (b) informative terms representing the documents wastage. By using customized proxy server, it can overcome
visited by the user in recent time. However, it can be ex- both drawbacks.
tended to share other detail information as well. We discuss
the procedure in Sections 5.3.
3.2 Time Tracker
The desktop activities time tracker is one the most crucial
The modified proxy server is the module which receives
module. It not only captures the documents that user opens
search requests from users and adds (if user want so) ad-
on the desktop, but also keeps track of each window’s active
ditional information such as class distribution and informa-
life time. It enables us to monitor how long a user spends
tive term list. This module interacts with information miner
reading a particular document. It was built on top of UNIX
module to get the required information. The detail of the
utility tool called wmctrl. Time Tracker gets the current ac-
implementation is presented in section 3.1.
tive application name and title of the window from wmctrl
3
www.google.com/toolbar/labsprivacypolicy.html continuously, based on these information it can keep check-
ing the time user spends on a particular document. In our 4.1.4 χ2
2
implementation we tracked time user spends on a web page The χ (CHI) is defined by the following expression in [7].
and pdf and word documents as these are the main sources
of information. This model assumes that if window is active N × (AD − CB)2
χ2 (t, c) =
for more than 30 secs it been read by user. (A + C) × (B + D) × (A + B) × (C + D)
where N is the number of documents, A is the number of
4. MONITORING USER’S ACTIVITIES documents of class c containing the term t, B is the number
What is the class label of the document that user wants to of documents of other class (not c) containing t, C is the
visit? What are the terms that can represent user’s inter- number of documents of class c not containing the term t and
est? and Does user submit a query while reading a document D is the number of documents of other class not containing
ot writing/reading an e-mail or while exploring a social net- t. It measures the lack of independence between t and c and
work? Such information can play important role while deter- comparable to χ2 distribution with one degree of freedom.
mining user’s search goal. This paper focuses on extracting The χ2 static is known to be unreliable for low-frequency
the above issues using client side information. Before dis- terms [3]. The commonly used global goodness estimation
cussing them in detail, we first discuss few definitions that functions are maximum and mean functions i.e.,
are used in this study. χ2 (f ) = arg max χ2 (f, ci )
ci
4.1 Definitions or
X
4.1.1 Document Representation χ2 (f ) = P r(ci )χ2 (f, ci )
We use vector space model to represent documents [4]. In i
4.1.3 Kullback-Leibler divergence 4.2 What does the proxy server add to search
Given two probability densities pi and pj , the distance be- request
tween pi and pj can be defined by the Kullback-Leibler di- 4.2.1 Dominant document class
vergence as follows. It analyses the distribution of the categories of the docu-
ments or Web pages visited by the user in the last n min-
pi
KLD(pi ||pj ) = pi . log (2) utes. Such category information is important for customiz-
pj
ing search results. For instance, return the documents be-
KLD is a non-symmetric measure of difference between two longing to the dominant class alone.
probability distributions. KLD is also known as relative en-
tropy in information theory. KLD between pi and pj is zero 4.2.2 Informative terms
if pi = pj . It indicates maximum cross-entropy4 between pi Assume that user has been visiting k documents in the last
and pj . Considering a collection of documents, high KLD n minutes, it extracts informative terms which can repre-
between the probability distribution of a term in a local set sent user’s search context. KL-divergence is a commonly
of document and the probability distribution of the term in used mechanism in determining informative term in query
the entire set of documents indicates that the term is rela- expansion with local analysis [6]. We also use KL-divergence
tively frequent in the local set in contrast to the entire col- to determine informative terms from the k documents be-
lection. KLD is effectively used to determine popular terms cause of its conceptual similarity.
in a local set of document which is often necessary for local
analysis based query expansion [1]. Similarly, we also use
KLD to extract popular term. 4.2.3 Statistical Analyzer
We can also analyze the information such as how long user
4
http://en.wikipedia.org/wiki/Cross_entropy spend time on different activities such as email checking,
Table 1: Data Set Statistics Cosine Similarity with KLD Features
Parameter Count 1
No.Of.Users 13
No.Of.Days 3 Weeks 0.8
Cosine Similarity
No.Of.Search Queries 744
No.Of.Desktop Clicks 586 0.6
No.Of.Search Clicks 170
Avg. Words/Query 4.24 0.4
0.2
Table 2: Desktop & Search Doc Click Count
Time Span #Desktop Clicks #Search Clicks
0
5 Minutes 4.35 1.34
10 Minutes 8 1.36 Query Instances
15 Minutes 10.52 1.44
20 Minutes 11.03 1.43 Figure 3: Distribution of Non-zero cosine similarity
25 Minutes 12.45 1.41 between client desktop activities and Web search ac-
30 Minutes 13.17 1.47 tivities
35 Minutes 13.24 1.45
40 Minutes 13.93 1.45
45 Minutes 14.38 1.45
50 Minutes 14.24 1.43
55 Minutes 13.72 1.53
60 Minutes 13.81 1.53
Table 2 further shows the average number of documents user We use boolean VSM (see Section 4.1.1) to represent the
explores before submitting a query at different time span. It documents and top KLD terms to build the vocabulary set.
clearly shows that the document number converses at around Figure 3 shows the average cosine similarity between the
15 minutes. Surprisingly, user often opens the same docu- search documents and the documents read within a time
ments in between queries or different queries are asked while span of 30 minutes. We observe that around 65.71% of
reading a document which resulted repeated open of the the total query instances have non-zero cosine similarity. It
same document. clearly indicates that the number of query instances which
% of dominant class matching over different time spans Table 4: Time Overhead
Percentage of matching dominant class
100 Time to get data
Min 2ms
80 Avg 2.23ms
Max 10ms
60
influenced by the user’s activities performed before submit- on Query Log Analysis, 2007.
ting the query. We also verify that the overhead incurred by [6] J. Xu and W. B. Croft. Query expansion using local
the proxy in both banswidth and execution time is negligi- and global document analysis. In SIGIR’96:
ble. Proceedings of the Nineteenth Annual International
ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 4–11, 1996.
7. REFERENCES [7] Y. Yang and J. O. Pedersen. A comparative study on
[1] G. R. C. Carpineto, R. de Mori and B. Bigi. An
feature selection in text categorization. In Proceedings
information-theoretic approach to automatic query
of the Fourteenth International Conference on Machine
expansion. ACM Transactions on Information Systems,
Learning, ICML ’97, pages 412–420, San Francisco, CA,
19(1):1–27, 2001.
USA, 1997. Morgan Kaufmann Publishers Inc.
[2] J. M. Carroll and M. B. Rosson. Paradox of the active
[8] G. Zhu and G. Mishne. Mining rich session context to
user. In Interfacing thought: cognitive aspects of
improve web search. In Proceedings of the 15th ACM
human-computer interaction, pages 80–111, 1987.
SIGKDD, KDD ’09, pages 1037–1046, 2009.
[3] T. Dunning. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19(1):61–74, 1993.
[4] A. W. G. Salton and C. S. Yang. A vector space model APPENDIX
for automatic indexing. ACM Communication,
18(11):613–620, 1975. A. CLASSIFIER
[5] C. Grimes, D. Tang, and D. M. Russell. Query logs We design and implement Naive bayes text classifier as de-
alone are not enough. In Proceedings of the Workshop fined in Section A. we build the classifier with Chi-square
feature selection mechanism using DMOZ 6 datasets. We
use NB for its efficiency and scalability and Chi-square is
used because it is one of the most effective feature selec-
tion mechanism for text classification [7]. In our estima-
tion, we assume that all classes are equally likely, otherwise
P r(ck |di ) often Q
biases toward the class with larger example
i.e., P r(ck ) >> j P r(dij |ck ). As denominator is indepen-
dent of class, we have ignore it.
6
http://www.dmoz.org/