Project Report PWS

PROJECT REPORT
ON
A
SUPPORTING PRIVACY PROTECTION IN PERSONALIZED

WEB SEARCH
1. PROBLEM DEFINITION
Our work is dealing with the very important domains from computer science
viz. Knowledge and data Engineering, Artificial Intelligence including Data mining,
Expert systems, Decision support systems and various information retrieval systems.
Most of the commercial search engines returns roughly the same result for the same
query independent of users actual interest i.e. what the user wants exactly. It also
gives some irrelevant data. Our method will help the user to get the relevant result of
his query. Consider the following example:
If a user submits a query bank on the search engine, then the normal search
engine will give the result including, blood bank, money bank or river bank without
considering users real interest, whether he is interested in blood bank or money bank
or river bank. In that case he needs to provide the interest manually (explicit feedback)
by taking extra efforts. But our personalized search will help the user to get relevant
result without taking extra manual efforts. Our approach will include positive and
negative preferences and will create a user profile which will be employed by a
clustering algorithm. We will use SPY NB algorithm for creation of user profile
followed by Agglomerative Clustering Algorithm.
2. LITERATURE SURVEY
Searching is one of the important task in Computer Science. If the time
required in searching is long then it will result high cost and lower the system
performance.
Present searching techniques does not take user interest in account. The most
of show results related to latest data present in their primary memory.
For example: kolavary-di song. It was mostly searched song on Google. Now
even if a user is interested in cola information related to Coca-Cola Company, yet he
will be shown results for kolavary-di song. It is because present search engines search
results are based on previous user search.
This drawback has been overcome in Concept Based User Profiles project.
User interest is taken into consideration by creating user profile for each query.
Two algorithms are used:
1. Suffix (SPY NB algorithm)
2. Clustering (Agglomerative clustering algorithm)
Suffix algorithm removes articles from search query enter by user and
convert it into keyword form which can be compared with keyword in
database.
Clustering algorithm forms clusters of pages to be viewed to user who has
higher weights.
In profile creation system assigns weights to the pages user interested in.
+1 for positive and +0 for negative interest.
2.1 Related work:

Existing strategies were dealing with creation of a single profile per user, but
conflict occurs when users interest varies for the same query Eg. When a user is
interested in banking exams in query bank may be slightly interested in accounts of
money bank where not at all interested in blood bank. At such time conflict occurs so
we are dealing with negative preferences to obtain the fine grain between the
interested results and not interested. Consider following two aspects:
2.1.1) Document-Based method:
These methods aim at capturing users clicking and browsing behaviours. It deals
with clickthrough data from the user i.e. the documents user has clicked on.
Clickthrough data in search engines can be thought of as triplets (q, r, c)
Where,
q = query
r = ranking
c = set of links clicked by user.
Table 1 illustrates this with an example: the user asked the query support
vector machine, received the ranking shown in Figure 1, and then clicked on the links
ranked 1, 3, and 7. since every query corresponds to one triplet.
Table 2.1.1 Document Based Method.
1. Kernel Machines
http : //svm.first.gmd.de/
2. Support Vector Machine
http : //jbolivar.freeservers.com/
3. SVM-Light Support Vector Machine
http : //ais.gmd.de/ _ thorsten/svm light/
4. An Introduction to Support Vector Machines
http : //www.support vector.net/
5. Support Vector Machine and Kernel Methods References
http : //svm.research.bell labs.com/SVMrefs.html
6. Archives of SUPPORT-VECTOR-MACHINES@JISCMAIL.AC.UK
http : //www.jiscmail.ac.uk/lists/SUPPORTV ECTORMACHINES.html
7. Lucent Technologies: SVM demo applet
http : //svm.research.bell labs.com/SV T/SVMsvt.html
8. Royal Holloway Support Vector Machine
http : //svm.dcs.rhbnc.ac.uk/
9. Support Vector Machine - The Software
http : //www.support vector.net/software.html
Table 2.1.1 shows the ranking presented for the query support vector
machine. Marked in bold are the links the user clicked on. This strategy for extracting
preference feedback is summarized in the following algorithm.
Table 2.1.2 Algorithm 1: Extracting Preference Feedback from Clickthrough

Procedure:
For a ranking (link1, link2, link3, ...) and a set C containing the ranks of the clickedon links, extract a preference
Example
linki <r* linkj
for all pairs 1 j < i, with i C and j C.

Joachims method assumes that a user would scan the search result list from
top to bottom. If a user has skipped a document di at rank i before clicking on
document dj at rank j, it is assumed that he/she must have scan the document di and
decided to skip it. Thus, we can conclude that the user prefers document dj more than
document di (i.e., dj <r di , where r is the users preference order of the documents in
the search result list).[2]
2.1.2) Concept-based methods:
These methods aim at capturing users conceptual needs. Users browsed
documents and search histories. User profiles are used to represent users interests and
to infer their intentions for new queries. In this paper, a user profile consists of a set of
categories and for each category, a set of terms (keywords) with weights. Each
category represents a user interest in that category. The weight of a term in a category
reflects the significance of the term in representing the user's interest in that category.
Table 2.2 (a)
Doc\Ter
Appl
Recip
m
D1
D2
D3
D4
e
1
0.58
0
0
(a)
e
l
0
0
0
0.58
0.58
0
0
0
1
0
0
0.58
Document term Matrix DT
Table 2.2 (b)
Doc\Category
D1
D2
D3
pudding
COOKING
1
1
0
Footbal
Soccer Fifa
0
0
0
0.58
SOCCER
0
0
1
0
0
0
0.58
D4
0
1
(b) Document-Category matrix DC
Matrix Representation of User Search History and User Profile:

We use matrices to represent user search histories and user profiles. Table 2
shows an example of the matrix representations of a search history and a profile for a
particular user, who is interested in the categories COOKING and SOCCER. This
users search history is represented by two matrices DT (Table 2(a)) and DC (Table
2(b)) [3].
i)
DT = document-term matrix, constructed from the user queries and the

relevant documents.
ii)
DC = document-category matrix, constructed from the relationships

between the categories and the documents.
iii)
A user profile is represented by a category-term matrix M (Table 2(c)).
In this example, D1, D2, are documents; lowercase words such as

football, apple, are terms; uppercase words such as SOCCER, COOKING,
are categories.
Table 2.2 (c)
Cate\Term
Apple
Recip
Puddin
footbal
Socce
Fifa
COOKIN
e
0.37
g
0.37
l
0
r
0
G
SOCCER
0.37
0.37
(c) Category-Term matrix M represents a user profile
3. SOFTWARE REQUIREMENT SPECIFICATION

3.1 INTRODUCTION:
Personalized search is an important research area that aims to resolve the

ambiguity of query terms. To increase the relevance of search results, personalized
search engines create user profiles to capture the users personal preferences and as
such identify the actual goal of the input query.
Most personalization methods focused on the creation of one single profile for
user and applied the same profile to all of the users queries. We believe that different
queries from a user should be handled differently because a users preferences may
vary across queries. For example, a user who prefers information about fruit on the
query orange, may prefer the information about Apple Computer for the query
apple.
3.1.1 Project scope:
The proposed system covers important domains from computer science viz.
Knowledge and data Engineering, Artificial Intelligence including Data mining,
Expert systems, Decision support systems and various information retrieval systems.
We focus on search engine personalization and develop several concept-based
user profiling methods that are based on both positive and negative preferences.
The profiles which capture and utilize both of the users positive and negative
preferences perform the best. The profiles with negative preferences can increase the
separation between similar and dissimilar queries.
The separation provides a clear threshold for an agglomerative clustering
algorithm to terminate and improve the overall quality of the resulting query clusters.
3.1.2 User Classes and Characteristics:

Concept-based user profile for search engine is designed by keeping in view a
general user.
Relevant search can be obtained considering the users real interest search area.
3.1.3 Operating Environment:

The project basically is a web based application, web browser is the minimum
requirement. Tomcat apache server is basically used to run this web based application.
Minimum platform requirements:
Processor Pentium 4 or above

Minimum hard disk space:-2 GB
Virtual Memory (RAM):- 512MB
Operating system: - Windows 2003 server and above all windows
operating system.
Tomcat Apache 6.8
Jdk1.6
MS Access
3.1.4 Assumptions and Dependencies:

Relationships between users cant be mined from the concept-based user
profiles to perform collaborative filtering. Thus users with the same interests cannot
share their profiles.
Second, the existing user profiles cannot be used to predict the intent of unseen
queries, such that when a user submits a new query, personalization can benefit the
unseen query.
The concept-based user profiles cannot be integrated into the ranking
algorithms of a search engine so that search results can be ranked according to
individual users interests.
3.2 SYSTEM FEATURES:

3.2.1
Query Text Features:

Users decide which results to examine in more detail by looking at
the result title, URL and summary.
3.2.2
Title overlap Features:

The overlap between the words in title and in query is examined.
For Example: Suppose the query is related to reserve bank, then the
words in query such as reserve bank are checked with those in the title of
URL.
3.2.3
Query Length Features:

The number of tokens in the query are been evaluated.
For Example: If the query is related to reserve bank then number of
tokens=2.
3.2.4
Query Next Overlap Features:

The average fraction of words shared with next query is checked.
For Example: The query related to reserve bank is entered first then query
related to blood bank is entered. In this the average fraction between the
same words bank is been checked for both the queries.
3.2.5 Summary overlap Features:

The fraction of words shared by the query and the result and summary
are been checked.
For Example: Suppose the query is related to reserve bank, then the
words in query such as reserve bank are checked with those in the
summary of the opened URL page.
3.2.6
Browsing Features:
Simple aspects of user web page interactions can be captured and
quantified. This feature is used to characterize interactions with pages
beyond the results page.
For Example: we compute how long users dwell on a page or domain,
and the deviation of dwell time from expected page dwell time for a query.
This feature allows us to model intra-query diversity of page browsing

behavior.
3.2.7
Clickthrough Features:
Clicks are a special case of user interaction with the search engine.
For example: for a query-URL pair we provide the number of clicks for
the result, as well as whether there was a click on result below or above the
current URL. The derived feature values such as ClickRelativeFrequency
and Click Deviation are computed.
3.2.8
Ranking with implicit feedback :

Modern web search engines rank results based on a large number
of features, including content-based features (i.e., how closely a query
matches the text or title or anchor text of the document), and
query- independent page quality features (e.g., PageRank of the
document or the domain)
In most cases, automatic (or semiautomatic) methods are

developed for tuning the specific ranking function that combines
these feature values.
.
3.3 EXTERNAL USER INTERFACES:
3.3.1. Hardware Interfaces:
Processor Pentium 4 or above

Minimum hard disk space:-40 GB
Virtual Memory (RAM):- 512MB
3.3.2. Software interfaces:
Operating system: - Windows 2003 server and above all windows operating
system.
Tomcat Apache 7.1
10
Jdk1.7
MySQL_5.5
3.4 ANALYSIS MODEL:

3.4.1 Data Flow Diagrams
Data Flow Diagram (DFD) are used to show the relationships among the
functions within a System.
11
Fig 3.4.1.1 DFD 0
Fig 3.4.1.2 DFD 1
3.4.2 Class Diagram:

In design specification it can be used to specify interfaces and classes that will
be implemented in an object oriented program.
12
Fig 3.4.2 class diagram
13
3.4.3 Activity Diagram:

Activity diagrams show the sequences of states that an object goes through, the
events that cause a transition from one state to another and the actions that result into
an activity diagram. Apply Stemmer
14
Fig 3.4.3 Activity diagram
4.SYSTEM DESIGN
15
4.1 UML DIAGRAMS:

4.1.1 Use Case Diagram:
The use case view models functionality of the system as perceived by outside
users. A use case is a coherent unit of functionality expressed as a transaction among
actors and the system.
Fig 4.1.1 Use Case Diagram for Concept Based User Profile Search Engine
Enter Iconic Query
16
4.1.2 Sequence Diagram:

A sequence diagram is a graphical view of a scenario that shows object
interaction in a time-based sequence what happens first, what happens next. Sequence
diagrams establish the roles of objects and help provide essential information to
determine class responsibilities and interfaces.
Fig 4.1.2 Sequence Diagram
17
4.1.3 Deployment Diagram:

A deployment diagram shows the allocation of processes to processors
in the physical design and architecture of a system.
Fig 4.1.3 Deployment Diagram
18
4.1.4 Timing Diagram:

Activity
Start date
End date
Duration
Initiate the project
04/8/16
24/8/16
21
Communication
04/8/16
10/8/16
Literature survey
11/8/16
17/8/16
Define scope
18/8/16
19/8/16
Develop SRS
20/8/16
24/8/16
Plan the project
25/8/16
5/10/16
42
Design mathematical model
25/8/16
31/8/16
Feasibility Analysis
01/9/16
07/9/16
Develop work breakdown structure
08/9/16
09/9/16
Planning project schedule
10/9/16
14/9/16
Design UML and other diagrams
15/9/16
21/9/16
Design test plan
22/9/16
28/9/16
Design risk management plan
29/9/16
5/10/16
05/01/17
29/03/17
84
Build and test basic functional unit
05/01/17
25/01/17
21
Build and test database with login and session maintenance
26/01/17
15/02/17
21
Build and test Bluetooth mode
16/02/17
08/03/17
21
Build and test security features
09/03/17
29/03/17
21
Execute the project
facility
TABLE 4.1.4.1: GRAPHICAL TIME CHART (1)
19
Activity
II
III
IV
VI
VII
VIII
IX
week
week
week
week
Week
week
week
week
week
Aug 4
Aug 11
Aug 18
Aug 25
Sept 1
Sept 8
Sept 15
Sept 22
Sept 29
Initiate the project

Communication
Literature survey
Define scope
Develop SRS
Plan the project
Design mathematical model
Feasibility Analysis
Develop work breakdown
structure
Planning project schedule
Design UML and other diagrams
Design test plan
Design risk management plan

Activity
XI
XII
XIII
XIV
XV
XVI
XVII
XVII
XIX
XX
XXI
XXII
week
week
week
week
week
week
week
week
week
week
week
week
Jan 5
Jan
Jan
Jan
12
19
26
Feb 2
Feb 9
Feb
Feb
Mar
Mar
Mar
Mar
16
23
16
23
Execute the project
Build and test basic functional

unit
Build and test database with
login and session maintenance
facility
Build and test Bluetooth mode
Build and test security features
5.TECHNICAL SPECIFICATION
20
5.1 ADVANTAGES
1. In short period of time we can have a relevant search.
2. Explicit interest of the user is taken into account by creating user profile for
each query.
3. Same user can have different queries.
5.2 DISADVANTAGES
1.
Sharing of user profile cant be done which can help to reduce time.
2. The existing user profiles cannot be used to predict the intent of unseen
queries.
5.3 APPLICATIONS
1. Text Mining.
3. Search Engine for Business Application.
21
APPENDIX A: MATHEMATICAL MODEL

CONCEPT BASED USER PROFILE FOR SEARCH ENGINE (MATHEMATICAL
MODEL)
Let us consider S as a system for CONCEPT BASED USER PROFILE.
S= {
INPUT:
Identify the inputs
F= {f1, f2, f3 ....., fn| F as set of functions to execute commands.}

I= {i1, i2, i3|I sets of inputs to the function set}
O= {o1, o2, o3.|O Set of outputs from the function sets}
S= {I, F, O}
I = {Query submitted by the user, ...}
O = {Output of desired query,...}
F = {Functions implemented to get the output,
SPY-NB algorithm,
Clustering algorithm}
22
Search engine:
A1
R1
A2
R2
A3
A1: Query provided by the user. Eg: Apple iphone

A2: Query provided by user. Eg: Orange fruit
R1: Resulted web snippets provided by the search engine
A3: Wrong or incorrect query submitted

R2: Error routine in the search engine
23
6.BIBLIOGRAPHY
[1] E. Agichtein, E. Brill, and S. Dumais, Improving web search ranking by
incorporating user behavior information, in Proc. of ACM SIGIR Conference,
2006.
[2] E. Agichtein, E. Brill, S. Dumais, and R. Ragno, Learning user interaction
models for predicting web search result preferences, in Proc. of ACM SIGIR
Conference, 2006.
[3]
Appendix:
500
test
queries.
[Online].
Available:
http://www.cse.ust.hk/dlee/tkde09/Appendix.pdf
[4] R. Baeza-yates, C. Hurtado, and M. Mendoza, Query recommendation using
query logs in search engines, vol. 3268, pp. 588596, 2004.
[5]
D. Beeferman and A. Berger, Agglomerative clustering of a search

engine query log, in Proc. of ACM SIGKDD Conference, 2000.
[6]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G.

Hullender, Learning to rank using gradient descent, in Proc. of the
International Conference on Machine learning (ICML), 2005.
[7]
K. W. Church, W. Gale, P. Hanks, and D. Hindle, Using statistics in

lexical analysis, Lexical Acquisition: Exploiting On-Line Resources to
Build a Lexicon, 1991.
[8]
Z. Dou, R. Song, and J.-R. Wen, A largescale evaluation and analysis of

personalized search strategies, in Proc. of WWW Conference, 2007.
[9]
S. Gauch, J. Chaffee, and A. Pretschner, Ontology-based personalized

search and browsing, ACM WIAS, vol. 1, no. 3-4, pp. 219234, 2003. [10] T.
Joachims, Optimizing search engines using clickthrough data, in

Proc. of ACM SIGKDD Conference, 2002.
[11] K. W.-T. Leung, W. Ng, and D. L. Lee, Personalized concept-based clustering
of search engine queries, IEEE TKDE, vol. 20, no. 11, 2008.
[12] B. Liu, W. S. Lee, P. S. Yu, and X. Li, Partially supervised classication of
text documents, in Proc. of the International Conference on Machine Learning
(ICML), 2002.
24
[13] F. Liu, C. Yu, and W. Meng, Personalized web search by mapping user
queries to categories, in Proc. of the International Conference on Information
and Knowledge Management (CIKM), 2002.
[14] Magellan. [Online]. Available: http://magellan.mckinley.com/
[15] W. Ng, L. Deng, and D. L. Lee, Mining user preference using spy voting
for search engine personalization, ACM TOIT, vol. 7, no. 4,
2007.
[16] Open directory project. [Online]. Available: http://www.dmoz.org/
[17]
M. Speretta and S. Gauch, Personalized search based on user search
histories, in Proc. of IEEE/WIC/ACM International Conference on Web

Intelligence, 2005.
[18] Q. Tan, X. Chai, W. Ng, and D. Lee, Applying co-training to clickthrough data
for search engine adaptation, in Proc. of DASFAA
Conference, 2004.
[19] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang, Query clustering using user logs,
ACM TOIS, vol. 20, no. 1, pp. 5981, 2002.
[20] Y. Xu, K. Wang, B. Zhang, and Z. Chen, Privacy-enhancing personalized
web search, in Proc. of WWW Conference, 2007.
25

Project Report PWS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Report PWS

Uploaded by

Copyright:

Available Formats

PROJECT REPORT

SUPPORTING PRIVACY PROTECTION IN PERSONALIZED

2.1 Related work:

Table 2.1.2 Algorithm 1: Extracting Preference Feedback from Clickthrough

for all pairs 1 j < i, with i C and j C.

Matrix Representation of User Search History and User Profile:

DT = document-term matrix, constructed from the user queries and the

DC = document-category matrix, constructed from the relationships

A user profile is represented by a category-term matrix M (Table 2(c)).

In this example, D1, D2, are documents; lowercase words such as

(c) Category-Term matrix M represents a user profile

3. SOFTWARE REQUIREMENT SPECIFICATION

Personalized search is an important research area that aims to resolve the

3.1.2 User Classes and Characteristics:

3.1.3 Operating Environment:

Processor Pentium 4 or above

3.1.4 Assumptions and Dependencies:

3.2 SYSTEM FEATURES:

Query Text Features:

Title overlap Features:

Query Length Features:

Query Next Overlap Features:

3.2.5 Summary overlap Features:

This feature allows us to model intra-query diversity of page browsing

Ranking with implicit feedback :

In most cases, automatic (or semiautomatic) methods are

Processor Pentium 4 or above

3.3.2. Software interfaces:

3.4 ANALYSIS MODEL:

Fig 3.4.1.1 DFD 0

Fig 3.4.1.2 DFD 1

3.4.2 Class Diagram:

Fig 3.4.2 class diagram

3.4.3 Activity Diagram:

Fig 3.4.3 Activity diagram

4.1 UML DIAGRAMS:

Enter Iconic Query

4.1.2 Sequence Diagram:

Fig 4.1.2 Sequence Diagram

4.1.3 Deployment Diagram:

Fig 4.1.3 Deployment Diagram

4.1.4 Timing Diagram:

Initiate the project

Plan the project

Design mathematical model

Develop work breakdown structure

Planning project schedule

Design UML and other diagrams

Design test plan

Design risk management plan

Build and test basic functional unit

Build and test database with login and session maintenance

Build and test Bluetooth mode

Build and test security features

Execute the project

TABLE 4.1.4.1: GRAPHICAL TIME CHART (1)

Initiate the project

TABLE 4.1.4.2: GRAPHICAL TIME CHART (2)

Execute the project

Build and test basic functional

Build and test security features

TABLE 4.1.4.3: GRAPHICAL TIME CHART (3)

APPENDIX A: MATHEMATICAL MODEL

Identify the inputs