You are on page 1of 25

PROJECT REPORT

ON
A

SUPPORTING PRIVACY PROTECTION IN PERSONALIZED


WEB SEARCH

1. PROBLEM DEFINITION
Our work is dealing with the very important domains from computer science
viz. Knowledge and data Engineering, Artificial Intelligence including Data mining,
Expert systems, Decision support systems and various information retrieval systems.
Most of the commercial search engines returns roughly the same result for the same
query independent of users actual interest i.e. what the user wants exactly. It also
gives some irrelevant data. Our method will help the user to get the relevant result of
his query. Consider the following example:
If a user submits a query bank on the search engine, then the normal search
engine will give the result including, blood bank, money bank or river bank without
considering users real interest, whether he is interested in blood bank or money bank
or river bank. In that case he needs to provide the interest manually (explicit feedback)
by taking extra efforts. But our personalized search will help the user to get relevant
result without taking extra manual efforts. Our approach will include positive and
negative preferences and will create a user profile which will be employed by a
clustering algorithm. We will use SPY NB algorithm for creation of user profile
followed by Agglomerative Clustering Algorithm.

2. LITERATURE SURVEY
Searching is one of the important task in Computer Science. If the time
required in searching is long then it will result high cost and lower the system
performance.
Present searching techniques does not take user interest in account. The most
of show results related to latest data present in their primary memory.
For example: kolavary-di song. It was mostly searched song on Google. Now
even if a user is interested in cola information related to Coca-Cola Company, yet he
will be shown results for kolavary-di song. It is because present search engines search
results are based on previous user search.
This drawback has been overcome in Concept Based User Profiles project.
User interest is taken into consideration by creating user profile for each query.
Two algorithms are used:
1. Suffix (SPY NB algorithm)
2. Clustering (Agglomerative clustering algorithm)
Suffix algorithm removes articles from search query enter by user and
convert it into keyword form which can be compared with keyword in

database.
Clustering algorithm forms clusters of pages to be viewed to user who has

higher weights.
In profile creation system assigns weights to the pages user interested in.
+1 for positive and +0 for negative interest.

2.1 Related work:


Existing strategies were dealing with creation of a single profile per user, but
conflict occurs when users interest varies for the same query Eg. When a user is
interested in banking exams in query bank may be slightly interested in accounts of
money bank where not at all interested in blood bank. At such time conflict occurs so
we are dealing with negative preferences to obtain the fine grain between the
interested results and not interested. Consider following two aspects:
2.1.1) Document-Based method:

These methods aim at capturing users clicking and browsing behaviours. It deals
with clickthrough data from the user i.e. the documents user has clicked on.
Clickthrough data in search engines can be thought of as triplets (q, r, c)
Where,
q = query
r = ranking
c = set of links clicked by user.
Table 1 illustrates this with an example: the user asked the query support
vector machine, received the ranking shown in Figure 1, and then clicked on the links
ranked 1, 3, and 7. since every query corresponds to one triplet.
Table 2.1.1 Document Based Method.
1. Kernel Machines
http : //svm.first.gmd.de/
2. Support Vector Machine
http : //jbolivar.freeservers.com/
3. SVM-Light Support Vector Machine
http : //ais.gmd.de/ _ thorsten/svm light/
4. An Introduction to Support Vector Machines
http : //www.support vector.net/
5. Support Vector Machine and Kernel Methods References
http : //svm.research.bell labs.com/SVMrefs.html
6. Archives of SUPPORT-VECTOR-MACHINES@JISCMAIL.AC.UK
http : //www.jiscmail.ac.uk/lists/SUPPORTV ECTORMACHINES.html
7. Lucent Technologies: SVM demo applet
http : //svm.research.bell labs.com/SV T/SVMsvt.html
8. Royal Holloway Support Vector Machine
http : //svm.dcs.rhbnc.ac.uk/
9. Support Vector Machine - The Software
http : //www.support vector.net/software.html
Table 2.1.1 shows the ranking presented for the query support vector
machine. Marked in bold are the links the user clicked on. This strategy for extracting
preference feedback is summarized in the following algorithm.

Table 2.1.2 Algorithm 1: Extracting Preference Feedback from Clickthrough


Procedure:
For a ranking (link1, link2, link3, ...) and a set C containing the ranks of the clickedon links, extract a preference
Example
linki <r* linkj

for all pairs 1 j < i, with i C and j C.


Joachims method assumes that a user would scan the search result list from
top to bottom. If a user has skipped a document di at rank i before clicking on
document dj at rank j, it is assumed that he/she must have scan the document di and
decided to skip it. Thus, we can conclude that the user prefers document dj more than
document di (i.e., dj <r di , where r is the users preference order of the documents in
the search result list).[2]
2.1.2) Concept-based methods:
These methods aim at capturing users conceptual needs. Users browsed
documents and search histories. User profiles are used to represent users interests and
to infer their intentions for new queries. In this paper, a user profile consists of a set of
categories and for each category, a set of terms (keywords) with weights. Each
category represents a user interest in that category. The weight of a term in a category
reflects the significance of the term in representing the user's interest in that category.
Table 2.2 (a)
Doc\Ter

Appl

Recip

m
D1
D2
D3
D4

e
1
0.58
0
0
(a)

e
l
0
0
0
0.58
0.58
0
0
0
1
0
0
0.58
Document term Matrix DT
Table 2.2 (b)

Doc\Category
D1
D2
D3

pudding

COOKING
1
1
0

Footbal

Soccer Fifa
0
0
0
0.58

SOCCER
0
0
1

0
0
0
0.58

D4

0
1
(b) Document-Category matrix DC

Matrix Representation of User Search History and User Profile:


We use matrices to represent user search histories and user profiles. Table 2
shows an example of the matrix representations of a search history and a profile for a
particular user, who is interested in the categories COOKING and SOCCER. This
users search history is represented by two matrices DT (Table 2(a)) and DC (Table
2(b)) [3].
i)

DT = document-term matrix, constructed from the user queries and the


relevant documents.

ii)

DC = document-category matrix, constructed from the relationships


between the categories and the documents.

iii)

A user profile is represented by a category-term matrix M (Table 2(c)).

In this example, D1, D2, are documents; lowercase words such as


football, apple, are terms; uppercase words such as SOCCER, COOKING,
are categories.
Table 2.2 (c)
Cate\Term

Apple

Recip

Puddin

footbal

Socce

Fifa

COOKIN

e
0.37

g
0.37

l
0

r
0

G
SOCCER

0.37

0.37

(c) Category-Term matrix M represents a user profile

3. SOFTWARE REQUIREMENT SPECIFICATION


3.1 INTRODUCTION:

Personalized search is an important research area that aims to resolve the


ambiguity of query terms. To increase the relevance of search results, personalized
search engines create user profiles to capture the users personal preferences and as
such identify the actual goal of the input query.
Most personalization methods focused on the creation of one single profile for

user and applied the same profile to all of the users queries. We believe that different
queries from a user should be handled differently because a users preferences may
vary across queries. For example, a user who prefers information about fruit on the
query orange, may prefer the information about Apple Computer for the query
apple.
3.1.1 Project scope:
The proposed system covers important domains from computer science viz.
Knowledge and data Engineering, Artificial Intelligence including Data mining,
Expert systems, Decision support systems and various information retrieval systems.
We focus on search engine personalization and develop several concept-based
user profiling methods that are based on both positive and negative preferences.
The profiles which capture and utilize both of the users positive and negative
preferences perform the best. The profiles with negative preferences can increase the
separation between similar and dissimilar queries.
The separation provides a clear threshold for an agglomerative clustering
algorithm to terminate and improve the overall quality of the resulting query clusters.

3.1.2 User Classes and Characteristics:


Concept-based user profile for search engine is designed by keeping in view a
general user.
Relevant search can be obtained considering the users real interest search area.

3.1.3 Operating Environment:


The project basically is a web based application, web browser is the minimum
requirement. Tomcat apache server is basically used to run this web based application.
Minimum platform requirements:

Processor Pentium 4 or above


Minimum hard disk space:-2 GB
Virtual Memory (RAM):- 512MB
Operating system: - Windows 2003 server and above all windows

operating system.
Tomcat Apache 6.8
Jdk1.6
MS Access

3.1.4 Assumptions and Dependencies:


Relationships between users cant be mined from the concept-based user
profiles to perform collaborative filtering. Thus users with the same interests cannot
share their profiles.
Second, the existing user profiles cannot be used to predict the intent of unseen
queries, such that when a user submits a new query, personalization can benefit the
unseen query.
The concept-based user profiles cannot be integrated into the ranking
algorithms of a search engine so that search results can be ranked according to
individual users interests.

3.2 SYSTEM FEATURES:


3.2.1

Query Text Features:


Users decide which results to examine in more detail by looking at
the result title, URL and summary.

3.2.2

Title overlap Features:


The overlap between the words in title and in query is examined.

For Example: Suppose the query is related to reserve bank, then the
words in query such as reserve bank are checked with those in the title of
URL.
3.2.3

Query Length Features:


The number of tokens in the query are been evaluated.
For Example: If the query is related to reserve bank then number of
tokens=2.

3.2.4

Query Next Overlap Features:


The average fraction of words shared with next query is checked.
For Example: The query related to reserve bank is entered first then query
related to blood bank is entered. In this the average fraction between the
same words bank is been checked for both the queries.

3.2.5 Summary overlap Features:


The fraction of words shared by the query and the result and summary
are been checked.
For Example: Suppose the query is related to reserve bank, then the
words in query such as reserve bank are checked with those in the
summary of the opened URL page.
3.2.6

Browsing Features:
Simple aspects of user web page interactions can be captured and
quantified. This feature is used to characterize interactions with pages
beyond the results page.
For Example: we compute how long users dwell on a page or domain,
and the deviation of dwell time from expected page dwell time for a query.

This feature allows us to model intra-query diversity of page browsing


behavior.
3.2.7

Clickthrough Features:
Clicks are a special case of user interaction with the search engine.
For example: for a query-URL pair we provide the number of clicks for
the result, as well as whether there was a click on result below or above the
current URL. The derived feature values such as ClickRelativeFrequency
and Click Deviation are computed.

3.2.8

Ranking with implicit feedback :


Modern web search engines rank results based on a large number
of features, including content-based features (i.e., how closely a query
matches the text or title or anchor text of the document), and
query- independent page quality features (e.g., PageRank of the
document or the domain)

In most cases, automatic (or semiautomatic) methods are


developed for tuning the specific ranking function that combines
these feature values.

.
3.3 EXTERNAL USER INTERFACES:
3.3.1. Hardware Interfaces:

Processor Pentium 4 or above


Minimum hard disk space:-40 GB
Virtual Memory (RAM):- 512MB

3.3.2. Software interfaces:

Operating system: - Windows 2003 server and above all windows operating

system.
Tomcat Apache 7.1

10

Jdk1.7
MySQL_5.5

3.4 ANALYSIS MODEL:


3.4.1 Data Flow Diagrams
Data Flow Diagram (DFD) are used to show the relationships among the
functions within a System.

11

Fig 3.4.1.1 DFD 0

Fig 3.4.1.2 DFD 1

3.4.2 Class Diagram:


In design specification it can be used to specify interfaces and classes that will
be implemented in an object oriented program.

12

Fig 3.4.2 class diagram

13

3.4.3 Activity Diagram:


Activity diagrams show the sequences of states that an object goes through, the
events that cause a transition from one state to another and the actions that result into
an activity diagram. Apply Stemmer

14

Fig 3.4.3 Activity diagram

4.SYSTEM DESIGN

15

4.1 UML DIAGRAMS:


4.1.1 Use Case Diagram:

The use case view models functionality of the system as perceived by outside
users. A use case is a coherent unit of functionality expressed as a transaction among
actors and the system.

Fig 4.1.1 Use Case Diagram for Concept Based User Profile Search Engine

Enter Iconic Query

16

4.1.2 Sequence Diagram:


A sequence diagram is a graphical view of a scenario that shows object
interaction in a time-based sequence what happens first, what happens next. Sequence
diagrams establish the roles of objects and help provide essential information to
determine class responsibilities and interfaces.

Fig 4.1.2 Sequence Diagram

17

4.1.3 Deployment Diagram:


A deployment diagram shows the allocation of processes to processors
in the physical design and architecture of a system.

Fig 4.1.3 Deployment Diagram

18

4.1.4 Timing Diagram:


Activity

Start date

End date

Duration

Initiate the project

04/8/16

24/8/16

21

Communication

04/8/16

10/8/16

Literature survey

11/8/16

17/8/16

Define scope

18/8/16

19/8/16

Develop SRS

20/8/16

24/8/16

Plan the project

25/8/16

5/10/16

42

Design mathematical model

25/8/16

31/8/16

Feasibility Analysis

01/9/16

07/9/16

Develop work breakdown structure

08/9/16

09/9/16

Planning project schedule

10/9/16

14/9/16

Design UML and other diagrams

15/9/16

21/9/16

Design test plan

22/9/16

28/9/16

Design risk management plan

29/9/16

5/10/16

05/01/17

29/03/17

84

Build and test basic functional unit

05/01/17

25/01/17

21

Build and test database with login and session maintenance

26/01/17

15/02/17

21

Build and test Bluetooth mode

16/02/17

08/03/17

21

Build and test security features

09/03/17

29/03/17

21

Execute the project

facility

TABLE 4.1.4.1: GRAPHICAL TIME CHART (1)

19

Activity

II

III

IV

VI

VII

VIII

IX

week

week

week

week

Week

week

week

week

week

Aug 4

Aug 11

Aug 18

Aug 25

Sept 1

Sept 8

Sept 15

Sept 22

Sept 29

Initiate the project


Communication
Literature survey
Define scope
Develop SRS
Plan the project
Design mathematical model
Feasibility Analysis
Develop work breakdown
structure
Planning project schedule
Design UML and other diagrams
Design test plan
Design risk management plan

TABLE 4.1.4.2: GRAPHICAL TIME CHART (2)


Activity

XI

XII

XIII

XIV

XV

XVI

XVII

XVII

XIX

XX

XXI

XXII

week

week

week

week

week

week

week

week

week

week

week

week

Jan 5

Jan

Jan

Jan

12

19

26

Feb 2

Feb 9

Feb

Feb

Mar

Mar

Mar

Mar

16

23

16

23

Execute the project

Build and test basic functional


unit
Build and test database with
login and session maintenance
facility
Build and test Bluetooth mode

Build and test security features

TABLE 4.1.4.3: GRAPHICAL TIME CHART (3)

5.TECHNICAL SPECIFICATION

20

5.1 ADVANTAGES
1. In short period of time we can have a relevant search.
2. Explicit interest of the user is taken into account by creating user profile for
each query.
3. Same user can have different queries.

5.2 DISADVANTAGES
1.

Sharing of user profile cant be done which can help to reduce time.

2. The existing user profiles cannot be used to predict the intent of unseen
queries.

5.3 APPLICATIONS
1. Text Mining.
3. Search Engine for Business Application.

21

APPENDIX A: MATHEMATICAL MODEL


CONCEPT BASED USER PROFILE FOR SEARCH ENGINE (MATHEMATICAL
MODEL)
Let us consider S as a system for CONCEPT BASED USER PROFILE.
S= {
INPUT:

Identify the inputs

F= {f1, f2, f3 ....., fn| F as set of functions to execute commands.}


I= {i1, i2, i3|I sets of inputs to the function set}
O= {o1, o2, o3.|O Set of outputs from the function sets}
S= {I, F, O}
I = {Query submitted by the user, ...}
O = {Output of desired query,...}
F = {Functions implemented to get the output,
SPY-NB algorithm,
Clustering algorithm}

22

Search engine:

A1

R1

A2

R2

A3

A1: Query provided by the user. Eg: Apple iphone


A2: Query provided by user. Eg: Orange fruit
R1: Resulted web snippets provided by the search engine

A3: Wrong or incorrect query submitted


R2: Error routine in the search engine

23

6.BIBLIOGRAPHY
[1] E. Agichtein, E. Brill, and S. Dumais, Improving web search ranking by
incorporating user behavior information, in Proc. of ACM SIGIR Conference,
2006.
[2] E. Agichtein, E. Brill, S. Dumais, and R. Ragno, Learning user interaction
models for predicting web search result preferences, in Proc. of ACM SIGIR
Conference, 2006.
[3]

Appendix:

500

test

queries.

[Online].

Available:

http://www.cse.ust.hk/dlee/tkde09/Appendix.pdf
[4] R. Baeza-yates, C. Hurtado, and M. Mendoza, Query recommendation using
query logs in search engines, vol. 3268, pp. 588596, 2004.
[5]

D. Beeferman and A. Berger, Agglomerative clustering of a search


engine query log, in Proc. of ACM SIGKDD Conference, 2000.

[6]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G.


Hullender, Learning to rank using gradient descent, in Proc. of the
International Conference on Machine learning (ICML), 2005.

[7]

K. W. Church, W. Gale, P. Hanks, and D. Hindle, Using statistics in


lexical analysis, Lexical Acquisition: Exploiting On-Line Resources to
Build a Lexicon, 1991.

[8]

Z. Dou, R. Song, and J.-R. Wen, A largescale evaluation and analysis of


personalized search strategies, in Proc. of WWW Conference, 2007.

[9]

S. Gauch, J. Chaffee, and A. Pretschner, Ontology-based personalized


search and browsing, ACM WIAS, vol. 1, no. 3-4, pp. 219234, 2003. [10] T.

Joachims, Optimizing search engines using clickthrough data, in


Proc. of ACM SIGKDD Conference, 2002.
[11] K. W.-T. Leung, W. Ng, and D. L. Lee, Personalized concept-based clustering
of search engine queries, IEEE TKDE, vol. 20, no. 11, 2008.
[12] B. Liu, W. S. Lee, P. S. Yu, and X. Li, Partially supervised classication of
text documents, in Proc. of the International Conference on Machine Learning
(ICML), 2002.

24

[13] F. Liu, C. Yu, and W. Meng, Personalized web search by mapping user
queries to categories, in Proc. of the International Conference on Information
and Knowledge Management (CIKM), 2002.
[14] Magellan. [Online]. Available: http://magellan.mckinley.com/
[15] W. Ng, L. Deng, and D. L. Lee, Mining user preference using spy voting
for search engine personalization, ACM TOIT, vol. 7, no. 4,
2007.
[16] Open directory project. [Online]. Available: http://www.dmoz.org/
[17]

M. Speretta and S. Gauch, Personalized search based on user search

histories, in Proc. of IEEE/WIC/ACM International Conference on Web


Intelligence, 2005.
[18] Q. Tan, X. Chai, W. Ng, and D. Lee, Applying co-training to clickthrough data
for search engine adaptation, in Proc. of DASFAA
Conference, 2004.
[19] J.-R. Wen, J.-Y. Nie, and H.-J. Zhang, Query clustering using user logs,
ACM TOIS, vol. 20, no. 1, pp. 5981, 2002.
[20] Y. Xu, K. Wang, B. Zhang, and Z. Chen, Privacy-enhancing personal- ized
web search, in Proc. of WWW Conference, 2007.

25

You might also like