You are on page 1of 19

www.globalsoftsolutions.

in

KNOWLEDGE AND DATA ENGG

1.SLICING A NEW APPROACH TO PRIVACY PRESERVING DATA


PUBLISHING
Several anonymization techniques, such as generalization and bucketization, have
been designed for privacy preserving microdata publishing. Recent work has shown
that generalization loses considerable amount of information, especially for highdimensional data. Bucketization, on the other hand, does not prevent membership
disclosure and does not apply for data that do not have a clear separation between
quasi-identifying attributes and sensitive attributes. In this paper, we present a
novel technique called slicing, which partitions the data both horizontally and
vertically. We show that slicing preserves better data utility than generalization
and can be used for membership disclosure protection. Another important
advantage of slicing is that it can handle high-dimensional data. We show how slicing
can be used for attribute disclosure protection and develop an efficient algorithm
for computing the sliced data that obey the -diversity requirement. Our workload
experiments confirm that slicing preserves better utility than generalization and is
more effective than bucketization in workloads involving the sensitive attribute.
Our experiments also demonstrate that slicing can be used to prevent membership
disclosure.

2.RANKING MODEL ADAPTATION FOR DOMAIN-SPECIFIC


SEARCH
With the explosive emergence of vertical search domains, applying the broad-based
ranking model directly to different domains is no longer desirable due to domain
differences, while building a unique ranking model for each domain is both laborious
for labeling data and time consuming for training models. In this paper, we address
these difficulties by proposing a regularization-based algorithm called ranking
adaptation SVM (RA-SVM), through which we can adapt an existing ranking model
to a new domain, so that the amount of labeled data and the training cost is
reduced while the performance is still guaranteed. Our algorithm only requires the
prediction

from the existing ranking models, rather than

their internal

representations or the data from auxiliary domains. In addition, we assume that


documents similar in the domain-specific feature space should have consistent

www.globalsoftsolutions.in
rankings, and add some constraints to control the margin and slack variables of RASVM adaptively. Finally, ranking adaptability measurement is proposed to
quantitatively estimate if an existing ranking model can be adapted to a new domain.
Experiments performed over Letor and two large scale data sets crawled from a
commercial search engine demonstrate the applicabilities of the proposed ranking
adaptation algorithms and the ranking adaptability measurement.

3.Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining


Analysis
Preparing a data set for analysis is generally the most time consuming task in a data
mining project, requiring many complex SQL queries, joining tables, and aggregating
columns. Existing SQL aggregations have limitations to prepare data sets because
they return one column per aggregated group. In general, a significant manual
effort is required to build data sets, where a horizontal layout is required. We
propose simple, yet powerful, methods to generate SQL code to return aggregated
columns in a horizontal tabular layout, returning a set of numbers instead of one
number per row. This new class of functions is called horizontal aggregations.
Horizontal aggregations build data sets with a horizontal denormalized layout (e.g.,
point-dimension, observation-variable, instance-feature), which is the standard
layout required by most data mining algorithms. We propose three fundamental
methods to evaluate horizontal aggregations: CASE: Exploiting the programming
CASE construct; SPJ: Based on standard relational algebra operators (SPJ
queries); PIVOT: Using the PIVOT operator, which is offered by some DBMSs.
Experiments with large tables compare the proposed query evaluation methods. Our
CASE method has similar speed to the PIVOT operator and it is much faster than
the SPJ method. In general, the CASE and PIVOT methods exhibit linear
scalability, whereas the SPJ method does not.

4.CLUSTERING WITH MULTIVIEWPOINT-BASED COMPARISON


DETERMINE.
All clustering methods have to assume some cluster relationship among the data
objects that they are applied on. Similarity between a pair of objects can be

www.globalsoftsolutions.in
defined either explicitly or implicitly. In this paper, we introduce a novel
multiviewpoint-based similarity measure and two related clustering methods. The
major difference between a traditional dissimilarity/similarity measure and ours is
that the former uses only a single viewpoint, which is the origin, while the latter
utilizes many different viewpoints, which are objects assumed to not be in the same
cluster with the two objects being measured. Using multiple viewpoints, more
informative assessment of similarity could be achieved. Theoretical analysis and
empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure. We compare them
with several well-known clustering algorithms that use other popular similarity
measures on various document collections to verify the advantages of our proposal.

5.OUTSOURCED SIMILARITY SEARCH ON METRIC DATA ASSETS


This paper considers a cloud computing setting in which similarity querying of
metric data is outsourced to a service provider. The data is to be revealed only to
trusted users, not to the service provider or anyone else. Users query the server
for the most similar data objects to a query example. Outsourcing offers the data
owner scalability and a low-initial investment. The need for privacy may be due to
the data being sensitive (e.g., in medicine), valuable (e.g., in astronomy), or
otherwise confidential. Given this setting, the paper presents techniques that
transform the data prior to supplying it to the service provider for similarity
queries on the transformed data. Our techniques provide interesting trade-offs
between query cost and accuracy. They are then further extended to offer an
intuitive privacy guarantee. Empirical studies with real data demonstrate that the
techniques are capable of offering privacy while enabling efficient and accurate
processing of similarity queries.

6.TSCAN: A CONTENT ANATOMY APPROACH TO TEMPORAL


TOPIC SUMMARIZATION
A topic is defined as a seminal event or activity along with all directly related
events and activities. It is represented by a chronological sequence of documents
published by different authors on the Internet. In this study, we define a task

www.globalsoftsolutions.in
called topic anatomy, which summarizes and associates the core parts of a topic
temporally so that readers can understand the content easily. The proposed topic
anatomy model, called TSCAN, derives the major themes of a topic from the
eigenvectors of a temporal block association matrix. Then, the significant events of
the themes and their summaries are extracted by examining the constitution of the
eigenvectors. Finally, the extracted events are associated through their temporal
closeness and context similarity to form an evolution graph of the topic.
Experiments based on the official TDT4 corpus demonstrate that the generated
temporal summaries present the storylines of topics in a comprehensible form.
Moreover, in terms of content coverage, coherence, and consistency, the summaries
are superior to those derived by existing summarization methods based on humancomposed reference summaries

7.A LINK-BASED CLUSTER ENSEMBLE APPROACH FOR


CATEGORICAL DATA CLUSTERING
Although attempts have been made to solve the problem of clustering categorical
data via cluster ensembles, with the results being competitive to conventional
algorithms, it is observed that these techniques unfortunately generate a final data
partition based on incomplete information. The underlying ensemble-information
matrix presents only cluster-data point relations, with many entries being left
unknown. The paper presents an analysis that suggests this problem degrades the
quality of the clustering result, and it presents a new link-based approach, which
improves the conventional matrix by discovering unknown entries through similarity
between clusters in an ensemble. In particular, an efficient link-based algorithm is
proposed for the underlying similarity assessment. Afterward, to obtain the final
clustering result, a graph partitioning technique is applied to a weighted bipartite
graph that is formulated from the refined matrix. Experimental results on multiple
real data sets suggest that the proposed link-based method almost always
outperforms both conventional clustering algorithms for categorical data and wellknown cluster ensemble techniques.

8.EFFICIENT EXTENDED BOOLEAN RETRIEVAL 2012


Extended Boolean retrieval (EBR) models were proposed nearly three decades ago,
but have had little practical impact, despite their significant advantages compared

www.globalsoftsolutions.in
to either ranked keyword or pure Boolean retrieval. In particular, EBR models
produce meaningful rankings; their query model allows the representation of
complex concepts in an and-or format; and they are scrutable, in that the score
assigned to a document depends solely on the content of that document, unaffected
by any collection statistics or other external factors. These characteristics make
EBR models attractive in domains typified by medical and legal searching, where the
emphasis is on iterative development of reproducible complex queries of dozens or
even hundreds of terms. However, EBR is much more computationally expensive
than the alternatives. We consider the implementation of the p-norm approach to
EBR, and demonstrate that ideas used in the max-score and wand exact
optimization techniques for ranked keyword retrieval can be adapted to allow
selective bypass of documents via a low-cost screening process for this and similar
retrieval models. We also propose term-independent bounds that are able to
further reduce the number of score calculations for short, simple queries under the
extended Boolean retrieval model. Together, these methods yield an overall saving
from 50 to 80 percent of the evaluation cost on test queries drawn from biomedical
search.

9.EFFECTIVE PATTERN DISCOVERY FOR TEXT MINING


Many data mining techniques have been proposed for mining useful patterns in text
documents. However, how to effectively use and update discovered patterns is still
an open research issue, especially in the domain of text mining. Since most existing
text mining methods adopted term-based approaches, they all suffer from the
problems of polysemy and synonymy. Over the years, people have often held the
hypothesis that pattern (or phrase)-based approaches should perform better than
the term-based ones, but many experiments do not support this hypothesis. This
paper presents an innovative and effective pattern discovery technique which
includes the processes of pattern deploying and pattern evolving, to improve the
effectiveness of using and updating discovered patterns for finding relevant and
interesting information. Substantial experiments on RCV1 data collection and TREC
topics demonstrate that the proposed solution achieves encouraging performance.

www.globalsoftsolutions.in

10.INCREMENTAL INFORMATION EXTRACTION USING


RELATIONAL DATABASES
Information extraction systems are traditionally implemented as a pipeline of
special-purpose processing modules targeting the extraction of a particular kind of
information. A major drawback of such an approach is that whenever a new
extraction goal emerges or a module is improved, extraction has to be reapplied
from scratch to the entire text corpus even though only a small part of the corpus
might be affected. In this paper, we describe a novel approach for information
extraction in which extraction needs are expressed in the form of database
queries, which are evaluated and optimized by database systems. Using database
queries for information extraction enables generic extraction and minimizes
reprocessing of data by performing incremental extraction to identify which part
of the data is affected by the change of components or goals. Furthermore, our
approach provides automated query generation components so that casual users do
not have to learn the query language in order to perform extraction. To
demonstrate the feasibility of our incremental extraction approach, we performed
experiments to highlight two important aspects of an information extraction
system: efficiency and quality of extraction results. Our experiments show that in
the event of deployment of a new module, our incremental extraction approach
reduces the processing time by 89.64 percent as compared to a traditional pipeline
approach. By applying our methods to a corpus of 17 million biomedical abstracts,
our experiments show that the query performance is efficient for real-time
applications. Our experiments also revealed that our approach achieves high quality
extraction results.

11.A FRAMEWORK FOR LEARNING COMPREHENSIBLE THEORIES


IN XML DOCUMENT CLASSIFICATION
XML has become the universal data format for a wide variety of information
systems. The large number of XML documents existing on the web and in other
information storage systems makes classification an important task. As a typical
type of semistructured data, XML documents have both structures and contents.
Traditional text learning techniques are not very suitable for XML document
classification as structures are not considered. This paper presents a novel
complete framework for XML document classification. We first present a

www.globalsoftsolutions.in
knowledge representation method for XML documents which is based on a typed
higher order logic formalism. With this representation method, an XML document is
represented as a higher order logic term where both its contents and structures
are captured. We then present a decision-tree learning algorithm driven by
precision/recall breakeven point (PRDT) for the XML classification problem which
can produce comprehensible theories. Finally, a semi-supervised learning algorithm
is given which is based on the PRDT algorithm and the cotraining framework.
Experimental results demonstrate that our framework is able to achieve good
performance in both supervised and semi-supervised learning with the bonus of
producing comprehensible learning theories.

12.EVALUATING PATH QUERIES OVER FREQUENTLY UPDATED


ROUTE COLLECTIONS
The recent advances in the infrastructure of Geographic Information Systems
(GIS), and the proliferation of GPS technology, have resulted in the abundance of
geodata in the form of sequences of points of interest (POIs), waypoints, etc. We
refer to sets of such sequences as route collections. In this work, we consider path
queries on frequently updated route collections: given a route collection and two
points ns and nt, a path query returns a path, i.e., a sequence of points, that
connects ns to nt. We introduce two path query evaluation paradigms that enjoy the
benefits of search algorithms (i.e., fast index maintenance) while utilizing
transitivity information to terminate the search sooner. Efficient indexing schemes
and appropriate updating procedures are introduced. An extensive experimental
evaluation verifies the advantages of our methods compared to conventional graphbased search.

13.DATA MINING FOR XML QUERY-ANSWERING SUPPORT 2012


Extracting information from semistructured documents is a very hard task, and is
going to become more and more critical as the amount of digital information
available on the Internet grows. Indeed, documents are often so large that the
data set returned as answer to a query may be too big to convey interpretable
knowledge. In this paper, we describe an approach based on Tree-Based Association
Rules (TARs): mined rules, which provide approximate, intensional information on

www.globalsoftsolutions.in
both the structure and the contents of Extensible Markup Language (XML)
documents, and can be stored in XML format as well. This mined knowledge is later
used to provide: 1) a concise idea-the gist-of both the structure and the content of
the XML document and 2) quick, approximate answers to queries. In this paper, we
focus on the second feature. A prototype system and experimental results
demonstrate the effectiveness of the approach.

14.A MODEL OF DATA WAREHOUSING PROCESS MATURITY 2012


Even though data warehousing (DW) requires huge investments, the data warehouse
market is experiencing incredible growth. However, a large number of DW
initiatives end up as failures. In this paper, we argue that the maturity of a data
warehousing process (DWP) could significantly mitigate such large-scale failures
and ensure the delivery of consistent, high quality, single-version of truth data in
a timely manner. However, unlike software development, the assessment of DWP
maturity has not yet been tackled in a systematic way. In light of the critical
importance of data as a corporate resource, we believe that the need for a
maturity model for DWP could not be greater. In this paper, we describe the design
and development of a five-level DWP maturity model (DWP-M) over a period of
three years. A unique aspect of this model is that it covers processes in both data
warehouse development and operations. Over 20 key DW executives from 13
different corporations were involved in the model development process. The final
model was evaluated by a panel of experts; the results strongly validate the
functionality, productivity, and usability of the model. We present the initial and
final DWP-M model versions, along with illustrations of several key process areas at
different levels of maturity.

15.MODEL-BASED METHOD FOR PROJECTIVE CLUSTERING 2012


Clustering high-dimensional data is a major challenge due to the curse of
dimensionality. To solve this problem, projective clustering has been defined as an
extension to traditional clustering that attempts to find projected clusters in
subsets of the dimensions of a data space. In this paper, a probability model is first
proposed to describe projected clusters in high-dimensional data space. Then, we
present a model-based algorithm for fuzzy projective clustering that discovers
clusters with overlapping boundaries in various projected subspaces. The suitability
of the proposal is demonstrated in an empirical study done with synthetic data set
and some widely used real-world data set.

www.globalsoftsolutions.in

16.EXTRACTING REPRESENTATIVE INFORMATION TO ENHANCE


FLEXIBLE DATA QUERIES
Extracting representative information is of great interest in data queries and web
applications nowadays, where approximate match between attribute values/records
is an important issue in the extraction process. This paper proposes an approach to
extracting representative tuples from data classes under an extended possibilitybased data model, and to introducing a measure (namely, relation compactness)
based upon information entropy to reflect the degree that a relation is compact in
light of information redundancy. Theoretical analysis and data experiments show
that the approach has desirable properties that: 1) the set of representative tuples
has high degrees of compactness (less redundancy) and coverage (rich content); 2)
it provides a way to obtain data query outcomes of different sizes in a flexible
manner according to user preference; and 3) the approach is also meaningful and
applicable to web search applications.

17.A FUZZY APPROACH FOR MULTITYPE RELATIONAL DATA


CLUSTERING
Mining interrelated data among multiple types of objects or entities is important in
many real-world applications. Despite extensive study on fuzzy clustering of vector
space data, very limited exploration has been made on fuzzy clustering of relational
data that involve several object types. In this paper, we propose a new fuzzy
clustering approach for multitype relational data (FC-MR). In FC-MR, different
types of objects are clustered simultaneously. An object is assigned a large
membership with respect to a cluster if its related objects in this cluster have high
rankings. In each cluster, an object tends to have a high ranking if its related
objects have large memberships in this cluster. The FC-MR approach is formulated
to deal with multitype relational data with various structures. The objective
function of FC-MR is locally optimized by an efficient iterative algorithm, which
updates the fuzzy membership matrix and the ranking matrix of one type at once
while keeping those of other types constant. We also discuss the simplified FC-MR
for multitype relational data with two special structures, namely, star-structure

www.globalsoftsolutions.in
and extended star-structure. Experimental studies are conducted on benchmark
document datasets to illustrate how the proposed approach can be applied flexibly
under different scenarios in real-world applications. The experimental results
demonstrate the feasibility and effectiveness of the new approach compared with
existing ones.

18.EFFICIENT COMPUTATION OF RANGE AGGREGATES AGAINST


UNCERTAIN LOCATION-BASED QUERIES
In many applications, including location-based services, queries may not be precise.
In this paper, we study the problem of efficiently computing range aggregates in a
multidimensional space when the query location is uncertain. Specifically, for a
query point Q whose location is uncertain and a set S of points in a multidimensional
space, we want to calculate the aggregate (e.g., count, average and sum) over the
subset S' of S such that for each p S', Q has at least probability within the
distance to p. We propose novel, efficient techniques to solve the problem
following the filtering-and-verification paradigm. In particular, two novel filtering
techniques are proposed to effectively and efficiently remove data points from
verification. Our comprehensive experiments based on both real and synthetic data
demonstrate the efficiency and scalability of our techniques.

19.EFFICIENT FUZZY TYPE-AHEAD SEARCH IN XML DATA 2012


In a traditional keyword-search system over XML data, a user composes a keyword
query, submits it to the system, and retrieves relevant answers. In the case where
the user has limited knowledge about the data, often the user feels left in the
dark when issuing queries, and has to use a try-and-see approach for finding
information. In this paper, we study fuzzy type-ahead search in XML data, a new
information-access paradigm in which the system searches XML data on the fly as
the user types in query keywords. It allows users to explore data as they type, even
in the presence of minor errors of their keywords. Our proposed method has the
following features: 1) Search as you type: It extends Autocomplete by supporting
queries with multiple keywords in XML data. 2) Fuzzy: It can find high-quality
answers that have keywords matching query keywords approximately. 3) Efficient:

www.globalsoftsolutions.in
Our effective index structures and searching algorithms can achieve a very high
interactive speed. We study research challenges in this new search framework. We
propose effective index structures and top-k algorithms to achieve a high
interactive speed. We examine effective ranking functions and early termination
techniques to progressively identify the top-k relevant answers. We have
implemented our method on real data sets, and the experimental results show that
our method achieves high search efficiency and result quality.

20.FOOTPRINT: DETECTING SYBIL ATTACKS IN URBAN


VEHICULAR NETWORKS

In urban vehicular networks, where privacy, especially the location privacy of


anonymous vehicles is highly concerned, anonymous verification of vehicles is
indispensable. Consequently, an attacker who succeeds in forging multiple hostile
identifies can easily launch a Sybil attack, gaining a disproportionately large
influence. In this paper, we propose a novel Sybil attack detection mechanism,
Footprint, using the trajectories of vehicles for identification while still preserving
their location privacy. More specifically, when a vehicle approaches a road-side unit
(RSU), it actively demands an authorized message from the RSU as the proof of
the appearance time at this RSU. We design a location-hidden authorized message
generation scheme for two objectives: first, RSU signatures on messages are signer
ambiguous so that the RSU location information is concealed from the resulted
authorized message; second, two authorized messages signed by the same RSU
within the same given period of time (temporarily linkable) are recognizable so that
they can be used for identification. With the temporal limitation on the linkability
of two authorized messages, authorized messages used for long-term identification
are prohibited. With this scheme, vehicles can generate a location-hidden
trajectory for location-privacy-preserved identification by collecting a consecutive
series of authorized messages. Utilizing social relationship among trajectories
according to the similarity definition of two trajectories, Footprint can recognize
and therefore dismiss communities of Sybil trajectories. Rigorous security
analysis and extensive trace-driven simulations demonstrate the efficacy of
Footprint.

21.MINING WEB GRAPHS FOR RECOMMENDATIONS 2012


As the exponential explosion of various contents generated on the Web,
Recommendation techniques have become increasingly indispensable. Innumerable

www.globalsoftsolutions.in
different kinds of recommendations are made on the Web every day, including
movies, music, images, books recommendations, query suggestions, tags
recommendations, etc. No matter what types of data sources are used for the
recommendations, essentially these data sources can be modeled in the form of
various types of graphs. In this paper, aiming at providing a general framework on
mining Web graphs for recommendations, (1) we first propose a novel diffusion
method which propagates similarities between different nodes and generates
recommendations; (2) then we illustrate how to generalize different
recommendation problems into our graph diffusion framework. The proposed
framework can be utilized in many recommendation tasks on the World Wide Web,
including query suggestions, tag recommendations, expert finding, image
recommendations, image annotations, etc. The experimental analysis on large data
sets shows the promising future of our work.

22.MULTIPLE EXPOSURE FUSION FOR HIGH DYNAMIC RANGE


IMAGE ACQUISITION
A multiple exposure fusion to enhance the dynamic range of an image is proposed.
The construction of high dynamic range images (HDRIs) is performed by combining
multiple images taken with different exposures and estimating the irradiance value
for each pixel. This is a common process for HDRI acquisition. During this process,
displacements of the images caused by object movements often yield motion blur
and ghosting artifacts. To address the problem, this paper presents an efficient
and accurate multiple exposure fusion technique for the HDRI acquisition. Our
method simultaneously estimates displacements and occlusion and saturation
regions by using maximum a posteriori estimation and constructs motion-blur-free
HDRIs. We also propose a new weighting scheme for the multiple image fusion. We
demonstrate that our HDRI acquisition algorithm is accurate, even for images with
large motion.

23.CLUSTERING WITH MULTIVIEWPOINT-BASED SIMILARITY


MEASURE
All clustering methods have to assume some cluster relationship among the data
objects that they are applied on. Similarity between a pair of objects can be
defined either explicitly or implicitly. In this paper, we introduce a novel
multiviewpoint-based similarity measure and two related clustering methods. The
major difference between a traditional dissimilarity/similarity measure and ours is

www.globalsoftsolutions.in
that the former uses only a single viewpoint, which is the origin, while the latter
utilizes many different viewpoints, which are objects assumed to not be in the same
cluster with the two objects being measured. Using multiple viewpoints, more
informative assessment of similarity could be achieved. Theoretical analysis and
empirical study are conducted to support this claim. Two criterion functions for
document clustering are proposed based on this new measure. We compare them
with several well-known clustering algorithms that use other popular similarity
measures on various document collections to verify the advantages of our proposal.

24.IEEE 2012: QUERY PLANNING FOR CONTINUOUS


AGGREGATION QUERIES OVER A NETWORK OF DATA
AGGREGATOR
Continuous queries are used to monitor changes to time varying data and to provide
results useful for online decision making. Typically a user desires to obtain the value
of some aggregation function over distributed data items, for example, to know
value of portfolio for a client; or the AVG of temperatures sensed by a set of
sensors. In these queries a client specifies a coherency requirement as part of the
query. We present a low-cost, scalable technique to answer continuous aggregation
queries using a network of aggregators of dynamic data items. In such a network of
data aggregators, each data aggregator serves a set of data items at specific
coherencies. Just as various fragments of a dynamic webpage are served by one or
more nodes of a content distribution network, our technique involves decomposing a
client query into subqueries and executing subqueries on judiciously chosen data
aggregators with their individual subquery incoherency bounds. We provide a
technique for getting the optimal set of subqueries with their incoherency bounds
which satisfies client query's coherency requirement with least number of refresh
messages sent from aggregators to the client. For estimating the number of
refresh messages, we build a query cost model which can be used to estimate the
number of messages required to satisfy the client specified incoherency bound.
Performance results using real-world traces show that our cost-based query
planning leads to queries being executed using less than one third the number of
messages required by existing schemes.

25.A FRAMEWORK FOR PERSONAL MOBILE COMMERCE PATTERN


MINING AND PREDICTION

www.globalsoftsolutions.in
Due to a wide range of potential applications, research on mobile commerce has
received a lot of interests from both of the industry and academia. Among them,
one of the active topic areas is the mining and prediction of users' mobile
commerce behaviors such as their movements and purchase transactions. In this
paper, we propose a novel framework, called Mobile Commerce Explorer (MCE), for
mining and prediction of mobile users' movements and purchase transactions under
the context of mobile commerce. The MCE framework consists of three major
components: 1) Similarity Inference Model (SIM) for measuring the similarities
among stores and items, which are two basic mobile commerce entities considered
in this paper; 2) Personal Mobile Commerce Pattern Mine (PMCP-Mine) algorithm for
efficient discovery of mobile users' Personal Mobile Commerce Patterns (PMCPs);
and 3) Mobile Commerce Behavior Predictor (MCBP) for prediction of possible
mobile user behaviors. To our best knowledge, this is the first work that facilitates
mining and prediction of mobile users' commerce behaviors in order to recommend
stores and items previously unknown to a user. We perform an extensive
experimental evaluation by simulation and show that our proposals produce
excellent results.

26.PRIVACY PRESERVING DECISION TREE LEARNING USING


UNREALIZED DATA SETS

Privacy preservation is important for machine learning and data mining, but
measures designed to protect private information often result in a trade-off:
reduced utility of the training samples. This paper introduces a privacy preserving
approach that can be applied to decision tree learning, without concomitant loss of
accuracy. It describes an approach to the preservation of the privacy of collected
data samples in cases where information from the sample database has been
partially lost. This approach converts the original sample data sets into a group of
unreal data sets, from which the original samples cannot be reconstructed without
the entire group of unreal data sets. Meanwhile, an accurate decision tree can be
built directly from those unreal data sets. This novel approach can be applied
directly to the data storage as soon as the first sample is collected. The approach
is compatible with other privacy preserving approaches, such as cryptography, for
extra protection.

27.REVISITING DEFENSES AGAINST LARGE-SCALE ONLINE


PASSWORD GUESSING ATTACKS

www.globalsoftsolutions.in
Brute force and dictionary attacks on password-only remote login services are now
widespread and ever increasing. Enabling convenient login for legitimate users while
preventing such attacks is a difficult problem. Automated Turing Tests (ATTs)
continue to be an effective, easy-to-deploy approach to identify automated
malicious login attempts with reasonable cost of inconvenience to users. In this
paper, we discuss the inadequacy of existing and proposed login protocols designed
to address large-scale online dictionary attacks (e.g., from a botnet of hundreds of
thousands of nodes). We propose a new Password Guessing Resistant Protocol
(PGRP), derived upon revisiting prior proposals designed to restrict such attacks.
While PGRP limits the total number of login attempts from unknown remote hosts
to as low as a single attempt per username, legitimate users in most cases (e.g.,
when attempts are made from known, frequently-used machines) can make several
failed login attempts before being challenged with an ATT. We analyze the
performance of PGRP with two real-world data sets and find it more promising than
existing proposals.

28.COMBINING TAG AND VALUE SIMILARITY FOR DATA


EXTRACTION AND ALIGNMENT
Web databases generate query result pages based on a user's query. Automatically
extracting the data from these query result pages is very important for many
applications, such as data integration, which need to cooperate with multiple web
databases. We present a novel data extraction and alignment method called CTVS
that combines both tag and value similarity. CTVS automatically extracts data from
query result pages by first identifying and segmenting the query result records
(QRRs) in the query result pages and then aligning the segmented QRRs into a
table, in which the data values from the same attribute are put into the same
column. Specifically, we propose new techniques to handle the case when the QRRs
are not contiguous, which may be due to the presence of auxiliary information, such
as a comment, recommendation or advertisement, and for handling any nested
structure that may exist in the QRRs. We also design a new record alignment
algorithm that aligns the attributes in a record, first pairwise and then holistically,
by combining the tag and data value similarity information. Experimental results
show that CTVS achieves high precision and outperforms existing state-of-the-art
data extraction methods.

www.globalsoftsolutions.in
29.PROTECTING SENSITIVE LABELS IN SOCIAL NETWORK DATA

ANONYMIZATION

Privacy is one of the major concerns when publishing or sharing social network data
for social science research and business analysis. Recently, researchers have
developed privacy models similar to $k$-anonymity to prevent node reidentification through structure information. However, even when these privacy
models are enforced, an attacker may still be able to infer one's private
information if a group of nodes largely share the same sensitive labels (i.e.,
attributes). In other words, the label-node relationship is not well protected by
pure structure anonymization methods. Furthermore, existing approaches,
which rely on edge editing or node clustering and merging, may significantly
alter key graph properties. In this paper, we define a $k$-degree-$l$diversity anonymity model that considers the protection of structural
information as well as sensitive labels of individuals. We further propose a
novel anonymization methodology based on adding noise nodes. We develop
several algorithms to add noise nodes into the original graph with the
consideration of introducing the least distortion to graph properties. Most
importantly, we provide a rigorous analysis of the theoretical upper bound on
the number of noise nodes added and their impacts on important graph
properties. We conduct extensive experiments to evaluate the effectiveness of
the proposed technique.

30.DDD: A New Ensemble Approach for Dealing with Concept Drift


Online learning algorithms often have to operate in the presence of concept drifts.
A recent study revealed that different diversity levels in an ensemble of learning
machines are required in order to maintain high generalization on both old and new
concepts. Inspired by this study and based on a further study of diversity with
different strategies to deal with drifts, we propose a new online ensemble learning
approach called Diversity for Dealing with Drifts (DDD). DDD maintains ensembles
with different diversity levels and is able to attain better accuracy than other
approaches. Furthermore, it is very robust, outperforming other drift handling
approaches in terms of accuracy when there are false positive drift detections. In
all the experimental comparisons we have carried out, DDD always performed at
least as well as other drift handling approaches under various conditions, with very
few exceptions.

31.Multiparty Access Control for Online Social Networks: Model and


Mechanisms

www.globalsoftsolutions.in

Online social networks (OSNs) have experienced tremendous growth


in recent years and become a de facto portal for hundreds of millions
of Internet users. These OSNs offer attractive means for digital
social interactions and information sharing, but also raise a number of
security and privacy issues. While OSNs allow users to restrict access
to shared data, they currently do not provide any mechanism to
enforce privacy concerns over data associated with multiple users. To
this end, we propose an approach to enable the protection of shared
data associated with multiple users in OSNs. We formulate an access
control model to capture the essence of multiparty authorization
requirements, along with a multiparty policy specification scheme and
a policy enforcement mechanism. Besides, we present a logical
representation of our access control model which allows us to leverage
the features of existing logic solvers to perform various analysis
tasks on our model. We also discuss a proof-of-concept prototype of
our approach as part of an application in Facebook and provide system
evaluation and usability study of our method.
32.Publishing Search LogsA Comparative Study of Privacy Guarantees
Search engine companies collect the database of intentions, the histories of their
users' search queries. These search logs are a gold mine for researchers. Search
engine companies, however, are wary of publishing search logs in order not to
disclose sensitive information. In this paper, we analyze algorithms for publishing
frequent keywords, queries, and clicks of a search log. We first show how methods
that achieve variants of k-anonymity are vulnerable to active attacks. We then
demonstrate that the stronger guarantee ensured by -differential privacy
unfortunately does not provide any utility for this problem. We then propose an
algorithm ZEALOUS and show how to set its parameters to achieve (, )probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by
Korolova et al. [17] that achieves (',')-indistinguishability. Our paper concludes
with a large experimental study using real applications where we compare ZEALOUS
and previous work that achieves k-anonymity in search log publishing. Our results
show that ZEALOUS yields comparable utility to k-anonymity while at the same
time achieving much stronger privacy guarantees.

33.Scalable Learning of Collective Behavior

www.globalsoftsolutions.in
This study of collective behavior is to understand how individuals behave in a social
networking environment. Oceans of data generated by social media like Facebook,
Twitter, Flickr, and YouTube present opportunities and challenges to study
collective behavior on a large scale. In this work, we aim to learn to predict
collective behavior in social media. In particular, given information about some
individuals, how can we infer the behavior of unobserved individuals in the same
network? A social-dimension-based approach has been shown effective in
addressing the heterogeneity of connections presented in social media. However,
the networks in social media are normally of colossal size, involving hundreds of
thousands of actors. The scale of these networks entails scalable learning of
models for collective behavior prediction. To address the scalability issue, we
propose an edge-centric clustering scheme to extract sparse social dimensions.
With sparse social dimensions, the proposed approach can efficiently handle
networks of millions of actors while demonstrating a comparable prediction
performance to other nonscalable methods

34.Slicing: A New Approach for Privacy Preserving Data Publishing


Several anonymization techniques, such as generalization and bucketization, have
been designed for privacy preserving microdata publishing. Recent work has shown
that generalization loses considerable amount of information, especially for highdimensional data. Bucketization, on the other hand, does not prevent membership
disclosure and does not apply for data that do not have a clear separation between
quasi-identifying attributes and sensitive attributes. In this paper, we present a
novel technique called slicing, which partitions the data both horizontally and
vertically. We show that slicing preserves better data utility than generalization
and can be used for membership disclosure protection. Another important
advantage of slicing is that it can handle high-dimensional data. We show how slicing
can be used for attribute disclosure protection and develop an efficient algorithm
for computing the sliced data that obey the -diversity requirement. Our workload
experiments confirm that slicing preserves better utility than generalization and is
more effective than bucketization in workloads involving the sensitive attribute.
Our experiments also demonstrate that slicing can be used to prevent membership
disclosure.

35.Pointcut Rejuvenation: Recovering Pointcut Expressions in Evolving


Aspect-Oriented Software
Pointcut fragility is a well-documented problem in Aspect-Oriented Programming;
changes to the base code can lead to join points incorrectly falling in or out of the

www.globalsoftsolutions.in
scope of pointcuts. In this paper, we present an automated approach that limits
fragility problems by providing mechanical assistance in pointcut maintenance. The
approach is based on harnessing arbitrarily deep structural commonalities between
program elements corresponding to join points selected by a pointcut. The
extracted patterns are then applied to later versions to offer suggestions of new
join points that may require inclusion. To illustrate that the motivation behind our
proposal is well founded, we first empirically establish that join points captured by
a single pointcut typically portray a significant amount of unique structural
commonality by analyzing patterns extracted from 23 AspectJ programs. Then, we
demonstrate the usefulness of our technique by rejuvenating pointcuts in multiple
versions of three of these programs. The results show that our parameterized
heuristic algorithm was able to accurately and automatically infer the majority of
new join points in subsequent software versions that were not captured by the
original pointcuts

You might also like