IEEE Projects 2013-2014 - DataMining - Project Titles and Abstracts

Elysium Technologies Private Limited
Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience Automated Services 24/7 Help Desk Support Experience & Expertise Developers Advanced Technologies & Tools Legitimate Member of all Journals Having 1,50,000 Successive records in all Languages More than 12 Branches in Tamilnadu, Kerala & Karnataka. Ticketing & Appointment Systems. Individual Care for every Student. Around 250 Developers & 20 Researchers

227-230 Church Road, Anna Nagar, Madurai 625020. 0452-4390702, 4392702, + 91-9944793398. info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam, Chennai - 600034. 044-42072702, +91-9600354638, chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy 620001. 0431-4002234, + 91-9790464324. trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore 641002 0422- 4377758, +91-9677751577. coimbatore@elysiumtechnologies.com

1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram - 623501. 04567-223225, +919677704922.ramnad@elysiumtechnologies.com
Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli627007. 0462-2532104, +919677733255, tirunelveli@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +919677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry 605 001. 0413 4200640 +91-9677704822 pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem 636007, 0427-4042220, +91-9894444716. salem@elysiumtechnologies.com

ETPL Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data DM-001 Abstract: Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graphtheoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world highdimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers. ETPL Graph-Based Consensus Maximization Approach for Combining Multiple Supervised DM-002 and Unsupervised Models Ensemble learning has emerged as a powerful method for combining multiple models. Well-known methods, such as bagging, boosting, and model averaging, have been shown to improve accuracy and robustness over single models. However, due to the high costs of manual labeling, it is hard to obtain sufficient and reliable labeled data for effective training. Meanwhile, lots of unlabeled data exist in these sources, and we can readily obtain multiple unsupervised models. Although unsupervised models do not directly generate a class label prediction for each object, they provide useful constraints on the joint predictions for a set of related objects. Therefore, incorporating these unsupervised models into the ensemble of supervised models can lead to better prediction performance. In this paper, we study ensemble learning with outputs from multiple supervised and unsupervised models, a topic where little work has been done. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the predictions over the graph, but penalizes the deviations from the initial labeling provided by the supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes and prove the optimality of the solution. The proposed method can be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on different applications with heterogeneous data sources demonstrate the benefits of the proposed method over existing alternatives. ETPL Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontology and RuleDM-003 Based Model

Abstract: Recent increase in the use of video-based applications has revealed the need for extracting the content in videos. Raw data and low-level features alone are not sufficient to fulfill the user 's needs; that is, a deeper understanding of the content at the semantic level is required. Currently, manual techniques, which are inefficient, subjective and costly in time and limit the querying capabilities, are being used to bridge the gap between low-level representative features and high-level semantic content. Here, we propose a semantic content extraction system that allows the user to query and retrieve objects, events, and concepts that are extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses spatial/temporal relations in event and concept definitions. This metaontology definition provides a widedomain applicable rule construction standard that allows the user to construct an ontology for a given domain. In addition to domain ontologies, we use additional rule definitions (without using ontology) to lower spatial relation computation cost and to be able to define some complex situations more effectively. The proposed framework has been fully implemented and tested on three different domains. We have obtained satisfactory precision and recall rates for object, event and concept extraction. ETPL Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm DM-004 In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic present within a document or set of documents. However, because most sentence similarity measures do not represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering. This paper presents a novel fuzzy clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise similarities between data objects. The algorithm uses a graph representation of the data, and operates in an Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as a likelihood. Results of applying the algorithm to sentence clustering tasks demonstrate that the algorithm is capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential use in a variety of text mining tasks. We also include results of applying the algorithm to benchmark data sets in several other domains. ETPL Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks DM-005 Abstract: In this paper, we introduce the notion of sufficient set and necessary set for distributed processing of probabilistic top-k queries in cluster-based wireless sensor networks. These two concepts have very nice properties that can facilitate localized data pruning in clusters. Accordingly, we develop a suite of algorithms, namely, sufficient set-based (SSB), necessary set-based (NSB), and boundary-based (BB), for intercluster query processing with bounded rounds of communications. Moreover, in responding to dynamic changes of data distribution in the network, we develop an adaptive algorithm that dynamically switches among the three proposed algorithms to minimize the transmission cost. We show the applicability of sufficient set and necessary set to wireless sensor networks with both two-tier hierarchical and tree-structured network topologies. Experimental results show that the proposed algorithms reduce data transmissions significantly and incur only small constant rounds of data communications. The experimental results also demonstrate the superiority of the adaptive algorithm, which achieves a near-optimal performance under various conditions.

ETPL Evaluating Data Reliability: An Evidential Answer with Application to a Web-Enabled DM-006 Data Warehouse Abstract: There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the Sym'Previus predictive microbiology oriented data warehouse. ETPL Large Graph Analysis in the GMine System DM-007 Abstract: Current applications have produced graphs on the order of hundreds of thousands of nodes and millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers, and communities. These tasks are better performed in an interactive environment, where human expertise can guide the process. For large graphs, though, there are some challenges: the excessive processing requirements are prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with these problems, we propose an innovative framework suited for any kind of tree-like graph visual design. GMine integrates 1) a representation for graphs organized as hierarchies of partitions-the concepts of SuperGraph and Graph-Tree; and 2) a graph summarization methodology-CEPS. Our graph representation deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity, allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated globally and locally. ETPL Maximum Likelihood Estimation from Uncertain Data in the Belief Function DM-008 Framework Abstract: We consider the problem of parameter estimation in statistical models in the case where data are uncertain and represented as belief functions. The proposed method is based on the maximization of a generalized likelihood criterion, which can be interpreted as a degree of agreement between the statistical model and the uncertain observations. We propose a variant of the EM algorithm that iteratively maximizes this criterion. As an illustration, the method is applied to uncertain data clustering using finite mixture models, in the cases of categorical and continuous attributes. ETPL Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case DM-009 Studies Abstract: In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database have any common genetic mutations, or if two people have any common friends in an online social network. We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds on the number of queries needed to clone the database, as well as experimental tests on genomic information, collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of

these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that relatively few nonadaptive queries are needed to recover a large majority of each database. ETPL On the Recovery of R-Trees DM-010 Abstract: We consider the recoverability of traditional R-tree index structures under concurrent updating transactions, an important issue that is neglected or treated inadequately in many proposals of R-tree concurrency control. We present two solutions to ARIES-based recovery of transactions on R-trees. These assume a standard fine-grained single-version update model with physiological write-ahead logging and stealand-no-force buffering where records with uncommitted updates by a transaction may migrate from their original page to another page due to structure modifications caused by other transactions. Both solutions guarantee that an R-tree will remain in a consistent and balanced state in the presence of any number of concurrent forward-rolling and (totally or partially) backward-rolling multiaction transactions and in the event of process failures and system crashes. One solution maintains the R-tree in a strictly consistent state in which the bounding rectangles of pages are as tight as possible, while in the other solution this requirement is relaxed. In both solutions only a small constant number of simultaneous exclusive latches (write latches) are needed, and in the solution that only maintains relaxed consistency also the number of simultaneous nonexclusive latches is similarly limited. In both solutions, deletions are handled uniformly with insertions, and a logarithmic insertion-path length is maintained under all circumstances. ETPL Ontology Matching: State of the Art and Future Challenges DM-011 Abstract: After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field. ETPL Ranking on Data Manifold with Sink Points DM-012 Abstract: Ranking is an important problem in various applications, such as Information Retrieval (IR), natural language processing, computational biology, and social sciences. Many ranking approaches have been proposed to rank objects according to their degrees of relevance or importance. Beyond these two goals, diversity has also been recognized as a crucial criterion in ranking. Top ranked results are expected to convey as little redundant information as possible, and cover as many aspects as possible. However, existing ranking approaches either take no account of diversity, or handle it separately with some heuristics. In this paper, we introduce a novel approach, Manifold Ranking with Sink Points (MRSPs), to address diversity as well as relevance and importance in ranking. Specifically, our approach uses a manifold ranking process over the data manifold, which can naturally find the most relevant and important data objects. Meanwhile, by turning ranked objects into sink points on data manifold, we can effectively prevent redundant objects from receiving a high rank. MRSP not only shows a nice convergence property, but also has an interesting and satisfying

optimization explanation. We applied MRSP on two application tasks, update summarization and query recommendation, where diversity is of great concern in ranking. Experimental results on both tasks present a strong empirical performance of MRSP as compared to existing ranking approaches. ETPL Region-Based Foldings in Process Discovery DM-013 Abstract: A central problem in the area of Process Mining is to obtain a formal model that represents the processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to apply these techniques. Given that the complexity of the region-based techniques depends on the size of the input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool suggest that the techniques presented in this paper are a big step into widening the application of the theory of regions in Process Mining for industrial scenarios. ETPL Relationships between Diversity of Classification Ensembles and Single-Class DM-014 Performance Measures Abstract: In class imbalance learning problems, how to better recognize examples from the minority class is the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an ensemble could help to improve the performance of class imbalance learning. However, no study has actually investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and Fmeasure, which are widely used in class imbalance learning. Six different situations of diversity's impact on these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects the single class performance and overall performance in class imbalance problems, we carry out extensive experimental studies on both artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong correlations between diversity and discussed performance measures. Diversity shows a positive impact on the minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean. ETPL T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence DM-015

Abstract: This paper presents a smart driving direction system leveraging the intelligence of experienced drivers. In this system, GPS-equipped taxis are employed as mobile sensors probing the traffic rhythm of a city and taxi drivers' intelligence in choosing driving directions in the physical world. We propose a time-dependent landmark graph to model the dynamic traffic pattern as well as the intelligence of experienced drivers so as to provide a user with the practically fastest route to a given destination at a given departure time. Then, a Variance-Entropy-Based Clustering approach is devised to estimate the distribution of travel time between two landmarks in different time slots. Based on this graph, we design a two-stage routing algorithm to compute the practically fastest and customized route for end users. We build our system based on a real-world trajectory data set generated by over 33,000 taxis in a period of three months, and evaluate the system by conducting both synthetic experiments and in-the-field evaluations. As a result, 60-70 percent of the routes suggested by our method are faster than the competing methods, and 20 percent of the routes share the same results. On average, 50 percent of our routes are at least 20 percent faster than the competing approaches.
A Generalized Flow-Based Method for Analysis of Implicit Relationships on Wikipedia Abstract: We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship. A Proxy-Based Approach to Continuous Location-Based Spatial Queries in ETPL DM-017 Mobile Environments Abstract: Caching valid regions of spatial queries at mobile clients is effective in reducing the number of queries submitted by mobile clients and query load on the server. However, mobile clients suffer from longer waiting time for the server to compute valid regions. We propose in this paper a proxybased approach to continuous nearest-neighbor (NN) and window queries. The proxy creates estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to represent the EVRs of window queries in the form of vectors, called estimated window vectors (EWVs), to achieve larger estimated valid regions. This novel representation and the associated
ETPL DM-016

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com creation algorithm result in more effective EVRs of window queries. In addition, due to the distinct characteristics, we use separate index structures, namely EVR-tree and grid index, for NN queries and window queries, respectively. To further increase efficiency, we develop algorithms to exploit the results of NN queries to aid grid index growth, benefiting EWV creation of window queries. Similarly, the grid index is utilized to support NN query answering and EVR updating. We conduct several experiments for performance evaluation. The experimental results show that the proposed approach significantly outperforms the existing proxy-based approaches. A Rough-Set-Based Incremental Approach for Updating Approximations under ETPL DM-018 Dynamic Maintenance Environments Abstract: Approximations of a concept by a variable precision rough-set model (VPRS) usually vary under a dynamic information system environment. It is thus effective to carry out incremental updating approximations by utilizing previous data structures. This paper focuses on a new incremental method for updating approximations of VPRS while objects in the information system dynamically alter. It discusses properties of information granulation and approximations under the dynamic environment while objects in the universe evolve over time. The variation of an attribute's domain is also considered to perform incremental updating for approximations under VPRS. Finally, an extensive experimental evaluation validates the efficiency of the proposed method for dynamic maintenance of VPRS approximations
ETPL A System to Filter Unwanted Messages from OSN User Walls DM-019 Abstract: One fundamental issue in today's Online Social Networks (OSNs) is to give users the ability
to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering.
ETPL DM-020
AML: Efficient Approximate Membership Localization within a Web-Based
Abstract: In this paper, we propose a new type of Dictionary-based Entity Recognition Problem,
named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.
ETPL DM-021 Abstract:
Anonymization of Centralized and Distributed Social Networks by Sequential
We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.
ETPL Clustering Large Probabilistic Graphs DM-022 Abstract: We study the problem of clustering probabilistic graphs. Similar to the problem of clustering
standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.
ETPL DM-023
Detecting Intrinsic Loops Underlying Data Manifold
Abstract: Detecting intrinsic loop structures of a data manifold is the necessary prestep for the proper
employment of the manifold learning techniques and of fundamental importance in the discovery of the essential representational features underlying the data lying on the loopy manifold. An effective strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com essential loops of a manifold is derived. An effective algorithm for loop detection is then constructed. The soundness of the proposed theory and algorithm is validated by a series of experiments performed on synthetic and real-life data sets. In each of the experiments, the essential loops underlying the data manifold can be properly detected, and the intrinsic representational features of the data manifold can be revealed along the loop structure so detected. Particularly, some of these features can hardly be discovered by the conventional manifold learning methods.
ETPL DM-024
Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)
Abstract: This paper introduces a platform for online Sensitivity Analysis (SA) that is applicable in
large scale real-time data acquisition (DAQ) systems. Here, we use the term real-time in the context of a system that has to respond to externally generated input stimuli within a finite and specified period. Complex industrial systems such as manufacturing, healthcare, transport, and finance require high-quality information on which to base timely responses to events occurring in their volatile environments. The motivation for the proposed EventTracker platform is the assumption that modern industrial systems are able to capture data in real-time and have the necessary technological flexibility to adjust to changing system requirements. The flexibility to adapt can only be assured if data is succinctly interpreted and translated into corrective actions in a timely manner. An important factor that facilitates data interpretation and information modeling is an appreciation of the affect system inputs have on each output at the time of occurrence. Many existing sensitivity analysis methods appear to hamper efficient and timely analysis due to a reliance on historical data, or sluggishness in providing a timely solution that would be of use in real-time applications. This inefficiency is further compounded by computational limitations and the complexity of some existing models. In dealing with real-time event driven systems, the underpinning logic of the proposed method is based on the assumption that in the vast majority of cases changes in input variables will trigger events. Every single or combination of events could subsequently result in a change to the system state. The proposed event tracking sensitivity analysis method describes variables and the system state as a collection of events. The higher the numeric occurrence of an input variable at the trigger level during an event monitoring interval, the greater is its impact on the final analysis of the system state. Expeiments were designed to compare the proposed event tracking sensitivity analysis method with a comparable method (that of Entropy). An improvement of 10 percent in computational efficiency without loss in accuracy was observed. The comparison also showed that the time taken to perform the sensitivity analysis was 0.5 percent of that required when using the comparable Entropy-based method.
ETPL Fast Activity Detection: Indexing for Temporal Stochastic Automaton-Based DM-025 Abstract: Today, numerous applications require the ability to monitor a continuous stream of fine-
grained data for the occurrence of certain high-level activities. A number of computerized systemsincluding ATM networks, web servers, and intrusion detection systems-systematically track every

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com atomic action we perform, thus generating massive streams of timestamped observation data, possibly from multiple concurrent activities. In this paper, we address the problem of efficiently detecting occurrences of high-level activities from such interleaved data streams. A solution to this important problem would greatly benefit a broad range of applications, including fraud detection, video surveillance, and cyber security. There has been extensive work in the last few years on modeling activities using probabilistic models. In this paper, we propose a temporal probabilistic graph so that the elapsed time between observations also plays a role in defining whether a sequence of observations constitutes an activity. We first propose a data structure called temporal multiactivity graph to store mu ltiple activities that need to be concurrently monitored. We then define an index called Temporal Multiactivity Graph Index Creation (tMAGIC) that, based on this data structure, examines and links observations as they occur. We define algorithms for insertion and bulk insertion into the tMAGIC index and show that this can be efficiently accomplished. We also define algorithms to solve two problems: the evidence problem that tries to find all occurrences of an activity (with probability over a threshold) w ithin a given sequence of observations, and the identification problem that tries to find the activity that best matches a sequence of observations. We introduce complexity reducing restrictions and pruning strategies to make the problem-which is intrinsically exponential-linear to the number of observations. Our experiments confirm that tMAGI- has time and space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored activities.
ETPL Finding Rare Classes: Active Learning with Generative and Discriminative DM-026 Abstract: Discovering rare categories and classifying new instances of them are important data mining
issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in labeling effort. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur together in practice and are intrinsically related because examples of each class are required to train a classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data mining for rare classes in new domains makes inefficient use of human supervision. Developing active learning algorithms to optimise both rare class discovery and classification simultaneously is challenging because discovery and classification have conflicting requirements in query criteria. In this paper, we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them by adapting query criteria online; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of this approach over existing methods.
ETPL DM-027
Halite: Fast and Scalable Multiresolution Local-Correlation Clustering

Abstract: This paper proposes Halite, a novel, fast, and scalable clustering method that looks for
clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in time and space regarding the data size and dimensionality, and the dimensionality of the clusters' subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or better quality compared to top related works; and 4) Generality: it includes a soft clustering approach. Experiments on synthetic data ranging from five to 30 axes and up to 1 rm million points were performed. Halite was in average at least 12 times faster than seven representative works, and always presented highly accurate results. On real data, Halite was at least 11 times faster than others, increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where soft clustering is desirable.
ETPL DM-028
k-Pattern Set Mining under Constraints
Abstract: We introduce the problem of k-pattern set mining, concerned with finding a set of k related
patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The k-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.
ETPL DM-029
Minimally Supervised Novel Relation Extraction Using a Latent Relational
Abstract The World Wide Web includes semantic relations of numerous types that exist among
different entities. Extracting the relations that exist between two entities is an important step in various Web-related tasks such as information retrieval (IR), information extraction, and social network extraction. A supervised relation extraction system that is trained to extract a particular relation type (source relation) might not accurately extract a new type of a relation (target relation) for which it has not been trained. However, it is costly to create training data manually for every new relation type that one might want to extract. We propose a method to adapt an existing relation

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com extraction system to extract new relation types with minimum supervision. Our proposed method comprises two stages: learning a lower dimensional projection between different relations, and learning a relational classifier for the target relation type with instance sampling. First, to represent a semantic relation that exists between two entities, we extract lexical and syntactic patterns from contexts in which those two entities co-occur. Then, we construct a bipartite graph between relationspecific (RS) and relation-independent (RI) patterns. Spectral clustering is performed on the bipartite graph to compute a lower dimensional projection. Second, we train a classifier for the target relation type using a small number of labeled instances. To account for the lack of target relation training instances, we present a one-sided under sampling method. We evaluate the proposed method using a data set that contains 2,000 instances for 20 different relation types. Our experimental results show that the proposed method achieves a statistically significant macroaverage F-score of 62.77. Moreover, the proposed method outperforms numerous baselines and a previously proposed weakly supervised relation extraction method.
Mining User Queries with Markov Chains: Application to Online Image
We propose a novel method for automatic annotation, indexing and annotation-based retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented in the context of an online image retrieval system. Assuming such a system, the users' queries are used to construct an Aggregate Markov Chain (AMC) through which the relevance between the keywords seen by the system is defined. The users' queries are also used to automatically annotate the images. A stochastic distance between images, based on their annotation and the keyword relevance captured in the AMC, is then introduced. Geometric interpretations of the proposed distance are provided and its relation to a clustering in the keyword space is investigated. By means of a new measure of Markovian state similarity, the mean first cross passage time (CPT), optimality properties of the proposed distance are proved. Images are modeled as points in a vector space and their similarity is measured with MSI. The new method is shown to possess certain theoretical advantages and also to achieve better Precision versus Recall results when compared to Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation-Based Image Retrieval (ABIR) tasks.
ETPL DM-031
Reinforced Similarity Integration in Image-Rich Information Networks
Abstract: Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain
billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments, and other information, thus forming heterogeneous image-rich information networks. In this paper, we introduce the concept of (heterogeneous) image-rich information network and the problem of how to perform information

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com retrieval and recommendation in such networks. We propose a fast algorithm heterogeneous minimum order k-SimRank (HMok-SimRank) to compute link-based similarity in weighted heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity Learning (IWSL) to account for both link-based and content-based similarities by considering the network structure and mutually reinforcing link similarity and feature weight learning. Both local and global feature learning methods are designed. Experimental results on Flickr and Amazon data sets show that our approach is significantly better than traditional methods in terms of both relevance and speed. A new product search and recommendation system for e-commerce has been implemented based on our algorithm.
ETPL DM-032
Supporting Search-As-You-Type Using SQL in Databases
Abstract: A search-as-you-type system computes answers on-the-fly as a user types in a keyword
query character by character. We study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the highperformance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search performance. We present solutions for both single-keyword queries and multikeyword queries, and develop novel techniques for fuzzy search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables with millions of records.
ETPL DM-033
Simple Hybrid and Incremental Postpruning Techniques for Rule Induction
: Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental postpruning technique based on a misclassification tolerance. Although both have been designed for RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other stateof-the-art classification techniques. The improved classifier is also more accurate and up to two orders of magnitude faster than before.
ETPL -diverse nearest neighbors browsing for multidimensional data DM-034 Abstract: Traditional search methods try to obtain the most relevant information and rank it according
to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannot capture all aspects of the queried topic. In this paper, we focus on the lambda-diverse k-nearest neighbor search problem on spatial and

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com multidimensional data. Unlike the approach of diversifying query results in a postprocessing step, we naturally obtain diverse results with the proposed geometric and index-based methods. We first make an analogy with the concept of Natural Neighbors (NatN) and propose a natural neighbor-based method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for higher dimensional spaces. We then introduce a diverse browsing method based on the distance browsing feature of spatial index structures, such as R-trees. The algorithm maintains a Priority Queue with mindivdist of the objects depending on both relevancy and angular diversity and efficiently prunes nondiverse items and nodes. We experiment with a number of spatial and highdimensional data sets, including Factual's (http://www.factual.com/) US points-of-interest data set of 13M entries. On the experimental setup, the diverse browsing method is shown to be more efficient (regarding disk accesses) than k-NN search on R-trees, and more effective (regarding Maximal Marginal Relevance (MMR)) than the diverse nearest neighbor search techniques found in the literature.
ETPL A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles DM-035 Abstract: Kappa-error diagrams are used to gain insights about why an ensemble method is better
than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The xaxis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study, kappa is calculated from the 2 2 correct/wrong contingency matrix. We derive a lower bound on kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments with real data show that there is unoccupied feasible space on the diagram corresponding to (hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the ensemble accuracy.
ETPL A New Algorithm for Inferring User Search Goals with Feedback Sessions DM-036 Abstract: For a broad-topic and ambiguous query, different users may have different search goals
when they submit it to a search engine. The inference and analysis of user search goals can be very useful in improving search engine relevance and user experience. In this paper, we propose a novel approach to infer user search goals by analyzing search engine query logs. First, we propose a framework to discover different user search goals for a query by clustering the proposed feedback sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect the information needs of users. Second, we propose a novel approach to generate pseudo-documents to better represent the feedback sessions for clustering. Finally, we propose a new criterion )Classified Average Precision (CAP) to evaluate the performance of inferring user search goals. Experimental results are presented using user click-through logs from a commercial search engine to validate the effectiveness of our proposed methods.
ETPL DM-037
Annotating Search Results from Web Databases

Abstract: An increasing number of databases have become web accessible through HTML form-based
search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
ETPL Building a Scalable Database-Driven Reverse Dictionary DM-038 Abstract: In this paper, we describe the design and implementation of a reverse dictionary. Unlike a
traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the input phrase. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. We present a set of algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and the runtime response time performance of our implementation. Our experimental results show that our approach can provide significant improvements in performance scale without sacrificing the quality of the result. Our experiments comparing the quality of our approach to that of currently available reverse dictionaries show that of our approach can provide significantly higher quality over either of the other currently available implementations.
ETPL DM-039
Discovering Temporal Change Patterns in the Presence of Taxonomies
Abstract: Frequent itemset mining is a widely exploratory technique that focuses on discovering
recurrent correlations among data. The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that 1) frequently occur in the source data, and 2) provide a high-level abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the History GENeralized Pattern (HIGEN), that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalizations characterized by minimal redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period.

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com To address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding itemset mining followed by postprocessing by exploiting a support-driven itemset generalization approach. To focus the attention on the minimally redundant frequent generalizations and thus reduce the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the NONREDUNDANT HIGENs, is addressed as well. Experiments performed on both real and synthetic datasets show the efficiency and the effectiveness of the proposed approach as well as its usefulness in a real application context.
ETPL Extending BCDM to Cope with Proposals and Evaluations of Updates DM-040 Abstract: The cooperative construction of data/knowledge bases has recently had a significant impulse
(see, e.g., Wikipedia [1]). In cases in which data/knowledge quality and reliability are crucial, proposals of update/insertion/deletion need to be evaluated by experts. To the best of our knowledge, no theoretical framework has been devised to model the semantics of update proposal/ evaluation in the relational context. Since time is an intrinsic part of most domains (as well as of the proposal/evaluation process itself), semantic approaches to temporal relational databases (specifically, Bitemporal Conceptual Data Model (henceforth, BCDM) [2]) are the starting point of our approach. In this paper, we propose BCDMPV, a semantic temporal relational model that extends BCDM to deal with multiple update/insertion/deletion proposals and with acceptances/rejections of proposals themselves. We propose a theoretical framework, defining the new data structures, manipulation operations and temporal relational algebra and proving some basic properties, namely that BCDMPV is a consistent extension of BCDM and that it is reducible to BCDM. These properties ensure consistency with most relational temporal database frameworks, facilitating implementations.
ETPL DM-041
Facilitating Effective User Navigation through Website Structure Improvement
Abstract: Designing well-structured websites to facilitate effective user navigation has long been a
challenge. A primary reason is that the web developers' understanding of how a website should be structured can be considerably different from that of the users. While various methods have been proposed to relink webpages to improve navigability using user navigation data, the completely reorganized new structure can be highly unpredictable, and the cost of disorienting users after the changes remains unanalyzed. This paper addresses how to improve a website without introducing substantial changes. Specifically, we propose a mathematical programming model to improve the user navigation on a website while minimizing alterations to its current structure. Results from extensive tests conducted on a publicly available real data set indicate that our model not only significantly improves the user navigation with very few changes, but also can be effectively solved. We have also tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we define two evaluation metrics and use them to assess the performance of the improved website using the real data set. Evaluation results confirm that the user navigation on the improved structure is

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com indeed greatly enhanced. More interestingly, we find that heavily disoriented users are more likely to benefit from the improved structure than the less disoriented users.
ETPL Information-Theoretic Outlier Detection for Large-Scale Categorical Data DM-042 Abstract: Outlier detection can usually be considered as a pre-processing step for locating, in a data
set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail. Modeling and Solving Distributed Configuration Problems: A CSP-Based ETPL DM-043 Approach Abstract: Product configuration can be defined as the task of tailoring a product according to the specific needs of a customer. Due to the inherent complexity of this task, which for example includes the consideration of complex constraints or the automatic completion of partial configurations, various Artificial Intelligence techniques have been explored in the last decades to tackle such configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In modern supply chain settings, however, the components of a customizable product may themselves be configurable, thus requiring a multisite, distributed approach. In this paper, we analyze the challenges of modeling and solving such distributed configuration problems and propose an approach based on Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic constraints is particularly advantageous also in the distributed problem solving phase.
ETPL On Similarity Preserving Feature Selection DM-044 Abstract: In the literature of feature selection, different criteria have been proposed to evaluate the
goodness of features. In our investigation, we notice that a number of existing selection criteria implicitly select features that preserve sample similarity, and can be unified under a common framework. We further point out that any feature selection criteria covered by this framework cannot handle redundant features, a common drawback of these criteria. Motivated by these observations, we

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com propose a new "Similarity Preserving Feature Selection framework in an explicit and rigorous way. We show, through theoretical analysis, that the proposed framework not only encompasses many widely used feature selection criteria, but also naturally overcomes their common weakness in handling feature redundancy. In developing this new framework, we begin with a conventional combinatorial optimization formulation for similarity preserving feature selection, then extend it with a sparse multiple-output regression formulation to improve its efficiency and effectiveness. A set of three algorithms are devised to efficiently solve the proposed formulations, each of which has its own advantages in terms of computational complexity and selection performance. As exhibited by our extensive experimental study, the proposed framework achieves superior feature selection performance and attractive properties.
ETPL Protecting Sensitive Labels in Social Network Data Anonymization DM-045 Abstract: Privacy is one of the major concerns when publishing or sharing social network data for
social science research and business analysis. Recently, researchers have developed privacy models similar to k-anonymity to prevent node reidentification through structure information. However, even when these privacy models are enforced, an attacker may still be able to infer one's private information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words, the label-node relationship is not well protected by pure structure anonymization methods. Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly alter key graph properties. In this paper, we define a k-degree-l-diversity anonymity model that considers the protection of structural information as well as sensitive labels of individuals. We further propose a novel anonymization methodology based on adding noise nodes. We develop a new algorithm by adding noise nodes into the original graph with the consideration of introducing the least distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical bounds on the number of noise nodes added and their impacts on an important graph property. We conduct extensive experiments to evaluate the effectiveness of the proposed technique .
ETPL DM-046
Robust Module-Based Data Management
Abstract: The current trend for building an ontology-based data management system (DMS) is to
capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from the reference DMS a piece of schema relevant to the new application needs-a module-, possibly personalizing it with extra constraints w.r.t. the application under construction, and then managing a data set using the resulting schema. In this paper, we extend the existing definitions of modules and we introduce novel properties of robustness that provide means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and the data of the reference DMS. We carry out our investigations in the setting of description logics which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com profile of OWL2 (i.e., DL-liteR): the W3C recommendation for efficiently managing large data sets.
Sampling Online Social Networks
As online social networking emerges, there has been increased interest to utilize the underlying network structure as well as the available information on social peers to improve the information needs of a user. In this paper, we focus on improving the performance of information collection from the neighborhood of a user in a dynamic social network. We introduce samplingbased algorithms to efficiently explore a user's social network respecting its structure and to quickly approximate quantities of interest. We introduce and analyze variants of the basic sampling scheme exploring correlations across our samples. Models of centralized and distributed social networks are considered. We show that our algorithms can be utilized to rank items in the neighborhood of a user, assuming that information for each user in the network is available. Using real and synthetic data sets, we validate the results of our analysis and demonstrate the efficiency of our algorithms in approximating quantities of interest. The methods we describe are general and can probably be easily adopted in a variety of strategies aiming to efficiently collect information from a social graph. Supporting Flexible, Efficient, and User-Interpretable Retrieval of Similar ETPL DM-048 Time Series Abstract: Supporting decision making in domains in which the observed phenomenon dynamics have to be dealt with, can greatly benefit of retrieval of past cases, provided that proper representation and retrieval techniques are implemented. In particular, when the parameters of interest take the form of time series, dimensionality reduction and flexible retrieval have to be addresses to this end. Classical methodological solutions proposed to cope with these issues, typically based on mathematical transforms, are characterized by strong limitations, such as a difficult interpretation of retrieval results for end users, reduced flexibility and interactivity, or inefficiency. In this paper, we describe a novel framework, in which time-series features are summarized by means of Temporal Abstractions, and then retrieved resorting to abstraction similarity. Our approach grants for interpretability of the output results, and understandability of the (user-guided) retrieval process. In particular, multilevel abstraction mechanisms and proper indexing techniques are provided, for flexible query issuing, and efficient and interactive query answering. Experimental results have shown the efficiency of our approach in a scalability test, and its superiority with respect to the use of a classical mathematical technique in flexibility, user friendliness, and also quality of results. The Minimum Consistent Subset Cover Problem: A Minimization View of Data ETPL DM-049 Mining Abstract: In this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition (MCP), a dual problem of

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com graph coloring, as an instance. Many common data mining tasks in rule learning, clustering, and pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set (MRS) problem that minimizes model complexity of decision rules, the converse k-clustering problem that minimizes the number of clusters, and the pattern summarization problem that minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set..
ETPL DM-050
Transductive Multilabel Learning via Label Set Propagation
Abstract: The problem of multilabel classification has attracted great interest in the last decade, where
each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image annotations and gene function analysis. Current research on multilabel classification focuses on supervised settings which assume existence of large amounts of labeled training data. However, in many applications, the labeling of multilabeled data is extremely expensive and time consuming, while there are often abundant unlabeled data available. In this paper, we study the problem of transductive multilabel learning and propose a novel solution, called Trasductive Multilabel Classification (TraM), to effectively assign a set of multiple labels to each instance. Different from supervised multilabel learning methods, we estimate the label sets of the unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. We first formulate the transductive multilabel learning as an optimization problem of estimating label concept compositions. Then, we derive a closed-form solution to this optimization problem and propose an effective algorithm to assign label sets to the unlabeled instances. Empirical studies on several real-world multilabel learning tasks demonstrate that our TraM method can effectively boost the performance of multilabel classification by using both labeled and unlabeled data . A Method for Mining Infrequent Causal Associations and Its Application in ETPL DM-051 Finding Adverse Drug Reaction Signal Pairs Abstract: In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative outcomes caused by their antecedents. In this paper, we propose an innovative data mining framework and apply it to mine potential causal associations in electronic patient data sets where the drug-related events of interest occur infrequently. Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a data mining algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com data included 16,206 patients (15,605 male, 601 female). The exclusive causal-leverage was employed to rank the potential causal associations between each of the three selected drugs (i.e., enalapril, pravastatin, and rosuvastatin) and 3,954 recorded symptoms, each of which corresponded to a potential ADR. The top 10 drug-symptom pairs for each drug were evaluated by the physicians on our project team. The numbers of symptoms considered as likely real ADRs for enalapril, pravastatin, and rosuvastatin were 8, 7, and 6, respectively. These preliminary results indicate the usefulness of our method in finding potential ADR signal pairs for further analysis (e.g., epidemiology study) and investigation (e.g., case review) by drug safety professionals. A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in ETPL DM-052 Supervised Learning Abstract: Discretization is an essential preprocessing technique used in many knowledge discovery and data mining tasks. Its main goal is to transform a set of continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data. In this manner, symbolic data mining algorithms can be applied over continuous data and the representation of information is simplified, making it more concise and specific. The literature provides numerous proposals of discretization and some attempts to categorize them into a taxonomy can be found. However, in previous papers, there is a lack of consensus in the definition of the properties and no formal categorization has been established yet, which may be confusing for practitioners. Furthermore, only a small set of discretizers have been widely considered, while many other methods have gone unnoticed. With the intention of alleviating these problems, this paper provides a survey of discretization methods proposed in the literature from a theoretical and empirical perspective. From the theoretical perspective, we develop a taxonomy based on the main properties pointed out in previous research, unifying the notation and including all the known methods up to date. Empirically, we conduct an experimental study in supervised classification involving the most representative and newest discretizers, different types of classifiers, and a large number of data sets. The results of their performances measured in terms of accuracy, number of intervals, and inconsistency have been verified by means of nonparametric statistical tests. Additionally, a set of discretizers are highlighted as the best performing ones.
ETPL Clustering Uncertain Data Based on Probability Distribution Similarity DM-053 Abstract: Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts
significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like $(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density -based clustering methods to cluster uncertain objects . Nevertheless, a naive implementation is very costly . Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches.
ETPL DM-054
Efficient Evaluation of SUM Queries over Probabilistic Data
Abstract: SUM queries are crucial for many applications that need to deal with uncertain data. In this
paper, we are interested in the queries, called ALL_SUM, that return all possible sum values and their probabilities. In general, there is no efficient solution for the problem of evaluating ALL_SUM queries. But, for many practical applications, where aggregate values are small integers or real numbers with small precision, it is possible to develop efficient solutions. In this paper, based on a recursive approach, we propose a new solution for those applications. We implemented our solution and conducted an extensive experimental evaluation over synthetic and real-world data sets; the results show its effectiveness.
ETPL DM-055
Efficient Service Skyline Computation for Composite Service Selection
Abstract: Service composition is emerging as an effective vehicle for integrating existing web services
to create value-added and personalized composite services. As web services with similar functionality are expected to be provided by competing providers, a key challenge is to find the best web services to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc.) are considered, a weighting mechanism is usually adopted by most existing approaches, which requires users to specify their preferences as numeric values. We propose to exploit the dominance relationship among service providers to find a set of best possible composite services, referred to as a composite service skyline. We develop efficient algorithms that allow us to find the composite service skyline from a significantly reduced searching space instead of considering all possible service compositions. We propose a novel bottom-up computation framework that enables the skyline algorithm to scale well with the number of services in a composition. We conduct a comprehensive analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the composite skyline computation approaches.
ETPL DM-056
Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Sets
Abstract: A spatial colocation pattern is a group of spatial features whose instances are frequently
located together in geographic space. Discovering colocations has many useful applications. For

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com example, colocated plant species discovered from plant distribution data sets can contribute to the analysis of plant geography, phytosociology studies, and plant protection recommendations. In this paper, we study the colocation mining problem in the context of uncertain data, as the data generated from a wide range of data sources are inherently uncertain. One straightforward method to mine the prevalent colocations in a spatially uncertain data set is to simply compute the expected participation index of a candidate and decide if it exceeds a minimum prevalence threshold. Although this definition has been widely adopted, it misses important information about the confidence which can be associated with the participation index of a colocation. We propose another definition, probabilistic prevalent colocations, trying to find all the colocations that are likely to be prevalent in a randomly generated possible world. Finding probabilistic prevalent colocations (PPCs) turn out to be difficult. First, we propose pruning strategies for candidates to reduce the amount of computation of the probabilistic participation index values. Next, we design an improved dynamic programming algorithm for identifying candidates. This algorithm is suitable for parallel computation, and approximate computation. Finally, the effectiveness and efficiency of the methods proposed as well as the pruning strategies and the optimization techniques are verified by extensive experiments with real + synthetic spatially uncertain data sets. Fuzzy Web Data Tables Integration Guided by an Ontological and ETPL DM-057 Terminological Resource Abstract: In this paper, we present the design of ONDINE system which allows the loading and the querying of a data warehouse opened on the Web, guided by an Ontological and Terminological Resource (OTR). The data warehouse, composed of data tables extracted from Web documents, has been built to supplement existing local data sources. First, we present the main steps of our semiautomatic method to annotate data tables driven by an OTR. The output of this method is an XML/RDF data warehouse composed of XML documents representing data tables with their fuzzy RDF annotations. We then present our flexible querying system which allows the local data sources and the data warehouse to be simultaneously and uniformly queried, using the OTR. This system relies on SPARQL and allows approximate answers to be retrieved by comparing preferences expressed as fuzzy sets with fuzzy RDF annotations
ETPL PMSE: A Personalized Mobile Search Engine DM-058 Abstract: We propose a personalized mobile search engine (PMSE) that captures the users'
preferences in the form of concepts by mining their clickthrough data. Due to the importance of location information in mobile search, PMSE classifies these concepts into content concepts and location concepts. In addition, users' locations (positioned by GPS) are used to supplement the location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user profile, which are used to adapt a personalized ranking function for rank adaptation of future search results. To characterize the diversity of the concepts associated with a query and their relevances to the user's need, four entropies are introduced to balance the weights between the content and location

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com facets. Based on the client-server model, we also present a detailed architecture and design for implementation of PMSE. In our design, the client collects and stores locally the clickthrough data to protect privacy, whereas heavy tasks such as concept extraction, training, and reranking are performed at the PMSE server. Moreover, we address the privacy issue by restricting the information in the user profile exposed to the PMSE server with two privacy parameters. We prototype PMSE on the Google Android platform. Experimental results show that PMSE significantly improves the precision comparing to the baseline.
ETPL DM-059
Range-Based Skyline Queries in Mobile Environments
Abstract: Skyline query processing for location-based services, which considers both spatial and
nonspatial attributes of the objects being queried, has recently received increasing attention. Existing solutions focus on solving point- or line-based skyline queries, in which the query location is an exact location point or a line segment. However, due to privacy concerns and limited precision of localization devices, the input of a user location is often a spatial range. This paper studies a new problem of how to process such range-based skyline queries. Two novel algorithms are proposed: one is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY, which avoid recomputing the query index and results from scratch. Additionally, we develop efficient solutions for probabilistic and continuous range-based skyline queries. Experimental results show that our proposed algorithms well outperform the baseline algorithm that adopts the existing line-based skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial computation cost, especially when the objects move frequently.
ETPL DM-060
Skyline Processing on Distributed Vertical Decompositions
Abstract: We assume a data set that is vertically decomposed among several servers, and a client that
wishes to compute the skyline by obtaining the minimum number of points. Existing solutions for this problem are restricted to the case where each server maintains exactly one dimension. This paper proposes a general solution for vertical decompositions of arbitrary dimensionality. We first investigate some interesting problem characteristics regarding the pruning power of points. Then, we introduce vertical partition skyline (VPS), an algorithmic framework that includes two steps. Phase 1 searches for an anchor point Panc that dominates, and hence eliminates, a large number of records. Starting with Panc, Phase 2 constructs incrementally a pruning area using an interesting unionintersection property of dominance regions. Servers do not transmit points that fall within the pruning area in their local subspace. Our experiments confirm the effectiveness of the proposed methods under various settings.
ETPL DM-061
Spatial Query Integrity with Voronoi Neighbors

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com With the popularity of location-based services and the abundant usage of smart phones and GPSenabled devices, the necessity of outsourcing spatial data has grown rapidly over the past few years. Meanwhile, the fast arising trend of cloud storage and cloud computing services has provided a flexible and cost-effective platform for hosting data from businesses and individuals, further enabling many location-based applications. Nevertheless, in this database outsourcing paradigm, the authentication of the query results at the client remains a challenging problem. In this paper, we focus on the Outsourced Spatial Database (OSDB) model and propose an efficient scheme, called VN-Auth, which allows a client to verify the correctness and completeness of the result set. Our approach is based on neighborhood information derived from the Voronoi diagram of the underlying spatial data set and can handle fundamental spatial query types, such as k nearest neighbor and range queries, as well as more advanced query types like reverse k nearest neighbor, aggregate nearest neighbor, and spatial skyline. We evaluated VN-Auth based on real-world data sets using mobile devices (Google Droid smart phones with Android OS) as query clients. Compared to the current state-of-the-art approaches (i.e., methods based on Merkle Hash Trees), our experiments show that VN-Auth produces significantly smaller verification objects and is more computationally efficient, especially for queries with low selectivity.
ETPL DM-062
Supporting Pattern-Preserving Anonymization for Time-Series Data
Abstract: Time series is an important form of data available in numerous applications and often
contains vast amount of personal privacy. The need to protect privacy in time-series data while effectively supporting complex queries on them poses nontrivial challenges to the database community. We study the anonymization of time series while trying to support complex queries, such as range and pattern matching queries, on the published data. The conventional k-anonymity model cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel anonymization model called (k, P)-anonymity for pattern-rich time series. This model publishes both the attribute values and the patterns of time series in separate data forms. We demonstrate that our model can prevent linkage attacks on the published data while effectively support a wide variety of queries on the anonymized data. We propose two algorithms to enforce (k, P)-anonymity on timeseries data. Our anonymity model supports customized data publishing, which allows a certain part of the values but a different part of the pattern of the anonymized time series to be published simultaneously. We present estimation techniques to support query processing on such customized data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify the effectiveness and efficiency of our approach.
ETPL DM-063
Synchronization-Inspired Partitioning and Hierarchical Clustering
Synchronization is a powerful and inherently hierarchical concept regulating a large variety of complex processes ranging from the metabolism in a cell to opinion formation in a group of individuals. Synchronization phenomena in nature have been widely investigated and models

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com concisely describing the dynamical synchronization process have been proposed, e.g., the well-known Extensive Kuramoto Model. We explore the potential of the Extensive Kuramoto Model for data clustering. We regard each data object as a phase oscillator and simulate the dynamical behavior of the objects over time. By interaction with similar objects, the phase of an object gradually aligns with its neighborhood, resulting in a nonlinear object movement naturally driven by the local cluster structure. We demonstrate that our framework has several attractive benefits: 1) It is suitable to detect clusters of arbitrary number, shape, and data distribution, even in difficult settings with noise points and outliers. 2) Combined with the Minimum Description Length (MDL) principle, it allows partitioning and hierarchical clustering without requiring any input parameters which are difficult to estimate. 3) Synchronization faithfully captures the natural hierarchical cluster structure of the data and MDL suggests meaningful levels of abstraction. Extensive experiments demonstrate the effectiveness and efficiency of our approach.
ETPL Transfer across Completely Different Feature Spaces via Spectral Embedding DM-064 Abstract: In many applications, it is very expensive or time consuming to obtain a lot of labeled
examples. One practically important problem is: can the labeled data from other related sources help predict the target task, even if they have 1) different feature spaces (e.g., image versus text data), 2) different data distributions, and 3) different output spaces? This paper proposes a solution and discusses the conditions where this is highly likely to produce better results. It first unifies the feature spaces of the target and source data sets by spectral embedding, even when they are with completely different feature spaces. The principle is to devise an optimization objective that preserves the original structure of the data, while at the same time, maximizes the similarity between the two. A linear projection model, as well as a nonlinear approach are derived on the basis of this principle with closed forms. Second, a judicious sample selection strategy is applied to select only those related source examples. At last, a Bayesian-based approach is applied to model the relationship between different output spaces. The three steps can bridge related heterogeneous sources in order to learn the target task. Among the 20 experiment data sets, for example, the images with wavelet-transformed-based features are used to predict another set of images whose features are constructed from color-histogram space; documents are used to help image classification, etc. By using these extracted examples from heterogeneous sources, the models can reduce the error rate by as much as 50 percent, compared with the methods using only the examples from the target task. Tweet Analysis for Real-Time Event Detection and Earthquake Reporting ETPL DM-065 System Development Abstract: Twitter has received much attention recently. An important characteristic of Twitter is its real-time nature. We investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com that can find the center of the event location. We regard each Twitter user as a sensor and apply particle filtering, which are widely used for location estimation. The particle filter works better than other comparable methods for estimating the locations of target events. As an application, we develop an earthquake reporting system for use in Japan. Because of the numerous earthquakes and the large number of Twitter users throughout the country, we can detect an earthquake with high probability (93 percent of earthquakes of Japan Meteorological Agency (JMA) seismic intensity scale 3 or more are detected) merely by monitoring tweets. Our system detects earthquakes promptly and notification is delivered much faster than JMA broadcast announcements. . TW-$(k)$-Means: Automated Two-Level Variable Weighting Clustering ETPL DM-066 Algorithm for Multiview Data Abstract: This paper proposes TW-k-means, an automated two-level variable weighting clustering algorithm for multiview data, which can simultaneously compute weights for views and individual variables. In this algorithm, a view weight is assigned to each view to identify the compactness of the view and a variable weight is also assigned to each variable in the view to identify the importance of the variable. Both view weights and variable weights are used in the distance function to determine the clusters of objects. In the new algorithm, two additional steps are added to the iterative k-means clustering process to automatically compute the view weights and the variable weights. We used two real-life data sets to investigate the properties of two types of weights in TW-k-means and investigated the difference between the weights of TW-k-means and the weights of the individual variable weighting method. The experiments have revealed the convergence property of the view weights in TW-k-means. We compared TW-k-means with five clustering algorithms on three real-life data sets and the results have shown that the TW-k-means algorithm significantly outperformed the other five clustering algorithms in four evaluation indices.
ETPL DM-067
U-Skyline: A New Skyline Query for Uncertain Databases
Abstract: The skyline query, aiming at identifying a set of skyline tuples that are not dominated by
any other tuple, is particularly useful for multicriteria data analysis and decision making. For uncertain databases, a probabilistic skyline query, called P-Skyline, has been developed to return skyline tuples by specifying a probability threshold. However, the answer obtained via a P-Skyline query usually includes skyline tuples undesirably dominating each other when a small threshold is specified; or it may contain much fewer skyline tuples if a larger threshold is employed. To address this concern, we propose a new uncertain skyline query, called U-Skyline query, in this paper. Instead of setting a probabilistic threshold to qualify each skyline tuple independently, the U-Skyline query searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as the skyline answer. In order to answer U-Skyline queries efficiently, we propose a number of optimization techniques for query processing, including 1) computational simplification of U-Skyline probability, 2) pruning of unqualified candidate skylines and early termination of query processing, 3) reduction of the input data set, and 4) partition and conquest of the reduced data set. We perform a

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com comprehensive performance evaluation on our algorithm and an alternative approach that formulates the U-Skyline processing problem by integer programming. Experimental results demonstrate that our algorithm is 10-100 times faster than using CPLEX, a parallel integer programming solver, to answer the U-Skyline query. A Novel Profit Maximizing Metric for Measuring Classification Performance of ETPL DM-068 Customer Churn Prediction Models Abstract: The interest for data mining techniques has increased tremendously during the past decades, and numerous classification techniques have been applied in a wide range of business applications. Hence, the need for adequate performance measures has become more important than ever. In this paper, a cost-benefit analysis framework is formalized in order to define performance measures which are aligned with the main objectives of the end users, i.e., profit maximization. A new performance measure is defined, the expected maximum profit criterion. This general framework is then applied to the customer churn problem with its particular cost-benefit structure. The advantage of this approach is that it assists companies with selecting the classifier which maximizes the profit. Moreover, it aids with the practical implementation in the sense that it provides guidance about the fraction of the customer base to be included in the retention campaign. A Predictive-Reactive Method for Improving the Robustness of Real-Time Data ETPL DM-069 Services Abstract: Supporting timely data services using fresh data in data-intensive real-time applications, such as e-commerce and transportation management is desirable but challenging, since the workload may vary dynamically. To control the data service delay to be below the specified threshold, we develop a predictive as well as reactive method for database admission control. The predictive method derives the workload bound for admission control in a predictive manner, making no statistical or queuing-theoretic assumptions about workloads. Also, our reactive scheme based on formal feedback control theory continuously adjusts the database load bound to support the delay threshold. By adapting the load bound in a proactive fashion, we attempt to avoid severe overload conditions and excessive delays before they occur. Also, the feedback control scheme enhances the timeliness by compensating for potential prediction errors due to dynamic workloads. Hence, the predictive and reactive methods complement each other, enhancing the robustness of real-time data services as a whole. We implement the integrated approach and several baselines in an open-source database. Compared to the tested open-loop, feedback-only, and statistical prediction + feedback baselines representing the state of the art, our integrated method significantly improves the average/transient delay and real-time data service throughput.
ETPL DM-070
Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates
Abstract: We may want to keep sensitive information in a relational database hidden from a user or
group thereof. We characterize sensitive data as the extensions of secrecy views. The database, before returning the answers to a query posed by a restricted user, is updated to make the secrecy views

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com empty or a single tuple with null values. Then, a query about any of those views returns no meaningful information. Since the database is not supposed to be physically changed for this purpose, the updates are only virtual, and also minimal. Minimality makes sure that query answers, while being privacy preserving, are also maximally informative. The virtual updates are based on null values as used in the SQL standard. We provide the semantics of secrecy views, virtual updates, and secret answers (SAs) to queries. The different instances resulting from the virtually updates are specified as the models of a logic program with stable model semantics, which becomes the basis for computation of the SAs.
ETPL DM -071
Co-Occurrence-Based Diffusion for Expert Search on the Web
Abstract: Expert search has been studied in different contexts, e.g., enterprises, academic communities.
We examine a general expert search problem: searching experts on the web, where millions of webpages and thousands of names are considered. It has mainly two challenging issues: 1) webpages could be of varying quality and full of noises; 2) The expertise evidences scattered in webpages are usually vague and ambiguous. We propose to leverage the large amount of co-occurrence information to assess relevance and reputation of a person name for a query topic. The co-occurrence structure is modeled using a hypergraph, on which a heat diffusion based ranking algorithm is proposed. Query keywords are regarded as heat sources, and a person name which has strong connection with the query (i.e., frequently co-occur with query keywords and co-occur with other names related to query keywords) will receive most of the heat, thus being ranked high. Experiments on the ClueWeb09 web collection show that our algorithm is effective for retrieving experts and outperforms baseline algorithms significantly. This work would be regarded as one step toward addressing the more general entity search problem without sophisticated NLP techniques. Efficient All Top-$(k)$ Computation—A Unified Solution for All TopETPL DM -072 $(k)$, Reverse Top-$(k)$ and Top-$(m)$ Influential Queries Abstract: Given a set of objects $(P)$ and a set of ranking functions $(F)$ over $(P)$, an interesting problem is to compute the top ranked objects for all functions. Evaluation of multiple top-$(k)$ queries finds application in systems, where there is a heavy workload of ranking queries (e.g., online search engines and product recommendation systems). The simple solution of evaluating the top$(k)$ queries one-by-one does not scale well; instead, the system can make use of the fact that similar queries share common results to accelerate search. This paper is the first, to our knowledge, thorough study of this problem. We propose methods that compute all top-$(k)$ queries in batch. Our first solution applies the block indexed nested loops paradigm, while our second technique is a view-based algorithm. We propose appropriate optimization techniques for the two approaches and demonstrate experimentally that the second approach is consistently the best. Our approach facilitates evaluation of other complex queries that depend on the computation of multiple top-$(k)$ queries, such as reverse top-$(k)$ and top-$(m)$ influential queries. We show that our batch processing technique for these complex queries outperform the state-of-the-art by orders of magnitude.

ETPL DM - 073
Efficient and Effective Duplicate Detection in Hierarchical Data
Abstract: Although there is a long line of work on identifying duplicates in relational data, only a few
solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness.
ETPL DM -074 Abstract:
Failure-Aware Cascaded Suppression in Wireless Sensor Networks
Wireless sensor networks are widely used to continuously collect data from the environment. Because of energy constraints on battery-powered nodes, it is critical to minimize communication. Suppression has been proposed as a way to reduce communication by using predictive models to suppress reporting of predictable data. However, in the presence of communication failures, missing data are difficult to interpret because these could have been either suppressed or lost in transmission. There is no existing solution for handling failures for general, spatiotemporal suppression that uses cascading. While cascading further reduces communication, it makes failure handling difficult, because nodes can act on incomplete or incorrect information and in turn affect other nodes. We propose a cascaded suppression framework that exploits both temporal and spatial data correlation to reduce communication, and applies coding theory and Bayesian inference to recover missing data resulted from suppression and communication failures. Experiment results show that cascaded suppression significantly reduces communication cost and improves missing data recovery compared to existing approaches. .
ETPL DM -075
Multiview Partitioning via Tensor Methods
Abstract: Clustering by integrating multiview representations has become a crucial issue for
knowledge discovery in heterogeneous environments. However, most prior approaches assume that the multiple representations share the same dimension, limiting their applicability to homogeneous environments. In this paper, we present a novel tensor-based framework for integrating heterogeneous multiview data in the context of spectral clustering. Our framework includes two novel formulations; that is multiview clustering based on the integration of the Frobenius-norm objective function (MCFR-OI) and that based on matrix integration in the Frobenius-norm objective function (MC-FR-MI). We show that the solutions for both formulations can be computed by tensor decompositions. We evaluated our methods on synthetic data and two real-world data sets in comparison with baseline

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com methods. Experimental results demonstrate that the proposed formulations are effective in integrating multiview data in heterogeneous environments.
ETPL Novel Biobjective Clustering (BiGC) Based on Cooperative Game Theory DM -076 Abstract: We propose a new approach to clustering. Our idea is to map cluster formation to coalition
formation in cooperative games, and to use the Shapley value of the patterns to identify clusters and cluster representatives. We show that the underlying game is convex and this leads to an efficient biobjective clustering algorithm that we call BiGC. The algorithm yields high-quality clustering with respect to average point-to-center distance (potential) as well as average intracluster point-to-point distance (scatter). We demonstrate the superiority of BiGC over state-of-the-art clustering algorithms (including the center based and the multiobjective techniques) through a detailed experimentation using standard cluster validity criteria on several benchmark data sets. We also show that BiGC satisfies key clustering properties such as order independence, scale invariance, and richness . On Generalizable Low False-Positive Learning Using Asymmetric Support ETPL DM -077 Vector Machines Abstract: The Support Vector Machines (SVMs) have been widely used for classification due to its ability to give low generalization error. In many practical applications of classification, however, the wrong prediction of a certain class is much severer than that of the other classes, making the original SVM unsatisfactory. In this paper, we propose the notion of Asymmetric Support Vector Machine (ASVM), an asymmetric extension of the SVM, for these applications. Different from the existing SVM extensions such as thresholding and parameter tuning, ASVM employs a new objective that models the imbalance between the costs of false predictions from different classes in a novel way such that user tolerance on false-positive rate can be explicitly specified. Such a new objective formulation allows us of obtaining a lower false-positive rate without much degradation of the prediction accuracy or increase in training time. Furthermore, we show that the generalization ability is preserved with the new objective. We also study the effects of the parameters in ASVM objective and address some implementation issues related to the Sequential Minimal Optimization (SMO) to cope with large-scale data. An extensive simulation is conducted and shows that ASVM is able to yield either noticeable improvement in performance or reduction in training time as compared to the previous arts.
Optimal Route Queries with Arbitrary Order Constraints
Given a set of spatial points $(DS)$, each of which is associated with categorical information, e.g., restaurant, pub, etc., the optimal route query finds the shortest path that starts from the query point (e.g., a home or hotel), and covers a user-specified set of categories (e.g., {pub, restaurant, museum}). The user may also specify partial order constraints between different categories, e.g., a restaurant must be visited before a pub. Previous work has focused on a special case where the query contains the total order of all categories to be visited (e.g., museum $(rightarrow)$

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com restaurant $(rightarrow)$ pub). For the general scenario without such a total order, the only known solution reduces the problem to multiple, total-order optimal route queries. As we show in this paper, this naïve approach incurs a significant amount of repeated computations, and, thus, is not scalable to large data sets. Motivated by this, we propose novel solutions to the general optimal route query, based on two different methodologies, namely backward search and forward search. In addition, we discuss how the proposed methods can be adapted to answer a variant of the optimal route queries, in which the route only needs to cover a subset of the given categories. Extensive experiments, using both real and synthetic data sets, confirm that the proposed solutions are efficient and practical, and outperform existing methods by large margins.
ETPL DM -079
Pay-As-You-Go Entity Resolution
Abstract: Entity resolution (ER) is the problem of identifying which records in a database refer to the
same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data from the web may simply be too large to completely resolve with a reasonable amount of work. As another example, real-time applications may not be able to tolerate any ER processing that takes longer than a certain amount of time. This paper investigates how we can maximize the progress of ER with a limited amount of work using “hints,” which give information on records that are likely to refer to the same realworld entity. A hint can be represented in various formats (e.g., a grouping of records based on their likelihood of matching), and ER can use this information as a guideline for which records to compare first. We introduce a family of techniques for constructing hints efficiently and techniques for using the hints to maximize the number of matching records identified using a limited amount of work. Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to running ER without using hints.. Single-Database Private Information Retrieval from Fully Homomorphic ETPL DM -080 Encryption Abstract: Private Information Retrieval (PIR) allows a user to retrieve the $(i)$th bit of an $(n)$-bit database without revealing to the database server the value of $(i)$. In this paper, we present a PIR protocol with the communication complexity of $(O(gamma log n))$ bits, where $(gamma)$ is the ciphertext size. Furthermore, we extend the PIR protocol to a private block retrieval (PBR) protocol, a natural and more practical extension of PIR in which the user retrieves a block of bits, instead of retrieving single bit. Our protocols are built on the state-of-the-art fully homomorphic encryption (FHE) techniques and provide privacy for the user if the underlying FHE scheme is semantically secure. The total communication complexity of our PBR is $(O(gamma log m+gamma n/m))$ bits, where $(m)$ is the number of blocks. The total computation complexity of our PBR is $(O(mlog m))$ modular multiplications plus $(O(n/2))$ modular additions. In terms of total protocol execution time, our PBR protocol is more efficient than existing PBR protocols which usually require to compute $(O(n/2))$ modular multiplications when the size of a block in the database is large and a high-speed

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com network is available. Toward SWSs Discovery: Mapping from WSDL to OWL-S Based on Ontology Search and Standardization Engine Abstract: Semantic Web Services (SWSs) represent the most recent and revolutionary technology developed for machine-to-machine interaction on the web 3.0. As for the conventional web services, the problem of discovering and selecting the most suitable web service represents a challenge for SWSs to be widely used. In this paper, we propose a mapping algorithm that facilitates the redefinition of the conventional web services annotations (i.e., WSDL) using semantic annotations (i.e., OWL-S). This algorithm will be a part of a new discovery mechanism that relies on the semantic annotations of the web services to perform its task. The “local ontology repository” and “ontology search and standardization engine” are the backbone of this algorithm. Both of them target to define any data type in the system using a standard ontology-based concept. The originality of the proposed mapping algorithm is its applicability and consideration of the standardization problem. The proposed algorithm is implemented and its components are validated using some test collections and real examples. An experimental test of the proposed techniques is reported, showing the impact of the proposed algorithm in decreasing the time and the effort of the mapping process. Moreover, the experimental results promises that the proposed algorithm will have a positive impact on the discovery process as a whole. . Trace Ratio Optimization-Based Semi-Supervised Nonlinear Dimensionality ETPL DM -082 Reduction for Marginal Manifold Visualization Abstract: Visualizing similarity data of different objects by exhibiting more separate organizations with local and multimodal characteristics preserved is important in multivariate data analysis. Laplacian Eigenmaps (LAE) and Locally Linear Embedding (LLE) aim at preserving the embeddings of all similarity pairs in the close vicinity of the reduced output space, but they are unable to identify and separate interclass neighbors. This paper considers the semi-supervised manifold learning problems. We apply the pairwise Cannot-Link and Must-Link constraints induced by the neighborhood graph to specify the types of neighboring pairs. More flexible regulation on supervised information is provided. Two novel multimodal nonlinear techniques, which we call trace ratio (TR) criterion-based semi-supervised LAE ($({rm S}^2{rm LAE})$) and LLE ($({rm S}^2{rm LLE})$), are then proposed for marginal manifold visualization. We also present the kernelized $({rm S}^2{rm LAE})$ and $({rm S}^2{rm LLE})$. We verify the feasibility of $({rm S}^2{rm LAE})$ and $({rm S}^2{rm LLE})$ through extensive simulations over benchmark real-world MIT CBCL, CMU PIE, MNIST, and USPS data sets. Manifold visualizations show that $({rm S}^2{rm LAE})$ and $({rm S}^2{rm LLE})$ are able to deliver large margins between different clusters or classes with multimodal distributions preserved. Clustering evaluations show they can achieve comparable to or even better results than some widely used methods.
ETPL DM -081

ETPL DM -083
Update Summarization via Graph-Based Sentence Ranking
Abstract: Due to the fast evolution of the information on the Internet, update summarization has
received much attention in recent years. It is to summarize an evolutionary document collection at current time supposing the users have read some related previous documents. In this paper, we propose a graph-ranking-based method. It performs constrained reinforcements on a sentence graph, which unifies previous and current documents, to determine the salience of the sentences. The constraints ensure that the most salient sentences in current documents are updates to previous documents. Since this method is NP-hard, we then propose its approximate method, which is polynomial time solvable. Experiments on the TAC 2008 and 2009 benchmark data sets show the effectiveness and efficiency of our method.
ETPL DM -084
Change Detection in Streaming Multivariate Data Using Likelihood Detectors
Abstract: Change detection in streaming data relies on a fast estimation of the probability that the data
in two consecutive windows come from different distributions. Choosing the criterion is one of the multitude of questions that need to be addressed when designing a change detection procedure. This paper gives a log-likelihood justification for two well-known criteria for detecting change in streaming multidimensional data: Kullback-Leibler (K-L) distance and Hotelling's T-square test for equal means (H). We propose a semiparametric log-likelihood criterion (SPLL) for change detection. Compared to the existing log-likelihood change detectors, SPLL trades some theoretical rigor for computation simplicity. We examine SPLL together with K-L and H on detecting induced change on 30 real data sets. The criteria were compared using the area under the respective Receiver Operating Characteristic (ROC) curve (AUC). SPLL was found to be on the par with H and better than K-L for the nonnormalized data, and better than both on the normalized data.
ETPL DM -085
Coping with Events in Temporal Relational Databases
Abstract: Event relations are used in many temporal relational database approaches to represent facts
occurring at time instants. However, to the best of our knowledge, none of such approaches fully copes with the definition of events as provided, e.g., by the “consensus” temporal database glossary. We propose a new approach which overcomes such a limitation, allowing one to cope with multiple events occurring in the same temporal granule. This move involves major extensions to current approaches, since indeterminacy about the time and number of occurrences of events need to be faced. Specifically, we have introduced a new data model, and new definitions of relational algebraic operators coping with the above issues, and we have studied their reducibility. Last, but not least, we have shown that our approach can be easily extended in order to cope with a general form of temporal indeterminacy. Such an extension further increases the applicability of our approach.

Cutting Plane Training for Linear Support Vector Machines
Support Vector Machines (SVMs) have been shown to achieve high performance on classification tasks across many domains, and a great deal of work has been dedicated to developing computationally efficient training algorithms for linear SVMs. One approach [1] approximately minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work, presenting a modification to the algorithm developed by Franc and Sonnenburg [2]. We demonstrate empirically that our changes can reduce cutting plane training time by up to 40 percent, and discuss how changes in data sets and parameter settings affect the effectiveness of our method.
ETPL DM-087
Successive Group Selection for Microaggregation
Abstract: In this paper, we propose an efficient clustering algorithm that has been applied to the
microaggregation problem. The goal is to partition $(N)$ given records into clusters, each of them grouping at least $(K)$ records, so that the sum of the within-partition squared error (SSE) is minimized. We propose a successive Group Selection algorithm that approximately solves the microaggregation problem in $(O(N^2 log N))$ time, based on sequential Minimization of SSE. Experimental results and comparisons to existing methods with similar computation cost on real and synthetic data sets demonstrate the high performance and robustness of the proposed scheme .
ETPL Modeling and Computing Ternary Projective Relations between Regions DM-088 Abstract: We report a corrected version of the algorithms to compute ternary projective relations
between regions appeared in E. Clementini and R. Billen, "Modeling and computing ternary projective relations between regions," IEEE Transactions on Knowledge and Data Engineering, vol. 18, pp. 799-814, 2006. A Survival Modeling Approach to Biomedical Search Result Diversification ETPL DM-089 Using Wikipedia Abstract: In this paper, we propose a survival modeling approach to promoting ranking diversity for biomedical information retrieval. The proposed approach concerns with finding relevant documents that can deliver more different aspects of a query. First, two probabilistic models derived from the survival analysis theory are proposed for measuring aspect novelty. Second, a new method using Wikipedia to detect aspects covered by retrieved documents is presented. Third, an aspect filter based on a two-stage model is introduced. It ranks the detected aspects in decreasing order of the probability that an aspect is generated by the query. Finally, the relevance and the novelty of retrieved documents are combined at the aspect level for reranking. Experiments conducted on the TREC 2006 and 2007 Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking diversity for biomedical information retrieval. Moreover, we further evaluate our approach in the Web retrieval environment. The evaluation results on the ClueWeb09-T09B collection show that our

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com approach can achieve promising performance improvements.
ETPL Centroid-Based Actionable 3D Subspace Clustering DM-090 Abstract: Demand-side management, together with the integration of distributed energy generation and storage, are considered increasingly essential elements for implementing the smart grid concept and balancing massive energy production from renewable sources. We focus on a smart grid in which the demand-side comprises traditional users as well as users owning some kind of distributed energy sources and/or energy storage devices. By means of a day-ahead optimization process regulated by an independent central unit, the latter users intend to reduce their monetary energy expense by producing or storing energy rather than just purchasing their energy needs from the grid. In this paper, we formulate the resulting grid optimization problem as a noncooperative game and analyze the existence of optimal strategies. Furthermore, we present a distributed algorithm to be run on the users' smart meters, which provides the optimal production and/or storage strategies, while preserving the privacy of the users and minimizing the required signaling with the central unit. Finally, the proposed day-ahead optimization is tested in a realistic situation. ETPL Constrained Text Coclustering with Supervised and Unsupervised Constraints DM-091 Abstract: In this paper, we propose a novel constrained coclustering method to achieve two goals.
First, we combine information-theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches.
ETPL GC-092
Crowdsourced Trace Similarity with Smartphones
Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers, etc. This capability allows smartphone users to easily engage in crowdsourced computing services, which contribute to the solution of complex problems in a distributed manner. In this work, we leverage such a computing paradigm to solve efficiently the following problem: comparing a query trace $(Q)$ against a crowd of traces generated and stored on distributed smartphones. Our proposed framework, coined $({rm SmartTrace}^+)$, provides an effective solution without disclosing any part of the crowd traces to the query processor. $({rm SmartTrace}^+)$, relies on an in-situ data storage

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com model and intelligent top-K query processing algorithms that exploit distributed trajectory similarity measures, resilient to spatial and temporal noise, in order to derive the most relevant answers to $(Q)$. We evaluate our algorithms on both synthetic and real workloads. We describe our prototype system developed on the Android OS. The solution is deployed over our own SmartLab testbed of 25 smartphones. Our study reveals that computations over $({rm SmartTrace}^+)$ result in substantial energy conservation; in addition, results can be computed faster than competitive approaches.
ETPL Customized Policies for Handling Partial Information in Relational Databases DM-093 Abstract: Most real-world databases have at least some missing data. Today, users of such databases
are “on their own” in terms of how they manage this incompleteness. In this paper, we propose the general concept of partial information policy (PIP) operator to handle incompleteness in relational databases. PIP operators build upon preference frameworks for incomplete information, but accommodate different types of incomplete data (e.g., a value exists but is not known; a value does not exist; a value may or may not exist). Different users in the real world have different ways in which they want to handle incompleteness—PIP operators allow them to specify a policy that matches their attitude to risk and their knowledge of the application and how the data was collected. We propose index structures for efficiently evaluating PIP operators and experimentally assess their effectiveness on a real-world airline data set. We also study how relational algebra operators and PIP operators interact with one another.
ETPL Decision Trees for Mining Data Streams Based on the McDiarmid's Bound DM-094 Abstract: In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the
Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered..
ETPL Discovering Characterizations of the Behavior of Anomalous Subpopulations DM-095 Abstract: We consider the problem of discovering attributes, or properties, accounting for the a priori
stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given population (the inliers). To this aim, we introduce the notion of exceptional property and define the concept of exceptionality score, which measures the significance of a property. In particular, in order

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com to single out exceptional properties, we resort to a form of minimum distance estimation for evaluating the badness of fit of the values assumed by the outliers compared to the probability distribution associated with the values assumed by the inliers. Suitable exceptionality scores are introduced for both numeric and categorical attributes. These scores are, both from the analytical and the empirical point of view, designed to be effective for small samples, as it is the case for outliers. We present an algorithm, called $({rm EXPREX})$, for efficiently discovering exceptional properties. The algorithm is able to reduce the needed computational effort by not exploring many irrelevant numerical intervals and by exploiting suitable pruning rules. The experimental results confirm that our technique is able to provide knowledge characterizing outliers in a natural manner.
ETPL DM-096
FoCUS: Learning to Crawl Web Forums
Abstract: In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-
scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites. Improving Word Similarity by Augmenting PMI with Estimates of Word ETPL DM-097 Polysemy Abstract: Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, $({rm PMI}_{max})$, that augments PMI with information about a word's number of senses. The coefficients of $({rm PMI}_{max})$ are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. $({rm PMI}_{max})$ achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set.
ETPL DM-098
Incentive Compatible Privacy-Preserving Data Analysis

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com In many cases, competing parties who have private data may collaboratively conduct privacy-preserving distributed data analysis (PPDA) tasks to learn beneficial data models or analysis results. Most often, the competing parties have different incentives. Although certain PPDA techniques guarantee that nothing other than the final analysis result is revealed, it is impossible to verify whether participating parties are truthful about their private input data. Unless proper incentives are set, current PPDA techniques cannot prevent participating parties from modifying their private inputs. This raises the question of how to design incentive compatible privacy-preserving data analysis techniques that motivate participating parties to provide truthful inputs. In this paper, we first develop key theorems, then base on these theorems, we analyze certain important privacy-preserving data analysis tasks that could be conducted in a way that telling the truth is the best choice for any participating party.
Abstract: ETPL Nonnegative Matrix Factorization: A Comprehensive Review DM-099 Abstract: Nonnegative Matrix Factorization (NMF), a relatively novel paradigm for dimensionality
reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles, characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit.
ETPL DM-100
On Identifying Critical Nuggets of Information during Classification Tasks
Abstract: In large databases, there may exist critical nuggets;small collections of records or instances
that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data records and improving classification results by reducing false positive and false negative errors. This work introduces the idea of critical nuggets, proposes an innovative domain-independent method to measure criticality, suggests a heuristic to reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from some real-world data sets. It seems that only a few subsets may qualify to be critical nuggets, underlying the importance of finding them. The proposed methodology can detect them. This work also identifies certain properties of critical nuggets and provides experimental validation of the

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com properties. Experimental results also helped validate that critical nuggets can assist in improving classification accuracies in real-world data sets. Radio Database Compression for Accurate Energy-Efficient Localization in ETPL DM-101 Fingerprinting Systems Abstract: Location fingerprinting is a positioning method that exploits the already existing infrastructures such as cellular networks or WLANs. Regarding the recent demand for energy efficient networks and the emergence of issues like green networking, we propose a clustering technique to compress the radio database in the context of cellular fingerprinting systems. The aim of the proposed technique is to reduce the computation cost and transmission load in the mobile-based implementations. The presented method may be called Block-based Weighted Clustering (BWC) technique, which is applied in a concatenated location-radio signal space, and attributes different weight factors to the location and radio components. Computer simulations and real experiments have been conducted to evaluate the performance of our proposed technique in the context of a GSM network. The obtained results confirm the efficiency of the BWC technique, and show that it improves the performance of standard k-means and hierarchical clustering methods. Semi-Supervised Nonlinear Hashing Using Bootstrap Sequential Projection ETPL DM-102 Learning Abstract: In this paper, we study the effective semi-supervised hashing method under the framework of regularized learning-based hashing. A nonlinear hash function is introduced to capture the underlying relationship among data points. Thus, the dimensionality of the matrix for computation is not only independent from the dimensionality of the original data space but also much smaller than the one using linear hash function. To effectively deal with the error accumulated during converting the realvalue embeddings into the binary code after relaxation, we propose a semi-supervised nonlinear hashing algorithm using bootstrap sequential projection learning which effectively corrects the errors by taking into account of all the previous learned bits holistically without incurring the extra computational overhead. Experimental results on the six benchmark data sets demonstrate that the presented method outperforms the state-of-the-art hashing algorithms at a large margin.
ETPL Spatial Approximate String Search DM-103 Abstract: This work deals with the approximate string search in large spatial databases. Specifically,
we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (Sas) query. In euclidean space, we propose an approximate solution, the MhR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node $(u)$ keeps a concise representation of the union of $(q)$-grams from strings under the subtree of $(u)$. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the $(q)$-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a Sas query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RsasSol, which significantly outperforms the baseline algorithm in practice. The RsasSol combines the $(q)$-gram-based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approaches.
ETPL DM-104
SVStream: A Support Vector-Based Algorithm for Clustering Data Streams
Abstract: In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which
is based on support vector domain description and support vector clustering. In the proposed algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are used as the summary information of the historical elements to construct cluster boundaries of arbitrary shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. By allowing for bounded support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise). We perform experiments over synthetic and real data streams, with the overlapping, evolving, and noise situations taken into consideration. Comparison results with state-of-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method.
ETPL DM-105
The Move-Split-Merge Metric for Time Series
Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric
uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series. A Move operation changes the value of a single element, a Split operation converts a single element into two consecutive elements, and a Merge operation merges two consecutive elements into one. Each operation has an associated cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence of operations that transforms the first time series into the second one. An efficient, quadratic-time algorithm is provided for computing the MSM distance. MSM has the desirable properties of being metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments with public time series data sets demonstrate that MSM is a meaningful distance measure, that oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.
ETPL DM-106
A User-Friendly Patent Search Paradigm
Abstract: As an important operation for finding existing relevant patents and validating a new patent
application, patent search has attracted considerable attention recently. However, many users have limited knowledge about the underlying patents, and they have to use a try-and-see approach to repeatedly issue different queries and check answers, which is a very tedious process. To address this

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com problem, in this paper, we propose a new user-friendly patent search paradigm, which can help users find relevant patents more easily and improve user search experience. We propose three effective techniques, error correction, topic-based query suggestion, and query expansion, to improve the usability of patent search. We also study how to efficiently find relevant answers from a large collection of patents. We first partition patents into small partitions based to their topics and classes. Then, given a query, we find highly relevant partitions and answer the query in each of such highly relevant partitions. Finally, we combine the answers of each partition and generate top-$(k)$ answers of the patent-search query. A Methodology for Direct and Indirect Discrimination Prevention in Data ETPL DM-107 Mining Abstract: Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. There are, however, negative social perceptions about data mining, among which potential privacy invasion and potential discrimination. The latter consists of unfairly treating people on the basis of their belonging to a specific group. Automated data collection and data mining techniques such as classification rule mining have paved the way to making automated decisions, like loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what regards discriminatory (sensitive) attributes like gender, race, religion, etc., discriminatory decisions may ensue. For this reason, anti-discrimination techniques including discrimination discovery and prevention have been introduced in data mining. Discrimination can be either direct or indirect. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on nonsensitive attributes which are strongly correlated with biased sensitive ones. In this paper, we tackle discrimination prevention in data mining and propose new techniques applicable for direct or indirect discrimination prevention individually or both at the same time. We discuss how to clean training data sets and outsourced data sets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate (nondiscriminatory) classification rules. We also propose new metrics to evaluate the utility of the proposed approaches and we compare these approaches. The experimental evaluations demonstrate that the proposed techniques are effective at removing direct and/or indirect discrimination biases in the original data set while preserving data quality.
Anomaly Detection via Online Oversampling Principal Component Analysis
Anomaly detection has been an important research topic in data mining and machine learning. Many real-world applications such as intrusion or credit card fraud detection require an effective and efficient framework to identify deviated data instances. However, most anomaly detection methods are typically implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing computation and memory requirements. In this paper, we propose an online oversampling principal component analysis (osPCA) algorithm to address this problem, and we aim at detecting the presence of outliers from a large amount of data via an online

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our approach is especially of interest in online or large-scale problems. By oversampling the target instance and extracting the principal direction of the data, the proposed osPCA allows us to determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector. Since our osPCA need not perform eigen analysis explicitly, the proposed framework is favored for online applications which have computation or memory limitations. Compared with the well-known power method for PCA and other popular anomaly detection algorithms, our experimental results verify the feasibility of our proposed method in terms of both accuracy and efficiency. CDAMA: Concealed Data Aggregation Scheme for Multiple Applications in ETPL DM-109 Wireless Sensor Networks Abstract: For wireless sensor networks, data aggregation scheme that reduces a large amount of transmission is the most practical technique. In previous studies, homomorphic encryptions have been applied to conceal communication during aggregation such that enciphered data can be aggregated algebraically without decryption. Since aggregators collect data without decryption, adversaries are not able to forge aggregated results by compromising them. However, these schemes are not satisfy multi-application environments. Second, these schemes become insecure in case some sensor nodes are compromised. Third, these schemes do not provide secure counting; thus, they may suffer unauthorized aggregation attacks. Therefore, we propose a new concealed data aggregation scheme extended from Boneh et al.'s homomorphic public encryption system. The proposed scheme has three contributions. First, it is designed for a multi-application environment. The base station extracts application-specific data from aggregated ciphertexts. Next, it mitigates the impact of compromising attacks in single application environments. Finally, it degrades the damage from unauthorized aggregations. To prove the proposed scheme's robustness and efficiency, we also conducted the comprehensive analyses and comparisons in the end. Classification and Adaptive Novel Class Detection of Feature-Evolving Data ETPL DM-110 Streams Abstract: Data stream classification poses many challenges to the data mining community. In this paper, we address four such major challenges, namely, infinite length, concept-drift, conceptevolution, and feature-evolution. Since a data stream is theoretically infinite in length, it is impractical to store and use all the historical data for training. Concept-drift is a common phenomenon in data streams, which occurs as a result of changes in the underlying concepts. Concept-evolution occurs as a result of new classes evolving in the stream. Feature-evolution is a frequently occurring process in many streams, such as text streams, in which new features (i.e., words or phrases) appear as the stream progresses. Most existing data stream classification techniques address only the first two challenges, and ignore the latter two. In this paper, we propose an ensemble classification framework, where each classifier is equipped with a novel class detector, to address concept-drift and conceptevolution. To address feature-evolution, we propose a feature set homogenization technique. We also

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com enhance the novel class detection module by making it more adaptive to the evolving stream, and enabling it to detect more than one novel class at a time. Comparison with state-of-the-art data stream classification techniques establishes the effectiveness of the proposed approach.
ETPL DM-111
Comparable Entity Mining from Comparative Questions
Abstract: Comparing one thing with another is a typical part of human decision making process.
However, it is not always easy to know what to compare and what are the alternatives. In this paper, we present a novel way to automatically mine comparable entities from comparative questions that users posted online to address this difficulty. To ensure high precision and high recall, we develop a weakly supervised bootstrapping approach for comparative question identification and comparable entity extraction by leveraging a large collection of online question archive. The experimental results show our method achieves F1-measure of 82.5 percent in comparative question identification and 83.3 percent in comparable entity extraction. Both significantly outperform an existing state-of-the-art method. Additionally, our ranking results show highly relevance to user's comparison intents in web.
ETPL Cross-Space Affinity Learning with Its Application to Movie Recommendation DM-112 Abstract: In this paper, we propose a novel cross-space affinity learning algorithm over different
spaces with heterogeneous structures. Unlike most of affinity learning algorithms on the homogeneous space, we construct a cross-space tensor model to learn the affinity measures on heterogeneous spaces subject to a set of order constraints from the training pool. We further enhance the model with a factorization form which greatly reduces the number of parameters of the model with a controlled complexity. Moreover, from the practical perspective, we show the proposed factorized cross-space tensor model can be efficiently optimized by a series of simple quadratic optimization problems in an iterative manner. The proposed cross-space affinity learning algorithm can be applied to many real-world problems, which involve multiple heterogeneous data objects defined over different spaces. In this paper, we apply it into the recommendation system to measure the affinity between users and the product items, where a higher affinity means a higher rating of the user on the product. For an empirical evaluation, a widely used benchmark movie recommendation data set—MovieLens—is used to compare the proposed algorithm with other state-ofthe-art recommendation algorithms and we show that very competitive results can be obtained.
ETPL DM-113
Distributed Strategies for Mining Outliers in Large Data Sets
Abstract: We introduce a distributed method for detecting distance-based outliers in very large data
sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com magnitude faster than the classical nested-loop like approach to detect outliers. Experimental results show that the algorithm is efficient and that its running time scales quite well for an increasing number of nodes. We discuss also a variant of the basic strategy which reduces the amount of data to be transferred in order to improve both the communication cost and the overall runtime. Importantly, the solving set computed by our approach in a distributed environment has the same quality as that produced by the corresponding centralized method.
ETPL DM-114
Enhancing Access Privacy of Range Retrievals over $({rm B}^+)$-Trees
Abstract: Users of databases that are hosted on shared servers cannot take for granted that their queries
will not be disclosed to unauthorized parties. Even if the database is encrypted, an adversary who is monitoring the I/O activity on the server may still be able to infer some information about a user query. For the particular case of a $({rm B}^+)$-tree that has its nodes encrypted, we identify properties that enable the ordering among the leaf nodes to be deduced. These properties allow us to construct adversarial algorithms to recover the $({rm B}^+)$-tree structure from the I/O traces generated by range queries. Combining this structure with knowledge of the key distribution (or the plaintext database itself), the adversary can infer the selection range of user queries. To counter the threat, we propose a privacy-enhancing $({rm PB}^+)$-tree index which ensures that there is high uncertainty about what data the user has worked on, even to a knowledgeable adversary who has observed numerous query executions. The core idea in $({rm PB}^+)$-tree is to conceal the order of the leaf nodes in an encrypted $({rm B}^+)$-tree. In particular, it groups the nodes of the tree into buckets, and employs homomorphic encryption techniques to prevent the adversary from pinpointing the exact nodes retrieved by range queries. $({rm PB}^+)$-tree can be tuned to balance its privacy strength with the computational and I/O overheads incurred. Moreover, it can be adapted to protect access privacy in cases where the attacker additionally knows a priori the access frequencies of key values. Experiments demonstrate that $({rm PB}^+)$-tree effectively impairs the adversary's ability to recover the $({rm B}^+)$-tree structure and deduce the query ranges in all considered scenarios.
ETPL DM-115
Inferring Statistically Significant Hidden Markov Models
Abstract: Hidden Markov models (HMMs) are used to analyze real-world problems. We consider an
approach that constructs minimum entropy HMMs directly from a sequence of observations. If an insufficient amount of observation data is used to generate the HMM, the model will not represent the underlying process. Current methods assume that observations completely represent the underlying process. It is often the case that the training data size is not large enough to adequately capture all statistical dependencies in the system. It is, therefore, important to know the statistical significance level for that the constructed model representing the underlying process, not only the training set. In this paper, we present a method to determine if the observation data and constructed model fully express the underlying process with a given level of statistical significance. We use the statistics of the process to calculate an upper bound on the number of samples required to guarantee that the

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com model has a given level significance. We provide theoretical and experimental results that confirm the utility of this approach. The experiment is conducted on a real private Tor network. Lineage Encoding: An Efficient Wireless XML Streaming Supporting Twig ETPL DM-116 Pattern Queries Abstract: In this paper, we propose an energy and latency efficient XML dissemination scheme for the mobile computing. We define a novel unit structure called G-node for streaming XML data in the wireless environment. It exploits the benefits of the structure indexing and attribute summarization that can integrate relevant XML elements into a group. It provides a way for selective access of their attribute values and text content. We also propose a lightweight and effective encoding scheme, called Lineage Encoding, to support evaluation of predicates and twig pattern queries over the stream. The Lineage Encoding scheme represents the parent-child relationships among XML elements as a sequence of bit-strings, called Lineage Code(V, H), and provides basic operators and functions for effective twig pattern query processing at mobile clients. Extensive experiments using real and synthetic data sets demonstrate our scheme outperforms conventional wireless XML broadcasting methods for simple path queries as well as complex twig pattern queries with predicate conditions.
ETPL DM-117
MKBoost: A Framework of Multiple Kernel Boosting
Abstract: Multiple kernel learning (MKL) is a promising family of machine learning algorithms using
multiple kernel functions for various challenging data mining tasks. Conventional MKL methods often formulate the problem as an optimization task of learning the optimal combinations of both kernels and classifiers, which usually results in some forms of challenging optimization tasks that are often difficult to be solved. Different from the existing MKL methods, in this paper, we investigate a boosting framework of MKL for classification tasks, i.e., we adopt boosting to solve a variant of MKL problem, which avoids solving the complicated optimization tasks. Specifically, we present a novel framework of Multiple kernel boosting (MKBoost), which applies the idea of boosting techniques to learn kernel-based classifiers with multiple kernels for classification problems. Based on the proposed framework, we propose several variants of MKBoost algorithms and extensively examine their empirical performance on a number of benchmark data sets in comparisons to various state-of-the-art MKL algorithms on classification tasks. Experimental results show that the proposed method is more effective and efficient than the existing MKL techniques.
ETPL DM-118
Mining Order-Preserving Submatrices from Data with Repeated Measurements
Abstract: Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent
patterns in data when the relative magnitudes of data items are more important than their exact values. For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative magnitudes are important both because they represent the change of gene activities across the experiments, and because there is typically a high level of noise in data that makes the exact values

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com untrustable. To cope with data noise, repeated experiments are often conducted to collect multiple measurements. We propose and study a more robust version of OPSM, where each data item is represented by a set of values obtained from replicated experiments. We call the new problem OPSMRM (OPSM with repeated measurements). We define OPSM-RM based on a number of practical requirements. We discuss the computational challenges of OPSM-RM and propose a generic mining algorithm. We further propose a series of techniques to speed up two time dominating components of the algorithm. We show the effectiveness and efficiency of our methods through a series of experiments conducted on real microarray data.
ETPL DM -119
Modeling Noisy Annotated Data with Application to Social Annotation
Abstract: We propose a probabilistic topic model for analyzing and extracting content-related
annotations from noisy annotated discrete data such as webpages stored using social bookmarking services. With these services, because users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e., not content related. The extraction of content-related annotations can be used as a prepossessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.
ETPL DM -120
Multiparty Access Control for Online Social Networks: Model and Mechanisms
Online social networks (OSNs) have experienced tremendous growth in recent years and become a de facto portal for hundreds of millions of Internet users. These OSNs offer attractive means for digital social interactions and information sharing, but also raise a number of security and privacy issues. While OSNs allow users to restrict access to shared data, they currently do not provide any mechanism to enforce privacy concerns over data associated with multiple users. To this end, we propose an approach to enable the protection of shared data associated with multiple users in OSNs. We formulate an access control model to capture the essence of multiparty authorization requirements, along with a multiparty policy specification scheme and a policy enforcement mechanism. Besides, we present a logical representation of our access control model that allows us to leverage the features of existing logic solvers to perform various analysis tasks on our model. We also discuss a proof-of-concept prototype of our approach as part of an application in Facebook and provide usability study and system evaluation of our method.
ETPL DM -121
On the Analytical Properties of High-Dimensional Randomization

Abstract: In this paper, we will provide the first comprehensive analysis of high-dimensional
randomization. The goal is to examine the strengths and weaknesses of randomization and explore both the potential and the pitfalls of high-dimensional randomization. Our theoretical analysis results in a number of interesting and insightful conclusions. 1) The privacy effects of randomization reduce rapidly with increasing dimensionality. 2) The properties of the underlying data set can affect the anonymity level of the randomization method. For example, natural properties of real data sets such as clustering improve the effectiveness of randomization. On the other hand, variations in data density of nonempty data localities and outliers create privacy preservation challenges for the randomization method. 3) The use of a public information-sensitive attack method makes the choice of perturbing distribution more critical than previously thought. In particular, Gaussian perturbations are significantly more effective than uniformly distributed perturbations for the high dimensional case. These insights are very useful for future research and design of the randomization method. We use the insights gained from our analysis to discuss and suggest future research directions for improvements and extensions of the randomization method.
ETPL DM -122
TACI: Taxonomy-Aware Catalog Integration
A fundamental data integration task faced by online commercial portals and commerce search engines is the integration of products coming from multiple providers to their product catalogs. In this scenario, the commercial portal has its own taxonomy (the master taxonomy), while each data provider organizes its products into a different taxonomy (the provider taxonomy). In this paper, we consider the problem of categorizing products from the data providers into the master taxonomy, while making use of the provider taxonomy information. Our approach is based on a taxonomy-aware processing step that adjusts the results of a text-based classifier to ensure that products that are close together in the provider taxonomy remain close in the master taxonomy. We formulate this intuition as a structured prediction optimization problem. To the best of our knowledge, this is the first approach that leverages the structure of taxonomies in order to enhance catalog integration. We propose algorithms that are scalable and thus applicable to the large data sets that are typical on the web. We evaluate our algorithms on real-world data and we show that taxonomy-aware classification provides a significant improvement over existing approaches.
ETPL The Skyline of a Probabilistic Relation DM -123 Abstract: In a deterministic relation R, tuple u dominates tuple v if u is no worse than v on all the
attributes of interest, and better than v on at least one attribute. This concept is at the heart of skyline queries, that return the set of undominated tuples in R. In this paper, we extend the notion of skyline to probabilistic relations by generalizing to this context the definition of tuple domination. Our approach is parametric in the semantics for linearly ranking probabilistic tuples and, being it based on order-theoretic principles, preserves the three fundamental properties the skyline has in the deterministic case: 1) It equals the union of all top-1 results of monotone scoring functions; 2) it

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com requires no additional parameter; and 3) it is insensitive to actual attribute scales. We then show how domination among probabilistic tuples (or P-domination for short) can be efficiently checked by means of a set of rules. We detail such rules for the cases in which tuples are ranked using either the expected rank or the expected score semantics, and explain how the approach can be applied to other semantics as well. Since computing the skyline of a probabilistic relation is a time-consuming task, we introduce a family of algorithms for checking P-domination rules in an optimized way. Experiments show that these algorithms can significantly reduce the actual execution times with respect to a naive evaluation. Unsupervised Hybrid Feature Extraction Selection for High-Dimensional NonETPL DM-124 Gaussian Data Clustering with Variational Inference Abstract: Clustering has been a subject of extensive research in data mining, pattern recognition, and other areas for several decades. The main goal is to assign samples, which are typically non-Gaussian and expressed as points in high-dimensional feature spaces, to one of a number of clusters. It is well known that in such high-dimensional settings, the existence of irrelevant features generally compromises modeling capabilities. In this paper, we propose a variational inference framework for unsupervised non-Gaussian feature selection, in the context of finite generalized Dirichlet (GD) mixture-based clustering. Under the proposed principled variational framework, we simultaneously estimate, in a closed form, all the involved parameters and determine the complexity (i.e., both model an feature selection) of the GD mixture. Extensive simulations using synthetic data along with an analysis of real-world data and human action videos demonstrate that our variational approach achieves better results than comparable techniques.
ETPL DM-125
A Context-Based Word Indexing Model for Document Summarization
Existing models for document summarization mostly use the similarity between sentences in the document to extract the most salient sentences. The documents as well as the sentences are indexed using traditional term indexing measures, which do not take the context into consideration. Therefore, the sentence similarity values remain independent of the context. In this paper, we propose a context sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli model of randomness has been used to find the probability of the cooccurrences of two terms in a large corpus. A new approach using the lexical association between terms to give a context sensitive weight to the document terms has been proposed. The resulting indexing weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the baseline graph-based ranking models for sentence extraction. Experiments have been conducted over the benchmark DUC data sets and it has been shown that the proposed Bernoulli-based sentence similarity model provides consistent improvements over the baseline IntraLink and UniformLink methods [1]. A Segmentation and Graph-Based Video Sequence Matching Method for Video ETPL DM-126 Copy Detection

Abstract: We propose in this paper a segmentation and graph-based video sequence matching method
for video copy detection. Specifically, due to the good stability and discriminative ability of local features, we use SIFT descriptor for video content description. However, matching based on SIFT descriptor is computationally expensive for large number of points and the high dimension. Thus, to reduce the computational complexity, we first use the dual-threshold method to segment the videos into segments with homogeneous content and extract keyframes from each segment. SIFT features are extracted from the keyframes of the segments. Then, we propose an SVD-based method to match two video frames with SIFT point set descriptors. To obtain the video sequence matching result, we propose a graph-based method. It can convert the video sequence matching into finding the longest path in the frame matching-result graph with time constraint. Experimental results demonstrate that the segmentation and graph-based video sequence matching method can detect video copies effectively. Also, the proposed method has advantages. Specifically, it can automatically find optimal sequence matching result from the disordered matching results based on spatial feature. It can also reduce the noise caused by spatial feature matching. And it is adaptive to video frame rate changes. Experimental results also demonstrate that the proposed method can obtain a better tradeoff between the effectiveness and the efficiency of video copy detection.
ETPL DM-127
Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus
Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. Typically, sentiment classification has been modeled as the problem of training a binary classifier using reviews annotated for positive or negative sentiment. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier trained using labeled data for a particular domain to classify sentiment of user reviews on a different domain often results in poor performance because words that occur in the train (source) domain might not appear in the test (target) domain. We propose a method to overcome this problem in crossdomain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using labeled data for the source domains and unlabeled data for both source and target domains. Sentiment sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the context vectors used as the basis for measuring the distributional similarity between words. Next, we use the created thesaurus to expand feature vectors during train and test times in a binary classifier. The proposed method significantly outperforms numerous baselines and returns results that are comparable with previously proposed cross-domain sentiment classification methods on a benchmark data set containing Amazon user reviews for different types of products. We conduct an extensive empirical analysis of the proposed method on single- and multisource domain adaptation, unsupervised and supervised domain adaptation, and numerous similarity measures for creating the sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com words that express similar s- ntiments. Determining $(k)$-Most Demanding Products with Maximum Expected Number of Total Customers Abstract:In this paper, a problem of production plans, named k-most demanding products ($(k)$MDP) discovering, is formulated. Given a set of customers demanding a certain type of products with multiple attributes, a set of existing products of the type, a set of candidate products that can be offered by a company, and a positive integer $(k)$, we want to help the company to select $(k)$ products from the candidate products such that the expected number of the total customers for the $(k)$ products is maximized. We show the problem is NP-hard when the number of attributes for a product is 3 or more. One greedy algorithm is proposed to find approximate solution for the problem. We also attempt to find the optimal solution of the problem by estimating the upper bound of the expected number of the total customers for a set of $(k)$ candidate products for reducing the search space of the optimal solution. An exact algorithm is then provided to find the optimal solution of the problem by using this pruning strategy. The experiment results demonstrate that both the efficiency and memory requirement of the exact algorithm are comparable to those for the greedy algorithm, and the greedy algorithm is well scalable with respect to $(k)$. Dirichlet Process Mixture Model for Document Clustering with Feature ETPL DM-129 Partition Abstract: Finding the appropriate number of clusters to which documents should be partitioned is crucial in document clustering. In this paper, we propose a novel approach, namely DPMFP, to discover the latent cluster structure based on the DPM model without requiring the number of clusters as input. Document features are automatically partitioned into two groups, in particular, discriminative words and nondiscriminative words, and contribute differently to document clustering. A variational inference algorithm is investigated to infer the document collection structure as well as the partition of document words at the same time. Our experiments indicate that our proposed approach performs well on the synthetic data set as well as real data sets. The comparison between our approach and state-of-the-art document clustering approaches shows that our approach is robust and effective for document clustering.
ETPL DM-128 ETPL Discriminative Nonnegative Spectral Clustering with Out-of-Sample Extension DM-130 Abstract: Data clustering is one of the fundamental research problems in data mining and machine
learning. Most of the existing clustering methods, for example, normalized cut and $(k)$-means, have been suffering from the fact that their optimization processes normally lead to an NP-hard problem due to the discretization of the elements in the cluster indicator matrix. A practical way to cope with this problem is to relax this constraint to allow the elements to be continuous values. The eigenvalue decomposition can be applied to generate a continuous solution, which has to be further discretized. However, the continuous solution is probably mixing-signed. This result may cause it deviate severely

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com from the true solution, which should be naturally nonnegative. In this paper, we propose a novel clustering algorithm, i.e., discriminative nonnegative spectral clustering, to explicitly impose an additional nonnegative constraint on the cluster indicator matrix to seek for a more interpretable solution. Moreover, we show an effective regularization term which is able to not only provide more useful discriminative information but also learn a mapping function to predict cluster labels for the out-of-sample test data. Extensive experiments on various data sets illustrate the superiority of our proposal compared to the state-of-the-art clustering algorithms Efficient Algorithms for Mining High Utility Itemsets from Transactional ETPL DM-131 Databases Abstract: Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant algorithms have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement. The situation may become worse when the database contains lots of long transactions or long high utility itemsets. In this paper, we propose two algorithms, namely utility pattern growth (UP-Growth) and UP-Growth+, for mining high utility itemsets with a set of effective strategies for pruning candidate itemsets. The information of high utility itemsets is maintained in a tree-based data structure named utility pattern tree (UP-Tree) such that candidate itemsets can be generated efficiently with only two scans of database. The performance of UP-Growth and UP-Growth+ is compared with the state-of-the-art algorithms on many types of both real and synthetic data sets. Experimental results show that the proposed algorithms, especially UP-Growth+, not only reduce the number of candidates effectively but also outperform other algorithms substantially in terms of runtime, especially when databases contain lots of long transactions. Entity Translation Mining from Comparable Corpora: Combining Graph ETPL DM-132 Mapping with Corpus Latent Features Abstract: This paper addresses the problem of mining named entity translations from comparable corpora, specifically, mining English and Chinese named entity translation. We first observe that existing approaches use one or more of the following named entity similarity metrics: entity, entity context, and relationship. Motivated by this observation, we propose a new holistic approach by 1) combining all similarity types used and 2) additionally considering relationship context similarity between pairs of named entities, a missing quadrant in the taxonomy of similarity metrics. We abstract the named entity translation problem as the matching of two named entity graphs extracted from the comparable corpora. Specifically, named entity graphs are first constructed from comparable corpora to extract relationship between named entities. Entity similarity and entity context similarity are then calculated from every pair of bilingual named entities. A reinforcing method is utilized to reflect relationship similarity and relationship context similarity between named entities. We also discover "latent" features lost in the graph extraction process and integrate this into our framework. According to our experimental results, our holistic graph-based approach and its enhancement using

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com corpus latent features are highly effective and our framework significantly outperforms previous approaches.
ETPL Harnessing Folksonomies to Produce a Social Classification of Resources DM-133 Abstract: In our daily lives, organizing resources like books or webpages into a set of categories to
ease future access is a common task. The usual largeness of these collections requires a vast endeavor and an outrageous expense to organize manually. As an approach to effectively produce an automated classification of resources, we consider the immense amounts of annotations provided by users on social tagging systems in the form of bookmarks. In this paper, we deal with the utilization of these user-provided tags to perform a social classification of resources. For this purpose, we have created three large-scale social tagging data sets including tagging data for different types of resources, webpages and books. Those resources are accompanied by categorization data from sound expertdriven taxonomies. We analyze the characteristics of the three social tagging systems and perform an analysis on the usefulness of social tags to perform a social classification of resources that resembles the classification by experts as much as possible. We analyze six different representations using tags and compare to other data sources by using three different settings of SVM classifiers. Finally, we explore combinations of different data sources with tags using classifier committees to best classify the resources.
ETPL DM-134
Optimizing Multi-Top-k Queries over Uncertain Data Streams
Abstract: Query processing over uncertain data streams, in particular top-$(k)$ query processing, has
become increasingly important due to its wide application in many fields such as sensor network monitoring and internet traffic control. In many real applications, multiple top-$(k)$ queries are registered in the system. Sharing the results of these queries is a key factor in saving the computation cost and providing real-time response. However, due to the complex semantics of uncertain top-$(k)$ query processing, it is nontrivial to implement sharing among different top-$(k)$ queries and few works have addressed the sharing issue. In this paper, we formulate various types of sharing among multiple top-$(k)$ queries over uncertain data streams based on the frequency upper bound of each top-$(k)$ query. We present an optimal dynamic programming solution as well as a more efficient (in terms of time and space complexity) greedy algorithm to compute the execution plan of executing queries for saving the computation cost between them. Experiments have demonstrated that the greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same performance (in terms of latency and throughput) as the dynamic programming approach.
ETPL DM-135
Prequery Discovery of Domain-Specific Query Forms: A Survey
Abstract: The discovery of HTML query forms is one of the main challenges in Deep web crawling.
Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the web, which is done through the use of traditional/focused crawlers. The second is identifying

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com which of these forms are indeed meant for querying, which also typically involves determining a domain for the underlying data source (and thus for the form as well). This problem has attracted a great deal of interest, resulting in a long list of algorithms and techniques. Some methods submit requests through the forms and then analyze the data retrieved in response, typically requiring a great deal of knowledge about the domain as well as semantic processing. Others do not employ form submission, to avoid such difficulties, although some techniques rely to some extent on semantics and domain knowledge. This survey gives an up-to-date review of methods for the discovery of domainspecific query forms that do not involve form submission. We detail these methods and discuss how form discovery has become increasingly more automated over time. We conclude with a forecast of what we believe are the immediate next steps in this trend.
ETPL Preventing Private Information Inference Attacks on Social Networks DM-136 Abstract: Online social networks, such as Facebook, are increasingly utilized by many people. These
networks allow users to publish details about themselves and to connect to their friends. Some of the information revealed inside these networks is meant to be private. Yet it is possible to use learning algorithms on released data to predict private information. In this paper, we explore how to launch inference attacks using released social networking data to predict private information. We then devise three possible sanitization techniques that could be used in various situations. Then, we explore the effectiveness of these techniques and attempt to use methods of collective inference to discover sensitive attributes of the data set. We show that we can decrease the effectiveness of both local and relational classification algorithms by using the sanitization methods we described. Principal Composite Kernel Feature Analysis: Data-Dependent Kernel ETPL DM-137 Approach Abstract: Principal composite kernel feature analysis (PC-KFA) is presented to show kernel adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis (CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis (KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel method. The principal composite process for PC-KFA herein has been applied to kernel principal component analysis [34] and to our previously developed accelerated kernel feature analysis [20]. Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace can be first chosen by principal component analysis, and then be processed for composite kernel subspace through the efficient combination representations used for further reconstruction and classification. Numerical experiments based on several MID feature spaces of cancer CAD data have shown that PC-KFA generates efficient and an effective feature representation, and has yielded a better classification performance for the proposed composite kernel subspace using a simple pattern classifier

Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com Revealing Density-Based Clustering Structure from the Core-Connected Tree of a Network Abstract: Clustering is an important technique for mining the intrinsic community structures in networks. The density-based network clustering method is able to not only detect communities of arbitrary size and shape, but also identify hubs and outliers. However, it requires manual parameter specification to define clusters, and is sensitive to the parameter of density threshold which is difficult to determine. Furthermore, many real-world networks exhibit a hierarchical structure with communities embedded within other communities. Therefore, the clustering result of a global parameter setting cannot always describe the intrinsic clustering structure accurately. In this paper, we introduce a novel density-based network clustering method, called graph-skeleton-based clustering (gSkeletonClu). By projecting an undirected network to its core-connected maximal spanning tree, the clustering problem can be converted to detect core connectivity components on the tree. The densitybased clustering of a specific parameter setting and the hierarchical clustering structure both can be efficiently extracted from the tree. Moreover, it provides a convenient way to automatically select the parameter and to achieve the meaningful cluster tree in a network. Extensive experiments on both real-world and synthetic networks demonstrate the superior performance of gSkeletonClu for effective and efficient density-based clustering.
ETPL DM-138 ETPL Secure Provenance Transmission for Streaming Data DM-139 Abstract: Many application domains, such as real-time financial analysis, e-healthcare systems, sensor
networks, are characterized by continuous data streaming from multiple sources and through intermediate processing by multiple aggregators. Keeping track of data provenance in such highly dynamic context is an important requirement, since data provenance is a key factor in assessing data trustworthiness which is crucial for many applications. Provenance management for streaming data requires addressing several challenges, including the assurance of high processing throughput, low bandwidth consumption, storage efficiency and secure transmission. In this paper, we propose a novel approach to securely transmit provenance for streaming data (focusing on sensor network) by embedding provenance into the interpacket timing domain while addressing the above mentioned issues. As provenance is hidden in another host-medium, our solution can be conceptualized as watermarking technique. However, unlike traditional watermarking approaches, we embed provenance over the interpacket delays (IPDs) rather than in the sensor data themselves, hence avoiding the problem of data degradation due to watermarking. Provenance is extracted by the data receiver utilizing an optimal threshold-based mechanism which minimizes the probability of provenance decoding errors. The resiliency of the scheme against outside and inside attackers is established through an extensive security analysis. Experiments show that our technique can recover provenance up to a certain level against perturbations to inter-packet timing characteristics. The Adaptive Clustering Method for the Long Tail Problem of Recommender ETPL DM-140 Systems

Abstract: This is a study of the long tail problem of recommender systems when many items in the
long tail have only a few ratings, thus making it hard to use them in recommender systems. The approach presented in this paper clusters items according to their popularities, so that the recommendations for tail items are based on the ratings in more intensively clustered groups and for the head items are based on the ratings of individual items or groups, clustered to a lesser extent. We apply this method to two real-life data sets and compare the results with those of the nongrouping and fully grouped methods in terms of recommendation accuracy and scalability. The results show that if such adaptive clustering is done properly, this method reduces the recommendation error rates for the tail items, while maintaining reasonable computational performance.
ETPL DM-141
VChunkJoin: An Efficient Algorithm for Edit Similarity Joins
Abstract: Similarity joins play an important role in many application areas, such as data integration
and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.

IEEE Projects 2013-2014 - DataMining - Project Titles and Abstracts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IEEE Projects 2013-2014 - DataMining - Project Titles and Abstracts

Uploaded by

Copyright:

Available Formats

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Elysium Technologies Private Limited

AML: Efficient Approximate Membership Localization within a Web-Based

Elysium Technologies Private Limited

Anonymization of Centralized and Distributed Social Networks by Sequential

Detecting Intrinsic Loops Underlying Data Manifold

Elysium Technologies Private Limited

Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)

Elysium Technologies Private Limited

Halite: Fast and Scalable Multiresolution Local-Correlation Clustering

Elysium Technologies Private Limited

k-Pattern Set Mining under Constraints

Minimally Supervised Novel Relation Extraction Using a Latent Relational

Elysium Technologies Private Limited

Mining User Queries with Markov Chains: Application to Online Image

Reinforced Similarity Integration in Image-Rich Information Networks

Elysium Technologies Private Limited

Supporting Search-As-You-Type Using SQL in Databases

Abstract: A search-as-you-type system computes answers on-the-fly as a user types in a keyword

Simple Hybrid and Incremental Postpruning Techniques for Rule Induction

Elysium Technologies Private Limited

Annotating Search Results from Web Databases

Elysium Technologies Private Limited

Discovering Temporal Change Patterns in the Presence of Taxonomies

Elysium Technologies Private Limited

Facilitating Effective User Navigation through Website Structure Improvement

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Robust Module-Based Data Management

Elysium Technologies Private Limited

Sampling Online Social Networks

Elysium Technologies Private Limited

Transductive Multilabel Learning via Label Set Propagation

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Efficient Evaluation of SUM Queries over Probabilistic Data

Efficient Service Skyline Computation for Composite Service Selection

Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Sets

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Range-Based Skyline Queries in Mobile Environments

Skyline Processing on Distributed Vertical Decompositions

Spatial Query Integrity with Voronoi Neighbors

Elysium Technologies Private Limited

Supporting Pattern-Preserving Anonymization for Time-Series Data

Synchronization-Inspired Partitioning and Hierarchical Clustering

Elysium Technologies Private Limited

Elysium Technologies Private Limited

U-Skyline: A New Skyline Query for Uncertain Databases

Elysium Technologies Private Limited

Elysium Technologies Private Limited

Co-Occurrence-Based Diffusion for Expert Search on the Web

Elysium Technologies Private Limited

Efficient and Effective Duplicate Detection in Hierarchical Data

Failure-Aware Cascaded Suppression in Wireless Sensor Networks

Multiview Partitioning via Tensor Methods