Professional Documents
Culture Documents
Singapore | Madurai | Chennai | Trichy | Coimbatore | Cochin | Ramnad | Pondicherry | Trivandrum | Salem | Erode | Tirunelveli http://www.elysiumtechnologies.com, info@elysiumtechnologies.com
13 Years of Experience Automated Services 24/7 Help Desk Support Experience & Expertise Developers Advanced Technologies & Tools Legitimate Member of all Journals Having 1,50,000 Successive records in all Languages More than 12 Branches in Tamilnadu, Kerala & Karnataka. Ticketing & Appointment Systems. Individual Care for every Student. Around 250 Developers & 20 Researchers
227-230 Church Road, Anna Nagar, Madurai 625020. 0452-4390702, 4392702, + 91-9944793398. info@elysiumtechnologies.com, elysiumtechnologies@gmail.com
S.P.Towers, No.81 Valluvar Kottam High Road, Nungambakkam, Chennai - 600034. 044-42072702, +91-9600354638, chennai@elysiumtechnologies.com
15, III Floor, SI Towers, Melapudur main Road, Trichy 620001. 0431-4002234, + 91-9790464324. trichy@elysiumtechnologies.com
577/4, DB Road, RS Puram, Opp to KFC, Coimbatore 641002 0422- 4377758, +91-9677751577. coimbatore@elysiumtechnologies.com
1st Floor, A.R.IT Park, Rasi Color Scan Building, Ramanathapuram - 623501. 04567-223225, +919677704922.ramnad@elysiumtechnologies.com
Plot No: 4, C Colony, P&T Extension, Perumal puram, Tirunelveli627007. 0462-2532104, +919677733255, tirunelveli@elysiumtechnologies.com
74, 2nd floor, K.V.K Complex,Upstairs Krishna Sweets, Mettur Road, Opp. Bus stand, Erode-638 011. 0424-4030055, +919677748477 erode@elysiumtechnologies.com
No: 88, First Floor, S.V.Patel Salai, Pondicherry 605 001. 0413 4200640 +91-9677704822 pondy@elysiumtechnologies.com
TNHB A-Block, D.no.10, Opp: Hotel Ganesh Near Busstand. Salem 636007, 0427-4042220, +91-9894444716. salem@elysiumtechnologies.com
A Generalized Flow-Based Method for Analysis of Implicit Relationships on Wikipedia Abstract: We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship. A Proxy-Based Approach to Continuous Location-Based Spatial Queries in ETPL DM-017 Mobile Environments Abstract: Caching valid regions of spatial queries at mobile clients is effective in reducing the number of queries submitted by mobile clients and query load on the server. However, mobile clients suffer from longer waiting time for the server to compute valid regions. We propose in this paper a proxybased approach to continuous nearest-neighbor (NN) and window queries. The proxy creates estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to represent the EVRs of window queries in the form of vectors, called estimated window vectors (EWVs), to achieve larger estimated valid regions. This novel representation and the associated
ETPL DM-016
to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering.
ETPL DM-020
Abstract: In this paper, we propose a new type of Dictionary-based Entity Recognition Problem,
named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune
We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.
ETPL Clustering Large Probabilistic Graphs DM-022 Abstract: We study the problem of clustering probabilistic graphs. Similar to the problem of clustering
standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.
ETPL DM-023
Abstract: Detecting intrinsic loop structures of a data manifold is the necessary prestep for the proper
employment of the manifold learning techniques and of fundamental importance in the discovery of the essential representational features underlying the data lying on the loopy manifold. An effective strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting
Abstract: This paper introduces a platform for online Sensitivity Analysis (SA) that is applicable in
large scale real-time data acquisition (DAQ) systems. Here, we use the term real-time in the context of a system that has to respond to externally generated input stimuli within a finite and specified period. Complex industrial systems such as manufacturing, healthcare, transport, and finance require high-quality information on which to base timely responses to events occurring in their volatile environments. The motivation for the proposed EventTracker platform is the assumption that modern industrial systems are able to capture data in real-time and have the necessary technological flexibility to adjust to changing system requirements. The flexibility to adapt can only be assured if data is succinctly interpreted and translated into corrective actions in a timely manner. An important factor that facilitates data interpretation and information modeling is an appreciation of the affect system inputs have on each output at the time of occurrence. Many existing sensitivity analysis methods appear to hamper efficient and timely analysis due to a reliance on historical data, or sluggishness in providing a timely solution that would be of use in real-time applications. This inefficiency is further compounded by computational limitations and the complexity of some existing models. In dealing with real-time event driven systems, the underpinning logic of the proposed method is based on the assumption that in the vast majority of cases changes in input variables will trigger events. Every single or combination of events could subsequently result in a change to the system state. The proposed event tracking sensitivity analysis method describes variables and the system state as a collection of events. The higher the numeric occurrence of an input variable at the trigger level during an event monitoring interval, the greater is its impact on the final analysis of the system state. Expeiments were designed to compare the proposed event tracking sensitivity analysis method with a comparable method (that of Entropy). An improvement of 10 percent in computational efficiency without loss in accuracy was observed. The comparison also showed that the time taken to perform the sensitivity analysis was 0.5 percent of that required when using the comparable Entropy-based method.
ETPL Fast Activity Detection: Indexing for Temporal Stochastic Automaton-Based DM-025 Abstract: Today, numerous applications require the ability to monitor a continuous stream of fine-
grained data for the occurrence of certain high-level activities. A number of computerized systemsincluding ATM networks, web servers, and intrusion detection systems-systematically track every
issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in labeling effort. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur together in practice and are intrinsically related because examples of each class are required to train a classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data mining for rare classes in new domains makes inefficient use of human supervision. Developing active learning algorithms to optimise both rare class discovery and classification simultaneously is challenging because discovery and classification have conflicting requirements in query criteria. In this paper, we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them by adapting query criteria online; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of this approach over existing methods.
ETPL DM-027
clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in time and space regarding the data size and dimensionality, and the dimensionality of the clusters' subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or better quality compared to top related works; and 4) Generality: it includes a soft clustering approach. Experiments on synthetic data ranging from five to 30 axes and up to 1 rm million points were performed. Halite was in average at least 12 times faster than seven representative works, and always presented highly accurate results. On real data, Halite was at least 11 times faster than others, increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where soft clustering is desirable.
ETPL DM-028
Abstract: We introduce the problem of k-pattern set mining, concerned with finding a set of k related
patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The k-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.
ETPL DM-029
Abstract The World Wide Web includes semantic relations of numerous types that exist among
different entities. Extracting the relations that exist between two entities is an important step in various Web-related tasks such as information retrieval (IR), information extraction, and social network extraction. A supervised relation extraction system that is trained to extract a particular relation type (source relation) might not accurately extract a new type of a relation (target relation) for which it has not been trained. However, it is costly to create training data manually for every new relation type that one might want to extract. We propose a method to adapt an existing relation
We propose a novel method for automatic annotation, indexing and annotation-based retrieval of images. The new method, that we call Markovian Semantic Indexing (MSI), is presented in the context of an online image retrieval system. Assuming such a system, the users' queries are used to construct an Aggregate Markov Chain (AMC) through which the relevance between the keywords seen by the system is defined. The users' queries are also used to automatically annotate the images. A stochastic distance between images, based on their annotation and the keyword relevance captured in the AMC, is then introduced. Geometric interpretations of the proposed distance are provided and its relation to a clustering in the keyword space is investigated. By means of a new measure of Markovian state similarity, the mean first cross passage time (CPT), optimality properties of the proposed distance are proved. Images are modeled as points in a vector space and their similarity is measured with MSI. The new method is shown to possess certain theoretical advantages and also to achieve better Precision versus Recall results when compared to Latent Semantic Indexing (LSI) and probabilistic Latent Semantic Indexing (pLSI) methods in Annotation-Based Image Retrieval (ABIR) tasks.
ETPL DM-031
Abstract: Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain
billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments, and other information, thus forming heterogeneous image-rich information networks. In this paper, we introduce the concept of (heterogeneous) image-rich information network and the problem of how to perform information
query character by character. We study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the highperformance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search performance. We present solutions for both single-keyword queries and multikeyword queries, and develop novel techniques for fuzzy search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables with millions of records.
ETPL DM-033
: Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental postpruning technique based on a misclassification tolerance. Although both have been designed for RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other stateof-the-art classification techniques. The improved classifier is also more accurate and up to two orders of magnitude faster than before.
ETPL -diverse nearest neighbors browsing for multidimensional data DM-034 Abstract: Traditional search methods try to obtain the most relevant information and rank it according
to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannot capture all aspects of the queried topic. In this paper, we focus on the lambda-diverse k-nearest neighbor search problem on spatial and
than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The xaxis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study, kappa is calculated from the 2 2 correct/wrong contingency matrix. We derive a lower bound on kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments with real data show that there is unoccupied feasible space on the diagram corresponding to (hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the ensemble accuracy.
ETPL A New Algorithm for Inferring User Search Goals with Feedback Sessions DM-036 Abstract: For a broad-topic and ambiguous query, different users may have different search goals
when they submit it to a search engine. The inference and analysis of user search goals can be very useful in improving search engine relevance and user experience. In this paper, we propose a novel approach to infer user search goals by analyzing search engine query logs. First, we propose a framework to discover different user search goals for a query by clustering the proposed feedback sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect the information needs of users. Second, we propose a novel approach to generate pseudo-documents to better represent the feedback sessions for clustering. Finally, we propose a new criterion )Classified Average Precision (CAP) to evaluate the performance of inferring user search goals. Experimental results are presented using user click-through logs from a commercial search engine to validate the effectiveness of our proposed methods.
ETPL DM-037
search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
ETPL Building a Scalable Database-Driven Reverse Dictionary DM-038 Abstract: In this paper, we describe the design and implementation of a reverse dictionary. Unlike a
traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the input phrase. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. We present a set of algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and the runtime response time performance of our implementation. Our experimental results show that our approach can provide significant improvements in performance scale without sacrificing the quality of the result. Our experiments comparing the quality of our approach to that of currently available reverse dictionaries show that of our approach can provide significantly higher quality over either of the other currently available implementations.
ETPL DM-039
Abstract: Frequent itemset mining is a widely exploratory technique that focuses on discovering
recurrent correlations among data. The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that 1) frequently occur in the source data, and 2) provide a high-level abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the History GENeralized Pattern (HIGEN), that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalizations characterized by minimal redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period.
(see, e.g., Wikipedia [1]). In cases in which data/knowledge quality and reliability are crucial, proposals of update/insertion/deletion need to be evaluated by experts. To the best of our knowledge, no theoretical framework has been devised to model the semantics of update proposal/ evaluation in the relational context. Since time is an intrinsic part of most domains (as well as of the proposal/evaluation process itself), semantic approaches to temporal relational databases (specifically, Bitemporal Conceptual Data Model (henceforth, BCDM) [2]) are the starting point of our approach. In this paper, we propose BCDMPV, a semantic temporal relational model that extends BCDM to deal with multiple update/insertion/deletion proposals and with acceptances/rejections of proposals themselves. We propose a theoretical framework, defining the new data structures, manipulation operations and temporal relational algebra and proving some basic properties, namely that BCDMPV is a consistent extension of BCDM and that it is reducible to BCDM. These properties ensure consistency with most relational temporal database frameworks, facilitating implementations.
ETPL DM-041
Abstract: Designing well-structured websites to facilitate effective user navigation has long been a
challenge. A primary reason is that the web developers' understanding of how a website should be structured can be considerably different from that of the users. While various methods have been proposed to relink webpages to improve navigability using user navigation data, the completely reorganized new structure can be highly unpredictable, and the cost of disorienting users after the changes remains unanalyzed. This paper addresses how to improve a website without introducing substantial changes. Specifically, we propose a mathematical programming model to improve the user navigation on a website while minimizing alterations to its current structure. Results from extensive tests conducted on a publicly available real data set indicate that our model not only significantly improves the user navigation with very few changes, but also can be effectively solved. We have also tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we define two evaluation metrics and use them to assess the performance of the improved website using the real data set. Evaluation results confirm that the user navigation on the improved structure is
set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail. Modeling and Solving Distributed Configuration Problems: A CSP-Based ETPL DM-043 Approach Abstract: Product configuration can be defined as the task of tailoring a product according to the specific needs of a customer. Due to the inherent complexity of this task, which for example includes the consideration of complex constraints or the automatic completion of partial configurations, various Artificial Intelligence techniques have been explored in the last decades to tackle such configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In modern supply chain settings, however, the components of a customizable product may themselves be configurable, thus requiring a multisite, distributed approach. In this paper, we analyze the challenges of modeling and solving such distributed configuration problems and propose an approach based on Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic constraints is particularly advantageous also in the distributed problem solving phase.
ETPL On Similarity Preserving Feature Selection DM-044 Abstract: In the literature of feature selection, different criteria have been proposed to evaluate the
goodness of features. In our investigation, we notice that a number of existing selection criteria implicitly select features that preserve sample similarity, and can be unified under a common framework. We further point out that any feature selection criteria covered by this framework cannot handle redundant features, a common drawback of these criteria. Motivated by these observations, we
social science research and business analysis. Recently, researchers have developed privacy models similar to k-anonymity to prevent node reidentification through structure information. However, even when these privacy models are enforced, an attacker may still be able to infer one's private information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words, the label-node relationship is not well protected by pure structure anonymization methods. Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly alter key graph properties. In this paper, we define a k-degree-l-diversity anonymity model that considers the protection of structural information as well as sensitive labels of individuals. We further propose a novel anonymization methodology based on adding noise nodes. We develop a new algorithm by adding noise nodes into the original graph with the consideration of introducing the least distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical bounds on the number of noise nodes added and their impacts on an important graph property. We conduct extensive experiments to evaluate the effectiveness of the proposed technique .
ETPL DM-046
Abstract: The current trend for building an ontology-based data management system (DMS) is to
capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from the reference DMS a piece of schema relevant to the new application needs-a module-, possibly personalizing it with extra constraints w.r.t. the application under construction, and then managing a data set using the resulting schema. In this paper, we extend the existing definitions of modules and we introduce novel properties of robustness that provide means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and the data of the reference DMS. We carry out our investigations in the setting of description logics which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL
As online social networking emerges, there has been increased interest to utilize the underlying network structure as well as the available information on social peers to improve the information needs of a user. In this paper, we focus on improving the performance of information collection from the neighborhood of a user in a dynamic social network. We introduce samplingbased algorithms to efficiently explore a user's social network respecting its structure and to quickly approximate quantities of interest. We introduce and analyze variants of the basic sampling scheme exploring correlations across our samples. Models of centralized and distributed social networks are considered. We show that our algorithms can be utilized to rank items in the neighborhood of a user, assuming that information for each user in the network is available. Using real and synthetic data sets, we validate the results of our analysis and demonstrate the efficiency of our algorithms in approximating quantities of interest. The methods we describe are general and can probably be easily adopted in a variety of strategies aiming to efficiently collect information from a social graph. Supporting Flexible, Efficient, and User-Interpretable Retrieval of Similar ETPL DM-048 Time Series Abstract: Supporting decision making in domains in which the observed phenomenon dynamics have to be dealt with, can greatly benefit of retrieval of past cases, provided that proper representation and retrieval techniques are implemented. In particular, when the parameters of interest take the form of time series, dimensionality reduction and flexible retrieval have to be addresses to this end. Classical methodological solutions proposed to cope with these issues, typically based on mathematical transforms, are characterized by strong limitations, such as a difficult interpretation of retrieval results for end users, reduced flexibility and interactivity, or inefficiency. In this paper, we describe a novel framework, in which time-series features are summarized by means of Temporal Abstractions, and then retrieved resorting to abstraction similarity. Our approach grants for interpretability of the output results, and understandability of the (user-guided) retrieval process. In particular, multilevel abstraction mechanisms and proper indexing techniques are provided, for flexible query issuing, and efficient and interactive query answering. Experimental results have shown the efficiency of our approach in a scalability test, and its superiority with respect to the use of a classical mathematical technique in flexibility, user friendliness, and also quality of results. The Minimum Consistent Subset Cover Problem: A Minimization View of Data ETPL DM-049 Mining Abstract: In this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition (MCP), a dual problem of
Abstract: The problem of multilabel classification has attracted great interest in the last decade, where
each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image annotations and gene function analysis. Current research on multilabel classification focuses on supervised settings which assume existence of large amounts of labeled training data. However, in many applications, the labeling of multilabeled data is extremely expensive and time consuming, while there are often abundant unlabeled data available. In this paper, we study the problem of transductive multilabel learning and propose a novel solution, called Trasductive Multilabel Classification (TraM), to effectively assign a set of multiple labels to each instance. Different from supervised multilabel learning methods, we estimate the label sets of the unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. We first formulate the transductive multilabel learning as an optimization problem of estimating label concept compositions. Then, we derive a closed-form solution to this optimization problem and propose an effective algorithm to assign label sets to the unlabeled instances. Empirical studies on several real-world multilabel learning tasks demonstrate that our TraM method can effectively boost the performance of multilabel classification by using both labeled and unlabeled data . A Method for Mining Infrequent Causal Associations and Its Application in ETPL DM-051 Finding Adverse Drug Reaction Signal Pairs Abstract: In many real-world applications, it is important to mine causal relationships where an event or event pattern causes certain outcomes with low probability. Discovering this kind of causal relationships can help us prevent or correct negative outcomes caused by their antecedents. In this paper, we propose an innovative data mining framework and apply it to mine potential causal associations in electronic patient data sets where the drug-related events of interest occur infrequently. Specifically, we created a novel interestingness measure, exclusive causal-leverage, based on a computational, fuzzy recognition-primed decision (RPD) model that we previously developed. On the basis of this new measure, a data mining algorithm was developed to mine the causal relationship between drugs and their associated adverse drug reactions (ADRs). The algorithm was tested on real patient data retrieved from the Veterans Affairs Medical Center in Detroit, Michigan. The retrieved
significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like $(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this
Abstract: SUM queries are crucial for many applications that need to deal with uncertain data. In this
paper, we are interested in the queries, called ALL_SUM, that return all possible sum values and their probabilities. In general, there is no efficient solution for the problem of evaluating ALL_SUM queries. But, for many practical applications, where aggregate values are small integers or real numbers with small precision, it is possible to develop efficient solutions. In this paper, based on a recursive approach, we propose a new solution for those applications. We implemented our solution and conducted an extensive experimental evaluation over synthetic and real-world data sets; the results show its effectiveness.
ETPL DM-055
Abstract: Service composition is emerging as an effective vehicle for integrating existing web services
to create value-added and personalized composite services. As web services with similar functionality are expected to be provided by competing providers, a key challenge is to find the best web services to participate in the composition. When multiple quality aspects (e.g., response time, fee, etc.) are considered, a weighting mechanism is usually adopted by most existing approaches, which requires users to specify their preferences as numeric values. We propose to exploit the dominance relationship among service providers to find a set of best possible composite services, referred to as a composite service skyline. We develop efficient algorithms that allow us to find the composite service skyline from a significantly reduced searching space instead of considering all possible service compositions. We propose a novel bottom-up computation framework that enables the skyline algorithm to scale well with the number of services in a composition. We conduct a comprehensive analytical and experimental study to evaluate the effectiveness, efficiency, and scalability of the composite skyline computation approaches.
ETPL DM-056
Abstract: A spatial colocation pattern is a group of spatial features whose instances are frequently
located together in geographic space. Discovering colocations has many useful applications. For
preferences in the form of concepts by mining their clickthrough data. Due to the importance of location information in mobile search, PMSE classifies these concepts into content concepts and location concepts. In addition, users' locations (positioned by GPS) are used to supplement the location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user profile, which are used to adapt a personalized ranking function for rank adaptation of future search results. To characterize the diversity of the concepts associated with a query and their relevances to the user's need, four entropies are introduced to balance the weights between the content and location
Abstract: Skyline query processing for location-based services, which considers both spatial and
nonspatial attributes of the objects being queried, has recently received increasing attention. Existing solutions focus on solving point- or line-based skyline queries, in which the query location is an exact location point or a line segment. However, due to privacy concerns and limited precision of localization devices, the input of a user location is often a spatial range. This paper studies a new problem of how to process such range-based skyline queries. Two novel algorithms are proposed: one is index-based (I-SKY) and the other is not based on any index (N-SKY). To handle frequent movements of the objects being queried, we also propose incremental versions of I-SKY and N-SKY, which avoid recomputing the query index and results from scratch. Additionally, we develop efficient solutions for probabilistic and continuous range-based skyline queries. Experimental results show that our proposed algorithms well outperform the baseline algorithm that adopts the existing line-based skyline solution. Moreover, the incremental versions of I-SKY and N-SKY save substantial computation cost, especially when the objects move frequently.
ETPL DM-060
Abstract: We assume a data set that is vertically decomposed among several servers, and a client that
wishes to compute the skyline by obtaining the minimum number of points. Existing solutions for this problem are restricted to the case where each server maintains exactly one dimension. This paper proposes a general solution for vertical decompositions of arbitrary dimensionality. We first investigate some interesting problem characteristics regarding the pruning power of points. Then, we introduce vertical partition skyline (VPS), an algorithmic framework that includes two steps. Phase 1 searches for an anchor point Panc that dominates, and hence eliminates, a large number of records. Starting with Panc, Phase 2 constructs incrementally a pruning area using an interesting unionintersection property of dominance regions. Servers do not transmit points that fall within the pruning area in their local subspace. Our experiments confirm the effectiveness of the proposed methods under various settings.
ETPL DM-061
Abstract: Time series is an important form of data available in numerous applications and often
contains vast amount of personal privacy. The need to protect privacy in time-series data while effectively supporting complex queries on them poses nontrivial challenges to the database community. We study the anonymization of time series while trying to support complex queries, such as range and pattern matching queries, on the published data. The conventional k-anonymity model cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel anonymization model called (k, P)-anonymity for pattern-rich time series. This model publishes both the attribute values and the patterns of time series in separate data forms. We demonstrate that our model can prevent linkage attacks on the published data while effectively support a wide variety of queries on the anonymized data. We propose two algorithms to enforce (k, P)-anonymity on timeseries data. Our anonymity model supports customized data publishing, which allows a certain part of the values but a different part of the pattern of the anonymized time series to be published simultaneously. We present estimation techniques to support query processing on such customized data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify the effectiveness and efficiency of our approach.
ETPL DM-063
Synchronization is a powerful and inherently hierarchical concept regulating a large variety of complex processes ranging from the metabolism in a cell to opinion formation in a group of individuals. Synchronization phenomena in nature have been widely investigated and models
examples. One practically important problem is: can the labeled data from other related sources help predict the target task, even if they have 1) different feature spaces (e.g., image versus text data), 2) different data distributions, and 3) different output spaces? This paper proposes a solution and discusses the conditions where this is highly likely to produce better results. It first unifies the feature spaces of the target and source data sets by spectral embedding, even when they are with completely different feature spaces. The principle is to devise an optimization objective that preserves the original structure of the data, while at the same time, maximizes the similarity between the two. A linear projection model, as well as a nonlinear approach are derived on the basis of this principle with closed forms. Second, a judicious sample selection strategy is applied to select only those related source examples. At last, a Bayesian-based approach is applied to model the relationship between different output spaces. The three steps can bridge related heterogeneous sources in order to learn the target task. Among the 20 experiment data sets, for example, the images with wavelet-transformed-based features are used to predict another set of images whose features are constructed from color-histogram space; documents are used to help image classification, etc. By using these extracted examples from heterogeneous sources, the models can reduce the error rate by as much as 50 percent, compared with the methods using only the examples from the target task. Tweet Analysis for Real-Time Event Detection and Earthquake Reporting ETPL DM-065 System Development Abstract: Twitter has received much attention recently. An important characteristic of Twitter is its real-time nature. We investigate the real-time interaction of events such as earthquakes in Twitter and propose an algorithm to monitor tweets and to detect a target event. To detect a target event, we devise a classifier of tweets based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, we produce a probabilistic spatiotemporal model for the target event
Abstract: The skyline query, aiming at identifying a set of skyline tuples that are not dominated by
any other tuple, is particularly useful for multicriteria data analysis and decision making. For uncertain databases, a probabilistic skyline query, called P-Skyline, has been developed to return skyline tuples by specifying a probability threshold. However, the answer obtained via a P-Skyline query usually includes skyline tuples undesirably dominating each other when a small threshold is specified; or it may contain much fewer skyline tuples if a larger threshold is employed. To address this concern, we propose a new uncertain skyline query, called U-Skyline query, in this paper. Instead of setting a probabilistic threshold to qualify each skyline tuple independently, the U-Skyline query searches for a set of tuples that has the highest probability (aggregated from all possible scenarios) as the skyline answer. In order to answer U-Skyline queries efficiently, we propose a number of optimization techniques for query processing, including 1) computational simplification of U-Skyline probability, 2) pruning of unqualified candidate skylines and early termination of query processing, 3) reduction of the input data set, and 4) partition and conquest of the reduced data set. We perform a
Achieving Data Privacy through Secrecy Views and Null-Based Virtual Updates
Abstract: We may want to keep sensitive information in a relational database hidden from a user or
group thereof. We characterize sensitive data as the extensions of secrecy views. The database, before returning the answers to a query posed by a restricted user, is updated to make the secrecy views
Abstract: Expert search has been studied in different contexts, e.g., enterprises, academic communities.
We examine a general expert search problem: searching experts on the web, where millions of webpages and thousands of names are considered. It has mainly two challenging issues: 1) webpages could be of varying quality and full of noises; 2) The expertise evidences scattered in webpages are usually vague and ambiguous. We propose to leverage the large amount of co-occurrence information to assess relevance and reputation of a person name for a query topic. The co-occurrence structure is modeled using a hypergraph, on which a heat diffusion based ranking algorithm is proposed. Query keywords are regarded as heat sources, and a person name which has strong connection with the query (i.e., frequently co-occur with query keywords and co-occur with other names related to query keywords) will receive most of the heat, thus being ranked high. Experiments on the ClueWeb09 web collection show that our algorithm is effective for retrieving experts and outperforms baseline algorithms significantly. This work would be regarded as one step toward addressing the more general entity search problem without sophisticated NLP techniques. Efficient All Top-$(k)$ Computation—A Unified Solution for All TopETPL DM -072 $(k)$, Reverse Top-$(k)$ and Top-$(m)$ Influential Queries Abstract: Given a set of objects $(P)$ and a set of ranking functions $(F)$ over $(P)$, an interesting problem is to compute the top ranked objects for all functions. Evaluation of multiple top-$(k)$ queries finds application in systems, where there is a heavy workload of ranking queries (e.g., online search engines and product recommendation systems). The simple solution of evaluating the top$(k)$ queries one-by-one does not scale well; instead, the system can make use of the fact that similar queries share common results to accelerate search. This paper is the first, to our knowledge, thorough study of this problem. We propose methods that compute all top-$(k)$ queries in batch. Our first solution applies the block indexed nested loops paradigm, while our second technique is a view-based algorithm. We propose appropriate optimization techniques for the two approaches and demonstrate experimentally that the second approach is consistently the best. Our approach facilitates evaluation of other complex queries that depend on the computation of multiple top-$(k)$ queries, such as reverse top-$(k)$ and top-$(m)$ influential queries. We show that our batch processing technique for these complex queries outperform the state-of-the-art by orders of magnitude.
Abstract: Although there is a long line of work on identifying duplicates in relational data, only a few
solutions focus on duplicate detection in more complex hierarchical structures, like XML data. In this paper, we present a novel method for XML duplicate detection, called XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML elements being duplicates, considering not only the information within the elements, but also the way that information is structured. In addition, to improve the efficiency of the network evaluation, a novel pruning strategy, capable of significant gains over the unoptimized version of the algorithm, is presented. Through experiments, we show that our algorithm is able to achieve high precision and recall scores in several data sets. XMLDup is also able to outperform another state-of-the-art duplicate detection solution, both in terms of efficiency and of effectiveness.
ETPL DM -074 Abstract:
Wireless sensor networks are widely used to continuously collect data from the environment. Because of energy constraints on battery-powered nodes, it is critical to minimize communication. Suppression has been proposed as a way to reduce communication by using predictive models to suppress reporting of predictable data. However, in the presence of communication failures, missing data are difficult to interpret because these could have been either suppressed or lost in transmission. There is no existing solution for handling failures for general, spatiotemporal suppression that uses cascading. While cascading further reduces communication, it makes failure handling difficult, because nodes can act on incomplete or incorrect information and in turn affect other nodes. We propose a cascaded suppression framework that exploits both temporal and spatial data correlation to reduce communication, and applies coding theory and Bayesian inference to recover missing data resulted from suppression and communication failures. Experiment results show that cascaded suppression significantly reduces communication cost and improves missing data recovery compared to existing approaches. .
ETPL DM -075
Abstract: Clustering by integrating multiview representations has become a crucial issue for
knowledge discovery in heterogeneous environments. However, most prior approaches assume that the multiple representations share the same dimension, limiting their applicability to homogeneous environments. In this paper, we present a novel tensor-based framework for integrating heterogeneous multiview data in the context of spectral clustering. Our framework includes two novel formulations; that is multiview clustering based on the integration of the Frobenius-norm objective function (MCFR-OI) and that based on matrix integration in the Frobenius-norm objective function (MC-FR-MI). We show that the solutions for both formulations can be computed by tensor decompositions. We evaluated our methods on synthetic data and two real-world data sets in comparison with baseline
formation in cooperative games, and to use the Shapley value of the patterns to identify clusters and cluster representatives. We show that the underlying game is convex and this leads to an efficient biobjective clustering algorithm that we call BiGC. The algorithm yields high-quality clustering with respect to average point-to-center distance (potential) as well as average intracluster point-to-point distance (scatter). We demonstrate the superiority of BiGC over state-of-the-art clustering algorithms (including the center based and the multiobjective techniques) through a detailed experimentation using standard cluster validity criteria on several benchmark data sets. We also show that BiGC satisfies key clustering properties such as order independence, scale invariance, and richness . On Generalizable Low False-Positive Learning Using Asymmetric Support ETPL DM -077 Vector Machines Abstract: The Support Vector Machines (SVMs) have been widely used for classification due to its ability to give low generalization error. In many practical applications of classification, however, the wrong prediction of a certain class is much severer than that of the other classes, making the original SVM unsatisfactory. In this paper, we propose the notion of Asymmetric Support Vector Machine (ASVM), an asymmetric extension of the SVM, for these applications. Different from the existing SVM extensions such as thresholding and parameter tuning, ASVM employs a new objective that models the imbalance between the costs of false predictions from different classes in a novel way such that user tolerance on false-positive rate can be explicitly specified. Such a new objective formulation allows us of obtaining a lower false-positive rate without much degradation of the prediction accuracy or increase in training time. Furthermore, we show that the generalization ability is preserved with the new objective. We also study the effects of the parameters in ASVM objective and address some implementation issues related to the Sequential Minimal Optimization (SMO) to cope with large-scale data. An extensive simulation is conducted and shows that ASVM is able to yield either noticeable improvement in performance or reduction in training time as compared to the previous arts.
ETPL DM -078 Abstract:
Given a set of spatial points $(DS)$, each of which is associated with categorical information, e.g., restaurant, pub, etc., the optimal route query finds the shortest path that starts from the query point (e.g., a home or hotel), and covers a user-specified set of categories (e.g., {pub, restaurant, museum}). The user may also specify partial order constraints between different categories, e.g., a restaurant must be visited before a pub. Previous work has focused on a special case where the query contains the total order of all categories to be visited (e.g., museum $(rightarrow)$
Abstract: Entity resolution (ER) is the problem of identifying which records in a database refer to the
same entity. In practice, many applications need to resolve large data sets efficiently, but do not require the ER result to be exact. For example, people data from the web may simply be too large to completely resolve with a reasonable amount of work. As another example, real-time applications may not be able to tolerate any ER processing that takes longer than a certain amount of time. This paper investigates how we can maximize the progress of ER with a limited amount of work using “hints,” which give information on records that are likely to refer to the same realworld entity. A hint can be represented in various formats (e.g., a grouping of records based on their likelihood of matching), and ER can use this information as a guideline for which records to compare first. We introduce a family of techniques for constructing hints efficiently and techniques for using the hints to maximize the number of matching records identified using a limited amount of work. Using real data sets, we illustrate the potential gains of our pay-as-you-go approach compared to running ER without using hints.. Single-Database Private Information Retrieval from Fully Homomorphic ETPL DM -080 Encryption Abstract: Private Information Retrieval (PIR) allows a user to retrieve the $(i)$th bit of an $(n)$-bit database without revealing to the database server the value of $(i)$. In this paper, we present a PIR protocol with the communication complexity of $(O(gamma log n))$ bits, where $(gamma)$ is the ciphertext size. Furthermore, we extend the PIR protocol to a private block retrieval (PBR) protocol, a natural and more practical extension of PIR in which the user retrieves a block of bits, instead of retrieving single bit. Our protocols are built on the state-of-the-art fully homomorphic encryption (FHE) techniques and provide privacy for the user if the underlying FHE scheme is semantically secure. The total communication complexity of our PBR is $(O(gamma log m+gamma n/m))$ bits, where $(m)$ is the number of blocks. The total computation complexity of our PBR is $(O(mlog m))$ modular multiplications plus $(O(n/2))$ modular additions. In terms of total protocol execution time, our PBR protocol is more efficient than existing PBR protocols which usually require to compute $(O(n/2))$ modular multiplications when the size of a block in the database is large and a high-speed
Abstract: Due to the fast evolution of the information on the Internet, update summarization has
received much attention in recent years. It is to summarize an evolutionary document collection at current time supposing the users have read some related previous documents. In this paper, we propose a graph-ranking-based method. It performs constrained reinforcements on a sentence graph, which unifies previous and current documents, to determine the salience of the sentences. The constraints ensure that the most salient sentences in current documents are updates to previous documents. Since this method is NP-hard, we then propose its approximate method, which is polynomial time solvable. Experiments on the TAC 2008 and 2009 benchmark data sets show the effectiveness and efficiency of our method.
ETPL DM -084
Abstract: Change detection in streaming data relies on a fast estimation of the probability that the data
in two consecutive windows come from different distributions. Choosing the criterion is one of the multitude of questions that need to be addressed when designing a change detection procedure. This paper gives a log-likelihood justification for two well-known criteria for detecting change in streaming multidimensional data: Kullback-Leibler (K-L) distance and Hotelling's T-square test for equal means (H). We propose a semiparametric log-likelihood criterion (SPLL) for change detection. Compared to the existing log-likelihood change detectors, SPLL trades some theoretical rigor for computation simplicity. We examine SPLL together with K-L and H on detecting induced change on 30 real data sets. The criteria were compared using the area under the respective Receiver Operating Characteristic (ROC) curve (AUC). SPLL was found to be on the par with H and better than K-L for the nonnormalized data, and better than both on the normalized data.
ETPL DM -085
Abstract: Event relations are used in many temporal relational database approaches to represent facts
occurring at time instants. However, to the best of our knowledge, none of such approaches fully copes with the definition of events as provided, e.g., by the “consensus” temporal database glossary. We propose a new approach which overcomes such a limitation, allowing one to cope with multiple events occurring in the same temporal granule. This move involves major extensions to current approaches, since indeterminacy about the time and number of occurrences of events need to be faced. Specifically, we have introduced a new data model, and new definitions of relational algebraic operators coping with the above issues, and we have studied their reducibility. Last, but not least, we have shown that our approach can be easily extended in order to cope with a general form of temporal indeterminacy. Such an extension further increases the applicability of our approach.
Support Vector Machines (SVMs) have been shown to achieve high performance on classification tasks across many domains, and a great deal of work has been dedicated to developing computationally efficient training algorithms for linear SVMs. One approach [1] approximately minimizes risk through use of cutting planes, and is improved by [2], [3]. We build upon this work, presenting a modification to the algorithm developed by Franc and Sonnenburg [2]. We demonstrate empirically that our changes can reduce cutting plane training time by up to 40 percent, and discuss how changes in data sets and parameter settings affect the effectiveness of our method.
ETPL DM-087
Abstract: In this paper, we propose an efficient clustering algorithm that has been applied to the
microaggregation problem. The goal is to partition $(N)$ given records into clusters, each of them grouping at least $(K)$ records, so that the sum of the within-partition squared error (SSE) is minimized. We propose a successive Group Selection algorithm that approximately solves the microaggregation problem in $(O(N^2 log N))$ time, based on sequential Minimization of SSE. Experimental results and comparisons to existing methods with similar computation cost on real and synthetic data sets demonstrate the high performance and robustness of the proposed scheme .
ETPL Modeling and Computing Ternary Projective Relations between Regions DM-088 Abstract: We report a corrected version of the algorithms to compute ternary projective relations
between regions appeared in E. Clementini and R. Billen, "Modeling and computing ternary projective relations between regions," IEEE Transactions on Knowledge and Data Engineering, vol. 18, pp. 799-814, 2006. A Survival Modeling Approach to Biomedical Search Result Diversification ETPL DM-089 Using Wikipedia Abstract: In this paper, we propose a survival modeling approach to promoting ranking diversity for biomedical information retrieval. The proposed approach concerns with finding relevant documents that can deliver more different aspects of a query. First, two probabilistic models derived from the survival analysis theory are proposed for measuring aspect novelty. Second, a new method using Wikipedia to detect aspects covered by retrieved documents is presented. Third, an aspect filter based on a two-stage model is introduced. It ranks the detected aspects in decreasing order of the probability that an aspect is generated by the query. Finally, the relevance and the novelty of retrieved documents are combined at the aspect level for reranking. Experiments conducted on the TREC 2006 and 2007 Genomics collections demonstrate the effectiveness of the proposed approach in promoting ranking diversity for biomedical information retrieval. Moreover, we further evaluate our approach in the Web retrieval environment. The evaluation results on the ClueWeb09-T09B collection show that our
First, we combine information-theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from WordNet. The results of our evaluation over two benchmark data sets demonstrate the superiority of our approaches against a number of existing approaches.
ETPL GC-092
Smartphones are nowadays equipped with a number of sensors, such as WiFi, GPS, accelerometers, etc. This capability allows smartphone users to easily engage in crowdsourced computing services, which contribute to the solution of complex problems in a distributed manner. In this work, we leverage such a computing paradigm to solve efficiently the following problem: comparing a query trace $(Q)$ against a crowd of traces generated and stored on distributed smartphones. Our proposed framework, coined $({rm SmartTrace}^+)$, provides an effective solution without disclosing any part of the crowd traces to the query processor. $({rm SmartTrace}^+)$, relies on an in-situ data storage
are “on their own” in terms of how they manage this incompleteness. In this paper, we propose the general concept of partial information policy (PIP) operator to handle incompleteness in relational databases. PIP operators build upon preference frameworks for incomplete information, but accommodate different types of incomplete data (e.g., a value exists but is not known; a value does not exist; a value may or may not exist). Different users in the real world have different ways in which they want to handle incompleteness—PIP operators allow them to specify a policy that matches their attitude to risk and their knowledge of the application and how the data was collected. We propose index structures for efficiently evaluating PIP operators and experimentally assess their effectiveness on a real-world airline data set. We also study how relational algebra operators and PIP operators interact with one another.
ETPL Decision Trees for Mining Data Streams Based on the McDiarmid's Bound DM-094 Abstract: In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the
Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered..
ETPL Discovering Characterizations of the Behavior of Anomalous Subpopulations DM-095 Abstract: We consider the problem of discovering attributes, or properties, accounting for the a priori
stated abnormality of a group of anomalous individuals (the outliers) with respect to an overall given population (the inliers). To this aim, we introduce the notion of exceptional property and define the concept of exceptionality score, which measures the significance of a property. In particular, in order
Abstract: In this paper, we present Forum Crawler Under Supervision (FoCUS), a supervised web-
scale forum crawler. The goal of FoCUS is to crawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL-type recognition problem. And we show how to learn accurate and effective regular expression patterns of implicit navigation paths from automatically created training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as five annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98 percent effectiveness and 97 percent coverage on a large set of test forums powered by over 150 different forum software packages. In addition, the results of applying FoCUS on more than 100 community Question and Answer sites and Blog sites demonstrated that the concept of implicit navigation path could apply to other social media sites. Improving Word Similarity by Augmenting PMI with Estimates of Word ETPL DM-097 Polysemy Abstract: Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, $({rm PMI}_{max})$, that augments PMI with information about a word's number of senses. The coefficients of $({rm PMI}_{max})$ are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. $({rm PMI}_{max})$ achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set.
ETPL DM-098
reduction, has been in the ascendant since its inception. It incorporates the nonnegativity constraint and thus obtains the parts-based representation as well as enhancing the interpretability of the issue correspondingly. This survey paper mainly focuses on the theoretical research into NMF over the last 5 years, where the principles, basic models, properties, and algorithms of NMF along with its various modifications, extensions, and generalizations are summarized systematically. The existing NMF algorithms are divided into four categories: Basic NMF (BNMF), Constrained NMF (CNMF), Structured NMF (SNMF), and Generalized NMF (GNMF), upon which the design principles, characteristics, problems, relationships, and evolution of these algorithms are presented and analyzed comprehensively. Some related work not on NMF that NMF should learn from or has connections with is involved too. Moreover, some open issues remained to be solved are discussed. Several relevant application areas of NMF are also briefly described. This survey aims to construct an integrated, state-of-the-art framework for NMF concept, from which the follow-up research may benefit.
ETPL DM-100
Abstract: In large databases, there may exist critical nuggets;small collections of records or instances
that contain domain-specific important information. This information can be used for future decision making such as labeling of critical, unlabeled data records and improving classification results by reducing false positive and false negative errors. This work introduces the idea of critical nuggets, proposes an innovative domain-independent method to measure criticality, suggests a heuristic to reduce the search space for finding critical nuggets, and isolates and validates critical nuggets from some real-world data sets. It seems that only a few subsets may qualify to be critical nuggets, underlying the importance of finding them. The proposed methodology can detect them. This work also identifies certain properties of critical nuggets and provides experimental validation of the
we investigate range queries augmented with a string similarity search predicate in both euclidean space and road networks. We dub this query the spatial approximate string (Sas) query. In euclidean space, we propose an approximate solution, the MhR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node $(u)$ keeps a concise representation of the union of $(q)$-grams from strings under the subtree of $(u)$. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the $(q)$-grams from the subtrees of index nodes. We also discuss how to estimate the selectivity of a Sas query in euclidean space, for which we present a novel adaptive algorithm to find balanced partitions using both the
Abstract: In this paper, we propose a novel data stream clustering algorithm, termed SVStream, which
is based on support vector domain description and support vector clustering. In the proposed algorithm, the data elements of a stream are mapped into a kernel space, and the support vectors are used as the summary information of the historical elements to construct cluster boundaries of arbitrary shape. To adapt to both dramatic and gradual changes, multiple spheres are dynamically maintained, each describing the corresponding data domain presented in the data stream. By allowing for bounded support vectors (BSVs), the proposed SVStream algorithm is capable of identifying overlapping clusters. A BSV decaying mechanism is designed to automatically detect and remove outliers (noise). We perform experiments over synthetic and real data streams, with the overlapping, evolving, and noise situations taken into consideration. Comparison results with state-of-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method.
ETPL DM-105
Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric
uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series. A Move operation changes the value of a single element, a Split operation converts a single element into two consecutive elements, and a Merge operation merges two consecutive elements into one. Each operation has an associated cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence of operations that transforms the first time series into the second one. An efficient, quadratic-time algorithm is provided for computing the MSM distance. MSM has the desirable properties of being metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments with public time series data sets demonstrate that MSM is a meaningful distance measure, that oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.
ETPL DM-106
Abstract: As an important operation for finding existing relevant patents and validating a new patent
application, patent search has attracted considerable attention recently. However, many users have limited knowledge about the underlying patents, and they have to use a try-and-see approach to repeatedly issue different queries and check answers, which is a very tedious process. To address this
Anomaly detection has been an important research topic in data mining and machine learning. Many real-world applications such as intrusion or credit card fraud detection require an effective and efficient framework to identify deviated data instances. However, most anomaly detection methods are typically implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing computation and memory requirements. In this paper, we propose an online oversampling principal component analysis (osPCA) algorithm to address this problem, and we aim at detecting the presence of outliers from a large amount of data via an online
Abstract: Comparing one thing with another is a typical part of human decision making process.
However, it is not always easy to know what to compare and what are the alternatives. In this paper, we present a novel way to automatically mine comparable entities from comparative questions that users posted online to address this difficulty. To ensure high precision and high recall, we develop a weakly supervised bootstrapping approach for comparative question identification and comparable entity extraction by leveraging a large collection of online question archive. The experimental results show our method achieves F1-measure of 82.5 percent in comparative question identification and 83.3 percent in comparable entity extraction. Both significantly outperform an existing state-of-the-art method. Additionally, our ranking results show highly relevance to user's comparison intents in web.
ETPL Cross-Space Affinity Learning with Its Application to Movie Recommendation DM-112 Abstract: In this paper, we propose a novel cross-space affinity learning algorithm over different
spaces with heterogeneous structures. Unlike most of affinity learning algorithms on the homogeneous space, we construct a cross-space tensor model to learn the affinity measures on heterogeneous spaces subject to a set of order constraints from the training pool. We further enhance the model with a factorization form which greatly reduces the number of parameters of the model with a controlled complexity. Moreover, from the practical perspective, we show the proposed factorized cross-space tensor model can be efficiently optimized by a series of simple quadratic optimization problems in an iterative manner. The proposed cross-space affinity learning algorithm can be applied to many real-world problems, which involve multiple heterogeneous data objects defined over different spaces. In this paper, we apply it into the recommendation system to measure the affinity between users and the product items, where a higher affinity means a higher rating of the user on the product. For an empirical evaluation, a widely used benchmark movie recommendation data set—MovieLens—is used to compare the proposed algorithm with other state-ofthe-art recommendation algorithms and we show that very competitive results can be obtained.
ETPL DM-113
Abstract: We introduce a distributed method for detecting distance-based outliers in very large data
sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting novel outliers. The method exploits parallel computation in order to obtain vast time savings. Indeed, beyond preserving the correctness of the result, the proposed schema exhibits excellent performances. From the theoretical point of view, for common settings, the temporal cost of our algorithm is expected to be at least three orders of
Abstract: Users of databases that are hosted on shared servers cannot take for granted that their queries
will not be disclosed to unauthorized parties. Even if the database is encrypted, an adversary who is monitoring the I/O activity on the server may still be able to infer some information about a user query. For the particular case of a $({rm B}^+)$-tree that has its nodes encrypted, we identify properties that enable the ordering among the leaf nodes to be deduced. These properties allow us to construct adversarial algorithms to recover the $({rm B}^+)$-tree structure from the I/O traces generated by range queries. Combining this structure with knowledge of the key distribution (or the plaintext database itself), the adversary can infer the selection range of user queries. To counter the threat, we propose a privacy-enhancing $({rm PB}^+)$-tree index which ensures that there is high uncertainty about what data the user has worked on, even to a knowledgeable adversary who has observed numerous query executions. The core idea in $({rm PB}^+)$-tree is to conceal the order of the leaf nodes in an encrypted $({rm B}^+)$-tree. In particular, it groups the nodes of the tree into buckets, and employs homomorphic encryption techniques to prevent the adversary from pinpointing the exact nodes retrieved by range queries. $({rm PB}^+)$-tree can be tuned to balance its privacy strength with the computational and I/O overheads incurred. Moreover, it can be adapted to protect access privacy in cases where the attacker additionally knows a priori the access frequencies of key values. Experiments demonstrate that $({rm PB}^+)$-tree effectively impairs the adversary's ability to recover the $({rm B}^+)$-tree structure and deduce the query ranges in all considered scenarios.
ETPL DM-115
Abstract: Hidden Markov models (HMMs) are used to analyze real-world problems. We consider an
approach that constructs minimum entropy HMMs directly from a sequence of observations. If an insufficient amount of observation data is used to generate the HMM, the model will not represent the underlying process. Current methods assume that observations completely represent the underlying process. It is often the case that the training data size is not large enough to adequately capture all statistical dependencies in the system. It is, therefore, important to know the statistical significance level for that the constructed model representing the underlying process, not only the training set. In this paper, we present a method to determine if the observation data and constructed model fully express the underlying process with a given level of statistical significance. We use the statistics of the process to calculate an upper bound on the number of samples required to guarantee that the
Abstract: Multiple kernel learning (MKL) is a promising family of machine learning algorithms using
multiple kernel functions for various challenging data mining tasks. Conventional MKL methods often formulate the problem as an optimization task of learning the optimal combinations of both kernels and classifiers, which usually results in some forms of challenging optimization tasks that are often difficult to be solved. Different from the existing MKL methods, in this paper, we investigate a boosting framework of MKL for classification tasks, i.e., we adopt boosting to solve a variant of MKL problem, which avoids solving the complicated optimization tasks. Specifically, we present a novel framework of Multiple kernel boosting (MKBoost), which applies the idea of boosting techniques to learn kernel-based classifiers with multiple kernels for classification problems. Based on the proposed framework, we propose several variants of MKBoost algorithms and extensively examine their empirical performance on a number of benchmark data sets in comparisons to various state-of-the-art MKL algorithms on classification tasks. Experimental results show that the proposed method is more effective and efficient than the existing MKL techniques.
ETPL DM-118
Abstract: Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent
patterns in data when the relative magnitudes of data items are more important than their exact values. For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative magnitudes are important both because they represent the change of gene activities across the experiments, and because there is typically a high level of noise in data that makes the exact values
Abstract: We propose a probabilistic topic model for analyzing and extracting content-related
annotations from noisy annotated discrete data such as webpages stored using social bookmarking services. With these services, because users can attach annotations freely, some annotations do not describe the semantics of the content, thus they are noisy, i.e., not content related. The extraction of content-related annotations can be used as a prepossessing step in machine learning tasks such as text classification and image recognition, or can improve information retrieval performance. The proposed model is a generative model for content and annotations, in which the annotations are assumed to originate either from topics that generated the content or from a general distribution unrelated to the content. We demonstrate the effectiveness of the proposed method by using synthetic data and real social annotation data for text and images.
ETPL DM -120
Multiparty Access Control for Online Social Networks: Model and Mechanisms
Online social networks (OSNs) have experienced tremendous growth in recent years and become a de facto portal for hundreds of millions of Internet users. These OSNs offer attractive means for digital social interactions and information sharing, but also raise a number of security and privacy issues. While OSNs allow users to restrict access to shared data, they currently do not provide any mechanism to enforce privacy concerns over data associated with multiple users. To this end, we propose an approach to enable the protection of shared data associated with multiple users in OSNs. We formulate an access control model to capture the essence of multiparty authorization requirements, along with a multiparty policy specification scheme and a policy enforcement mechanism. Besides, we present a logical representation of our access control model that allows us to leverage the features of existing logic solvers to perform various analysis tasks on our model. We also discuss a proof-of-concept prototype of our approach as part of an application in Facebook and provide usability study and system evaluation of our method.
ETPL DM -121
randomization. The goal is to examine the strengths and weaknesses of randomization and explore both the potential and the pitfalls of high-dimensional randomization. Our theoretical analysis results in a number of interesting and insightful conclusions. 1) The privacy effects of randomization reduce rapidly with increasing dimensionality. 2) The properties of the underlying data set can affect the anonymity level of the randomization method. For example, natural properties of real data sets such as clustering improve the effectiveness of randomization. On the other hand, variations in data density of nonempty data localities and outliers create privacy preservation challenges for the randomization method. 3) The use of a public information-sensitive attack method makes the choice of perturbing distribution more critical than previously thought. In particular, Gaussian perturbations are significantly more effective than uniformly distributed perturbations for the high dimensional case. These insights are very useful for future research and design of the randomization method. We use the insights gained from our analysis to discuss and suggest future research directions for improvements and extensions of the randomization method.
ETPL DM -122
A fundamental data integration task faced by online commercial portals and commerce search engines is the integration of products coming from multiple providers to their product catalogs. In this scenario, the commercial portal has its own taxonomy (the master taxonomy), while each data provider organizes its products into a different taxonomy (the provider taxonomy). In this paper, we consider the problem of categorizing products from the data providers into the master taxonomy, while making use of the provider taxonomy information. Our approach is based on a taxonomy-aware processing step that adjusts the results of a text-based classifier to ensure that products that are close together in the provider taxonomy remain close in the master taxonomy. We formulate this intuition as a structured prediction optimization problem. To the best of our knowledge, this is the first approach that leverages the structure of taxonomies in order to enhance catalog integration. We propose algorithms that are scalable and thus applicable to the large data sets that are typical on the web. We evaluate our algorithms on real-world data and we show that taxonomy-aware classification provides a significant improvement over existing approaches.
ETPL The Skyline of a Probabilistic Relation DM -123 Abstract: In a deterministic relation R, tuple u dominates tuple v if u is no worse than v on all the
attributes of interest, and better than v on at least one attribute. This concept is at the heart of skyline queries, that return the set of undominated tuples in R. In this paper, we extend the notion of skyline to probabilistic relations by generalizing to this context the definition of tuple domination. Our approach is parametric in the semantics for linearly ranking probabilistic tuples and, being it based on order-theoretic principles, preserves the three fundamental properties the skyline has in the deterministic case: 1) It equals the union of all top-1 results of monotone scoring functions; 2) it
Existing models for document summarization mostly use the similarity between sentences in the document to extract the most salient sentences. The documents as well as the sentences are indexed using traditional term indexing measures, which do not take the context into consideration. Therefore, the sentence similarity values remain independent of the context. In this paper, we propose a context sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli model of randomness has been used to find the probability of the cooccurrences of two terms in a large corpus. A new approach using the lexical association between terms to give a context sensitive weight to the document terms has been proposed. The resulting indexing weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the baseline graph-based ranking models for sentence extraction. Experiments have been conducted over the benchmark DUC data sets and it has been shown that the proposed Bernoulli-based sentence similarity model provides consistent improvements over the baseline IntraLink and UniformLink methods [1]. A Segmentation and Graph-Based Video Sequence Matching Method for Video ETPL DM-126 Copy Detection
for video copy detection. Specifically, due to the good stability and discriminative ability of local features, we use SIFT descriptor for video content description. However, matching based on SIFT descriptor is computationally expensive for large number of points and the high dimension. Thus, to reduce the computational complexity, we first use the dual-threshold method to segment the videos into segments with homogeneous content and extract keyframes from each segment. SIFT features are extracted from the keyframes of the segments. Then, we propose an SVD-based method to match two video frames with SIFT point set descriptors. To obtain the video sequence matching result, we propose a graph-based method. It can convert the video sequence matching into finding the longest path in the frame matching-result graph with time constraint. Experimental results demonstrate that the segmentation and graph-based video sequence matching method can detect video copies effectively. Also, the proposed method has advantages. Specifically, it can automatically find optimal sequence matching result from the disordered matching results based on spatial feature. It can also reduce the noise caused by spatial feature matching. And it is adaptive to video frame rate changes. Experimental results also demonstrate that the proposed method can obtain a better tradeoff between the effectiveness and the efficiency of video copy detection.
ETPL DM-127
Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. Typically, sentiment classification has been modeled as the problem of training a binary classifier using reviews annotated for positive or negative sentiment. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier trained using labeled data for a particular domain to classify sentiment of user reviews on a different domain often results in poor performance because words that occur in the train (source) domain might not appear in the test (target) domain. We propose a method to overcome this problem in crossdomain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using labeled data for the source domains and unlabeled data for both source and target domains. Sentiment sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the context vectors used as the basis for measuring the distributional similarity between words. Next, we use the created thesaurus to expand feature vectors during train and test times in a binary classifier. The proposed method significantly outperforms numerous baselines and returns results that are comparable with previously proposed cross-domain sentiment classification methods on a benchmark data set containing Amazon user reviews for different types of products. We conduct an extensive empirical analysis of the proposed method on single- and multisource domain adaptation, unsupervised and supervised domain adaptation, and numerous similarity measures for creating the sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures
learning. Most of the existing clustering methods, for example, normalized cut and $(k)$-means, have been suffering from the fact that their optimization processes normally lead to an NP-hard problem due to the discretization of the elements in the cluster indicator matrix. A practical way to cope with this problem is to relax this constraint to allow the elements to be continuous values. The eigenvalue decomposition can be applied to generate a continuous solution, which has to be further discretized. However, the continuous solution is probably mixing-signed. This result may cause it deviate severely
ease future access is a common task. The usual largeness of these collections requires a vast endeavor and an outrageous expense to organize manually. As an approach to effectively produce an automated classification of resources, we consider the immense amounts of annotations provided by users on social tagging systems in the form of bookmarks. In this paper, we deal with the utilization of these user-provided tags to perform a social classification of resources. For this purpose, we have created three large-scale social tagging data sets including tagging data for different types of resources, webpages and books. Those resources are accompanied by categorization data from sound expertdriven taxonomies. We analyze the characteristics of the three social tagging systems and perform an analysis on the usefulness of social tags to perform a social classification of resources that resembles the classification by experts as much as possible. We analyze six different representations using tags and compare to other data sources by using three different settings of SVM classifiers. Finally, we explore combinations of different data sources with tags using classifier committees to best classify the resources.
ETPL DM-134
Abstract: Query processing over uncertain data streams, in particular top-$(k)$ query processing, has
become increasingly important due to its wide application in many fields such as sensor network monitoring and internet traffic control. In many real applications, multiple top-$(k)$ queries are registered in the system. Sharing the results of these queries is a key factor in saving the computation cost and providing real-time response. However, due to the complex semantics of uncertain top-$(k)$ query processing, it is nontrivial to implement sharing among different top-$(k)$ queries and few works have addressed the sharing issue. In this paper, we formulate various types of sharing among multiple top-$(k)$ queries over uncertain data streams based on the frequency upper bound of each top-$(k)$ query. We present an optimal dynamic programming solution as well as a more efficient (in terms of time and space complexity) greedy algorithm to compute the execution plan of executing queries for saving the computation cost between them. Experiments have demonstrated that the greedy algorithm can find the optimal solution in most cases, and it can almost achieve the same performance (in terms of latency and throughput) as the dynamic programming approach.
ETPL DM-135
Abstract: The discovery of HTML query forms is one of the main challenges in Deep web crawling.
Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the web, which is done through the use of traditional/focused crawlers. The second is identifying
networks allow users to publish details about themselves and to connect to their friends. Some of the information revealed inside these networks is meant to be private. Yet it is possible to use learning algorithms on released data to predict private information. In this paper, we explore how to launch inference attacks using released social networking data to predict private information. We then devise three possible sanitization techniques that could be used in various situations. Then, we explore the effectiveness of these techniques and attempt to use methods of collective inference to discover sensitive attributes of the data set. We show that we can decrease the effectiveness of both local and relational classification algorithms by using the sanitization methods we described. Principal Composite Kernel Feature Analysis: Data-Dependent Kernel ETPL DM-137 Approach Abstract: Principal composite kernel feature analysis (PC-KFA) is presented to show kernel adaptations for nonlinear features of medical image data sets (MIDS) in computer-aided diagnosis (CAD). The proposed algorithm PC-KFA has extended the existing studies on kernel feature analysis (KFA), which extracts salient features from a sample of unclassified patterns by use of a kernel method. The principal composite process for PC-KFA herein has been applied to kernel principal component analysis [34] and to our previously developed accelerated kernel feature analysis [20]. Unlike other kernel-based feature selection algorithms, PC-KFA iteratively constructs a linear subspace of a high-dimensional feature space by maximizing a variance condition for the nonlinearly transformed samples, which we call data-dependent kernel approach. The resulting kernel subspace can be first chosen by principal component analysis, and then be processed for composite kernel subspace through the efficient combination representations used for further reconstruction and classification. Numerical experiments based on several MID feature spaces of cancer CAD data have shown that PC-KFA generates efficient and an effective feature representation, and has yielded a better classification performance for the proposed composite kernel subspace using a simple pattern classifier
networks, are characterized by continuous data streaming from multiple sources and through intermediate processing by multiple aggregators. Keeping track of data provenance in such highly dynamic context is an important requirement, since data provenance is a key factor in assessing data trustworthiness which is crucial for many applications. Provenance management for streaming data requires addressing several challenges, including the assurance of high processing throughput, low bandwidth consumption, storage efficiency and secure transmission. In this paper, we propose a novel approach to securely transmit provenance for streaming data (focusing on sensor network) by embedding provenance into the interpacket timing domain while addressing the above mentioned issues. As provenance is hidden in another host-medium, our solution can be conceptualized as watermarking technique. However, unlike traditional watermarking approaches, we embed provenance over the interpacket delays (IPDs) rather than in the sensor data themselves, hence avoiding the problem of data degradation due to watermarking. Provenance is extracted by the data receiver utilizing an optimal threshold-based mechanism which minimizes the probability of provenance decoding errors. The resiliency of the scheme against outside and inside attackers is established through an extensive security analysis. Experiments show that our technique can recover provenance up to a certain level against perturbations to inter-packet timing characteristics. The Adaptive Clustering Method for the Long Tail Problem of Recommender ETPL DM-140 Systems
long tail have only a few ratings, thus making it hard to use them in recommender systems. The approach presented in this paper clusters items according to their popularities, so that the recommendations for tail items are based on the ratings in more intensively clustered groups and for the head items are based on the ratings of individual items or groups, clustered to a lesser extent. We apply this method to two real-life data sets and compare the results with those of the nongrouping and fully grouped methods in terms of recommendation accuracy and scalability. The results show that if such adaptive clustering is done properly, this method reduces the recommendation error rates for the tail items, while maintaining reasonable computational performance.
ETPL DM-141
Abstract: Similarity joins play an important role in many application areas, such as data integration
and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.