You are on page 1of 12

Knowledge engineering This article is about an information science discipline.

For information about practitioners in this discipline, see Knowledge engineers. Knowledge engineering (KE) was defined in 1983 by Edward Feigenbaum, and Pamela McCorduck as follows: KE is an engineering discipline that integrating knowledge into computer systems in order to solve problems normally requiring a high level of human expertise.[1] involves complex

At present, it refers to the building, maintaining and development of knowledgebased systems.[2] It has a great deal in common with software engineering, and is used in many computer sciencedomains such as artificial intelligence,[3] [4] including databases, data mining, expert systems, decision support systems and geographic information systems. Knowledge engineering is also related to mathematical logic, as well as strongly involved in cognitive science and socio-cognitive engineering where the knowledge is produced by socio-cognitive aggregates (mainly humans) and is structured according to our understanding of how human reasoning and logic works. Various activities of KE specific for the development of a knowledgebased system: Assessment of the problem Development of a knowledge-based system shell/structure Acquisition and structuring of the related information, knowledge and specific preferences (IPK model) Implementation of the structured knowledge into knowledge bases Testing and validation of the inserted knowledge Integration and maintenance of the system Revision and evaluation of the system. Being still more art than engineering, KE is not as neat as the above list in practice. The phases overlap, the process might be iterative, and many challenges could appear. Contents [hide] 1 Knowledge engineering principles 2 Views of knowledge engineering 3 Overview of Trends in Knowledge Engineering 4 Bibliography 5 See also 6 External links Knowledge engineering principles Since the mid-1980s, knowledge engineers have developed a number of principles, methods and tools to improve the knowledge acquisition and ordering. Some of the key principles are: [citation needed] There are different: types of knowledge each requiring its own approach and technique. types of experts and expertise, such that methods should be chosen appropriately. ways of representing knowledge, which can aid the acquisition, validation and reuse of knowledge. ways of using knowledge, so that the acquisition process can be guided by the project aims (goal-oriented).

Structured methods increase the efficiency of the acquisition process. Knowledge Engineering is the process of eliciting Knowledge for any purpose be it Expert system or AI development Views of knowledge engineering There are two main views to knowledge engineering: [5] Transfer View This is the traditional view. In this view, the assumption is to apply conventional knowledge engineering techniques to transfer human knowledge into artificial intelligence systems. Modeling View This is the alternative view. In this view, the knowledge engineer attempts to model the knowledge and problem solving techniques of the domain expert into the artificial intelligence system. A major concern in knowledge engineering is the construction of ontologies. One philosophical question in this area is the debate between foundationalism and coherentism - are fundamental axioms of belief required, or merely consistency of beliefs which may have no lower-level beliefs to justify them? Overview of Trends in Knowledge Engineering Some of the trends in Knowledge Engineering in the last few years are discussed in this section.The text below is a brief overview of paper "Knowledge Engineering: Principles and methods" authored by Rudi Studer, V. Richard Benjamins and Dieter Fensel. The paradigm Shift from a transfer view to a modeling view According to the transfer view the human knowledge required to solve a problem is transferred and implemented into the knowledge base. However this assumes that concrete knowledge is already present in humans to solve a problem. The transfer view disregards the tacit knowledge an individual acquires in order to solve a problem. This is one of the reasons for a paradigm shift towards modeling view. This shift is compared to a shift from first generation expert systems to second generation expert systems. The modeling view is a closer approximate of reality and perceives solving problems as a dynamic, cyclic, incessant process dependent on the knowledge acquired and the interpretations made by the system. This is similar to how an expert solves problems in real life. The evolving of Role Limiting methods and Generic Tasks Role limiting methods are based on reusable problem solving methods. Different knowledge roles are decided and the knowledge expected from each of these roles is clarified. However the disadvantage of role limiting methods is that there is no logical means of deciding whether a specific problem can be solved by a specific role-limiting method. This disadvantage gave rise to Configurable role limiting methods. Configurable role limiting methods are based on the idea that a problem solving method can further be broken up into several smaller sub tasks each task solved by its own problem solving method. Generic Tasks include a rigid knowledge structure, a standard strategy to solve problems, a specific input and a specific output. The GT approach is based on the strong interaction problem hypothesis which states that the structure and representation of domain knowledge is completely determined by its use The usage of Modeling Frameworks The development of Specification languages and problem solving methods of knowledge based systems.Over the past few years the modeling frameworks

that became prominent within Knowledge engineering are Common KADS, MIKE (Model-based and Incremental knowledge engineering) and PROTGII.PROTG-II is a modeling framework influenced by the concept of Ontology. The influence of Ontology Ontologies help building model of a domain and define the terms inside the domain and the relationships between them. There are different types of Ontologies including Domain ontologies, Generic ontologies, application ontologies and representational ontologies. While categorizing knowledge, storing, retrieving and managing information is not only useful for solving problems without direct need of human expertise but also leads to Knowledge Management efforts that enable an organization to function efficiently in the long run.

Database Systems and Knowledgebase Systems share many common principles. Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems. DKE achieves this aim by publishing original research results, technical advances and news items concerning data engineering, knowledge engineering, and the interface of these two fields.

Types of Paper
DKE covers the following topics: 1. Representation and Manipulation of Data & Knowledge: Conceptual data models. Knowledge representation techniques. Data/knowledge manipulation languages and techniques. 2. Architectures of database, expert, or knowledge-based systems: New architectures for database / knowledge base / expert systems, design and implementation techniques, languages and user interfaces, distributed architectures. 3. Construction of data/knowledge bases: Data / knowledge base design methodologies and tools, data/knowledge acquisition methods, integrity/security/maintenance issues. 4. Applications, case studies, and management issues: Data administration issues, knowledge engineering practice, office and engineering applications. 5. Tools for specifying and developing Data and Knowledge Bases using tools based on Linguistics or Human Machine Interface principles. 6. Communication aspects involved in implementing, designing and using KBSs in Cyberspace. Plus... conference reports, calendar of events, book reviews etc. Data engineering uses data as the means for understanding a process. For a more comprehensive introduction, see our White Paper on Data Engineering. The data might be generated in many ways, or subset of the available data may be used. Data engineering uses data analysis techniques from statistics, machine learning, pattern recognition or neural networks, together with other technologies such as visualization, optimization, database systems, prototyping tools and knowledge elicitation. The goal is to use the available data or generate more data, and to thereby understand the process being investigated. The process of analysing the data, creating new analysis tools specifically for the task, and working with the domain experts is a key aspect of this engineering task. We will be using Bayesian data analysismethods (which occur throughout the different communities). You may have also head of the terms data mining and knowledge discovery, exploratory data analysis, intelligent data analysis, and so forth. These are similar.

Transductive Multilabel Learning via Label Set Propagation Xiangnan Kong, Michael K. Ng, and Zhi-Hua Zhou, Fellow, IEEE

AbstractThe problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image annotations and gene function analysis. Current research on multilabel classification focuses on supervised settings which assume existence of large amounts of labeled training data. However, in many applications, the labeling of multilabeled data is extremely expensive and time consuming, while there are often abundant unlabeled data available. In this paper, we study the problem of transductive multilabel learning and propose a novel solution, called TRAsductive Multilabel Classification (TRAM), to effectively assign a set of multiple labels to each instance. Different from supervised multilabel learning methods, we estimate the label sets of the unlabeled instances effectively by utilizing the information from both labeled and unlabeled data. We first formulate the transductive multilabel learning as an optimization problem of estimating label concept compositions. Then, we derive a closed-form solution to this optimization problem and propose an effective algorithm to assign label sets to the unlabeled instances. Empirical studies on several real-world multilabel learning tasks demonstrate that our T RAM method can effectively boost the performance of multilabel classification by using both labeled and unlabeled data. Index TermsData mining, machine learning, multilabel learning, transductive learning, semi-supervised learning, unlabeled

Extending BCDM to Cope with Proposals and Evaluations of Updates Luca Anselma, Alessio Bottrighi, Stefania Montani, and Paolo Terenziani

AbstractThe cooperative construction of data/knowledge bases has recently had a significant impulse (see, e.g., Wikipedia [1]). In cases in which data/knowledge quality and reliability are crucial, proposals of update/insertion/deletion need to be evaluated by experts. To the best of our knowledge, no theoretical framework has been devised to model the semantics of update proposal/ evaluation in the relational context. Since time is an intrinsic part of most domains (as well as of the proposal/evaluation process itself),semantic approaches to temporal relational databases (specifically, Bitemporal Conceptual Data Model (henceforth, BCDM) [2]) are the starting point of our approach. In this paper, we propose BCDMPV, a semantic temporal relational model that extends BCDM to deal with multiple update/insertion/deletion proposals and with acceptances/rejections of proposals themselves. We propose a theoretical framework, defining the new data structures, manipulation operations and temporal relational algebra and proving some basic properties, namely that BCDMPV is a consistent extension of BCDM and that it is reducible to BCDM. These properties ensure consistency with most relational temporal database frameworks, facilitating implementations. Index Terms Temporal databases, database semantics, database design, modeling and management

Annotating Search Results from Web Databases Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, Member, IEEE, and Clement Yu, Senior Member, IEEE

AbstractAn increasing number of databases have become web accessible through HTML form-based search interfaces. The data units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result pages from the same web database. Our experiments indicate that the proposed approach is highly effective. Index TermsData alignment, data annotation, web database, wrapper generation

AML: Efficient Approximate Membership Localization within a Web-Based Join Framework Zhixu Li, Laurianne Sitbon, Liwei Wang, Xiaofang Zhou, Senior Member, IEEE, and Xiaoyong Du
AbstractIn this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of realworld applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better

approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach. Index TermsWeb-based join, approximate membership location,

A Proxy-Based Approach to Continuous Location-Based Spatial Queries in Mobile Environments Jiun-Long Huang, Member, IEEE, and Chen-Che Huang

AbstractCaching valid regions of spatial queries at mobile clients is effective in reducing the number of queries submitted by mobile clients and query load on the server. However, mobile clients suffer from longer waiting time for the server to compute valid regions. We propose in this paper a proxy-based approach to continuous nearest-neighbor (NN) and window queries. The proxy creates estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to represent the EVRs of window queries in the form of vectors, called estimated window vectors (EWVs), to achieve larger estimated valid regions. This novel representation and the associated creation algorithm result in more effective EVRs of window queries. In addition, due to the distinct characteristics, we use separate index structures, namely EVR-tree and grid index, for NN queries and window queries, respectively. To further increase efficiency, we develop algorithms to exploit the results of NN queries to aid grid index growth, benefiting EWV creation of window queries. Similarly, the grid index is utilized to support NN query answering and EVR updating. We conduct several experiments for performance evaluation. The experimental results show that the proposed approach significantly outperforms the existing proxy-based approaches. Index TermsNearest neighbor query, window query, spatial query processing, location-based service, mobile computing

A Rough-Set-Based Incremental Approach for Updating Approximations under Dynamic Maintenance Environments Hongmei Chen, Tianrui Li, Senior Member, IEEE, Da Ruan, Member, IEEE, Jianhui Lin, and Chengxiang Hu

AbstractApproximations of a concept by a variable precision rough-set model (VPRS) usually vary under a dynamic information system environment. It is thus effective to carry out incremental updating approximations by utilizing previous data structures. This paper focuses on a new incremental method for updating approximations of VPRS while objects in the information system dynamically alter. It discusses properties of information granulation and approximations under the dynamic environment while objects in the universe evolve over time. The variation of an attributes domain is also considered to perform incremental updating for approximations under VPRS. Finally, an extensive experimental evaluation validates the efficiency of the proposed method for dynamic maintenance of VPRS approximations. Index TermsVariable precision rough-set model, knowledge discovery, granular computing, information systems, incremental updating

On Similarity Preserving Feature Selection Zheng Zhao, Member, IEEE, Lei Wang, Senior Member, IEEE, Huan Liu, Senior Member, IEEE, and Jieping Ye, Senior Member, IEEE

AbstractIn the literature of feature selection, different criteria have been proposed to evaluate the goodness of features. In our investigation, we notice that a number of existing selection criteria implicitly select features that preserve sample similarity, and can be unified under a common framework. We further point out that any feature selection criteria covered by this framework cannot handle redundant features, a common drawback of these criteria. Motivated by these observations, we propose a new Similarity Preserving Feature Selection framework in an explicit and rigorous way. We show, through theoretical analysis, that the proposed framework not

only encompasses many widely used feature selection criteria, but also naturally overcomes their common weakness in handling feature redundancy. In developing this new framework, we begin with a conventional combinatorial optimization formulation for similarity preserving feature selection, then extend it with a sparse multiple-output regression formulation to improve its efficiency and effectiveness. A set of three algorithms are devised to efficiently solve the proposed formulations, each of which has its own advantages in terms of computational complexity and selection performance. As exhibited by our extensive experimental study, the proposed framework achieves superior feature selection performance and attractive properties. Index TermsFeature selection, similarity preserving, redundancy removal, multiple output regression, sparse regularization

Building a Scalable Database-Driven Reverse Dictionary Ryan Shaw, Member, IEEE, Anindya Datta, Member, IEEE, Debra VanderMeer, Member, IEEE, and Kaushik Dutta, Member, IEEE

AbstractIn this paper, we describe the design and implementation of a reverse dictionary. Unlike a traditional forward dictionary, which maps from words to their definitions, a reverse dictionary takes a user input phrase describing the desired concept, and returns a set of candidate words that satisfy the input phrase. This work has significant application not only for the general public, particularly those who work closely with words, but also in the general field of conceptual search. We present a set of algorithms and the results of a set of experiments showing the retrieval accuracy of our methods and the runtime response time performance of our implementation. Our experimental results show that our approach can provide significant improvements in performance scale without sacrificing the quality of the result. Our experiments comparing the quality of our approach to that of currently available reverse dictionaries show that of our approach can provide significantly higher quality over either of the other currently available implementations. Index TermsDictionaries, thesauruses, search process, web-based services

A Generalized Flow-Based Method for Analysis of Implicit Relationships on Wikipedia Xinpeng Zhang, Member, IEEE, Yasuhito Asano, Member, IEEE, and Masatoshi Yoshikawa

AbstractWe focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship. Index TermsLink analysis, generalized flow, Wikipedia mining, relationship

A System to Filter Unwanted Messages from OSN User Walls Marco Vanetti, Elisabetta Binaghi, Elena Ferrari, Barbara Carminati, and Moreno Carullo

AbstractOne fundamental issue in todays Online Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering. Index TermsOnline social networks, information filtering, short text classification, policy-based personalization

Anonymization of Centralized and Distributed Social Networks by Sequential Clustering Tamir Tassa and Dror J. Cohen

AbstractWe study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction. Index TermsSocial networks, clustering, privacy preserving data mining, distributed computation

Discovering Temporal Change Patterns in the Presence of Taxonomies Luca Cagliero

AbstractFrequent itemset mining is a widely exploratory technique that focuses on discovering recurrent correlations among data. The steadfast evolution of markets and business environments prompts the need of data mining algorithms to discover significant correlation changes in order to reactively suit product and service provision to customer needs. Change mining, in the context of frequent itemsets, focuses on detecting and reporting significant changes in the set of mined itemsets from one time period to another. The discovery of frequent generalized itemsets, i.e., itemsets that 1) frequently occur in the source data, and 2) provide a high-level abstraction of the mined knowledge, issues new challenges in the analysis of itemsets that become rare, and thus are no longer extracted, from a certain point. This paper proposes a novel kind of dynamic pattern, namely the H Istory GENeralized Pattern (HIGEN), that represents the evolution of an itemset in consecutive time periods, by reporting the information about its frequent generalizations characterized by minimal redundancy (i.e., minimum level of abstraction) in case it becomes infrequent in a certain time period. To address HIGEN mining, it proposes HIGEN MINER, an algorithm that focuses on avoiding itemset mining followed by postprocessing by exploiting a support-driven itemset generalization approach. To focus the attention on the minimally redundant frequent generalizations and thus reduce the amount of the generated patterns, the discovery of a smart subset of HIGENs, namely the NONREDUNDANT HIGENs, is addressed as well. Experiments performed on both real and synthetic datasets show the efficiency and the effectiveness of the proposed approach as well as its usefulness in a real application context. Index TermsData mining, mining methods and algorithms

A Bound on Kappa-Error Diagrams for Analysis of Classifier Ensembles Ludmila I. Kuncheva, Member, IEEE

AbstractKappa-error diagrams are used to gain insights about why an ensemble method is better than another on a given data set. A point on the diagram corresponds to a pair of classifiers. The x-axis is the pairwise diversity (kappa), and the y-axis is the averaged individual error. In this study, kappa is calculated from the 2 _ 2 correct/wrong contingency matrix. We derive a lower bound on kappa which determines the feasible part of the kappa-error diagram. Simulations and experiments with real data show that there is unoccupied feasible space on the diagram corresponding to (hypothetical) better ensembles, and that individual accuracy is the leading factor in improving the ensemble accuracy. Index TermsClassifier ensembles, kappa-error diagrams, ensemble diversity, limits

Modeling and Solving Distributed Configuration Problems: A CSP-Based Approach Dietmar Jannach and Markus Zanker

AbstractProduct configuration can be defined as the task of tailoring a product according to the specific needs of a customer. Due to the inherent complexity of this task, which for example includes the consideration of complex constraints or the automatic completion of partial configurations, various Artificial Intelligence techniques have been explored in the last decades to tackle such configuration problems. Most of the existing approaches adopt a single-site, centralized approach. In modern supply chain settings, however, the

components of a customizable product may themselves be configurable, thus requiring a multisite, distributed approach. In this paper, we analyze the challenges of modeling and solving such distributed configuration problems and propose an approach based on Distributed Constraint Satisfaction. In particular, we advocate the use of Generative Constraint Satisfaction for knowledge modeling and show in an experimental evaluation that the use of generic constraints is particularly advantageous also in the distributed problem solving phase. Index TermsProduct configuration, distributed constraint satisfaction

Facilitating Effective User Navigation through Website Structure Improvement Min Chen and Young U. Ryu

AbstractDesigning well-structured websites to facilitate effective user navigation has long been a challenge. A primary reason is that the web developers understanding of how a website should be structured can be considerably different from that of the users. While various methods have been proposed to relink webpages to improve navigability using user navigation data, the completely reorganized new structure can be highly unpredictable, and the cost of disorienting users after the changes remains unanalyzed. This paper addresses how to improve a website without introducing substantial changes. Specifically, we propose a mathematical programming model to improve the user navigation on a website while minimizing alterations to its current structure. Results from extensive tests conducted on a publicly available real data set indicate that our model not only significantly improves the user navigation with very few changes, but also can be effectively solved. We have also tested the model on large synthetic data sets to demonstrate that it scales up very well. In addition, we define two evaluation metrics and use them to assess the performance of the improved website using the real data set. Evaluation results confirm that the user navigation on the improved structure is indeed greatly enhanced. More interestingly, we find that heavily disoriented users are more likely to benefit from the improved structure than the less disoriented users. Index TermsWebsite design, user navigation, web mining, mathematical programming

Clustering Large Probabilistic Graphs George Kollios, Michalis Potamias, and Evimaria Terzi

AbstractWe study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges. Index TermsUncertain data, probabilistic graphs, clustering algorithms, probabilistic databases

A New Algorithm for Inferring User Search Goals with Feedback Sessions Zheng Lu, Student Member, IEEE, Hongyuan Zha, Xiaokang Yang, Senior Member, IEEE, Weiyao Lin, Member, IEEE, and Zhaohui Zheng

AbstractFor a broad-topic and ambiguous query, different users may have different search goals when they submit it to a search engine. The inference and analysis of user search goals can be very useful in improving search engine relevance and user experience. In this paper, we propose a novel approach to infer user search goals by analyzing search engine query logs. First, we propose a framework to discover different user search goals for a query by clustering the proposed feedback sessions. Feedback sessions are constructed from user click-through logs and can efficiently reflect the information needs of users. Second, we propose a novel approach to generate pseudo-documents to better represent the feedback sessions for clustering. Finally, we propose a new criterion Classified Average Precision (CAP) to evaluate the performance of inferring user search goals. Experimental results are presented

using user click-through logs from a commercial search engine to validate the effectiveness of our proposed methods. Index TermsUser search goals, feedback sessions, pseudo-documents, restructuring search results, classified average precision

_-Diverse Nearest Neighbors Browsing for Multidimensional Data Onur Kucuktunc and Hakan Ferhatosmanoglu

AbstractTraditional search methods try to obtain the most relevant information and rank it according to the degree of similarity to the queries. Diversity in query results is also preferred by a variety of applications since results very similar to each other cannot capture all aspects of the queried topic. In this paper, we focus on the _-diverse k-nearest neighbor search problem on spatial and multidimensional data. Unlike the approach of diversifying query results in a postprocessing step, we naturally obtain diverse results with the proposed geometric and index-based methods. We first make an analogy with the concept of Natural Neighbors (NatN) and propose a natural neighbor-based method for 2D and 3D data and an incremental browsing algorithm based on Gabriel graphs for higher dimensional spaces. We then introduce a diverse browsing method based on the distance browsing feature of spatial index structures, such as R-trees. The algorithm maintains a Priority Queue with mindivdist of the objects depending on both relevancy and angular diversity and efficiently prunes nondiverse items and nodes. We experiment with a number of spatial and high-dimensional data sets, including Factuals (http://www.factual.com/) US points-of-interest data set of 13M entries. On the experimental setup, the diverse browsing method is shown to be more efficient (regarding disk accesses) than k-NN search on R-trees, and more effective (regarding Maximal Marginal Relevance (MMR)) than the diverse nearest neighbor search techniques found in the literature. Index TermsDiversity, diverse nearest neighbor search, angular similarity, natural neighbors, Gabriel graph

Sampling Online Social Networks Manos Papagelis, Gautam Das, Member, IEEE, and Nick Koudas, Member, IEEE

AbstractAs online social networking emerges, there has been increased interest to utilize the underlying network structure as well as the available information on social peers to improve the information needs of a user. In this paper, we focus on improving the performance of information collection from the neighborhood of a user in a dynamic social network. We introduce sampling-based algorithms to efficiently explore a users social network respecting its structure and to quickly approximate quantities of interest. We introduce and analyze variants of the basic sampling scheme exploring correlations across our samples. Models of centralized and distributed social networks are considered. We show that our algorithms can be utilized to rank items in the neighborhood of a user, assuming that information for each user in the network is available. Using real and synthetic data sets, we validate the results of our analysis and demonstrate the efficiency of our algorithms in approximating quantities of interest. The methods we describe are general and can probably be easily adopted in a variety of strategies aiming to efficiently collect information from a social graph. Index TermsInformation networks, search process, query processing, performance evaluation of algorithms and systems

Robust Module-Based Data Management Francois Goasdoue and Marie-Christine Rousset

AbstractThe current trend for building an ontology-based data management system (DMS) is to capitalize on efforts made to design a preexisting well-established DMS (a reference system). The method amounts to extracting from the reference DMS a piece of schema relevant to the new application needsa module, possibly personalizing it with extra constraints w.r.t. the application under construction, and then managing a data set using the resulting schema. In this paper, we extend the existing definitions of modules and we introduce novel properties of robustness that provide means for checking easily that a robust module-based DMS evolves safely w.r.t. both the schema and the data of the reference DMS. We carry out our investigations in the setting of description logics which underlie modern ontology languages, like RDFS, OWL, and OWL2 from W3C. Notably, we focus on the DL-liteA dialect of the DL-lite family, which encompasses the foundations of the QL profile of OWL2 (i.e., DL-liteR): the W3C recommendation for efficiently managing large data sets. Index TermsModels and principles, database management, personalization, algorithms for data and knowledge management, artificial intelligence, intelligent web services, Semantic Web

Protecting Sensitive Labels in Social Network Data Anonymization Mingxuan Yuan, Lei Chen, Member, IEEE, Philip S. Yu, Fellow, IEEE, and Ting Yu

AbstractPrivacy is one of the major concerns when publishing or sharing social network data for social science research and business analysis. Recently, researchers have developed privacy models similar to k-anonymity to prevent node reidentification through structure information. However, even when these privacy models are enforced, an attacker may still be able to infer ones private information if a group of nodes largely share the same sensitive labels (i.e., attributes). In other words, the label-node relationship is not well protected by pure structure anonymization methods. Furthermore, existing approaches, which rely on edge editing or node clustering, may significantly alter key graph properties. In this paper, we define a k-degree-ldiversity anonymity model that considers the protection of structural information as well as sensitive labels of individuals. We further propose a novel anonymization methodology based on adding noise nodes. We develop a new algorithm by adding noise nodes into the original graph with the consideration of introducing the least distortion to graph properties. Most importantly, we provide a rigorous analysis of the theoretical bounds on the number of noise nodes added and their impacts on an important graph property. We conduct extensive experiments to evaluate the effectiveness of the proposed technique. Index TermsSocial networks, privacy, anonymous

The Minimum Consistent Subset Cover Problem: A Minimization View of Data Mining Byron J. Gao, Martin Ester, Member, IEEE Computer Society, Hui Xiong, Senior Member, IEEE, Jin-Yi Cai, and Oliver Schulte

AbstractIn this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition (MCP), a dual problem of graph coloring, as an instance. Many common data mining tasks in rule learning, clustering, and pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set (MRS) problem that minimizes model complexity of decision rules, the converse k-clustering problem that minimizes the number of clusters, and the pattern summarization problem that minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set. Index TermsMinimum consistent subset cover, set covering, graph coloring, minimum clique partition, minimum star partition, minimum rule set, converse k-clustering, pattern summarization

Information-Theoretic Outlier Detection for Large-Scale Categorical Data Shu Wu, Member, IEEE, and Shengrui Wang, Member, IEEE

AbstractOutlier detection can usually be considered as a pre-processing step for locating, in a data set, those objects that do not conform to well-defined notions of expected behavior. It is very important in data mining for discovering novel or rare events, anomalies, vicious actions, exceptional phenomena, etc. We are investigating outlier detection for categorical data sets. This problem is especially challenging because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, we propose a formal definition of outliers and an optimization model of outlier detection, via a new concept of holoentropy that takes both entropy and total correlation into consideration. Based on this model, we define a function for the outlier factor of an object which is solely determined by the object itself and can be updated efficiently. We propose two practical 1-parameter outlier detection methods, named ITB-SS and ITB-SP, which require no user-defined parameters for deciding whether an object is an outlier. Users need only provide the number of outliers they want to detect. Experimental results show that ITB-SS and ITB-SP are more effective and efficient than mainstream methods and can be used to deal with both large and high-dimensional data sets where existing algorithms fail.

Index TermsOutlier detection, holoentropy, total correlation, outlier factor, attribute weighting, greedy algorithms

Supporting Flexible, Efficient, and User-Interpretable Retrieval of Similar Time Series Stefania Montani, Giorgio Leonardi, Alessio Bottrighi, Luigi Portinale, and Paolo Terenziani

AbstractSupporting decision making in domains in which the observed phenomenon dynamics have to be dealt with, can greatly benefit of retrieval of past cases, provided that proper representation and retrieval techniques are implemented. In particular, when the parameters of interest take the form of time series, dimensionality reduction and flexible retrieval have to be addresses to this end. Classical methodological solutions proposed to cope with these issues, typically based on mathematical transforms, are characterized by strong limitations, such as a difficult interpretation of retrieval results for end users, reduced flexibility and interactivity, or inefficiency. In this paper, we describe a novel framework, in which time-series features are summarized by means of Temporal Abstractions, and then retrieved resorting to abstraction similarity. Our approach grants for interpretability of the output results, and understandability of the (user-guided) retrieval process. In particular, multilevel abstraction mechanisms and proper indexing techniques are provided, for flexible query issuing, and efficient and interactive query answering. Experimental results have shown the efficiency of our approach in a scalability test, and its superiority with respect to the use of a classical mathematical technique in flexibility, user friendliness, and also quality of results. Index TermsDecision support, knowledge representation formalisms and methods, knowledge retrieval, information search and retrieval

You might also like