Professional Documents
Culture Documents
ABSTRACT
1
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
CHAPTER – 1
INTRODUCTION
1.1 OVERVIEW
Frequent pattern mining is commonly popular in research areas. Many algorithms
have been proposed and kinds of methods have been applied to tackle some issues, such as
the Aprioritystrategy, pattern matching strategy, and even evolutionary algorithms. Sequence
pattern mining, is an important branch in frequent pattern mining research, aims to discover
frequent subsequence patterns in a single sequence or a sequence database. Such patterns are
strongly correlated to meaningful events or knowledge within the data and are commonly
applied to numerous fields, such as mining customer purchase patterns, mining tree-structure
information, travel-landscape recommendations, time-series analysis and prediction, bug
repositories, sequence classification, biological sequence data analysis, and temporal
uncertainty.
In the Existing process the NOSEP Algorithm is used pattern mining methods, and the
pattern occurrences is mainly based on two approaches: 1) no-condition; 2) the one-off
condition. Net-tree is used to construct a tree from a derived data set of DNA, with the above
two conditions. In this Net Tree the data set of DNA are stored in a random form of data, as
how the DNA data are being uploaded in the database. Here the replication of DNA data sets
are not being restricted and also the Gap Restraints in them are not calculated, which will
cause the lack of matching of exact DNA data.
The no-condition and the one-off conditions are that, the no- condition is mainly
focuses on matching with only the data set DNA patterns. As when the conditions are applied
while the matching pattern mining, there is a possibility of neglecting some of the data, based
on the conditions applied.
Even though it makes a fine ways of matching, but here some of the drawbacks occurs
as, in case of no-condition it will check will all the available data in dataset and it will also
filter the data if it has a least matching of the existing data. And so here the repetition of data
will occur and finding of the matched data becomes too critical.
2
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
And with this one-off condition is that, this makes a certain condition based on the
data updated. So here with this one-off condition it will eliminate some of the data abruptly.
With this process of one-off condition some data sets will be completely removed before it
terms for a pattern matching data. Hence, with this process some of the relevant data will not
be detected. So that this may loss some data set when constructing the net tree. And also
difficult to find its occurrence path of data.
In addition to that only by using this two conditions it is problematic to determine the
preferred data sets. And if any incidence of the recurrent data or any of the noisy data it may
not be neglected by these conditions. With that while constructing the net tree and finding of
the frequent data in DNA will not give an faultless result. Because the DNA data are much
complex data and it also has some of repeated components.
This method is mainly focused at finding pairs (or sets) of motifs that co-occur in the
data set within a short distance of each other. This method only studies a simple mismatch-
based definition of noise, and does not consider other more complex motif models such as a
substitution matrix or a compatibility matrix.
Where as in the Proposed a new process the Motify detector algorithm called MEED
is also used to find the accuracy of matching and un matching DNA data. The Motify detector
which includes the Non-Overlapping of data and the Gap Restraints are detected. And the
Motify algorithm is also used to find the accuracy of matching and un matching DNA data.
We describe a novel motif detector algorithm called MEED that uses a concurrent
traversal of two processes and to efficiently explore the space of all motifs. Here presents an
algorithm that uses MEED as a building block and can mine combinations of simple
approximate motifs under relaxed constraints. MEED is a versatile algorithm that can be used
in several real motif mining tasks. MEED outperforms existing time series mining algorithms
(Random Projections) by more than an order of magnitude.
The approach we take in MEED explores the space of all possible models. In order to
carry out this study in aproficient way, we first construct two suffix trees: a suffix tree on the
actual data set that contains counts in an every single node (called the data suffix-tree), and a
suffix tree on the set of all possible model strings (called the model suffix tree).
3
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Non-overlapping is one that if an occurrence of DNA data are of similar patter with
different sequential order, thus has a chance to overlap with the similar DNA data which were
already in existing. So using this technique the overlapping can be blocked and reduced,
which enhance reduce the searching and matching time constraints of data.
Here the searching of data are in Sequence pattern. Because a sequence pattern
comprises of multiple pattern letters occurs in a sequential order, it is possible that when
pattern occurring in the sequence, two pattern letters may appear in the required sequential
order but have altered numbers of letters between them.
Gap Restraints is mainly used to find the allocation of the proteins such as for
instance Thymine, Cytosine, Adenine, Guanine, etc... Here the allocation of these proteins
may vary accordingly. With the help of this Gap constraints one can also calculate the
matching criteria and the allocation of DNA data set.
A small gap between pattern letters is too preventive to find valid patterns, whereas a
large gap makes the pattern too general to represent meaningful knowledge within the data.
Because gap restraints allow users to set gap flexibility to meet their distinct needs, sequence
pattern mining with gap restraints has been applied to many fields, including medical
emergency identification, mining biological characteristics, mining customer purchase
patterns, feature extraction and so on. .
In addition, based on MEED, have also address a more general problem like data
accuracy, mining of data, and Clustering. And thus in MEED algorithm the Non-overlapping
extraction and Gap Restraints which allows mining frequent combinations of sequence data
sets and motifs the gap in data under unperturbed restraints to fames an predicted data. Thus
with these sort outs data the final report and the suffix tree will be spawned, henceforth the
result attained maximum accuracy and better fallouts, which will be useful to find out the
scrupulous matching and easy to track the DNA data set.
However, in a sequence data environment, the occurrence of a pattern in the sequence
is inherently complicated because a letter in the sequence may match multiple pattern letters
and different matching may result in different frequency-counting results.
4
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Data mining, also called information discovery in databases, in computer science, the
process of discovering stimulating and useful patterns and relationships in large volumes
of data. Data mining is not exact to one type of media or data. Data mining should be
applicable to any kind of information storehouse. However, algorithms and methods may
vary when applied to altered types of data. Certainly, that defies presented by different types
of data vary considerably.
Data mining is actuality put into use and considered for databases, including
relational databases, object-relational databases and object oriented databases, data
warehouses, transactional databases, unstructured and semi structured repositories such as the
World Wide Web, progressive databases such as spatial databases, multimedia databases,
time-series databases and textual databases, and even flat files.
The field combines tools from statistics and artificial intelligence with database
management to analyze large digital collections, known as data sets. Data mining is widely
used in business, science research, and government security.
In data mining, the process of extraction of text is called text data mining. The
detection by computer of new, previously unknown gen, by automatically extracting
information from a usually large amount of different unstructured textual resources. In Text
Mining, patterns are mined from regular language text relatively than databases.
Database Data:
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data. The software programs provide mechanisms for defining
database structures and data storage; for specifying and managing concurrent, shared,
or distributed data access; and for ensuring consistency and security of the information
stored despite system crashes or attempts at unauthorized access.
A relational database is a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores a large
set of tuples (records or rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values. A semantic data model, such as an
5
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
entity-relationship (ER) data model, is often constructed for relational databases. An ER data
model represents the database as a set of entities and their relationships.
Data mining functionalities areused to specify the kinds of patterns to be found in data
mining tasks. In general, suchtasks can be classified into two
categories:descriptiveandpredictive. Descriptive mining tasks characterize properties of the
data in a target data set. Predictive mining tasksperform induction on the current data in order
to make predictions.Data mining functionalities, and the kinds of patterns they can discover,
are described below. Interestingpatterns representknowledge.
.
TEXT MINING:
Text mining, also stated to as text data mining, unevenlycomparable to text analytics, refers
to the process of deriving high quality information from text. In text mining, different types
of documents are extracted. We can give a dissimilar definition of text mining, which is
stirred by the exact perspective of the area:
Text Mining = Information Extraction.The first approach assumes that text mining
essentially corresponds to information extraction — the extraction of facts from texts.Text
Mining = Text Data Mining.Text mining can be also stated — similar to data mining — as the
application of algorithms and methods from the field machine learning and statistics to texts
with the goal of finding useful patterns. For this tenacity it is essential to pre-process the texts
accordingly.
Text Mining = KDD Process. Following the knowledge discovery process model, we
frequently find in literature text mining as a process with a series of partial steps, among
other things also information extraction as well as the use of data mining or statistical
procedure.The following four types are most commonly used to represent the documents:
Characters:
The separate component-level letters, numerals, special characters and spaces are the
building blocks of higher level semantic features such as words, terms, and concepts. A
6
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
character level representation can include the full set of all characters for a document or some
filters subset. Character based representations that include some level of positional
information are somewhat more useful and common.
Words:
Specific words selected directly from a “native” document are at what might be
described as the basic level of semantics richness. For this cause, word level structures are
sometimes mentioned to as existing in the native feature space of a document.
Terms:
Terms are particular words and multiword expressions selected right from the
corpus of a native document by means of term- extraction methodologies. Term level
features, in the sense of this definition, can only be made up of specific words and
expressions found within the native document for which they are meant to be generally
representative. A term establishedillustration of a document is certainlyself-possessed of a
subset of the terms in that document.
Concepts:
Text mining usually involves the process of structuring the input text, deriving
patterns within the structured, and finally evaluation and interpretation of the output.
Distinctive text mining tasks contain text categorization, text clustering, and concept/entity
extraction, production of granular taxonomies, sentiment analysis, document summarization,
and entity relation modeling.
7
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
WORKING FUNCTIONS:
Text mining includes the application of methods from areas such as information
retrieval, natural language processing, information extraction and data mining. These
severalphases of a text-mining method can be combined into a single workflow.
IE involves configuring the data that the NLP system produces. The objective of
information extractions methods is the extraction of specific information from text
documents. These are stored in data base-like patterns.
8
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
TEXT ENCODING:
For mining huge document collections it is necessary to pre-process the text documents and
store the information in a data structure, which is more appropriate for further processing
than a plain text file.
TEXT PREPROCESSING:
In order to attain all words that are used in a given text, a tokenization process is
required, i.e. a text document is divided into a torrent of words by eliminating all punctuation
marks and by replacing tabs and other non-text characters by single white spaces. This
tokenized illustration is then used for auxiliary processing. The set of altered words attained
by merging all text documents of a gathering is called the dictionary of a document
collection.
In order to lessen the size of the dictionary and thus the dimensionality of the
description of documents within the collection, the set of words describing the documents can
be compact by cleaning and lemmatization or restricting methods.
To supplementary decrease the number of words that must be used also indexing or
keyword selection algorithms can be used. In this item, only the certain keywords are used to
define the documents. A modesttechnique for keyword selection is to excerpt keywords
established on their entropy.
Regardless of its simple data structure without using any explicit semantic
information, the vector space model enables very efficient analysis of huge document
collections. It was originally introduced for indexing and information retrieval but is now
used also in several text mining approaches as well as in most of the currently available
document retrieval systems.
9
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
The vector space model signifies documents as vectors in m-dimensional space, i.e.
each document d is designated by a numerical feature vector w(d) = (x(d; t1),, x(d; tm)).
The foremost task of the vector space demonstration of documents is to discoveryansuitable
encoding of the feature vector. Every element of the vector typicallysymbolizes a word (or a
group of words) of the document collection, i.e. the size of the vector is definite by the
number of words (or groups of words) of the whole document collection.
whereN is the Size of the document collection D and ntis the number of documents in D that
comprise term t.
Linguistic Pre-processing:
10
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
1.3 CLASSIFICATION:
Classification is used to invention out in which group each data case is related within
a given dataset. It is used for classifying data into various classes affording to some
constrains. Several maintypes of classification algorithms comprising C4.5, ID3, k-nearest
neighbor classifier, Naive Bayes, SVM, and ANN are used for classification. Usually a
classification methodcharts three methods Statistical, Machine Learning and Neural Network
for classification. Thoughas these tactics this paper affords an inclusive survey of different
classification algorithms and their features and limitations.
Data mining in common terms mining or digging deep into data which is in different
forms to gain patterns, and to gain knowledge on that pattern. In the progression of data
mining, huge data sets are first sorted, then patterns are identified and relationships are
recognized to perform data analysis and solve problems.
Classification is a Data analysis task, i.e. the process of discovery a model that
defines and distinguishes data classes and concepts. Classification is the problematic of
recognising to which of a set of kinds (sub populations), a new observation belongs to, on the
basis of a training set of data containing observations and whose categories membership is
known.
Example: Formerlybeginning any Project, we requisite to check its feasibility. In this case, a
classifier is necessary to forecast class labels such as ‘Safe’ and ‘Risky’ for adopting the
Project and to further approve it. It is a two process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Diverse Algorithms are used to form a classifier by creating the model learn using the
training set available. Model has to be accomplished for prediction of exact results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Text Classification:
11
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
As document collections often comprise more than 100000 different words we may
select the most informative ones for a specific classification task to reduce the number of
words and thus the complication of the classification problem at hand.
Probabilistic classifiers start with the suppositionof that the words of a document di
have been generated by a probabilistic mechanism. It is thought that the class L(di) of
document dihas some relation to the words which seem in the document. This may be defined
by the conditional distribution p(t1,….., tni |L(di)) of the niwords given the class. Then the
Bayesian formularevenues the probability of a class given the words of a document.
Rather of building clear models for the different classes we may select documents
from the trained set which are “similar” to the targeted document. The class of the targeted
document consequently may be inferred from the class labels of these similar documents. If
K is thesimilar documents are considered, the approach is also known as k- nearest neighbor
classification.
12
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
JFree Chart
JFree Chart is a free 100 percentage Java plan library that sorts it easy for developers
to display professional quality charts in their applications. JFreeChart's wide feature set
contains:A constant and well-document API, supporting a wide range of chart types; A
flexible design that is easy to extend, and targets both server-side and client-side applications;
Supported for numerous output types, includes Swing components, image files (including
PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart
is "open source" or, more specifically, free software. It is circulated under the term of the
GNU Lesser General Public License (LGPL), which permits use in proprietary applications.
Decision Trees:
Decision trees are as classifiers, which consists of a set of rules which are applied in a
sequential way and finally yield a decision. They can be finest, explained by observing the
trained process, which starts with a comprehensive training set. It uses a divide and conquer
strategy: For a training set M with labelled documents the word tiis selected, which can
predict the class of the documents in the best way, e.g. by the information gain.
Decision trees are a typical tool in data mining. They are firm and ascendable both in
the numbers of variables and the size of the training set. For text mining, though, they have
the shortcoming that the concluding decision depends only on relatively few terms.
14
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Classifier Evaluations:
During the last years text classifiers have been assessed on a number of benchmark
document collections. It turned out that the levels of performances of course depended on the
document collection.
Classification is one of the Data Mining techniques that is mostly used to examine a
given data set and takes each instance of it and assigns this instance to a specific class such
that classification error will be lasted. It is used to abstract models that exactlydescribe
important data classes within the given data set. Classification is a two step process. Through
first step the model is formed by relating classification algorithm on trained data set then in
second step the extracted model is tested against a predefined test data set to measures the
modeled trained enactment and exactness. So classification is the procedure to allocate class
label from data set whose class label is unknown.
Simple
Architecture neutral
Object-oriented
Portable
Distributed
High performances
Interpreted
Multiplethreaded
Robust
Dynamic
Secured
With most programming languages, can either compiled or interpreted a program so
that one can run it on computer. The Java programming languages is unusually in that a
programed is both compiled and interpreted. With the compilers, mainly translate a programs
into an intermediated language called Java byte codes —the platform-independent codes
interpreted by the interpreter on the Java platform. The interpreter analyses and runs every
Java byte code instructions in the computer. Compilations happen just once; interpretation
occurs each time the programs is executed. The following figure illustrates how this works.
15
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
One can thinks of Java byte codes as machine code instruction for the Java Virtual
Machine (Java VM). Every Java interpreters, whether it’s a development tools or a Web
browsers that can run applets, is an implementation of the Java VM. Java byte codes helps
make “write once, run anywhere” possible. One can compiles the program into byte codes on
any platforms that has a Java compiler. The byte codes can formerly be run on any
implementations of the Java VM. That means that providing a computer has a Java VM, the
same programs written in the Java programming language can run on Windows 2000, a
Solaris workstation, or on an iMac.
16
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
17
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
systems and hardware. The Java platform varies from most of the other platforms in that it’s a
software-only platforms that runs on top of other hardwarebased platforms.
One have already been introduced to the Java VM. It’s the base for the Java platforms
and is ported on-to numeroushardware-based platforms. The Java API is a huge collections of
ready-made software components which provides many useful capabilities, such as graphical
user interface (GUI) widgets. The Java API is clustered into archives of relevant classes and
interfaces; those libraries are termed as packages. The next section, What Can Java
Technology Do? Highlighting, what the functionalities of some of the packages in the Java
API providers.
The following figure represents a programs that is running on the Java platform. As
the diagram shows, the Java-API and the virtual machine separate the program from the
hardware.
18
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Inherent code is code that after one,compiled it, the compiled code runs on a
particular hardware platforms. As a platform is independent environments, the Java platform
can be a moment slower than inherent code. Though, smarter compilers, well-planned
interpreters, and just-in-time bytes code compilers can brings performance close to that of
inherent code without threatening probabilities.
Every complete implementations of the Java platform gives the following features:
The essentials: Object, strings, threads, numbers, input and output, data structure, system
property, date and time, and etc..
Internationalization: Used to write programs which can be localized for users in world
wide. Programs willspontaneously adapt to specific locales and also to be displayed in the
suitable language.
Security: Both the low-level and high-level, including the electronic signature, public and
private key managements, access controls, and certificates.
Software components: known as Java Beans, can wad into existing component architectures.
19
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Object serialization: This allows lightweight persistence and communications via Remote
Method of Invocation (RMI).
The Java platform also have API for 2D & 3D graphics, accessibilities, servers,
collaborations, telephony, speech, animations, and more on. The following figure represents
what is include in the Java 2 SDK also.
The Java programming language. Still, it is likely to make your programs better and requires
less effort than other languages. We believe that Java technology will help you do the
following:
20
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Write once, run anywhere: Because 100% Pure Java programs are compiled
into machine-independent byte codes, they run consistently on any Java
platform.
Distribute software more easily: You can upgrade applets easily from a
central server. Applets take advantage of the feature of allowing new classes to
be loaded “on the fly,” without recompiling the entire program.
One design aim of Java is portability, is that the programs written for the Java
platform should run likewise on any combination of hardware and operating system with
adequate runtime support. This is achieved by the compliment of the Java language code to
an intermediary representation called Java bytecode, instead of directly to architecture-
specific machine code.
Java byte code instruction are comparable to machine codes, but they are proposed to
be executed by a virtual machine (VM) written specifically for the host hardware. End
usersfrequently uses a Java Runtime Environment (JRE) installed on their machine for
standalone Java applications, or in a web browser for Java applets. Typical libraries provides
a generic way to access host-specific features such as graphics, threading, and networking.
The use of universal bytecode makes porting simple. Henceforth, the overhead of
deducingthe bytecode into machine code instructions made interpreted programs almost
always run more slowly than native executable. Just-in-time(JIT) compilers that compile
bytecodes to machine code during runtime were introduced from an early stage. Java by its
owns aplatform-independent and is adapted to the particular platform it is to run on by a Java
virtual machine for it, which translates the Java bytecode into the platforms machine
languages.
21
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Bio Java is a mature open-source project that provides a framework for processing of
biological data. Bio Java contains powerful analysis and statistical routines, tools for parsing
common file formats and packages for manipulating sequences and 3D structures. It
allowsprompt bioinformatics application development in the Java programming languages.
Bio Java is written entirely in the Java programming language, and will run on any
platform for which a Java 1.5 run-time environment is available. Java 5 and Java 6 provides
advanced language features, and we shall be taking advantage of these in the next major
release, both to aid in maintenance of the library and to make it even easier for novice Java
developers to make use of the Bio Java APIs.
At the core of Bio Java is a symbolic alphabet API which represents sequences as a
list of references to singleton symbol objects that are derived from an alphabet. Lists of
symbols are stored when the possible in a compressed form of up to four symbols per byte of
memory.
In addition to the fundamental symbols of a given alphabet (A, C, G and T in the case
of DNA), all Bio Java alphabets implicitly contain extra symbol objects representing all
possible combinations of the fundamental symbols.
When in the biological sequences the data first becomes available, it was necessary to
find a convenient way to communicate it. A logical approaches is to represent every monomer
in a biological macro-molecule using a single letter -- usually the initial letter of the chemical
entity being described, for instance `T' for thymidine residues in DNA. When this type of
data was entered into computers, it was logical to use the same scheme.
A lots of computational biological software is based upon the normal string handler
APIs. While the concept of a sequence is as a string of ASCII characters has served us as well
to date, there are several issues which can present problems to the programmer:
Validation:
22
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Ambiguity:
The meaning of every symbol is not necessarily to be clear. The `t' which defines the
thymidine in DNA is the same `t' denotes a threonine residue in a protein sequence
Limited alphabet:
While there are apparent encodings for nucleicacid and sequence data as string, the
same method does not always work good for other kinds of data generate in biological
sequence analysis software BioJava takes a rather different approach to sequence data. Rather
of using a string of ASCII characters, a sequence is models is a list of Java object
implementing the Symbol interface.
This class, and the others defined here, are parts of the Java package org.biojava.bio.symbol
All Symbolinstances have a name property (for instance, Thymidine). They may
optionally have extra information related with them (for instance, information about the
chemical properties of a DNA base) stored in a standard BioJava data structure called an
Annotation. Annotations are just set of key-value data.
The least method, is getMatches, is the only important for ambiguous symbols, which
are, at the end of this chapter.The set of Symbol objects which may be found in a particular
types of sequence data are definite in an Alphabets. It always possible to define the customs
Symbols and Alphabets, but BioJava supplies a set of predefined alphabets for representing
biological molecules.
These are manageable through a central registry,so called the AlphabetManager, and
through convenience methods.
FiniteAlphabetdna = DNATools.getDNA();
23
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
while (dnaSymbols.hasNext()) {
System.out.println(s.getName());
24
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Input: Sequence database SDB, minsup,= [a, b], len= [minlen, maxlen]
Output: The frequent patterns in meta
25
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
26
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
CHAPTER – 2
LITERATURE SURVEY
OVERVIEW:
A literature review is adescription of what has been published on a topic by credited
scholars and researchers. In writing the literature review, the purposed is to convey to the
reader what knowledge and ideas have been established on the topic, and what their strengths
and weaknesses are.
As a part of writing, the literature review one must be defined by a guiding the
concept (e.g., your research objective, the problem or issue you are discussing or your
argumentative thesis). It is not just a descriptive list of the material available, or a set of
summaries, and it is part of the introduction to an essay, research report.
calculate the support (or the number of occurrences) of a pattern are then to determine
whether the pattern is frequent or not. A state-of-the-art in the sequential pattern mining with
gap constraints (or flexible wildcards) uses the number of non-overlapping existencesis to
symbolize the frequency of a pattern. Non-overlapping means that any two existencescannot
use the same character of the sequence at the same position of the pattern.
In this paper, we examine strict pattern matching under the non-overlapping
condition. It shows that the problem is in P at first. Then here proposed an algorithm, called
NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching
problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless
nodes in the Nettree after removing the rightmost root-leaf path. Here shows that NETLAP-
Best is a complete algorithm and analyses the time and space complexities of the algorithm.
Extensive experimental results validate the correctness and efficiency of the NETLAP-Best.
28
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
the support of the pattern. We use the discovered patterns to produce confident classification
rules, and present two different ways of building a classifier. The first classifier is based on an
enhanced version of the existing method of classification based on association rules, while
the second ranks the rules by first measuring their value specific to the new data object.
Experimental results show that the rule based classifiers outperforms the existing
comparable classifiers in terms of accuracy and stability. Moreover, we test a number of
patterns feature based models that use different kinds of patterns as features to represent each
sequence as a feature vector. We then apply a variety of machine learning algorithms for
sequence classification, experimentally demonstrating that the patterns we discover represent
the sequences well, and prove effective for the classification task.
29
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
the Nettree for pattern Matching with flexible wildcard Constraints (NAMEIC), based on
Nettree is designed to solve patterns matching with flexible wildcard constraints. The
problem is exponential with the regards to the pattern length m.
We prove that the correctness of the algorithms, and illustrated how it works through
an examples. NAMEIC is W*m times faster than an existing approaches, and because the
results can be given after creating the Nettree in one pass, where W is the maximal gap
flexibility. Experiments validate the correctness and efficiency of NAMEIC.
We consider string matching with variable length gaps. Given a string T and a pattern
P consisting of strings separated by variable length gaps (arbitrary strings of length in a
specified range), the problem is to find all ending positions of substrings in T that match P.
This problem is a basic primitive in computational biology applications. Let m and n be the
lengths of P and T, respectively, and let k be the number of strings in P.
We present a new algorithm achieving time O((n+m) log k+α) and space O(m+A),
where A is the sum of the lower bounds of the lengths of the gaps in P and α is the total
number of occurrences of the strings in P within T. Compared to the previous results this
bound essentially achieves the best known time and space complexities simultaneously.
Consequently, our algorithm obtains the best known bounds for almost all combinations of m,
n, k, A, and α. Our algorithm is surprisingly simple and straightforward to implement.
CHAPTER – 3
METHODOLOGY
Here describes a novel of motif detector algorithm called MEED that uses a
concurrent traversal of two suffix trees to efficiently explore the space of all motifs. Here
30
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
present an algorithm that uses MEED as a building block and can mine combinations of
simple approximate motifs under relaxed constraints.
The approach we take in MEED explores the space of all possible models. In order to
perform this exploration in an efficient way, we first construct two suffix trees: a suffix tree
on the actual data set that contains counts in every node (called the data suffix tree), and a
suffix tree, on the set of all probable model strings (called the model suffix tree).
Input: Sequence S, Pattern P, gap = [a, b], len= [minlen, maxlen], and minsup]
Output: sup(P, S)
1: Create a nettreeof P in S;
2: Prune nodes without child nodes (per Lemma 3);
3: for each ni
1 in nettreedo
4: node[1] ← ni
1; //node used to store an occurrence;
5: for j=1 to nettree.level− 1 step 1 do
6: node[j+1] ←the leftmost child meeting the length
constraints of node[j];
7: end for
8: sup(P, S) ← sup(P, S) + 1 ;
31
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Dataset Processing
In this module the datasets are being loaded from system to the application. Mainly
here we prefer to upload the DNA data to the system.DNA data are basically large in real
32
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
time, so finding the patterns among this data set are highly expensive task in terms of system
speed, accuracy and size.
Gap Restraints:
Gap Restraints which is used to detect the exact location of the data, and incidence of
null data. The number of mined patterns and the mined speeds are comparatively high and
accurate. Using this Restraints it will produce an accuracy of data comparison values.
Non- overlapping:
This will mainly eradicates data overlapping of data when the data are being loaded
into the data set. And it also outperforms the segregation of similar patter with different
sequential order.
33
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
CHAPTER – 4
IMPLEMENTATION:
Implementation is the phase of the project when the theoretical design is revolved out
into a working system. Thus it can be measured to be the most acute stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
34
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
PROCESSING
Load Dataset:Processing menus is the first main menu in this project. This menu loads
the text file, Extract and Update Data then it will split data from the extract file.
FLAME
Data Suffix Tree: It is sub menu used to fetch the data from load dataset and also split
the tree format. Show suffix tree used to view split tree format.
Flexible And Accurate Motify Detector: Detector in this module we need to enter DNA
Value, character length and also choose occurrence based or sequence based , Pattern
Discovery based on length. Finally we will get filter data in field.
VISUALIZATION
This Visualization which used to view final dataset generation of report and flame
chart report.
CHAPTER – 5
RESULTS AND DISCUSSION
This chapter presents the Analysis and Experimental results. The experimental
analysis results show the comparison of results based on the anticipated algorithms.
35
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Algorithms Accuracy
Range(100%)
The Table 5.1 shows the
Apriori 75%
Comparative Analysis results based on the
accuracy. Here the RARM 68% proposed algorithm
MEED is compared with the existing
FP-Growth 57%
algorithms like Apriori algorithm, RARM,
FP-Growth Algorithm. MEED 92% When the accuracy
range is assumed as 100%, the rage value
take for the MEED algorithm is 92%, which is higher than the accuracy range other
algorithms.
Fig 5.1 shows the accuracy range of various algorithms such as Apriori algorithm,
RARM algorithm, FP-Growth algorithm and MEED algorithm.
36
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
The Table 5.2 shows the Comparative Analysis results of mining speed of datasets.
Here the proposed algorithm MEED is compared with the existing algorithms like Apriori
algorithm, RARM, FP-Growth Algorithm. Where the mining speed range is comparatively
higher than other algorithms.
Table 5.3 Comparison of the number of mined patterns under different gap in DNA sequences
Comparative Analysis results of mining patterns under different Gap Restraints in DNA
sequences datasets. Here the proposed algorithm MEED is compared with the existing
algorithms like Apriori algorithm, RARM, FP-Growth Algorithm. Where the Gap Restraints
is comparatively higher than other algorithms.
Table 5.4Comparison of the number of mined patterns under different lengths in DNA
sequences
37
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
The Table 5.4 shows the Comparative Analysis results of mining patterns under
different lengths in DNA sequences datasets. Here the proposed algorithm MEED is
compared with the existing algorithms like Apriori algorithm, RARM, FP-Growth Algorithm.
Where the different lengths is comparatively higher than other algorithms.
CHAPTER – 6
6.1 CONCLUSION
In this paper, presented a powerful new model: ( L,M,s,k ) for motif mining in
sequence databases. The (L, M, s, k) model subsumes several existing models and provides
additional flexibility that makes it applicable in a wider variety of data mining applications.
And also presented MEED, a flexible and accurate algorithm that can find (L,M,s,k )motifs.
Through a series of experiments on real and synthetic data sets, demonstrate that MEED is a
versatile algorithm that can be used in several real motif mining tasks.
Also show that MEED outperforms existing time series mining algorithms (Random
Projections) by more than an order of magnitude. MEED is also superior to motif finding
algorithms used in computational biology (more accurate than Weeder, significantly faster
than YMF).Here also presented experiments which show that MEED can scale to handle the
motif mining tasks which are much larger than attempted before. Finally, present and evaluate
a flexible method for extracting combinations of simple motifs under relaxed constraints.
38
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
Given the way in which MEED computes the support for various candidate models, the
algorithm can easily combine the computation for many different lengths if the number of
mismatches is common across all lengths.
Hereit builds the suffix tree on all strings of length Lmax. At any node, if the length of
the model happens to be in the range of lengths considered, and the support is greater than the
minimum support, the output that model, and continue the traversal. When we considering
only one length at a time, a valid model would only be found at a leaf node of the suffix tree
since it consisted of strings only of length L. However, by allowing lengths in the range of
Lmin to Lmax and hence output valid models at depth starting at Lmin.
6.3 BIBLIOGRAPHY
[1] C. C. Aggarwal and J. Han, Frequent Pattern Mining. Cham,Switzerland: Springer, 2014.
[2] S. Ventura and J. M. Luna, Pattern Mining With Evolutionary Algorithms. Cham,
Switzerland: Springer, 2016.
[3] C. Li, Q. Yang, J. Wang, and M. Li, “Efficient mining of gap-constrained subsequences
and its various applications,” ACM Trans. Knowl. Disc. Data, vol. 6, no. 1, p. 2, 2012.
[4] B. Le, M.-T. Tran, and B. Vo, “Mining frequent closed inter-sequence patterns efficiently
using dynamic bit vectors,” Appl. Intell., vol. 43, no. 1, pp. 74–84, 2015.
[5] S. Zhang, Z. Du, and J. T. L. Wang, “New techniques for mining frequent patterns in
unordered trees,” IEEE Trans. Cybern., vol. 45, no. 6, pp. 1113–1125, Jun. 2015.
[6] L. Zhang et al., “Occupancy-based frequent pattern mining*,” ACM Trans. Knowl. Disc.
Data, vol. 10, no. 2, p. 14, 2015.
[7] F. Min, Y. Wu, and X. Wu, “The Apriori property of sequence pattern mining with
wildcard gaps,” Int. J. Funct. Informat. Personalised Med.,vol. 4, no. 1, pp. 15–31, 2012.
[8] C. D. Tan, F. Min, M. Wang, H.-R. Zhang, and Z.-H. Zhang, “Discovering patterns with
weak-wildcard gaps,” IEEE Access, vol. 4, pp. 4922–4932, 2016.
[9] F. Rasheed and R. Alhajj, “A framework for periodic outlier pattern detection in time-
39
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
series sequences,” IEEE Trans. Cybern., vol. 44, no. 5, pp. 569–582, May 2014.
[10] H. Jiang, J. Zhang, H. Ma, N. Nazar, and Z. Ren, “Mining authorship characteristics in
bug repositories,” Sci. China Inf. Sci., vol. 60, no. 1, pp. 1–16, 2017.
[11] E. Egho, D. Gay, M. Boullé, N. Voisine, and F. Clérot, “A user parameter-free approach
for mining robust sequential classification rules,” Knowl. Inf. Syst., vol. 52, no. 1, pp. 53–81,
2017.
[12] C. Zhou, B. Cule, and B. Goethals, “Pattern based sequence classification,” IEEE Trans.
Knowl. Data Eng., vol. 28, no. 5, pp. 1285–1298, May 2016.
[13] X. Wu, X. Zhu, Y. He, and A. N. Arslan, “PMBC: Pattern mining from biological
sequences with wildcard constraints,” Comput. Biol. Med., vol. 43, no. 5, pp. 481–492, 2013.
[15] H. Yang et al., “Mining top-k distinguishing sequential patterns with gap constraint,” J.
Softw., vol. 26, no. 11, pp. 2994–3009, 2015.
[16] Y. Wu, L. Wang, J. Ren, W. Ding, and X. Wu, “Mining sequential patterns with periodic
wildcard gaps,” Appl. Intell., vol. 41, no. 1, pp. 99–116, 2014.
[17] M. Zhang, B. Kao, D. W. Cheung, and K. Y. Yip, “Mining periodic patterns with gap
requirement from sequences,” ACM Trans. Knowl. Disc. Data, vol. 1, no. 2, p. 7, 2007.
[18] H.-F. Wang et al., “Efficient mining of distinguishing sequential patterns without a
predefined gap constraint,” Chin. J. Comput., vol. 39, no. 10, pp. 1979–1991, 2016.
[19] P. Bille, I. L. Gørtz, H. W. Vildhøj, and D. K. Wind, “String matching with variable
length gaps,” Theor. Comput. Sci., vol. 443, pp. 25–34, Jul. 2012
[20] X. Wu, J.-P. Qiang, and F. Xie, “Pattern matching with flexible wildcards,” J. Comput.
Sci. Technol., vol. 29, no. 5, pp. 740–750, 2014.
40
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
CHAPTER – 7
APPENDIX
A) SYSTEM DIAGRAM:
41
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
DB
Get
All Possible Matches
Pattern Data
String
42
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
B) SCREENSHOT
HOME PAGE:
43
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
ADMIN PAGE:
UPLOAD PAGE:
44
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
45
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
46
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
REPORT PAGE:
47
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
48
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS
49