You are on page 1of 49

EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

ABSTRACT

Sequence mining algorithms mostly emphasis on mining for subsequences. Here, a


large class of applications, such as DNA and protein motify mining, require efficient mining
of “Maximum” patterns that are contiguous. The Most Previous algorithms like Apriori
algorithm, FP-Growth algorithm and RARM algorithm that can be applied to find such
continuous approximate pattern mining have drawbacks like Worst scalability, lack of
Assurance in finding the pattern, and difficulty in adapting to other applications. In this
research, present a new algorithm called Maximum and Efficient Detector (MEED). MEED
is a compatible algorithm that can be used to find frequent patterns with a variety of
definitions of pattern models. It is also accurate, as it always discovers the pattern if it exists.
Using both real and synthetic data sets, here demonstrate that MEED is fast, scalable, and
outperforms than the existing NOSEP algorithm on a variety of performance metrics. In
addition, based on MEED, have also address a more general problem like data accuracy,
mining of data, and Clustering. And thus in MEED algorithm the Non-overlapping extraction
and Gap Restraints which allows mining frequent combinations of sequence data sets and
motifs the gap in data under unperturbed restraints to fames a predicted data. Thus with these
sort outs data the final report and the suffix tree will be spawned, henceforth the result
attained maximum accuracy and better fallouts, which will be useful to find out the
scrupulous matching and easy to track the DNA data set.

1
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

CHAPTER – 1

INTRODUCTION

1.1 OVERVIEW
Frequent pattern mining is commonly popular in research areas. Many algorithms
have been proposed and kinds of methods have been applied to tackle some issues, such as
the Aprioritystrategy, pattern matching strategy, and even evolutionary algorithms. Sequence
pattern mining, is an important branch in frequent pattern mining research, aims to discover
frequent subsequence patterns in a single sequence or a sequence database. Such patterns are
strongly correlated to meaningful events or knowledge within the data and are commonly
applied to numerous fields, such as mining customer purchase patterns, mining tree-structure
information, travel-landscape recommendations, time-series analysis and prediction, bug
repositories, sequence classification, biological sequence data analysis, and temporal
uncertainty.

In the Existing process the NOSEP Algorithm is used pattern mining methods, and the
pattern occurrences is mainly based on two approaches: 1) no-condition; 2) the one-off
condition. Net-tree is used to construct a tree from a derived data set of DNA, with the above
two conditions. In this Net Tree the data set of DNA are stored in a random form of data, as
how the DNA data are being uploaded in the database. Here the replication of DNA data sets
are not being restricted and also the Gap Restraints in them are not calculated, which will
cause the lack of matching of exact DNA data.
The no-condition and the one-off conditions are that, the no- condition is mainly
focuses on matching with only the data set DNA patterns. As when the conditions are applied
while the matching pattern mining, there is a possibility of neglecting some of the data, based
on the conditions applied.
Even though it makes a fine ways of matching, but here some of the drawbacks occurs
as, in case of no-condition it will check will all the available data in dataset and it will also
filter the data if it has a least matching of the existing data. And so here the repetition of data
will occur and finding of the matched data becomes too critical.

2
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

And with this one-off condition is that, this makes a certain condition based on the
data updated. So here with this one-off condition it will eliminate some of the data abruptly.
With this process of one-off condition some data sets will be completely removed before it
terms for a pattern matching data. Hence, with this process some of the relevant data will not
be detected. So that this may loss some data set when constructing the net tree. And also
difficult to find its occurrence path of data.
In addition to that only by using this two conditions it is problematic to determine the
preferred data sets. And if any incidence of the recurrent data or any of the noisy data it may
not be neglected by these conditions. With that while constructing the net tree and finding of
the frequent data in DNA will not give an faultless result. Because the DNA data are much
complex data and it also has some of repeated components.

This method is mainly focused at finding pairs (or sets) of motifs that co-occur in the
data set within a short distance of each other. This method only studies a simple mismatch-
based definition of noise, and does not consider other more complex motif models such as a
substitution matrix or a compatibility matrix.

Where as in the Proposed a new process the Motify detector algorithm called MEED
is also used to find the accuracy of matching and un matching DNA data. The Motify detector
which includes the Non-Overlapping of data and the Gap Restraints are detected. And the
Motify algorithm is also used to find the accuracy of matching and un matching DNA data.

We describe a novel motif detector algorithm called MEED that uses a concurrent
traversal of two processes and to efficiently explore the space of all motifs. Here presents an
algorithm that uses MEED as a building block and can mine combinations of simple
approximate motifs under relaxed constraints. MEED is a versatile algorithm that can be used
in several real motif mining tasks. MEED outperforms existing time series mining algorithms
(Random Projections) by more than an order of magnitude.

The approach we take in MEED explores the space of all possible models. In order to
carry out this study in aproficient way, we first construct two suffix trees: a suffix tree on the
actual data set that contains counts in an every single node (called the data suffix-tree), and a
suffix tree on the set of all possible model strings (called the model suffix tree).

3
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Non-overlapping is one that if an occurrence of DNA data are of similar patter with
different sequential order, thus has a chance to overlap with the similar DNA data which were
already in existing. So using this technique the overlapping can be blocked and reduced,
which enhance reduce the searching and matching time constraints of data.

Here the searching of data are in Sequence pattern. Because a sequence pattern
comprises of multiple pattern letters occurs in a sequential order, it is possible that when
pattern occurring in the sequence, two pattern letters may appear in the required sequential
order but have altered numbers of letters between them.
Gap Restraints is mainly used to find the allocation of the proteins such as for
instance Thymine, Cytosine, Adenine, Guanine, etc... Here the allocation of these proteins
may vary accordingly. With the help of this Gap constraints one can also calculate the
matching criteria and the allocation of DNA data set.

A small gap between pattern letters is too preventive to find valid patterns, whereas a
large gap makes the pattern too general to represent meaningful knowledge within the data.
Because gap restraints allow users to set gap flexibility to meet their distinct needs, sequence
pattern mining with gap restraints has been applied to many fields, including medical
emergency identification, mining biological characteristics, mining customer purchase
patterns, feature extraction and so on. .
In addition, based on MEED, have also address a more general problem like data
accuracy, mining of data, and Clustering. And thus in MEED algorithm the Non-overlapping
extraction and Gap Restraints which allows mining frequent combinations of sequence data
sets and motifs the gap in data under unperturbed restraints to fames an predicted data. Thus
with these sort outs data the final report and the suffix tree will be spawned, henceforth the
result attained maximum accuracy and better fallouts, which will be useful to find out the
scrupulous matching and easy to track the DNA data set.
However, in a sequence data environment, the occurrence of a pattern in the sequence
is inherently complicated because a letter in the sequence may match multiple pattern letters
and different matching may result in different frequency-counting results.

4
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

1.2 DATA MINING:

Data mining, also called information discovery in databases, in computer science, the
process of discovering stimulating and useful patterns and relationships in large volumes
of data. Data mining is not exact to one type of media or data. Data mining should be
applicable to any kind of information storehouse. However, algorithms and methods may
vary when applied to altered types of data. Certainly, that defies presented by different types
of data vary considerably.

Data mining is actuality put into use and considered for databases, including
relational databases, object-relational databases and object oriented databases, data
warehouses, transactional databases, unstructured and semi structured repositories such as the
World Wide Web, progressive databases such as spatial databases, multimedia databases,
time-series databases and textual databases, and even flat files.

The field combines tools from statistics and artificial intelligence with database
management to analyze large digital collections, known as data sets. Data mining is widely
used in business, science research, and government security.

In data mining, the process of extraction of text is called text data mining. The
detection by computer of new, previously unknown gen, by automatically extracting
information from a usually large amount of different unstructured textual resources. In Text
Mining, patterns are mined from regular language text relatively than databases.

Database Data:
A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data. The software programs provide mechanisms for defining
database structures and data storage; for specifying and managing concurrent, shared,
or distributed data access; and for ensuring consistency and security of the information
stored despite system crashes or attempts at unauthorized access.
A relational database is a collection of tables, each of which is assigned a unique
name. Each table consists of a set of attributes (columns or fields) and usually stores a large
set of tuples (records or rows). Each tuple in a relational table represents an object identified
by a unique key and described by a set of attribute values. A semantic data model, such as an

5
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

entity-relationship (ER) data model, is often constructed for relational databases. An ER data
model represents the database as a set of entities and their relationships.
Data mining functionalities areused to specify the kinds of patterns to be found in data
mining tasks. In general, suchtasks can be classified into two
categories:descriptiveandpredictive. Descriptive mining tasks characterize properties of the
data in a target data set. Predictive mining tasksperform induction on the current data in order
to make predictions.Data mining functionalities, and the kinds of patterns they can discover,
are described below. Interestingpatterns representknowledge.
.

TEXT MINING:

Text mining, also stated to as text data mining, unevenlycomparable to text analytics, refers
to the process of deriving high quality information from text. In text mining, different types
of documents are extracted. We can give a dissimilar definition of text mining, which is
stirred by the exact perspective of the area:

Text Mining = Information Extraction.The first approach assumes that text mining
essentially corresponds to information extraction — the extraction of facts from texts.Text
Mining = Text Data Mining.Text mining can be also stated — similar to data mining — as the
application of algorithms and methods from the field machine learning and statistics to texts
with the goal of finding useful patterns. For this tenacity it is essential to pre-process the texts
accordingly.

Text Mining = KDD Process. Following the knowledge discovery process model, we
frequently find in literature text mining as a process with a series of partial steps, among
other things also information extraction as well as the use of data mining or statistical
procedure.The following four types are most commonly used to represent the documents:

Characters:

The separate component-level letters, numerals, special characters and spaces are the
building blocks of higher level semantic features such as words, terms, and concepts. A

6
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

character level representation can include the full set of all characters for a document or some
filters subset. Character based representations that include some level of positional
information are somewhat more useful and common.

Words:

Specific words selected directly from a “native” document are at what might be
described as the basic level of semantics richness. For this cause, word level structures are
sometimes mentioned to as existing in the native feature space of a document.

Terms:

Terms are particular words and multiword expressions selected right from the
corpus of a native document by means of term- extraction methodologies. Term level
features, in the sense of this definition, can only be made up of specific words and
expressions found within the native document for which they are meant to be generally
representative. A term establishedillustration of a document is certainlyself-possessed of a
subset of the terms in that document.

Concepts:

Concepts are features generated for a document by means of manual, statistical,


rule-based, or hybrid methodologies. Concept level features can be manually generated for
documents but are now more commonly extracted from documents using complex
preprocessing routines that identifying single words, multiword expressions that are related to
specific concept identifiers. High-quality data is characteristicallyderivative through the
formulating of patterns and trends through means such as statistical pattern learning.

Text mining usually involves the process of structuring the input text, deriving
patterns within the structured, and finally evaluation and interpretation of the output.
Distinctive text mining tasks contain text categorization, text clustering, and concept/entity
extraction, production of granular taxonomies, sentiment analysis, document summarization,
and entity relation modeling.

TEXT MINING MODEL AND WORKING FUNCTIONS:

7
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Figure.1 Text Mining Model

WORKING FUNCTIONS:

Text mining includes the application of methods from areas such as information
retrieval, natural language processing, information extraction and data mining. These
severalphases of a text-mining method can be combined into a single workflow.

Information retrieval (IR)

Information reclamation is the discovery of documents which compriseresponses to


questions and not the finding of answers itself. In order to attain this goal statistical measures
and methods are used for the automatic processing of text data and comparison to the given
question. Information rescue in the broader sense deals with the intact range of information
processing, from data retrieval to knowledge.

Natural language processing (NLP)

The over-allarea of NLP is to attain a better indulgent of natural language by use of


computer. Others contain also the employ of humble and resilient techniques for the
wildhandling of text, as they are presented. The range of the assigned techniques reaches
from the simple manipulation of strings to the automatic processing of natural language
inquiries and it analyzes the text in structures based on human speech. It permits the
computer to perform a grammatical analysis of a sentence to “read” the text.

Information extraction (IE)

IE involves configuring the data that the NLP system produces. The objective of
information extractions methods is the extraction of specific information from text
documents. These are stored in data base-like patterns.

8
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

TEXT ENCODING:

For mining huge document collections it is necessary to pre-process the text documents and
store the information in a data structure, which is more appropriate for further processing
than a plain text file.

TEXT PREPROCESSING:

In order to attain all words that are used in a given text, a tokenization process is
required, i.e. a text document is divided into a torrent of words by eliminating all punctuation
marks and by replacing tabs and other non-text characters by single white spaces. This
tokenized illustration is then used for auxiliary processing. The set of altered words attained
by merging all text documents of a gathering is called the dictionary of a document
collection.

TEXT PREPROCESSING METHODS:

Filtering, Lemmatization and Stemming:

In order to lessen the size of the dictionary and thus the dimensionality of the
description of documents within the collection, the set of words describing the documents can
be compact by cleaning and lemmatization or restricting methods.

Index Term Selection:

To supplementary decrease the number of words that must be used also indexing or
keyword selection algorithms can be used. In this item, only the certain keywords are used to
define the documents. A modesttechnique for keyword selection is to excerpt keywords
established on their entropy.

The Vector Space Model:

Regardless of its simple data structure without using any explicit semantic
information, the vector space model enables very efficient analysis of huge document
collections. It was originally introduced for indexing and information retrieval but is now
used also in several text mining approaches as well as in most of the currently available
document retrieval systems.

9
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

The vector space model signifies documents as vectors in m-dimensional space, i.e.
each document d is designated by a numerical feature vector w(d) = (x(d; t1),, x(d; tm)).
The foremost task of the vector space demonstration of documents is to discoveryansuitable
encoding of the feature vector. Every element of the vector typicallysymbolizes a word (or a
group of words) of the document collection, i.e. the size of the vector is definite by the
number of words (or groups of words) of the whole document collection.

A size normalization factor is used to confirm that all documents have


identicalprobabilities of being retrieved independent of their lengths:

whereN is the Size of the document collection D and ntis the number of documents in D that
comprise term t.

Linguistic Pre-processing:

Frequently text mining methods may be functional without additional preprocessing.


Sometimes, however, additional linguistic preprocessing may be used to enhance the
available information about terms. For this, the following approaches are frequently applied:

 Word Sense Disambiguities(WSD) tries to resolve the ambiguity in the sense of


single phrases. An instance is ‘bank’ which may have – among others the senses
‘financial institution’ or the ‘border of a river or lake’.
 Parsingyields a full parse tree of a sentence. From the parse, one can find the relation
of each word in the sentence to all the others, and typically also its function in the
sentence (e.g. subject, object, etc.).

10
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

1.3 CLASSIFICATION:

Classification is used to invention out in which group each data case is related within
a given dataset. It is used for classifying data into various classes affording to some
constrains. Several maintypes of classification algorithms comprising C4.5, ID3, k-nearest
neighbor classifier, Naive Bayes, SVM, and ANN are used for classification. Usually a
classification methodcharts three methods Statistical, Machine Learning and Neural Network
for classification. Thoughas these tactics this paper affords an inclusive survey of different
classification algorithms and their features and limitations.

Data mining in common terms mining or digging deep into data which is in different
forms to gain patterns, and to gain knowledge on that pattern. In the progression of data
mining, huge data sets are first sorted, then patterns are identified and relationships are
recognized to perform data analysis and solve problems.

Classification is a Data analysis task, i.e. the process of discovery a model that
defines and distinguishes data classes and concepts. Classification is the problematic of
recognising to which of a set of kinds (sub populations), a new observation belongs to, on the
basis of a training set of data containing observations and whose categories membership is
known.
Example: Formerlybeginning any Project, we requisite to check its feasibility. In this case, a
classifier is necessary to forecast class labels such as ‘Safe’ and ‘Risky’ for adopting the
Project and to further approve it. It is a two process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Diverse Algorithms are used to form a classifier by creating the model learn using the
training set available. Model has to be accomplished for prediction of exact results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.

Text Classification:

Text classification objectivesis at assigning pre-defined classes to text documents. An


instance would be to inevitably label each inward news story with a topic

11
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

like”sports”,”politics”, or”art”. Whatsoever the specific method employed, a data mining


classification task starts with a training set D = (d1, …..,dn) of documents that are already
labelled with a class L 2 L (e.g. sport, politics).

The task is then to determine a classification model f :D L, f(d) = L. which is


able to assign the correct class to a new document d of the domain.To measure the
performance of a classification model a random fraction of the labeled documents is set away
and not used for training.

Index Term Selection:

As document collections often comprise more than 100000 different words we may
select the most informative ones for a specific classification task to reduce the number of
words and thus the complication of the classification problem at hand.

Naive Bayes Classifier:

Probabilistic classifiers start with the suppositionof that the words of a document di
have been generated by a probabilistic mechanism. It is thought that the class L(di) of
document dihas some relation to the words which seem in the document. This may be defined
by the conditional distribution p(t1,….., tni |L(di)) of the niwords given the class. Then the
Bayesian formularevenues the probability of a class given the words of a document.

Nearest Neighbor Classifier:

Rather of building clear models for the different classes we may select documents
from the trained set which are “similar” to the targeted document. The class of the targeted
document consequently may be inferred from the class labels of these similar documents. If
K is thesimilar documents are considered, the approach is also known as k- nearest neighbor
classification.

Classification Using Frequent Patterns:


Frequent patternsshow interesting relationships between attribute–value pairs
thatoccur frequently in a given data set. For example, we may find that the attribute–

12
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

valuepairsage=youthandcredit=OKoccur in 20% of data tuples describingAllElectronics


customers who buy a computer. We can think of each attribute–value pair as anitem , so the
search for these frequent patterns is known asfrequent pattern miningorfrequent Itemset
mining .we saw howassociation rulesare derived fromfrequent patterns, where the
associations are commonly used to analyze the purchasing patterns of customers in a store.
Such analysis is useful in many decision-makingprocesses such as product placement,
catalogue design, and cross-marketing, we examine how frequent patterns can be used for
classification. Associative classification, where association rules are generated from frequent
patterns and used for classification. The general idea is that we can searchfor strong
associations between frequent patterns.
Discriminative frequent pattern–basedclassification, where frequent patterns serve as
combined features, which are consideredin addition to single features when building a
classification model. Because frequentpatterns explore highly confident associations among
multiple attributes, frequentpattern–based classification may overcome some constraints
introduced by decision treeinduction, which considers only one attribute at a time. Studies
have shown many frequent pattern–based classification methods to have greater accuracy and
scalability thansome traditional classification methods.

JFree Chart

JFree Chart is a free 100 percentage Java plan library that sorts it easy for developers
to display professional quality charts in their applications. JFreeChart's wide feature set
contains:A constant and well-document API, supporting a wide range of chart types; A
flexible design that is easy to extend, and targets both server-side and client-side applications;
Supported for numerous output types, includes Swing components, image files (including
PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG); JFreeChart
is "open source" or, more specifically, free software. It is circulated under the term of the
GNU Lesser General Public License (LGPL), which permits use in proprietary applications.

Time Series Chart Interactivity:


Implement a new (to JFree Chart) feature for interactive time series charts --- to
display a separate control that shows a small version of ALL the time series data, with a
sliding "view" rectangle that allows you to select the subset of the time series data to display
in the main chart.
13
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Decision Trees:

Decision trees are as classifiers, which consists of a set of rules which are applied in a
sequential way and finally yield a decision. They can be finest, explained by observing the
trained process, which starts with a comprehensive training set. It uses a divide and conquer
strategy: For a training set M with labelled documents the word tiis selected, which can
predict the class of the documents in the best way, e.g. by the information gain.

Decision trees are a typical tool in data mining. They are firm and ascendable both in
the numbers of variables and the size of the training set. For text mining, though, they have
the shortcoming that the concluding decision depends only on relatively few terms.

Support Vector Machines and Kernel Methods:

A Support Vector Machine (SVM) is a directed classification algorithm that recently


has been applied successfully to text classification tasks. The SVM algorithm determines a
hyper plane which is located between the positive and negative examples of the training set.
This amounts to a controlled quadratic optimization problem which can be solved efficiently
for a large number of input vectors.

Figure.2 Hyper plane

14
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Classifier Evaluations:

During the last years text classifiers have been assessed on a number of benchmark
document collections. It turned out that the levels of performances of course depended on the
document collection.

Classification is one of the Data Mining techniques that is mostly used to examine a
given data set and takes each instance of it and assigns this instance to a specific class such
that classification error will be lasted. It is used to abstract models that exactlydescribe
important data classes within the given data set. Classification is a two step process. Through
first step the model is formed by relating classification algorithm on trained data set then in
second step the extracted model is tested against a predefined test data set to measures the
modeled trained enactment and exactness. So classification is the procedure to allocate class
label from data set whose class label is unknown.

1.4 JAVA TECHNOLOGY:


Java technology is both a programing languages and a platforms. The Java
programming language is a high-leveled languages that can be categorized by all of the
following buzzwords:

 Simple
 Architecture neutral
 Object-oriented
 Portable
 Distributed
 High performances
 Interpreted
 Multiplethreaded
 Robust
 Dynamic
 Secured
With most programming languages, can either compiled or interpreted a program so
that one can run it on computer. The Java programming languages is unusually in that a
programed is both compiled and interpreted. With the compilers, mainly translate a programs
into an intermediated language called Java byte codes —the platform-independent codes
interpreted by the interpreter on the Java platform. The interpreter analyses and runs every
Java byte code instructions in the computer. Compilations happen just once; interpretation
occurs each time the programs is executed. The following figure illustrates how this works.

15
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The


java tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g1.gif" \* MERGEFORMATINET

One can thinks of Java byte codes as machine code instruction for the Java Virtual
Machine (Java VM). Every Java interpreters, whether it’s a development tools or a Web
browsers that can run applets, is an implementation of the Java VM. Java byte codes helps
make “write once, run anywhere” possible. One can compiles the program into byte codes on
any platforms that has a Java compiler. The byte codes can formerly be run on any
implementations of the Java VM. That means that providing a computer has a Java VM, the
same programs written in the Java programming language can run on Windows 2000, a
Solaris workstation, or on an iMac.

16
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The


java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET
INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The
java tutorial\\figures\\getStarted\\helloWorld.gif" \* MERGEFORMATINET

The Java Platform:

A platform is the hardware or software environments in which a programs runs. We


have already stated some of the most of the popular platforms like Windows 2000, Linux,
Solaris, and MacOS. Mostly platforms can be defined as a combinations of the operating

17
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

systems and hardware. The Java platform varies from most of the other platforms in that it’s a
software-only platforms that runs on top of other hardwarebased platforms.

The Java platforms has two modules as:

 The Java Virtual Machine(Java-VM)


 The Java Application Programming Interface(Java-API)

One have already been introduced to the Java VM. It’s the base for the Java platforms
and is ported on-to numeroushardware-based platforms. The Java API is a huge collections of
ready-made software components which provides many useful capabilities, such as graphical
user interface (GUI) widgets. The Java API is clustered into archives of relevant classes and
interfaces; those libraries are termed as packages. The next section, What Can Java
Technology Do? Highlighting, what the functionalities of some of the packages in the Java
API providers.

The following figure represents a programs that is running on the Java platform. As
the diagram shows, the Java-API and the virtual machine separate the program from the
hardware.

INCLUDEPICTURE "F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The


java tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE

18
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java


tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET INCLUDEPICTURE
"F:\\XML-DATABASE SYNCHRONISATION\\rendoc\\eBooks\\The java
tutorial\\figures\\getStarted\\g3.gif" \* MERGEFORMATINET

Inherent code is code that after one,compiled it, the compiled code runs on a
particular hardware platforms. As a platform is independent environments, the Java platform
can be a moment slower than inherent code. Though, smarter compilers, well-planned
interpreters, and just-in-time bytes code compilers can brings performance close to that of
inherent code without threatening probabilities.

Every complete implementations of the Java platform gives the following features:

The essentials: Object, strings, threads, numbers, input and output, data structure, system
property, date and time, and etc..

Applets: The set of conventions used by applets.

Networking: URL, TCP(Transmission Control Protocol), UDP(User Data gram


Protocol)sockets, and IP (Internet Protocol) address.

Internationalization: Used to write programs which can be localized for users in world
wide. Programs willspontaneously adapt to specific locales and also to be displayed in the
suitable language.

Security: Both the low-level and high-level, including the electronic signature, public and
private key managements, access controls, and certificates.

Software components: known as Java Beans, can wad into existing component architectures.

19
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Object serialization: This allows lightweight persistence and communications via Remote
Method of Invocation (RMI).

Java Database Connectivity (JDBC): It affords uniform access to a larger range of


relational databases.

The Java platform also have API for 2D & 3D graphics, accessibilities, servers,
collaborations, telephony, speech, animations, and more on. The following figure represents
what is include in the Java 2 SDK also.

The Java programming language. Still, it is likely to make your programs better and requires
less effort than other languages. We believe that Java technology will help you do the
following:

 Get started quickly: Although the Java programming language is a powerful


object-oriented language, it’s easy to learn, especially for programmers
already familiar with C or C++.
 Write less code: Comparisons of program metrics (class counts, method
counts, and so on) suggest that a program written in the Java programming
language can be four times smaller than the same program in C++.
 Write better code: The Java programming language encourages good coding
practices, and its garbage collection helps you avoid memory leaks. Its object
orientation, its JavaBeans component architecture, and its wide-ranging, easily
extendible API let you reuse other people’s tested code and introduce fewer
bugs.
 Develop programs more quickly: Your development time may be as much as
twice as fast versus writing the same program in C++. Why? You write fewer
lines of code and it is a simpler programming language than C++.
 Avoid platform dependencies with 100% Pure Java: You can keep your
program portable by avoiding the use of libraries written in other languages.
The 100% Pure JavaTMProduct Certification Program has a repository of
historical process manuals, white papers, brochures, and similar materials
online.

20
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

 Write once, run anywhere: Because 100% Pure Java programs are compiled
into machine-independent byte codes, they run consistently on any Java
platform.
 Distribute software more easily: You can upgrade applets easily from a
central server. Applets take advantage of the feature of allowing new classes to
be loaded “on the fly,” without recompiling the entire program.

Java JVM and Byte code

One design aim of Java is portability, is that the programs written for the Java
platform should run likewise on any combination of hardware and operating system with
adequate runtime support. This is achieved by the compliment of the Java language code to
an intermediary representation called Java bytecode, instead of directly to architecture-
specific machine code.

Java byte code instruction are comparable to machine codes, but they are proposed to
be executed by a virtual machine (VM) written specifically for the host hardware. End
usersfrequently uses a Java Runtime Environment (JRE) installed on their machine for
standalone Java applications, or in a web browser for Java applets. Typical libraries provides
a generic way to access host-specific features such as graphics, threading, and networking.

The use of universal bytecode makes porting simple. Henceforth, the overhead of
deducingthe bytecode into machine code instructions made interpreted programs almost
always run more slowly than native executable. Just-in-time(JIT) compilers that compile
bytecodes to machine code during runtime were introduced from an early stage. Java by its
owns aplatform-independent and is adapted to the particular platform it is to run on by a Java
virtual machine for it, which translates the Java bytecode into the platforms machine
languages.

21
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

1.4.1 BIO JAVA:

Bio Java is a mature open-source project that provides a framework for processing of
biological data. Bio Java contains powerful analysis and statistical routines, tools for parsing
common file formats and packages for manipulating sequences and 3D structures. It
allowsprompt bioinformatics application development in the Java programming languages.

Bio Java is written entirely in the Java programming language, and will run on any
platform for which a Java 1.5 run-time environment is available. Java 5 and Java 6 provides
advanced language features, and we shall be taking advantage of these in the next major
release, both to aid in maintenance of the library and to make it even easier for novice Java
developers to make use of the Bio Java APIs.

At the core of Bio Java is a symbolic alphabet API which represents sequences as a
list of references to singleton symbol objects that are derived from an alphabet. Lists of
symbols are stored when the possible in a compressed form of up to four symbols per byte of
memory.

In addition to the fundamental symbols of a given alphabet (A, C, G and T in the case
of DNA), all Bio Java alphabets implicitly contain extra symbol objects representing all
possible combinations of the fundamental symbols.

Symbols and Alphabets:

When in the biological sequences the data first becomes available, it was necessary to
find a convenient way to communicate it. A logical approaches is to represent every monomer
in a biological macro-molecule using a single letter -- usually the initial letter of the chemical
entity being described, for instance `T' for thymidine residues in DNA. When this type of
data was entered into computers, it was logical to use the same scheme.

A lots of computational biological software is based upon the normal string handler
APIs. While the concept of a sequence is as a string of ASCII characters has served us as well
to date, there are several issues which can present problems to the programmer:

Validation:

It is probable to pass any string to a routine, which is expecting a biological


sequences. Any validations has to be performed on an ad-hoc basis.

22
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Ambiguity:

The meaning of every symbol is not necessarily to be clear. The `t' which defines the
thymidine in DNA is the same `t' denotes a threonine residue in a protein sequence

Limited alphabet:

While there are apparent encodings for nucleicacid and sequence data as string, the
same method does not always work good for other kinds of data generate in biological
sequence analysis software BioJava takes a rather different approach to sequence data. Rather
of using a string of ASCII characters, a sequence is models is a list of Java object
implementing the Symbol interface.

This class, and the others defined here, are parts of the Java package org.biojava.bio.symbol

public interface Symbol

public String getName();

public Annotation getAnnotation();

public Alphabet getMatches();}

All Symbolinstances have a name property (for instance, Thymidine). They may
optionally have extra information related with them (for instance, information about the
chemical properties of a DNA base) stored in a standard BioJava data structure called an
Annotation. Annotations are just set of key-value data.

The least method, is getMatches, is the only important for ambiguous symbols, which
are, at the end of this chapter.The set of Symbol objects which may be found in a particular
types of sequence data are definite in an Alphabets. It always possible to define the customs
Symbols and Alphabets, but BioJava supplies a set of predefined alphabets for representing
biological molecules.

These are manageable through a central registry,so called the AlphabetManager, and
through convenience methods.

FiniteAlphabetdna = DNATools.getDNA();

23
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

BioJava tutorial 1 -- Symbols and SymbolLists

http://www.biojava.org/tutorials/chap1.html (1 di 5) [02/04/2003 13.39.42]

Iterator dnaSymbols = dna.iterator();

while (dnaSymbols.hasNext()) {

Symbol s = (Symbol) dnaSymbols.next();

System.out.println(s.getName());

24
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

1.5 NOSEP ALGORITHM:


In the Existing process the NOSEP Algorithm is used pattern mining methods, and the
pattern occurrences is mainly based on two approaches: 1) no-condition; 2) the one-off
condition. Net-tree is used to construct a tree from a derived data set of DNA, with the above
two conditions. In this Net Tree the data set of DNA are stored in a random form of data, as
how the DNA data are being uploaded in the database.
No-condition which is used when the process of sequence pattern matching of data in
database. And here with the no-condition process it may not neglect the repeated data and
noisy data. So that the irrelevant data will also under goes for the mining of no-condition. The
One of condition is applied when the max and min length of the net tree in an frequent pattern
matching is calculated.
This method is mainly focused at finding pairs (or sets) of motifs that co-occur in the
data set within a short distance of each other. This method only reflects a simple mismatch-
based definition of noise, and does not consider other more complex motif models such as a
substitution matrix or a compatibility matrix.

Algorithm 1 :NOSEP: Mining All the Frequent Patterns Based on Pattern

Input: Sequence database SDB, minsup,= [a, b], len= [minlen, maxlen]
Output: The frequent patterns in meta

1: Scan sequence database SDB once, calculate the support


of each event item, and store the frequent patterns with
length 1 into a queue meta[1];
2: len← 1;
3: C ← gen_candidate(meta[len]); // generate candidateset C
4: while C <> null do
5: for each candin C do
6: candsup← 0;
7: supneeded← minsup;
8: for each sequence skin SDB do

25
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

9: sup ← NET(cand, supneeded);


10: supneeded← supneeded− sup;
11: candsup← candsup+ sup;
12: if candsup≥ minsupthen
13: meta[len+ 1].enqueue(cand);
14: break;
15: end if
16: end for
17: end for
18: len← len+ 1;
19: C ←gen_candidate(meta[len]);
20: end while
21: return meta[1] ∪meta[2]... ∪meta[len];

26
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

CHAPTER – 2

LITERATURE SURVEY
OVERVIEW:
A literature review is adescription of what has been published on a topic by credited
scholars and researchers. In writing the literature review, the purposed is to convey to the
reader what knowledge and ideas have been established on the topic, and what their strengths
and weaknesses are.
As a part of writing, the literature review one must be defined by a guiding the
concept (e.g., your research objective, the problem or issue you are discussing or your
argumentative thesis). It is not just a descriptive list of the material available, or a set of
summaries, and it is part of the introduction to an essay, research report.

2.1.“New techniques for mining frequent patterns in unordered trees”


Author: S. Zhang, Z. Du, and J. T. L. Wang,
I consider a new tree mining problem that aims to discover restrictedly embedded sub
tree patterns from a set of rooted labeled unordered trees. We study the properties of a
canonical form of unordered trees, and develop new Apriority-based techniques to generate
all candidate sub trees level by level through two efficient rightmost development operations:
1) pairwise joining and 2) leg attachments. Next, we show that restrictedly embedded sub tree
detection can be achieved by calculating the restricted edit distance between a candidate sub
tree and a data tree.
These techniques are then integrated into an efficient algorithm, named frequent
restrictedly embedded sub tree miner (FRESTM), to solve the tree mining problem at hand.
The accuracy of the FRESTM algorithm is to be proved and the time and spaces complexities
of the algorithm are discussed. Experimental results on synthetic and real-world data
demonstrated in the effectiveness of the proposed approach.

2.2. “Strict pattern matching under non-overlapping condition”


Author: Y. Wu, C. Shen, H. Jiang, and X. Wu,
Pattern matching (or string matching) is an necessary task in computer science,
especially in sequential pattern mining, since pattern matching methods can be used to
27
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

calculate the support (or the number of occurrences) of a pattern are then to determine
whether the pattern is frequent or not. A state-of-the-art in the sequential pattern mining with
gap constraints (or flexible wildcards) uses the number of non-overlapping existencesis to
symbolize the frequency of a pattern. Non-overlapping means that any two existencescannot
use the same character of the sequence at the same position of the pattern.
In this paper, we examine strict pattern matching under the non-overlapping
condition. It shows that the problem is in P at first. Then here proposed an algorithm, called
NETLAP-Best, which uses Nettree structure. NETLAP-Best transforms the pattern matching
problem into a Nettree and iterates to find the rightmost root-leaf path, to prune the useless
nodes in the Nettree after removing the rightmost root-leaf path. Here shows that NETLAP-
Best is a complete algorithm and analyses the time and space complexities of the algorithm.
Extensive experimental results validate the correctness and efficiency of the NETLAP-Best.

2.3. “Discovering patterns with weak-wildcard gaps,”


Author: C. D. Tan, F. Min, M. Wang, H.-R. Zhang, And Z.-H. Zhang,
Time series analysis is an essential data mining task in areas such as the stock market
and petroleum industry. One interesting problem in knowledge discovery is the detections of
previously unknown frequent patterns. With the existing types of patterns, in which some
similar subsequences are over-looked or dissimilar ones are matched. In this paper, we define
patterns with weak-wildcard gaps to the represented subsequences with noise and shifts, and
design efficient algorithms to obtain frequent and strong patterns.
First, we change a numeric time series into a sequence according to the data
fluctuation. Second, we define the pattern mining with weak-wildcard gaps problem, where a
weak-wildcard matches any character in an alphabet subset. Third, we designs an Apriori-like
algorithm with an efficient pruning technique to obtain frequent and strong patterns.
Experimental results show that our algorithm is effective and can discover frequent and
strong patterns.

2.4. “Pattern based sequence classification,”


Author: C. Zhou, B. Cule, and B. Goethals,
Sequence classification is an important tasks in data mining. We addressed the
problem of sequence classification using the rules composed of interesting patterns found in a
dataset of labelled sequences and accompanying class labels. We measured the
interestingness of a pattern in a given in class of sequences by combining the cohesion and

28
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

the support of the pattern. We use the discovered patterns to produce confident classification
rules, and present two different ways of building a classifier. The first classifier is based on an
enhanced version of the existing method of classification based on association rules, while
the second ranks the rules by first measuring their value specific to the new data object.
Experimental results show that the rule based classifiers outperforms the existing
comparable classifiers in terms of accuracy and stability. Moreover, we test a number of
patterns feature based models that use different kinds of patterns as features to represent each
sequence as a feature vector. We then apply a variety of machine learning algorithms for
sequence classification, experimentally demonstrating that the patterns we discover represent
the sequences well, and prove effective for the classification task.

2.5. “Efficient mining of barred repetitive gapped subsequences from a sequence


database,”
Author: B. Ding, D. Lo, J. Han, and S.-C. Khoo,
There is a vast wealth of sequence data available, for instance, customer purchase
histories, program execution traces, DNA, and protein sequences. Evaluating this wealth of
data to mine important knowledge is certainly a worthwhile goal. In this paper, as a step
onward to examining patterns in sequences, we introduce the problem of mining closed
repetitive gapped subsequences and propose efficient solutions.
Given a database of sequences where every sequences is an ordered lists of events,
the pattern we would like to mine is called repetitive gapped subsequence, which is a
subsequence of some sequences in the database.
We introduce the idea of repetitive support to measure how frequently a pattern
repeats in the database. Altered from the sequential pattern mining problems, repetitive
supports captures not only the repetitions of a pattern in different sequences but also the
repetitions within a sequence. Given a user-specified support threshold asmin_sup, we study
finding the set of all patterns with repetitive support no less than the min_sup. To acquire a
compact yet complete result set and improve the efficiency, we also study finding closed
patterns

2.6. “A Nettree for pattern matching with supple wildcard constraints,”


Author: Y. Wu, X. Wu, F. Min, and Y. Li,
In this project, a new nonlinear structure called Nettree is proposed. A Nettree is
altered from a tree in that the node may have more than one parents. An algorithm, named as

29
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

the Nettree for pattern Matching with flexible wildcard Constraints (NAMEIC), based on
Nettree is designed to solve patterns matching with flexible wildcard constraints. The
problem is exponential with the regards to the pattern length m.

We prove that the correctness of the algorithms, and illustrated how it works through
an examples. NAMEIC is W*m times faster than an existing approaches, and because the
results can be given after creating the Nettree in one pass, where W is the maximal gap
flexibility. Experiments validate the correctness and efficiency of NAMEIC.

7. “String matching with variable length gaps,”

Author: P. Bille, I. L. Gørtz, H. W. Vildhøj, and D. K. Wind,

We consider string matching with variable length gaps. Given a string T and a pattern
P consisting of strings separated by variable length gaps (arbitrary strings of length in a
specified range), the problem is to find all ending positions of substrings in T that match P.
This problem is a basic primitive in computational biology applications. Let m and n be the
lengths of P and T, respectively, and let k be the number of strings in P.

We present a new algorithm achieving time O((n+m) log k+α) and space O(m+A),
where A is the sum of the lower bounds of the lengths of the gaps in P and α is the total
number of occurrences of the strings in P within T. Compared to the previous results this
bound essentially achieves the best known time and space complexities simultaneously.
Consequently, our algorithm obtains the best known bounds for almost all combinations of m,
n, k, A, and α. Our algorithm is surprisingly simple and straightforward to implement.

CHAPTER – 3
METHODOLOGY

3.1 PROBLEM DEFENITION

Here describes a novel of motif detector algorithm called MEED that uses a
concurrent traversal of two suffix trees to efficiently explore the space of all motifs. Here

30
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

present an algorithm that uses MEED as a building block and can mine combinations of
simple approximate motifs under relaxed constraints.

The approach we take in MEED explores the space of all possible models. In order to
perform this exploration in an efficient way, we first construct two suffix trees: a suffix tree
on the actual data set that contains counts in every node (called the data suffix tree), and a
suffix tree, on the set of all probable model strings (called the model suffix tree).

3.2 MEED ALGORITHM


1. Motify detector algorithm called MEED is also superior to motif finding algorithms used
in computational biology (more accurate than Weeder, significant than YMF).
2. MEED can scale to handle motify mining tasks are much larger than attempted before.
3. Gap Restraints which is used to detect the exact location of the data, and incidence of
null data.
4. The number of mined patterns and the mined speeds are comparatively high and
accurate.
5. Non overlapping which outperforms the segregation of similar patter with different
sequential order.

Algorithm 2: MEED Algorithm (N-GAP)

Input: Sequence S, Pattern P, gap = [a, b], len= [minlen, maxlen], and minsup]
Output: sup(P, S)

1: Create a nettreeof P in S;
2: Prune nodes without child nodes (per Lemma 3);
3: for each ni
1 in nettreedo
4: node[1] ← ni
1; //node used to store an occurrence;
5: for j=1 to nettree.level− 1 step 1 do
6: node[j+1] ←the leftmost child meeting the length
constraints of node[j];
7: end for
8: sup(P, S) ← sup(P, S) + 1 ;

31
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

9: if sup(P, S) >minsupreturn sup(P, S);


10: Prune nodes without child nodes (per Lemma 3);
11: end for; return sup(P, S);

3.3 MODULE DESCRIPTION


 Dataset Processing
 MEED (Maximum and Efficient Detector)
 Model suffix Tree

Dataset Processing
In this module the datasets are being loaded from system to the application. Mainly
here we prefer to upload the DNA data to the system.DNA data are basically large in real

32
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

time, so finding the patterns among this data set are highly expensive task in terms of system
speed, accuracy and size.

MEED (Maximum and Efficient Detector)


Sequence and Accurate Motify Detector in this module we need to enter DNA Value,
character length and also choose occurrence based or sequence based , Pattern Discovery
based on length. Finally we will get filter data in field. The other two main process are:

Gap Restraints:
Gap Restraints which is used to detect the exact location of the data, and incidence of
null data. The number of mined patterns and the mined speeds are comparatively high and
accurate. Using this Restraints it will produce an accuracy of data comparison values.

Non- overlapping:
This will mainly eradicates data overlapping of data when the data are being loaded
into the data set. And it also outperforms the segregation of similar patter with different
sequential order.

Model Suffix Tree


The next step after constructing the Data suffix Tree is constructing the model suffix
tree. Since the second suffix tree (built on all possible model strings) can be extremely large,
MEED does not actually construct this suffix tree. Fairly, it algorithmically generates portions
of this tree as and when needed.
MEED then explores the model space by traversing this (conceptual) model suffix
tree. Using the suffix tree on the data set, MEED computes support at various nodes in the
model space and prunes away large portions of the model space which are assured not to
produce any results under the model. This careful pruning, ensures that MEED does not
waste any time inexploring models that do not have enough support. The MEED algorithm
simply stops when it has finished traversing the model suffix tree and outputs the model
strings that had sufficient support.
For instance, the suffix tree of ATATGTA$:

33
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

CHAPTER – 4

IMPLEMENTATION:
Implementation is the phase of the project when the theoretical design is revolved out
into a working system. Thus it can be measured to be the most acute stage in achieving a
successful new system and in giving the user, confidence that the new system will work and

34
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

be effective. The implementation stage includessuspicious planning, investigation of the


existing system and it’s constraints on implementation, designing of methods to achieve
changeover and evaluation of changeover methods.

 Input is the process of converting a user-oriented description of the input into a


computer-based system. This design is significant to elude errors in the data input
process and show the correct direction to the management for getting correct
information from the computerized system.
 A quality output is one, which meets the requirements of the end user and presents
the information clearly through outputs. It is the most significant and direct source
information to the user. Effective and intellectual output design improves the
system’s relationship to help user decision-making.

 PROCESSING

Load Dataset:Processing menus is the first main menu in this project. This menu loads
the text file, Extract and Update Data then it will split data from the extract file.

 FLAME

Data Suffix Tree: It is sub menu used to fetch the data from load dataset and also split
the tree format. Show suffix tree used to view split tree format.
Flexible And Accurate Motify Detector: Detector in this module we need to enter DNA
Value, character length and also choose occurrence based or sequence based , Pattern
Discovery based on length. Finally we will get filter data in field.

 VISUALIZATION

This Visualization which used to view final dataset generation of report and flame
chart report.
CHAPTER – 5
RESULTS AND DISCUSSION
This chapter presents the Analysis and Experimental results. The experimental
analysis results show the comparison of results based on the anticipated algorithms.

Table 5.1 Comparative Analysis of Algorithm

35
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Algorithms Accuracy
Range(100%)
The Table 5.1 shows the
Apriori 75%
Comparative Analysis results based on the
accuracy. Here the RARM 68% proposed algorithm
MEED is compared with the existing
FP-Growth 57%
algorithms like Apriori algorithm, RARM,
FP-Growth Algorithm. MEED 92% When the accuracy
range is assumed as 100%, the rage value
take for the MEED algorithm is 92%, which is higher than the accuracy range other
algorithms.

Fig5.1 Comparison of mining patterns of various algorithms (values presented in percentage)

Fig 5.1 shows the accuracy range of various algorithms such as Apriori algorithm,
RARM algorithm, FP-Growth algorithm and MEED algorithm.

Table 5.2 Comparison of the Mining speed on Dataset

36
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Algorithm MS1 MS1 MS1 MS2 MS2 MS2

min= min= Min= min=70 min=80 min=90


700 800 900 0 0 0

Apriori 29559 21042 18036 370250 251000 198550

RARM 3127 1638 1224 8400 3900 2610

FP-Growth 3456 1102 8060 3254 5037 1054

MEED 56078 49856 32679 420561 335000 219667

The Table 5.2 shows the Comparative Analysis results of mining speed of datasets.
Here the proposed algorithm MEED is compared with the existing algorithms like Apriori
algorithm, RARM, FP-Growth Algorithm. Where the mining speed range is comparatively
higher than other algorithms.

Table 5.3 Comparison of the number of mined patterns under different gap in DNA sequences

Algorithm gap= gap= gap= gap= gap= gap=


(0,1) (0,2) (0,3) (0,4) (0,5) (0,6)

Apriori 16 29 41 82 138 178


The
Table RARM 16 34 82 275 293 1010 5.3
shows FP-Growth 16 34 82 280 560 1117 the

MEED 20 34 87 299 650 1901

Comparative Analysis results of mining patterns under different Gap Restraints in DNA
sequences datasets. Here the proposed algorithm MEED is compared with the existing
algorithms like Apriori algorithm, RARM, FP-Growth Algorithm. Where the Gap Restraints
is comparatively higher than other algorithms.

Table 5.4Comparison of the number of mined patterns under different lengths in DNA
sequences

37
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Algorithm maxlen maxlen maxle maxlen maxlen maxlen


=5 = 11 n= 26 = 39 = 41 = 57

Apriori 98 101 124 147 154 169

RARM 102 115 127 152 165 185

FP- 205 229 254 301 379 408


Growth

MEED 284 295 315 422 465 610

The Table 5.4 shows the Comparative Analysis results of mining patterns under
different lengths in DNA sequences datasets. Here the proposed algorithm MEED is
compared with the existing algorithms like Apriori algorithm, RARM, FP-Growth Algorithm.
Where the different lengths is comparatively higher than other algorithms.

CHAPTER – 6

6.1 CONCLUSION
In this paper, presented a powerful new model: ( L,M,s,k ) for motif mining in
sequence databases. The (L, M, s, k) model subsumes several existing models and provides
additional flexibility that makes it applicable in a wider variety of data mining applications.
And also presented MEED, a flexible and accurate algorithm that can find (L,M,s,k )motifs.
Through a series of experiments on real and synthetic data sets, demonstrate that MEED is a
versatile algorithm that can be used in several real motif mining tasks.
Also show that MEED outperforms existing time series mining algorithms (Random
Projections) by more than an order of magnitude. MEED is also superior to motif finding
algorithms used in computational biology (more accurate than Weeder, significantly faster
than YMF).Here also presented experiments which show that MEED can scale to handle the
motif mining tasks which are much larger than attempted before. Finally, present and evaluate
a flexible method for extracting combinations of simple motifs under relaxed constraints.

6.2 FUTURE WORK


When applying MEED to a practical problem, there are opportunities for optimization
one might exploit. Next, it describe a few techniques that can be used to great benefit.

38
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Given the way in which MEED computes the support for various candidate models, the
algorithm can easily combine the computation for many different lengths if the number of
mismatches is common across all lengths.
Hereit builds the suffix tree on all strings of length Lmax. At any node, if the length of
the model happens to be in the range of lengths considered, and the support is greater than the
minimum support, the output that model, and continue the traversal. When we considering
only one length at a time, a valid model would only be found at a leaf node of the suffix tree
since it consisted of strings only of length L. However, by allowing lengths in the range of
Lmin to Lmax and hence output valid models at depth starting at Lmin.

6.3 BIBLIOGRAPHY

[1] C. C. Aggarwal and J. Han, Frequent Pattern Mining. Cham,Switzerland: Springer, 2014.
[2] S. Ventura and J. M. Luna, Pattern Mining With Evolutionary Algorithms. Cham,
Switzerland: Springer, 2016.
[3] C. Li, Q. Yang, J. Wang, and M. Li, “Efficient mining of gap-constrained subsequences
and its various applications,” ACM Trans. Knowl. Disc. Data, vol. 6, no. 1, p. 2, 2012.
[4] B. Le, M.-T. Tran, and B. Vo, “Mining frequent closed inter-sequence patterns efficiently
using dynamic bit vectors,” Appl. Intell., vol. 43, no. 1, pp. 74–84, 2015.
[5] S. Zhang, Z. Du, and J. T. L. Wang, “New techniques for mining frequent patterns in
unordered trees,” IEEE Trans. Cybern., vol. 45, no. 6, pp. 1113–1125, Jun. 2015.
[6] L. Zhang et al., “Occupancy-based frequent pattern mining*,” ACM Trans. Knowl. Disc.
Data, vol. 10, no. 2, p. 14, 2015.
[7] F. Min, Y. Wu, and X. Wu, “The Apriori property of sequence pattern mining with
wildcard gaps,” Int. J. Funct. Informat. Personalised Med.,vol. 4, no. 1, pp. 15–31, 2012.
[8] C. D. Tan, F. Min, M. Wang, H.-R. Zhang, and Z.-H. Zhang, “Discovering patterns with
weak-wildcard gaps,” IEEE Access, vol. 4, pp. 4922–4932, 2016.
[9] F. Rasheed and R. Alhajj, “A framework for periodic outlier pattern detection in time-

39
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

series sequences,” IEEE Trans. Cybern., vol. 44, no. 5, pp. 569–582, May 2014.
[10] H. Jiang, J. Zhang, H. Ma, N. Nazar, and Z. Ren, “Mining authorship characteristics in
bug repositories,” Sci. China Inf. Sci., vol. 60, no. 1, pp. 1–16, 2017.
[11] E. Egho, D. Gay, M. Boullé, N. Voisine, and F. Clérot, “A user parameter-free approach
for mining robust sequential classification rules,” Knowl. Inf. Syst., vol. 52, no. 1, pp. 53–81,
2017.
[12] C. Zhou, B. Cule, and B. Goethals, “Pattern based sequence classification,” IEEE Trans.
Knowl. Data Eng., vol. 28, no. 5, pp. 1285–1298, May 2016.

[13] X. Wu, X. Zhu, Y. He, and A. N. Arslan, “PMBC: Pattern mining from biological
sequences with wildcard constraints,” Comput. Biol. Med., vol. 43, no. 5, pp. 481–492, 2013.

[14] J. Ge, Y. Xia, J. Wang, C. H. Nadungodage, and S. Prabhakar, “Sequential pattern


mining in databases with temporal uncertainty,” Knowl. Inf. Syst., vol. 51, no. 3, pp. 821–
850, 2017.

[15] H. Yang et al., “Mining top-k distinguishing sequential patterns with gap constraint,” J.
Softw., vol. 26, no. 11, pp. 2994–3009, 2015.

[16] Y. Wu, L. Wang, J. Ren, W. Ding, and X. Wu, “Mining sequential patterns with periodic
wildcard gaps,” Appl. Intell., vol. 41, no. 1, pp. 99–116, 2014.
[17] M. Zhang, B. Kao, D. W. Cheung, and K. Y. Yip, “Mining periodic patterns with gap
requirement from sequences,” ACM Trans. Knowl. Disc. Data, vol. 1, no. 2, p. 7, 2007.

[18] H.-F. Wang et al., “Efficient mining of distinguishing sequential patterns without a
predefined gap constraint,” Chin. J. Comput., vol. 39, no. 10, pp. 1979–1991, 2016.

[19] P. Bille, I. L. Gørtz, H. W. Vildhøj, and D. K. Wind, “String matching with variable
length gaps,” Theor. Comput. Sci., vol. 443, pp. 25–34, Jul. 2012
[20] X. Wu, J.-P. Qiang, and F. Xie, “Pattern matching with flexible wildcards,” J. Comput.
Sci. Technol., vol. 29, no. 5, pp. 740–750, 2014.

40
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

CHAPTER – 7

APPENDIX
A) SYSTEM DIAGRAM:

A system architecture is the conceptual design that describes the structure or


behavior of a system. An architectures description is a formal description of a system,
organized in a way that supports reasoning about the structural properties of the
system. It defines the system components or building blocks and provides a plan from
which products can be procured, and systems developed, that will work together to
implement the overall system.

The fundamental organization of a system, embodied in its components, their


relationships to each other and the environment, and the principles governing its
design and evolution. The composite of the design architectures for products and their
life cycle processes. A representation of a system in which there is a mapping of
functionality onto hardware and software components, a mapping of the software
architecture onto the hardware architecture, and human interaction with these
components.

41
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

An allocated arrangement of physical elements which provides the design


solution for a consumer product or life-cycle process intended to satisfy the
requirements of the functional architecture and the requirements baseline. Architecture
is the most important, pervasive, top-level, strategic inventions, decisions, and their
associated rationales about the overall structure (i.e., essential elements and their
relationships) and associated characteristics and behaviour.

Datasets MEED Modal Suffix FLAME


Tree

Fig .7.1 System Architecture

DB

Sequence Data mining MEED Suffix tree


Data Sets

Get match Get Input


All Nodes l,d,k
data

Get
All Possible Matches
Pattern Data
String

42
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Fig 7.2 System Architecture

B) SCREENSHOT

HOME PAGE:

Fig 7.3 Home page

43
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Fig 7.4 Login page

ADMIN PAGE:

Fig 7.5 Admin page

UPLOAD PAGE:

44
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Fig 7.6 Upload of Data set page

COMPARISION OF DNA DATA:

Fig 7.7 Comparison and Graph view of Data set page

45
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Fig 7.8 Comparison and Graph view of Data set page

OVERLAPPING AND GAP RESTRAINTS :

Fig 7.9 Non Overlapping and Gap Constraints page

SUFFIX TREE PAGE:

46
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

Fig 7.10 Suffix Tree page

REPORT PAGE:

Fig 7.11 Report page

47
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

48
EFFICIENT AND ACCURATE DISCOVERY OF PATTERNS IN SEQUENCE DATA SETS

49

You might also like