You are on page 1of 4

Research Article

A Robust Gene Finding Classifier for Variable


Length DNA Sequences
P Kiran Sree1, SSSN Usha Devi N2
Abstract
Deep learning algorithm has two phases, learning or training phase and testing phase. In the
training phase, the algorithm is trained with some available patterns. Subsequent to the training,
the model performs the task of prediction in the testing phase. Based upon the nature of the
training, there are two broad categories of Deep Learning. The first category of training is named
as supervised learning, where the algorithm is trained with a set of examples. This algorithm
analyzes the input data and produces an inferred function, which can predict the unseen or
untrained instances. The second category is unsupervised learning, where the classes are
surmised from the input patterns based on some likeness measure. The user can define the
number of classes in the training phase. We have used a versatile algorithm with deep learning
to find a gene in different length DNA sequences. This proposed algorithm is versatile and results
prove the efficiency when compared with the existing methods.

Keywords: Algorithm, Deep learning, DNA sequencing


Introduction
Data Collection and Methods
The datasets are extracted from Irvine Primate Gene Finding junction database (http://archive.ics.uci.edu/ml/
machine-learning-database).1 The data set consist of 3190 DNA sequences each of length 60 bp. Among 3190
sequences, 25% sequences belong to the donor site category, 25% sequences belong to the acceptor site category
and 50% sequences belong to neither of these.

Among 767 donor sites, we have used 191 sequences for constructing DL-SSMACA tree and 192 for checking the
accuracy of the tree. The rest of 384 sequences are used for testing.

Among 768 acceptor sites, we have used 192 sequences for constructing DL-SSMACA tree and 192 for checking
the accuracy of the tree. The rest of 384 sequences are used for testing.

Among 1655 neutral sites (neither acceptor/donor), we have used 413 sequences for constructing DL-SSMACA
tree and 414 for checking the accuracy of the tree. The rest of 828 sequences are used for testing. The window
length is fixed as 60.
1
Professor, Department of Computer Science and Engineering, Shri Vishnu Engineering College for Women (Autonomous),
Bhimavaram, Andhra Pradesh, India.
2
Assistant Professor, Department of Computer Science and Engineering, UCEK, Jawaharlal Nehru Technological University,
Kakinada, Andhra Pradesh, India.
Correspondence: Mr. P Kiran Sree, Shri Vishnu Engineering College for Women (Autonomous), Bhimavaram, Andhra Pradesh,
India.
E-mail Id: drkiransree@gmail.com
Orcid Id: http://orcid.org/0000-0001-8601-4304
How to cite this article: Sree PK, Devi SSSNUN. A Robust Gene Finding Classifier for Variable Length DNA Sequences. J Adv Res
Comp Tech Soft Appl 2017; 4 (1&2): 20-23.

© ADR Journals 2017. All Rights Reserved.


J. Adv. Res. Comp. Tech. Soft. Appl. 2017; 4(1&2) Sree PK et al.

DL-SSMACA for Gene Finding Junction Prediction Input

The main aim of the learning algorithm is to encode DNA sequence


the DNA in the multiples of three and produce a DL-
SSMACA with n-attractors, k cells and m classes. Since Output
the input is of fixed length that is 60 bp, the n value
we have fixed as 4, k value is 3 and m value is also DNA class (Acceptor/Donor/Neither)
three. At the end of the execution of the learning
• Step 1: Read the input DNA sequence and process
algorithm we will have set of basins which represent
the sequence in the multiples of three
the classes.
• Step 2: Encode the input in the multiples of three
Learning Algorithm • Step 3: Distribute the input into the generated
DL-SSMACA basins till the entire sequence falls
Input into an attractor of the tree
• Step 4: Report the basin and corresponding class
DNA sequence • Step 5: Stop

Output Output and Experimental Results of DL-


SSMACA
DL-SSMACA tree with n attractor basins.
This section shows the output of the proposed
• Step 1: Read the input DNA sequence and process
classifier. DL-SSMACA will take the input as a DNA
the sequence in the multiples of three. (Three
sequence and reports the Gene Finding sites in both
neighborhood CA is used)
the stands of the sequence. Input 1 shown below is
• Step 2: Encode the input in the multiples of three
processed by DL-SSMACA classifier and the classifier
• Step 3: Choose a high fitness rule and apply it on predicts two donor sites, one in the forward strand
the input to construct an n-attractor, k-cell, 3-class and one in the reverse strand. Input 2 is processed by
DL-SSMACA DL-SSMACA and the classifier identifies acceptor site
• Step 4: Store all the basins constructed, repeat in the forward strand. Input 3 is processed by DL-
steps 1, 2, 3 till n-attractors are stored SSMACA and classifier identifies the sequence belong
• Step 5: Stop to neither donor nor acceptor.
Testing Algorithm Input 1
The main aim of the testing algorithm is to distribute CCCAAGGCCAACCGCGAGAAGATGACCCAGGTGAGTGG
the corresponding input into the generated basins. CCCGCTACCTCTTCTGGTGGCC
During this process fitness, diversity of the
intermediate node will be calculated for efficient Output
development of the desired tree. Once the DNA
sequence identifies the basin uniquely, we can report # Sequence Sequence_human_Kiran_Gene
the class associated with the basin. Finding_123jntuh = 60 bps

21
Sree PK et al. J. Adv. Res. Comp. Tech. Soft. Appl. 2017; 4(1&2)

Table 1.
Sequence_human_Kiran_Gene Finding_123jntuh, Human Gene Finding Prediction
Donor Site Prediction
START END SCORE EXON INTRON
24 38 0.99 GACCCAG GT GAGTGG
Donor Site Prediction in Reverse Strand
START END SCORE EXON INTRON
53 39 0.72 AGAAGAG GT AGCGGG
Acceptor Site Prediction
Nil
Acceptor Site Prediction in Reverse Strand
Nil

Input 2 Output

CTCCCTGATGCCCTCAGAATCTCCCCACAGGCCGCCTGAT # Sequence Sequence_human_Kiran_Gene


CTTTGACAACTTGAAGAAAT Finding_83jntuh = 60 bps

Table 2.
Sequence_human_Kiran_Gene Finding_83jntuh , Human Gene Finding Prediction
Donor Site Prediction
Nil
Donor Site Prediction in Reverse Strand
Nil
Acceptor Site Prediction
START, END, SCORE, INTRON, EXON
10 50 0.95 GCCCTCAGAATCTCCCCACAGGCCGCCTGATCTTTGACAAC
Acceptor Site Prediction in Reverse Strand
Nil

Input 3 Output

CCAGCAGGCTGAGGGCCAGAGCGGCCAGCCCTGGGAGC # Sequence Sequence_human_Kiran_Gene


TGGCACTGGGTCGCTTTTGGGA Finding_89jntuh = 60 bps

Table 3.
Sequence_human_Kiran_Gene Finding_89jntuh , Human Gene Finding Prediction
Donor Site Prediction
Nil
Donor Site Prediction in Reverse Strand
Nil
Acceptor Site Prediction
Nil
Acceptor Site Prediction in Reverse Strand
Nil

Conclusion sequences of different lengths without padding bits


also.
We have successfully developed an unsupervised
classifier with deep learning for gene predicting. The References
proposed algorithm computes the DNA sequence with
an accuracy of 90.2%. This classifier can process 1. Sree PK, Babu IR, Devi NU. An extensive report on
cellular automata based artificial immune system

22
J. Adv. Res. Comp. Tech. Soft. Appl. 2017; 4(1&2) Sree PK et al.

for strengthening automated protein prediction. cellular automata. Global Journal of Computer
Advances in Biomedical Engineering Research Science and Technology 2013; 13(4).
2013; 1(3): 45-51. 6. Sree PK, Babu IR, Devi NU. Multiple Attractor
2. Sree PK, Babu IR, Devi NU. A Novel Protein Coding Cellular Automata (MACA) for addressing major
Region Identifying Tool using Cellular Automata problems in bioinformatics. Review of
Classifier with Trust-Region Method and Parallel Bioinformatics and Biometrics 2013; 2(3): 70-6.
Scan Algorithm (NPCRITCACA). International 7. Sree PK, Babu IR, Devi NU. Protein coding region
Journal of Biotechnology and Biochemistry 2008; identification. International Conference on
4(2): 177-89. Proteomics Bioinformatics, Embassy Suites Las
3. Sree PK, Babu IR, Devi NU. HMACA: Towards Vegas, USA. Special Issue of Journal of Proteomics
proposing cellular automata based tool for and Bioinformatics 2012; 5(6): 123.
protein coding, promoter region identification 8. Sree PK, Babu IR, Devi NU. Hybrid attractor
and protein structure prediction. International cellular automata for addressing major problems
Journal of Research in Computer Applications and in bioinformatics in research and reviews. Journal
Information Technology 2013; 1(1): 26-31. of Engineering and Technology 2013; 2(4): 42-8.
4. Sree PK, Babu IR, Devi NU. PRMACA: A promoter 9. Sree PK, Babu IR, Devi NU. AIS-PRMACA: artificial
region identification using Multiple Attractor immune system based multiple attractor cellular
Cellular Automata (MACA) in the proceedings CT automata for strengthening PRMACA, Promoter
and critical infrastructure. Advances in Intelligent Region Identification. The Standard International
Systems and Computing 2014; 248: 393-9. Journals 2013; 1(4): 124-7.
5. Sree PK, Babu IR, Devi NU. AIS-PSMACA: Towards 10. Sree PK, Babu IR, Devi NU. A novel AIS-MACAX
proposing an artificial immune system for classifier in bioinformatics. World Congress on
strengthening PSMACA: an automated protein Biotechnology, Valencia Conference Centre,
structure prediction using multiple attractor Valencia, Spain. 2014.

23

You might also like