Bioinformatics F&amp M 20100722 Bujak

Automated
Classification
of mRNA and ncRNA
Edward Bujak
Franklin and Marshall College

Bioinformatics Research Fellowships
June-July, 2010
Dr. Jing Hu’s computer science lab
Stager Hall
Goals
 Machine learning to reliably differentiate
coding RNA (mRNA) from non-coding
RNA (ncRNA)
 How – by recognizing complex patterns
and making intelligent decisions based on
data
 Why – time, cost, huge search space
Topics
 Machine Learning
 Classification Model
 Evaluation of Classification performance
Why are we doing this?
ATCGCTCGCTAGATCGATCGATCGATCGCGCTCGCTA
TGTTTCGCTATCGCTAGCTACGTACGCTAGCTACGTA
CGCTATATCTCGCTCTAGCTCTAGCTAGCTATCTATA
TGGTTCGATAGCTAGCTAGCTAGCTATCGGCTCGATC
TGATCGACTAGCTAGCTAGCTAGCTAGCTCTCGCTCT
AGATCGCTAGCTAGCTTTCTCGATCGGCTATCGATCG
ATCGATCGATCGACTAGGCGCTAGATCGATCGATCGA
TCGATCGGCGCGCTAGAGCTAGCTAGCTAGCATCGAT
CGACTAGCGATCGATCGCTACGTACTATAGCTCGATC
ATCGATCGATCGACTAGGCGCTAGATCGATCGATCGA
TCGATCGGCGCGCTAGAGCTAGCTAGCTAGCATCGAT
CGACTAGCGATCGATCGCTACGTACTATAGCTCGATC
Machine Learning and
Bioinformatics
 Many bioinformatics problems are
classification problems (supervised
learning)
 Classify protein functions, secondary
structures, gene findings, etc.
 Some others are clustering problems
 Group similar proteins, similar gene
expression patterns, etc.
Machine Learning Algorithm Types
 Supervised learning: classification
 Naïve Bayesian classifier
 Bayesian network
 K-Nearest Neighbor (KNN)
 Decision
 Artificial Neural Network
 Support Vector Machine
 Markov Chain / Hidden Markov Model (HMM)
 Unsupervised learning: clustering
 K-Means
 Etc.
Machine Learning Algorithm
Markov Chain Model 53 50
Transition Matrix  Transition Diagram 20

62
ast  P  s | t   P  xi  t xi 1  s 
# transitions = # states(order+1)
Markov Chain Model
Transition Matrix (asj)
Above is for order = 1 (only 1 state or character behind)

Markov Chain Model
A key property of Markov Chains is that the
probability of each symbol xi depends only on
the preceding symbol xi-1 rather than the
entire sequence.
L
P  x   P  x1   a xi1xi
i 2
This equation shows that we need to specify
the P(x1), the probability of starting in a
particular state in addition to specifying the
transition probabilities.
L
P  x | model   P  x1   a xi1xi
i2
So we also need to capture P(x1), the initial probability vector.

Classification Model
 Discriminating or sorting the data into choices
 For a binary classifier, it is + or -, (mRNA or ncRNA)
Training data
Machine
Learning
Test data algorithm Novel data
Classifier
Evaluation of
Prediction
classifier
 Split data into a training and a test set
 Use the training set to train a classifier
 Test the classifier on test set
 The final classifier is applied to novel data
Training data
Test data
Training data
Machine
Learning
Test data algorithm
Classifier
Test data
Classifier
Evaluation of
classifier
Novel data
Classifier
Prediction
Training data
Machine
Learning
Test data algorithm Novel data
Classifier
Evaluation of
Prediction
classifier
Training: 10-Fold Cross Validation
For us, M=2  binary classifier,

it is coding RNA (mRNA) or not coding RNA (ncRNA)
Training: 10-Fold Cross Validation
 10 folds
 Binary classifier  + transition matrix and –
transition matrix for each fold
 Therefore 20 transition probability matrices, 20
starting probability vectors, 20 output prediction files
Sample output prediction file:
Evaluation of Classification
Performance Contingency Matrices
(Confusion Matrices)
TP TP  TN
Sensitivity = TPR  Accuracy =
TP  FN TP  TN  FP  FN
TN TP  TN  FP  FN
Specificity = TNR  MCC =
TN  FP TP  FN TP  FP TN  FP TN  FN 
Evaluation of Classification
Performance (example)
Process Review
 Step 0 - preface
 K-fold cross validation is a commonly used statistical technique which takes a
set of m examples and partitions them into K sets (”folds”) of size m/K. For
each fold, a classifier is trained on the other folds and then test on the fold.
 Step 1 - training data - Markov Chain
 Train on known + sequences  a+sj (+ transition probability matrix) and P+(X1)
(+ starting sequence probability vector)
 Train on known – sequences  a–sj (– transition probability matrix) and P–(X1)
(– starting sequence probability vector)
 Step 2 - known test data - (known + sequences and known – sequences)
 Binary classify (sort) each sequence as + or – using conditional probabilities
(discrimination threshold)
 Step 3 - novel test data
 Generate contingency matrix - TP, TN, FP, FN
 Classifier performance/evaluation metrics (Sensitivity, Specificity, Accuracy,
MCC, etc.)
 Step 4 - evaluate
 Based on your design and/or goals: evaluate, tweak  go back to step 1 (or
maybe 0)
 Hence the name “supervised learning”
Receiver Operating Curve (ROC)
Classification Performance Metric
 In signal detection theory, a receiver operating
characteristic (ROC), or simply ROC curve, is a
plot of the sensitivity, or true positives, vs. (1 −
specificity), or false positives, for a binary
classifier system as its discrimination threshold
is varied.
 An ROC space is defined by FPR and TPR as x
and y axes respectively, which depicts relative
trade-offs between true positive (benefits) and
false positive (costs). Since TPR is equivalent
with sensitivity and FPR is equal to 1 -
specificity, the ROC graph is sometimes called
the sensitivity vs. (1 - specificity) plot. Each
prediction result or one instance of a
contingency matrix represents one point in the
ROC space.
 The Area Under the Curve (AUC) is equal to the
probability that a classifier will rank a randomly
chosen positive instance higher than a randomly
chosen negative one.
Receiver Operating Curve (ROC)
Classification Performance Metric (example)
Status
 Preliminary Results
 Code
 Documented and robust
 Can handle various discrimination threshold -
for classifier [log(positive identification)-
log(negative identification)]
 Accepts Markov order = 1, 2, 3
Near Future
 Investigate behavior varying
Markov order of 2 or 3
 ROC – further investigate
classifier discrimination
thresholds
 Supervised Machine
Learning examine, adjust,
iterate
Long Future
 Include this classifier with other
simple classifiers such as KNN,
naive Bayesian, etc to generate
an ensemble classifier (it could
be two level classifier model).
 Each simple classifier will make a
prediction and each make
different predictions (positive or
negative). The second level
classifier combines these
prediction results and generate a
final prediction.
Take backs to class
Bioinformatics
 Research! Learning!
 Complexity
 Interdisciplinary nature of
bioinformatics
Applied
 “Bioinformatics, the computer- Biology
Mathematics
based analysis of biological data
sets, is an interdisciplinary field
at the intersection of biology,
computer science and applied Computer
mathematics.”- Dr. Dick Fluck,
Science
The Diplomat, December 3, 2009
Thanks
 Dr. Jing Hu
 Dr. Ellie Rice
 Franklin and Marshall College
 HHMI

Bioinformatics F&amp M 20100722 Bujak

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics F&amp M 20100722 Bujak

Uploaded by

Copyright:

Available Formats

Automated

Franklin and Marshall College

Transition Matrix  Transition Diagram 20

Above is for order = 1 (only 1 state or character behind)

So we also need to capture P(x1), the initial probability vector.

For us, M=2  binary classifier,

You might also like