Professional Documents
Culture Documents
Classification
of mRNA and ncRNA
Edward Bujak
ast P s | t P xi t xi 1 s
# transitions = # states(order+1)
Machine Learning Algorithm
Markov Chain Model
Transition Matrix (asj)
Training data
Machine
Learning
Test data algorithm Novel data
Classifier
Evaluation of
Prediction
classifier
Classification Model
Split data into a training and a test set
Use the training set to train a classifier
Test the classifier on test set
The final classifier is applied to novel data
Training data
Test data
Classification Model
Split data into a training and a test set
Use the training set to train a classifier
Test the classifier on test set
The final classifier is applied to novel data
Training data
Machine
Learning
Test data algorithm
Classifier
Classification Model
Split data into a training and a test set
Use the training set to train a classifier
Test the classifier on test set
The final classifier is applied to novel data
Test data
Classifier
Evaluation of
classifier
Classification Model
Split data into a training and a test set
Use the training set to train a classifier
Test the classifier on test set
The final classifier is applied to novel data
Novel data
Classifier
Prediction
Classification Model
Split data into a training and a test set
Use the training set to train a classifier
Test the classifier on test set
The final classifier is applied to novel data
Training data
Machine
Learning
Test data algorithm Novel data
Classifier
Evaluation of
Prediction
classifier
Training: 10-Fold Cross Validation
TP TP TN
Sensitivity = TPR Accuracy =
TP FN TP TN FP FN
TN TP TN FP FN
Specificity = TNR MCC =
TN FP TP FN TP FP TN FP TN FN
Evaluation of Classification
Performance (example)
Process Review
Step 0 - preface
K-fold cross validation is a commonly used statistical technique which takes a
set of m examples and partitions them into K sets (”folds”) of size m/K. For
each fold, a classifier is trained on the other folds and then test on the fold.
Step 1 - training data - Markov Chain
Train on known + sequences a+sj (+ transition probability matrix) and P+(X1)
(+ starting sequence probability vector)
Train on known – sequences a–sj (– transition probability matrix) and P–(X1)
(– starting sequence probability vector)
Step 2 - known test data - (known + sequences and known – sequences)
Binary classify (sort) each sequence as + or – using conditional probabilities
(discrimination threshold)
Step 3 - novel test data
Generate contingency matrix - TP, TN, FP, FN
Classifier performance/evaluation metrics (Sensitivity, Specificity, Accuracy,
MCC, etc.)
Step 4 - evaluate
Based on your design and/or goals: evaluate, tweak go back to step 1 (or
maybe 0)
Hence the name “supervised learning”
Receiver Operating Curve (ROC)
Classification Performance Metric
In signal detection theory, a receiver operating
characteristic (ROC), or simply ROC curve, is a
plot of the sensitivity, or true positives, vs. (1 −
specificity), or false positives, for a binary
classifier system as its discrimination threshold
is varied.
An ROC space is defined by FPR and TPR as x
and y axes respectively, which depicts relative
trade-offs between true positive (benefits) and
false positive (costs). Since TPR is equivalent
with sensitivity and FPR is equal to 1 -
specificity, the ROC graph is sometimes called
the sensitivity vs. (1 - specificity) plot. Each
prediction result or one instance of a
contingency matrix represents one point in the
ROC space.
The Area Under the Curve (AUC) is equal to the
probability that a classifier will rank a randomly
chosen positive instance higher than a randomly
chosen negative one.
Receiver Operating Curve (ROC)
Classification Performance Metric (example)
Status
Preliminary Results
Code
Documented and robust
Can handle various discrimination threshold -
for classifier [log(positive identification)-
log(negative identification)]
Accepts Markov order = 1, 2, 3
Near Future
Investigate behavior varying
Markov order of 2 or 3
ROC – further investigate
classifier discrimination
thresholds
Supervised Machine
Learning examine, adjust,
iterate
Long Future
Include this classifier with other
simple classifiers such as KNN,
naive Bayesian, etc to generate
an ensemble classifier (it could
be two level classifier model).
Each simple classifier will make a
prediction and each make
different predictions (positive or
negative). The second level
classifier combines these
prediction results and generate a
final prediction.
Take backs to class
Bioinformatics
Research! Learning!
Complexity
Interdisciplinary nature of
bioinformatics
Applied
“Bioinformatics, the computer- Biology
Mathematics
based analysis of biological data
sets, is an interdisciplinary field
at the intersection of biology,
computer science and applied Computer
mathematics.”- Dr. Dick Fluck,
Science
The Diplomat, December 3, 2009
Thanks
Dr. Jing Hu
Dr. Ellie Rice
Franklin and Marshall College
HHMI