You are on page 1of 2

Walkthrough genome assembly and evaluating its quality.

Workflow of a genomics experiment


Genome vs transcriptome
First generation sequencing vs ngs
Results of a run 40-400nt, read quality is good at the beginning and errors
accumulate
Miseq and hiseq
Illumina sequencing flowcell.
What you get on an output.
Assessing quality: reads
Assessing quality: tiles
Assessing quality: final
De novo genome assembly - short sequence fragments into contigs
Shortest superstring problem: We want to find a long string that includes all
substrings of s and is as short as possible
We assume that: reads are 100% accurate,identical reads must come from the
same location on the genome, and best is simplest
The problem with modern sequencers that even generate 100nt reads do not
cover all possible 100-mers, thus people generally use k-mers of certain length
3-mers is used here which is done by cutting the original reads into lengths of 3.
We then..
Make them unique, draw edges where suffice from x overlaps with prefix from y.
We want to find the Hamiltonian path which is the path that visits every vertex
atleast once.
And then record the first letter of each vertex and all letters of last vertex.
Unfortunately, the Hamiltonian path problem is difficult to solve as it is npcomplete which means that it is computationally difficult.
Euler had seven bridges of Konigsberg problem and found that a connected
graph with undirected edges contains and eulerian cycle exactly when every
node in the graph has an even number of edge stouching it.
Eulers theorm states that a connected directed graph has a eulerian cycle if and
only if it is balanced.and a eulerian path is computationally more simple than
eulerian.
Construct a de brujin graph
Edges represent k-mers
Vertices correspond to (k-l)-mers
First form a node for every distinct prefix or suffix of a k-mer
Second connect vertex x to vertex y with a directed edge if some k-mer has
prefix x and suffix y, and label the edge with this k-mer.
A vertex is semi-balanced if indegrees outdegrees is one.

A connected graph has a eulerian path if an only if it contains at most two semibalanced vertices.
Underlying assumptions
Four hidden assumptions that do not hold for next generation sequencing.
We can generate all k-mers present in the genome
All k-mers are error free
Each k-mer appears at most once in the genome
The genome consists of a single chromosome.
The smaller the k-mer, the higher the probability that we see all k-mers.
k-mer multiplicity.
How good the assembly is depends principally on four things
how long the reads are (how much overlap you can get)
how many reads you have (the more you have, the greater the chance of
overlap)
the error rate
the nature of your dna.
If you have repeats this causes problems for the assembly as the program
doesnt know where the read should go.
Paired end sequencing allow you to overcome the problem of repeats as you
know what unique sequence each repeat should be near.

You might also like