A METHOD FOR REFERENCE-FREE GENOME ASSEMBLY QUALITY
ASSESSMENT
by
JOSHUA BURKHART
A THESIS
Presented to the Department of Computer and Information Science
and the Graduate School of the University of Oregon
in partial fulfillment of the requirements
for the degree of
Master of Science 
June 2013
THESIS APPROVAL PAGE
Student: Joshua Burkhart
Title: A Method for Reference-Free Genome Assembly Quality Assessment
This thesis has been accepted and approved in partial fulfillment of the requirements for 
the Master of Science degree in the Department of Computer and Information Science 
by:
John S. Conery Chairperson
and
Kimberly Andrews Espy Vice President for Research and Innovation;
Dean of the Graduate School 
Original approval signatures are on file with the University of Oregon Graduate School.
Degree awarded June 2013
ii
© 2013 Joshua Burkhart 
iii
THESIS ABSTRACT
Joshua Burkhart
Master of Science
Department of Computer and Information Science
June 2013
Title: A Method for Reference-Free Genome Assembly Quality Assessment
How to assess the quality of a genome assembly without the help of a reference
sequence is an open question. Only a few techniques are currently used in the literature
and each has obvious bias. An additional method, using restriction enzyme associated
DNA (RAD) marker alignment, is proposed here. With high enough density, this method 
should be able to assess the quality of de novo assemblies without the biases of current 
methods.
With the growing ambition to sequence new genomes and the accelerating ability 
to  do  so  cost  effectively,  methods  to  assess  the  quality  of  reference-free  genome 
assemblies will become increasingly important. In addition to the existing methods of 
known sequence alignment, RAD marker alignment may contribute to this effort.
iv
CURRICULUM VITAE
NAME OF AUTHOR:  Joshua Burkhart
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:
University of Oregon, Eugene
Grand Valley State University, Allendale, Michigan
DEGREES AWARDED:
Master of Science, Computer and Information Science, 2013
University of Oregon
Bachelor of Science, Computer Science, 2010
Grand Valley State University, Michigan
AREAS OF SPECIAL INTEREST:
Scientific Computing
Data Mining
Information Theory
Bioinformatics
PROFESSIONAL EXPERIENCE:
Graduate Teaching Fellow, University of Oregon, March 2012 - June 2013
TechStudent-PostGraduate, Raytheon, June 2012 - September 2012
Software Engineer, Blue Medora, January 2011 - December 2011
GRANTS, AWARDS, AND HONORS:
Grand Valley State University Upsilon Pi Epsilon Hon. Soc., Vice President, 2010
Grand Valley State University, Student Senator, 2010
v
ACKNOWLEDGMENTS
I wish to express sincere appreciation to Dr. John Conery, Dr. William Bradshaw,
Dr. Christina Holzapfel, Rudy Borowczak, and Alida Gerritsen for their assistance in the
preparation of this manuscript. In addition, special thanks are due to Dr. Matt Streisfeld
and Dr. Jay Sobel for first introducing me to to the science and wonder of evolution and
genomic research.
vi
To my parents, Dr. Gary Burkhart and Marilyn Burkhart, who have inspired me to
think  critically  about  my  intuitions.  And  to  my  two  dear  friends,  Alex  Boorsma  and
Kimberly Baker, who have provided emotional support during my academic pursuits.
vii
TABLE OF CONTENTS
Chapter Page
I. BACKGROUND...................................................................................................... 1
II. PROJECT INTRODUCTION................................................................................. 9
III. PROCEDURE........................................................................................................ 10
IV. FUTURE DIRECTIONS....................................................................................... 14
APPENDIX: FIGURES............................................................................................... 16
REFERENCES CITED................................................................................................ 30
viii
LIST OF FIGURES
Figure Page
1. A sequence with 13 base pairs along with its 5-mers and 7-mers......................... 16
2. A sequence of 13 base pairs along with its 5-mers and the frequencies of each
5-mer...................................................................................................................... 17
3. Plotting several k-mer distributions together may lead to insights about the
information contained in a source sequence.......................................................... 18
4. Many k-mers are found with frequencies below 150,000...................................... 19
5. Every k-mer found in the source sequence appears at least three times................ 20
6. We see N50 = 2...................................................................................................... 21
7. The length of the contig containing the position is reported as the N50 Length... 22
8. Two RAD markers aligning................................................................................... 23
9. Two cut sites near each other and two RAD markers aligning.............................. 23
10. One RAD marker aligning..................................................................................... 24
11. Two cut sites somewhat separated and two RAD markers aligning...................... 24
12. It is possible to reconstruct the source sequence by finding an Eulerian path....... 25
13. Sample execution records...................................................................................... 26
14. How two genome assemblies align to each other.................................................. 27
15. How two genomes assemblies' predicted protein products align to each other..... 28
16. Complex alignments.............................................................................................. 29
ix
CHAPTER I
BACKGROUND
Whole genome sequencing enumerates the nucleotides found in the chromosomal DNA of 
an  organism  and  provides  a  most  intimate  view  of  an  individual.  Today,  genome 
sequencing contributes to our understanding of biology in many ways including medical 
diagnoses, gene network discovery, and evolutionary adaptation.
Sequencing even a single organism can assist in medical diagnoses and understanding 
gene  networks  by  both  identifying  genetic  markers  known to  be  linked with  certain 
conditions and identifying expressed and unexpressed regions of DNA along with the 
areas that may contribute to those regions' promotion or repression.
 
Comparative genomics focuses on how different genome sequences relate to each other and 
necessarily involves sequencing more than one organism. It  can be used to find inter-
species  and  intra-species  differences.  Common  differences  among  genomes  include 
inversions,  repeats,  deletions,  and  transpositions.  It  can  be  especially  interesting  if 
genotypic differences can be well correlated with phenotypic variation.
The brief history of genome sequencing began when Watson, J. and Crick, F. published 
their seminal paper[1] on the structure of DNA in 1953. Early advancements in the field 
include Maxam-Gilbert sequencing[2] and Sanger Sequencing[3], both published in 1977. 
They provided methods by which DNA could be sequenced with high accuracy, though at a 
1
cost  that  was  prohibitive  for  large projects.  A revolutionary  technique  termed  random 
sequencing  or  shotgun sequencing  was  used  as  early  as  1995[4]  to  sequence  the 
Haemophilus influenzae Rd. genome and was later used to sequence the human genome[5]. 
Shotgun sequencing was later parallelized[6] allowing for large amounts of DNA to be 
sequenced at low cost. This parallelization of DNA sequencing is sometimes called high 
throughput sequencing or  next-generation sequencing (NGS). A further extension of this 
technique  allowed  for  multiple  libraries  to  be  sequenced  simultaneously  using  unique 
identifying sequences  or  barcodes[37].  The state  of  the  art  is  exemplified  by industry 
leaders like Life Technologies SOLiD Next-Generation Sequencing, 454 Sequencing GS 
Systems, and Illumina HiSeq Systems.
NGS has many sequencing applications. One of these is genome sequencing. At a high 
level of abstraction, a typical NGS genome sequencing project is outlined below.
1. Cells from organism(s) of interest are collected.
2. DNA is separated from other cellular components.
3. DNA is amplified using a cloning process like the polymerase chain reaction[39] 
(PCR)
4. DNA is  fragmented  or  sheared randomly  either  by  chemical  or  mechanical 
processes
5. DNA fragments of a specific size are collected, forming a library
6. The library is loaded into a sequencing machine
2
7. Each DNA fragment in the library has up to 100 base pairs of one or both of its 
ends sequenced, resulting in reads
8. The sequencing machine produces files containing reads to be used for  further 
analysis
Because the NGS process yields  only short  disconnected sequences,  several  additional 
stages must be completed before meaningful insights can be made.
Firstly, by removing or filtering some of the reads produced by the sequencing process, one 
may be able to detect and address several issues that may hinder the remainder of the 
genome sequencing process. Several features should be filtered out from a set of reads prior 
to further analysis.
1. Sequencing machines are prone to misidentifying base pairs due to one or more of 
several phenomena, described well by Ledergerber and Dessimoz[7], resulting in 
what is termed a miscall or mismatch.
2. Due to the nature of the algorithms used in the later stage of genome assembly, it is 
difficult  to  determine  the  length  and  correct  placement  of  highly  repetitive 
sequences like  AAAAAAAA... or  AGTAGTAGTAGT... These sequences should be 
discarded.
3
3. Sequence contamination is the inclusion of 'unlikely' sequence in a genome. It has 
been reported that several published genomes have been found to contain sequence 
data probably describing another organism in the same experimental environment, 
such as that of human laboratory equipment operators[8].
Modern DNA sequencers record both the base pair and a quality score associated with each 
base pair and report it using the FASTQ file format[9]. Quality scores give a measure of 
assurance of each call and are typically generated using Phred[10].
Using filtering programs that average base pair quality scores along each read, portions of 
reads can be discarded or trimmed once predefined confidence thresholds are reached[11].
A k-mer is an ordered subset of necessarily adjacent base pairs with a length of some 
natural number, k, base pairs found to be in some larger sequence. By enumerating all the 
k-mers of length k, the amount of unique information in a genome can be measured using 
the distribution of resultant k-mers. By varying k, it is as if viewing a sequence through 
different  lenses,  each  providing a  slightly  altered  picture  of  the information contained 
therein (See APPENDIX: FIGURES: Fig. 1).
A k-mer frequency is the number of times a specific k-mer is seen in a larger sequence (See 
APPENDIX: FIGURES: Fig. 2).
4
Filters can use k-mer distributions to keep only those k-mers with maximum or minimum 
frequencies. Highly frequent k-mers may indicate repetitive sequence, contamination, or 
other over-represented DNA due to errors during the cloning process and can be discarded. 
Additionally, rare k-mers may indicate sequencer error and can often be discarded.
Typically, k-mer frequency is plotted against the number of k-mers with each frequency, or 
k-mer  count in  graphs  (See  APPENDIX:  FIGURES:  Fig.  3,  4,  5).  Notice  how scale 
changes affect each view.
Genome assembly is the process by which short sequences are connected or assembled into 
longer contiguous sequences termed contigs. There are several algorithms that perform this 
task but most genome assembly software today uses algorithms based on one of either de 
Bruijn graphs or string graphs.
A popular algorithm used for genome assembly relies on de Bruijn graphs which consider 
(k-1)-mers as vertices and k-mers as edges in a directed graph and search for an Eulerian 
path in order to reconstruct a source sequence (See APPENDIX: FIGURES: Fig. 12). For 
a more complete description, see [12].
Alternatively,  algorithms  using  string  graphs  consider  reads  as  strings or  curves and 
attempt to find the best intersection between these curves. For a more complete description, 
see [13].
5
The  goal  of  genome  assembly  is  to  accurately  reproduce  the  underlying  contiguous 
sequences present in a genome. Theoretically, each contig should represent a chromosome. 
When sequencing an organism for the first time, a reference genome is not available. This 
is called reference-free or de novo genome sequencing. The accuracy of such an assembly 
is  difficult  to  assess.  Several  methods  have  been  proposed  but  no  single  method 
consistently assures the highest accuracy[14].
An early method used to gauge assembly quality was to quantify connectivity. NG50 is the 
number of contigs into which the first 50% of the base pairs in the estimated genome 
assemble when ordered from largest to shortest. N50 is the number of contigs into which 
the first 50% of the base pairs in contigs assemble when ordered from largest to shortest 
(See APPENDIX: FIGURES: Fig. 6). N50 Length is the size of the middle contig (See 
APPENDIX: FIGURES: Fig. 7). Metrics like this can be taken with other percentages too: 
75%, 90%, etc. Together, these are termed "NX" statistics.
In addition to NX statistics, the mean contig length and longest contig length are simple 
ways to  determine how well  an assembly's  contigs  are connected,  though they do not 
address how well any of the sequence produced represents the actual genome.
One thought is to compare the quality of genome assemblies produced by different genome 
assembly software pipelines, select the best one, and trust its output. An advantage to this 
6
method is that testing is theoretically easy. A synthetic genome can be created as was done 
in Assemblathon 1[15], whose sequence is known. It can then be artificially broken into 
reads, used as input to the pipelines, and aligned with the contigs produced by the pipelines 
to test for accuracy. Unfortunately, this method has not been shown to produce a clear 
winner and the most accurate genome assembly pipeline remains undecided[14, 15, 16]. 
Another thought is to align some known sequence to a genome assembly and call the 
assembly "accurate" based on whether or not a high percentage of the known sequence can 
be matched to corresponding sequence in the assembly.
Known sequence can be obtained using Expressed Sequence Tags[17] (EST's) and aligned 
to  a  genome  assembly  to  test  for  congruence.  These  are  produced  by  sequencing 
complementary  DNA (cDNA)  created  using  RNA expressed  from  genic  regions.  A 
drawback of EST alignment is representation bias. Genic regions are not evenly distributed 
around a genome; extragenetic factors introduce positional bias, such as epigenetic gene 
silencing[38] which forces EST's to cluster around less tightly packed euchromatic regions 
of  a  chromosome[18].  This  can leave  more  tightly  packed  heterochromatic regions  of 
chromosomes untested, thus whole-genome assembly quality uncertain.
Another method of using known sequence to assess genome assembly quality is to align 
sequences  believed to  be  conserved in  an organism to  its  genome assembly[14].  This 
requires both making a priori assumptions about the structure of an organism's genome and 
7
suffers from the same uneven distribution bias as EST alignment.
A novel method to assess the quality of genome assemblies is to consider the alignment of 
RAD markers[19].  Because some restriction enzymes have been shown to digest DNA 
indiscriminately[40],  regardless  of  gene  density  or  epigenetic  structure,  resultant  RAD 
markers are theoretically more evenly distributed around a genome than either EST's or 
known conserved sequence.  Additionally,  several RAD treatments may be applied to a 
genome to increase marker density. Each restriction enzyme cut site should result in the 
creation of two RAD markers, so by aligning restriction enzyme cut sites to a genome 
assembly and comparing the ratio of cut site alignments to RAD marker alignments, the 
percent RAD marker alignment can be computed. 
t/2c = a
where:
t = number of RAD markers that align to genome assembly
c = number of restriction-enzyme cut sites that align to genome assembly
a = RAD marker alignment ratio
(A high ratio indicates a genome assembly expected to be of high quality.)
8
CHAPTER II
PROJECT INTRODUCTION
An example of a project using genome assembly is the ongoing effort to sequence several 
Wyeomyia smithii populations found along the eastern seaboard of North America in order 
to  study the genetic  basis  for several  varying characters  including photoperiodism and 
propensity to feed on blood.
9
CHAPTER III
PROCEDURE
Following DNA collection from groups of organisms representative of target  W. smithii 
populations, reads were obtained from an Illumina HiSeq 2000 in FASTQ format with 
Illumina descriptors[9].
The  FASTQ files  underwent  an  initial  assessment  so as  to  estimate  storage,  memory, 
processor, and bandwidth usage.
Initially, read filtering was performed using a custom perl script that counted quality scores 
until  a  predefined threshold  was reached.  Several  more  sophisticated  methods of  read 
filtering have been published and this method was ultimately abandoned.
The kmer_filter program, one of the components of the Stacks pipeline[20], filters reads 
using k-mer distributions with maximum and minimum frequency thresholds optimized 
using k-mer distribution visual representations (See APPENDIX: FIGURES: Fig. 3, 4, 5).
A project  called  Quake[21]  is  a  package  developed  for  use  especially  with  Illumina 
machine output and uses read quality values in addition to known error rates in order to 
estimate miscalls and correct them when possible.
A program called Diginorm[22] is a package that normalizes k-mer coverage, narrowing 
10
the range of k-mer frequencies, and has the effect of greatly decreasing the amount of data 
in read files (over 50% reductions were seen during this project) without greatly reducing 
the amount of information.
The  effects  of  filtering  can  be  quantified  in  several  ways.  First,  the  number  of  reads 
retained can be counted and compared to the number of reads prior to filtering. If a high 
percentage of reads are left, the filter may not have been effective. Second, the resultant k-
mer distribution can be plotted and inspected for interesting features[23]. Third, executing a 
genome assembly using the filtered reads may indicate what filtering works well. In fact, 
much  time  was  spent  "bouncing"  between  read  filtering  and  genome  assembly  in  an 
attempt to optimize the results of both processes.
Using the results of several genome assembler competitions[14, 15, 16] as a guide, several 
attractive assemblers were selected at the outset of this project.
The  Broad  Institute's  ALLPATHS-LG  assembler[42]  was  considered  due  to  its  high 
ratings in Assemblathon 1[15], 2[14], and GAGE[16] but required DNA fragments of 
varying length and was thus unfit for the available dataset which consisted entirely of 
(common length) short reads.
11
SOAPdenovo2[43] had shown good results in Assemblathon 1[15], 2[14], and GAGE[16] 
but  numerous  parameters  and  sparse  documentation  made  it  nebulous  and  a  poor 
candidate as a tool for unfamiliar users. 
MSR-CA, an early version of MaSuRCA[44]. It was shown to be a competitive candidate 
in GAGE[16], but, due to its early stage of development at the outset of this project, only 
registered  users  had  access  to  it.  Thus  it  was  not  available  for  installation  on  the 
ACISS[26] system.
The String Graph Assembler[24] is an assembler that relies on a string graph algorithm 
based  on  overlaps.  It  was  used  following  the  procedure  described  in  the  source 
repository[25].  The  reported  memory  usage  was  low,  even  for  large  genomes  in 
comparison to other assemblers and it was a high scorer in Assemblathon 1[15], 2[14], 
and GAGE[16].  Unfortunately, the longest available execution queue available on the 
ACISS cluster computer is 336 hours[27]. SGA was unable to complete an assembly in 
this amount of time. Additionally, SGA was found to be poorly documented and require 
unrealistic execution times for large genomes.
Velvet[28] was a mediocre scorer in Assemblathon 1[15] and GAGE[16] but was well 
documented, could finish executions within the time limits of ACISS queues, and could 
accept  a  single  length  DNA fragment  library  as  input.  Velvet  is  a  memory-intensive 
program and  only  special  "fat"  nodes  with  384  GB RAM[27]  could  complete  some 
12
genome  assemblies.  Additionally,  not  all  Velvet  executions  finished  in  a  reasonable 
amount  of  time.  Velvet  tended  to  have  unpredictable  run  times  and parameter  value 
testing  and  optimization  was  necessary  to  produce  assemblies  (See  APPENDIX: 
FIGURES:  Fig.  13).  Initial  parameter  estimates  were  made  with  the  help  of  a  ruby 
script[47] that reported average nucleotide coverage and expected k-mer coverage for 
several values of k. To further assist in optimizing Velvet parameters, an R script was 
written[45]  that  displayed  node  coverage,  as  explained  in  the  Velvet  manual[46].  In 
summary, Velvet was a good choice for the ACISS computing environment as the high 
memory  resources  required  were  available  and  the  time  required  for  some  other 
assemblers was not. 
Several methods were used to compare genome assemblies. Connectivity statistics were 
computed  using  a  custom  ruby  script[29]  and  RAD  marker  alignment  scores  were 
computed  using  a  package  called  Radiqual[30].  Review  the  Radiqual  README.md 
document for a brief description and usage[31] (See APPENDIX: FIGURES: Fig. 8, 9, 10, 
11).
13
CHAPTER IV
FUTURE DIRECTIONS
To fully validate the ability of RAD marker alignment to discern accurate assemblies 
from  inaccurate  ones  further  research  is  required.  Genome  assembly  alignment  to 
synthetic or finished reference genomes should be used as bases for quality and compared 
to EST, known conserved region, and RAD marker alignment for those assemblies. A 
relatively  strong  correlation  between  observed  accuracy  and  that  predicted  by  RAD 
alignment would confirm its usefulness. 
The variation of features found in k-mer distributions deserves further investigation. It is 
interesting that Fig. 5 shows 69-mers with an opposite concavity to that of 51-mers at low 
frequencies and that the k-mer distributions in Fig. 3 have similar inflection points.
As we seek to  better  understand the  origin and nature  of  life  on  this  planet,  projects 
conducting de novo genome sequencing are becoming more numerous. A few examples are 
below.
1. The 1000 Genomes Project[32] has a goal of sequencing 1000 human genomes in 
order to discover low-frequency genetic variation.
2. The Genome 10K[33] is a project whose goal is to assemble 10,000 vertebrate 
species in order to make discoveries about genetic diversity.
14
3. The 1000 plant genomes initiative[34] attempts to generate sequence for 1000 plant 
species.
With the assemblies of multiple organisms, many interesting discoveries can be made. 
Software like MUMmer[48] can be used to align genomes to one another to search for 
large-scale  differences(See  APPENDIX:  FIGURES:  Fig.  14,  15)  and  small-scale 
differences(See  APPENDIX:  FIGURES:  Fig.  16)  that  could  explain  the  origins  of 
disease, gene network development, or speciation events. 
Genome sequencing is a field of active research and technological development. Promising 
technologies include Illumina's 150-250 base pair insert sequence size library protocol[35], 
allowing even short reads to be more connective then the current 100 base pair size. Oxford 
Nanopore Technologies has also developed biosensors that may one day be used to provide 
even longer reads[36].
With the growing ambition to sequence new genomes and the accelerating ability to do so 
cost effectively, methods to assess the quality of reference-free genome assemblies will 
become increasingly important. In addition to the existing methods of EST and conserved 
sequence alignment, RAD marker alignment may contribute to this effort.
15
APPENDIX:
FIGURES
Fig. 1. A sequence with 13 base pairs along with its 5-mers and 7-mers. Each of the 5-
mers and 7-mers appears only once in the source sequence.
16
Fig. 2. A sequence of 13 base pairs along with its 5-mers and the frequencies of each 5-
mer. Some of the 5-mers appear only once in the source sequence but (4) and (5) each 
appear three times.
17
Fig. 3. Plotting several k-mer distributions together may lead to insights about the 
information contained in a source sequence.
18
Fig. 4. Many k-mers are found with frequencies below 150,000.
19
Fig.  5.  Every k-mer found in the source sequence appears at  least  three times.  Also,  
notice the distribution of 69-mers behaves differently from the other k-mer distributions 
plotted here. 
20
Fig. 6. Color-coded contigs (top) in no particular order, representing the way they are 
output from a genome assembler. The sum of the lengths of these contigs is taken as N, 
and the contigs are ordered from longest to shortest (middle). By halving N (bottom), a 
position along the ordered contigs is selected, indicated here by a black line. The number 
of contigs encountered as this position is approached, from largest to shortest, is the N50. 
In this figure we see N50 = 2.
21
Fig. 7. Similar to Fig. 6, above, an ordered list of contigs is required to calculate N50 
Length. Instead of counting the number of contigs encountered as the N/2 position is 
approached,  the  length  of  the  contig  containing  the  position  is  reported  as  the  N50 
Length.
22
Fig. 8. Two RAD markers aligning to the genome assembly around a single cut site.
Fig. 9. Two cut sites near each other and two RAD markers aligning to the genome 
assembly.
23
Fig. 10. One RAD marker aligning to the genome assembly around a single cut site.
Fig. 11. Two cut sites somewhat separated and two RAD markers aligning to the genome 
assembly around those cut sites.
24
Fig. 12. By constructing a digraph using the (k-1)-mers as nodes, indicated here as blue 
circles, and the k-mers as directed edges, indicated here as orange arrows, it is possible to 
reconstruct the source sequence by finding an Eulerian path. Circular source sequences, 
such as bacterial chromosomes, are reconstructed using Eulerian cycles instead.
25
Fig. 13. Sample execution records are shown here. Notice the wide variation in results  
given slight changes in parameters like min_k_freq and expected_cov.
26
Fig. 14. This figure shows how two genome assemblies align to each other. Notice much 
of the sequence aligns well. Forward alignments are in red, reverse alignments are in 
blue.
27
Fig. 15. This figure shows how two genomes assemblies' predicted protein products align 
to each other. Notice the interesting difference in concavity when comparing the forward 
alignments, in blue, and the reverse alignments, in red.
28
Fig. 16. This figure shows contigs from a reference genome assembly, in blue, at top, and 
contigs  from  a  query  genome  assembly,  in  green,  at  bottom.  Notice  the  complex 
alignments some contigs from the query have with those from the reference.
29
REFERENCES CITED
[1] Watson, J. D., Crick, F. H. C., A Structure for Deoxyribose Nucleic Acid, Nature, Vol.  
171, No. 4356, pg. 737-738, 1953.
[2] Maxam, A. M., Gilbert, W., A New Method for Sequencing DNA, PNAS, Vol. 74, No. 
2, pg. 560-564, 1977.
[3]  Sanger,  F.,  Nicklen,  S.,  Coulson,  A.  R.,  DNA Sequencing with  Chain-Terminating 
Inhibitors, PNAS, Vol. 74, No. 12, pg. 5463-5467, 1977.
[4]  Flieshmann,  R.  D.,  et.  al.,  Whole-Genome Random Sequencing  and  Assembly  of 
Haemophilus influenzae Rd., Science., Vol. 269, No. 5223, pg. 496-512, 1995.
[5] Venter, J. C., et. al., The Sequence of the Human Genome, Science, Vol. 291, No. 5507, 
pg. 1304-1351, 2001.
[6]  Brenner,  S.,  et.  al.,  Gene  Expression  Analysis  by  Massively  Parallel  Signature 
Sequencing (MPSS) on Microbead Arrays, Nature Biotechnology, Vol. 18, pg. 630-
634, 2000.
[7]  Ledergerber,  C.,  Dessimoz,  C.,  Base-Calling  for  Next-Generation  Sequencing 
Platforms, Briefings in Bioinformatics, Vol. 12, No. 5, pg. 489-497, 2011.
[8]  Longo,  M.  S.,  Abundant  Human  DNA Contamination  Identified  in  Non-Primate 
Genome Databases, PLoS ONE, Vol. 6, No. 2, 2011.
[9] Cock, P. J.  A., et. al., The Sanger FASTQ File Format for Sequences with Quality 
Scores, and the Solexa/Illumina FASTQ Variants, Nucleic Acids Research, Vol. 38, 
No. 6, pg. 1767-1771, 2009.
[10]  Ewing,  B.,  et.  al.,  Base-Calling  of  Automated  Sequencer  Traces  Using  Phred.  I. 
Accuracy Assessment, Genome Research, Vol. 8, No. 3, pg. 175-185, 1998.
[11] Chou, H., Holmes, M. H., DNA Sequence Quality Trimming and Vector Removal, 
Bioinformatics, Vol. 17, No. 12, pg. 1093-1104, 2001.
[12] Compeau, P. E. C., et. al., How to Apply de Bruijn Graphs to Genome Assembly, 
Nature Biotechnology, Vol. 29, pg. 987-991, 2011.
[13] Myers, E. W., The Fragment Assembly String Graph, Bioinformatics, Vol 21, Sup. No. 
2, pg. ii79-ii85, 2005. 
30
[14] Bradnam, K. R., et. al., Assemblathon 2: Evaluating de novo Methods of GEnome 
Assembly  in  Three  Vertebrate  Species,  pre-print  available  at 
http://arxiv.org/pdf/1301.5406v1.
[15] Earl, D. A., et. al., Assemblathon 1: A Competitive Assessment of de novo Short Read 
Assembly Methods, Genome Research, published online, 2011.
[16] Salzberg, S. L., GAGE, A Critical Evaluation of Genome Assemblies and Assembly 
Algorithms, Genome Research, published online, 2011.
[17] Parkinson, J.,  Blaxter, M., Expressed Sequence Tags: An Overview, Methods Mol 
Biol, Vol. 533, pg. 1-12, 2009.
[18] Stralfors, A., Ekwall, K., Heterochromatin and Euchromatin--Organization, 
Boundaries, and Gene Regulation, Encyclopedia of Molecular Cell Biology and 
Molecular Medicine, Wiley-VCH Verlag GmbH & Co., 2011.
[19] Miller, M. R., et. al., Rapid and Cost-Effective Polymorphism Identification and 
Genotyping using Restriction Site Associated DNA (RAD) Markers, Genome 
Research, Vol. 17, pg. 240-248, 2007.
[20] Catchen, J., et. al., Stacks: An Analysis Tool Set for Population Genomics, Molecular 
Ecology, Vol. 22, No. 11, pg. 3124-3140, 2013.
[21] Kelly, D. R., et. al., Quake: Quality-Aware Detection and Correction of Sequencing 
Errors, Genome Biology, Vol. 11, R116, 2010.
[22] Brown, T. C., A Reference-Free Algorithm for Computational Normalization of 
Shotgun Sequencing Data, pre-print available at http://arxiv.org/abs/1203.4802.
[23]  Chor,  B.,  et.  al.,  Genomic DNA k-mer Spectra:  Models  and Modalities,  Genome 
Biology, Vol. 10, R108, 2009.
[24] Simpson, J. T.,  Durbin,  R.,  Efficient  de novo Assembly of Large Genomes using 
Compressed Data Structures, Genome Research, published online, 2011.
[25] https://github.com/jts/sga/blob/master/src/examples/sga-celegans.sh
[26] MRI-R2: Acquisition of an Applied Computational Instrument for Scientific Synthesis 
(ACISS), Grant #: OCI-0960354.
[27] https://aciss.uoregon.edu/wiki/Submission_queues
[28] Zerbino, D. R., Birney, E., Velvet: Algorithms for de novo Short Read Assembly using 
31
de Bruijn Graphs, Genome Research, Vol. 18, No. 5, pg. 821-829, 2008.
[29] https://github.com/joshuaburkhart/bio/blob/master/param_ests.rb
[30] https://github.com/joshuaburkhart/RadiQual
[31] https://github.com/joshuaburkhart/RadiQual/blob/master/README.md
[32] http://www.1000genomes.org/
[33] https://genome10k.soe.ucsc.edu/
[34] http://www.onekp.com/
[35]  http://investor.illumina.com/phoenix.zhtml?c=121127&p=irol-
newsArticle&ID=1252407
[36]  http://www.nanoporetech.com/technology/introduction-to-nanopore-
sensing/introduction-to-nanopore-sensing
[37] Wong, K. H., et. al., Multiplex Illumina Sequencing Using DNA barcoding, Current 
Protocols in Molecular Biology, Ch. 7, Unit 7.11, 2013.
[38] Jaenisch, R., Bird, A., Epigenetic Regulation of Gene Expression: How the Genome 
Integrates Intrinsic and Environmental Signals, Nature Genetics Supplement, Vol. 
33, 2003.
[39] Kramer, M. F., Coen, D. M., The Polymerase Chain Reaction, Current Protocols in 
Molecular Biology, Wiley Interscience, Ch. 15, Unit 15.0.1-15.0.3, 2009.
[40] Tiwari, P. K., Lakhotia, S. C., Restriction Enzyme Digestion of Heterochromatin in 
Drosophila Nasuta, Journal of Biosciences, Vol. 16, No. 4, pg. 187-197, 1991.
[41] ten Bosch, J. R., Grody, W. W., Keeping Up With the Next Generation: Massively 
Parallel Sequencing in Clinical Diagnostics, Journal of Molecular Diagnostics, Vol. 
10, No. 6, 2008.
[42]  Gnerre,  S.,  et.  al.,  High-Quality  Draft  Assemblies  of  Mammalian  Genomes  from 
Massively Parallel Sequence Data, PNAS, Vol. 108, No. 4, pg. 1513-1518, 2010.
[43] Luo, R., et. al., SOAPdenovo2: An Empirically Improved Memory-Efficient Short-
Read de novo Assembler, GigaScience, Vol. 1, No. 18, 2012.
[44] Zimin, A. V., et. al., The MaSuRCA Genome Assembler, Submitted to Genome 
32
Biology, 2013.
[45] https://github.com/joshuaburkhart/bio/blob/master/plot_coverage.R
[46] http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf
[47] https://github.com/joshuaburkhart/bio/blob/master/param_ests.rb
[48] Kurtz, S., et. al., Versatile and Open Software for Comparing Large Genomes, 
Genome Biology, Vol. 5, R12, 2004.
33