저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게 l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다: l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다. l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다. 저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다. Disclaimer 저작자표시. 귀하는 원저작자를 표시하여야 합니다. 비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다. 변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.
A DISSERTATION FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
Diversity and evolution of Panax ginseng and
its relatives inferred from complete chloroplast
genome and nrDNA sequences
By
KYUNGHEE KIM
FEBRUARY, 2016
MAJOR IN CROP SCIENCE AND BIOTECHNOLOGY DEPARTMENT OF PLANT SCIENCE
Diversity and evolution of Panax ginseng and
its relatives inferred from complete chloroplast
genome and nrDNA sequences
UNDER THE DIRECTION OF DR. TAE-JIN YANG
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF SEOUL NATIONAL UNIVERSITY
BY
KYUNGHEE KIM
MAJOR IN CROP SCIENCE AND BIOTECHNOLOGY DEPARTMENT OF PLANT SCIENCE
NOVEMBER, 2015
APPROVED AS A QUALIFIED DISSERTATION OF KYUNGHEE KIM FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
BY THE COMMITTEE MEMBERS
FEBRUARY, 2016 CHAIRMAN Hee-Jong Koh, Ph.D. VICE-CHAIRMAN Tae-Jin Yang, Ph.D. MEMBER Do-Soon Kim, Ph.D. MEMBER Yeisoo Yu, Ph.D. MEMBER Gyoungju Nah, Ph.D.
I
Diversity and evolution of Panax ginseng and
its relatives inferred from complete chloroplast
genome and nrDNA sequences
KYUNGHEE KIM
MAJOR IN CROP SCIENCE AND BIOTECHNOLOGY DEPARTMENT OF PLANT SCIENCE
THE GRADUATE SCHOOL OF SEOUL NATIONAL UNIVERSITY
GENERAL ABSTRACT
Chloroplast (cp) genome and nuclear ribosomal DNA (nrDNA) are the main sequences used for genetic diversity and evolution research in the plant kingdom. The cp genomes are 57- to 217-kb circular DNA molecules containing ~100 conserved genes. In this study, high-throughput method of de novo assembly and error correction was developed to simultaneously obtain complete sequences of the chloroplast genome and nuclear ribosomal DNA units using relatively small amounts of whole genome shotgun sequence (WGS) produced by next generation sequencing (NGS) platform and coined as de novo assembly of low coverage WGS (dnaLCW). The dnaLCW method was successfully performed to obtain both types
II
of sequence for hundreds of plants with various genome sizes. This research opens a new era for practical application of NGS data to high-copy genomic components and represents a breakthrough technology for analyzing genetic diversity, barcoding at both inter and intra species levels, and for fundamental understanding of evolution in the plant kingdom. I applied dnaLCW for understanding evolution and genetic diversity inter and intra Panax species. Cp genomes and 45S DNAs of five Panax species and five related genus, including new sequences from seven species, have been investigated simultaneously. I studied for the genetic diversity and evolutionary history based on both cytoplasmic and nuclear genome representative sequences. All the cp genomes were determined from 155,993 bp to 156,730 bp size, and showed same structure with common 79 protein-coding, 30 tRNA and 4 rRNA genes. The complete 45S unit is ~ 11 kbp with 5.8 kbp of transcription region and varied intergenic spacer region. Ten cp genomes were 97.5-99.6% sequence homologies and 0.009 to 0.032 synonymous substitution rates. Nucleotide diversity varies among 73 cp protein coding genes, and the pabM,
rps19 and rpl22 genes showed the highest Ks values. Three genes, atpF, ycf2 and clpP, showed the highest Ka/Ks values (>1) suggesting these three genes might
play the positive pressure for speciation of ten Panax relative species. Sequence polymorphism rates in 45S DNAs is 0.2-1.5% and 26S rRNA gene showed the highest polymorphism. Based on phylogenomics analysis inferred from both cp genomes and nrDNAs, taxonomical positions of Panax relatives were clearly resolved into two monophyletic lineages as Panax-Aralia and
III
reveals 9-12.5 million years ago (MYA) for diversification of species in Araliaceae family and 3.9 MYA for diversification of Panax genus. Supposedly, divergence and speciation in Panax were occurred in period of uplift of the Himalaya-Tibetan plateau approximately 3.9 MYA and also followed by recent tetraploidization event and speciation between P. ginseng and P. quinquefolius 0.9-2.25 MYA. For comprehensive study of intra-species level diversity for cp genome and 45S nrDNA sequences of P. ginseng species, I obtained complete cp genome and nrDNA from 11 ginseng cultivars in Korea. The cp genomes sizes ranged from 156,241 to 156,425 bp and the major size variation was derived from differences in copy number of tandem repeats in the ycf1 gene and in the intergenic regions of
rps16-trnUUG and rpl32-trnUAG. The complete 45S nrDNA unit sequences were
11,091 bp, representing a consensus single transcriptional unit with an intergenic spacer region. Comparative analysis of these sequences as well as those previously reported for three Chinese accessions identified very rare but unique polymorphism in the cp genome within P. ginseng cultivars. There were 12 intra-species polymorphisms (six SNPs and six InDels) among 14 cultivars. I also identified five SNPs from 45S nrDNA of 11 Korean ginseng cultivars. From the 17 unique informative polymorphic sites, I developed six reliable and valuable markers for practical application for analysis of ginseng diversity and cultivar authentication.
Key words: Next generation sequencing (NGS), chloroplast (cp), nuclear ribosomal DNA (nrDNA), Panax ginseng, polyploidization
IV
CONTENTS
GENERAL ABSTRACT ... I
LIST OF TABLES ... III
LIST OF FIGURES ... IX
LIST OF ABBREVIATIONS ... XI
GENERAL INTRODUCTION ... 1
REFERENCES ... 4
CHAPTER I ... 8
High throughput and simultaneous assembly of complete
chloroplast and nuclear ribosomal DNA sequences from plant
genomes
ABSTACT ... 9
INTRODUCTION ... 10
MATERIALS AND METHODS ... 13
Preparation of whole-genome NGS reads ... 13
WGS assembly and building of complete cp genome and
nrDNA sequences ... 14
Annotation and comparative analysis of cp and nrDNA
sequence... 15
Validation of polymorphic regions in cp genome sequences ... 15
V
RESULTS ... 24
De novo assembly of low coverage WGS for plant genomes ... 24
Optimization of dnaLCW to obtain complete cp genome
sequence... 24
Identification and correction of cp genome assembly errors ... 29
Obtaining complete sequences for major nrDNA units ... 37
DISCUSSION ... 42
The dnaLCW workflow for simultaneous determination of
complete cp and nrDNA sequences... 42
Application to various plant species ... 45
CONCLUSION ... 46
REFERENCES ... 47
CHAPTER II ... 51
Evolution of Panax relatives inferred from complete
chloroplast genome and nrDNAs
ABSTRACT ... 52
INTRODUCTION ... 53
MATERIALS AND METHODS ... 56
Plant materials ... 56
DNA preparation and Whole-genome shotgun sequencing ... 56
De novo assembly and validation of cpDNAs and 45S nrDNA
sequences ... 57
Gene prediction of cpDNA and 45S nrDNA sequences ... 58
Comparative analysis of cp genomes and 45S nrDNA
sequences ... 58
VI
Phylogenetic analysis and estimation of divergence time ... 59
RESULTS ... 61
Complete cp genome and nrDNA sequences ... 61
Substitution rate based on cpDNAs of ten species in
Araliaceae ... 63
Sequence divergence of nrDNAs in Panax and its relatives ... 72
Molecular clock and divergence in Panax and its relatives ... 74
DISCUSSION ... 76
Genetic diversity inferred from sequence variation based on
cpDNAs ... 76
Low sequence variation of nrDNA among Panax relatives ... 77
Divergence and phylogeny based on cpDNA and nrDNA of
Araliaceae ... 78
Evolution and Molecular clocks for speciation of Panax
relatives ... 79
REFERENCES ... 83
CHAPTER III ... 89
Comprehensive survey of genetic diversity in chloroplast
genomes and 45S nrDNAs within Panax ginseng species
ABSTRACT ... 90
INTRODUCTION ... 91
MATERIALS AND METHODS ... 94
Plant materials ... 94
DNA preparation and whole-genome shotgun sequencing ... 94
Cp genome and 45S nrDNA assembly ... 95
VII
Comparative analysis and development of DNA markers ... 96
RESULTS ... 98
Complete cp genome and nrDNA sequences of 11 ginseng
cultivars. ... 98
Sequence variations among cp genomes of 14 P. ginseng
accessions... 102
Sequence divergence of 45S nrDNAs within P. ginseng species107
Validation of intra-species polymorphism and development of
cultivar authentication markers ... 109
DISCUSSION ... 114
Complete cp genome and nrDNA sequences derived from
low-coverage whole-genome NGS data ... 114
SNPs and InDels at the inter- and intra-species level ... 114
Hotspot polymorphic sites in the cp genome of Panax species . 115
Development of molecular markers for authentication of
ginseng cultivars ... 116
REFERENCES ... 119
APPENDIX ... 124
VIII
LIST OF TABLES
Table 1-1 Statistics for assembly and copy numbers for Cp and
nrDNAs
…..…18Table 1-2 Characterization of the 30 longest contigs in the de novo assembly of the rice Os2 dataset (1x genome coverage; 50x cp genome coverage) ...19
Table 1-3 Characterization of the 30 longest contigs among the de novo assembly of the ginseng Pg2 dataset (0.05x genome coverage; 50x cp genome coverage) ………….………..………..………21
Table 1-4 Summary of chloroplast genome assembly using different amounts of WGS data from O. sativa (cv. Nipponbare) and P. ginseng (cv. Chunpoong)………….………..……….26
Table 2-1 Cp genomes and 45S nrDNA sequences used for comparative analysis of 10Araliaceae species ...………….………...……...….62
Table 2-2 Ks values and estimated divergence time of 10 Araliaceae species …..67
Table 3-1 Statistics of WGS and assembly summary for nine P. ginseng accessions ………..100
Table 3-2 Summary of nucleotide polymorphisms in Cp genomes and 45S nrDNA sequences of 14 P. ginseng accessions ………..……….…...104
IX
LIST OF FIGURES
Figure 1-1 Characterization of the 30 longest contigs in assembly of Oryza and Ginseng species ………...…………..………23
Figure 1-2 Optimization of datasets and Genome assembler for de novo assembly ...……….28
Figure 1-3 Identification and correction of mis-assembled in false gaps region ...30
Figure 1-4 Identification and correction of mis-assembled in false SNPs region...32
Figure 1-5 Identification and correction of mis-assembled in tandem repeat copy number variation region .…………..………...……….….…..34
Figure 1-6 Identification and correction of mis-assembled in monopolymer copy number variation region .………...36
Figure 1-7 Schematic diagram of the method used to obtain a complete 45S unit.38
Figure 1-8 Confirming complete nrDNA units to Oryza genome ...…….……….39
Figure 1-9 Validation of polymorphic regions of CP genome & nrDNA sequences.………...……….40
Figure 1-10 dnaLCW pipeline for simultaneous completion of the cp genome and 45S sequences .…..………...……….44
Figure 2-1 Gene map and nucleotide polymorphism of cp genomes in ten Araliaceae species ..……….64
X
Figure 2-2 Summary of Ka/Ks and Ks value among cp genomes of ten species ..68
Figure 2-3 Validation for polymorphic sites with TR CNV in cpDNA of Araliaceae species ....……….71
Figure 2-4 Assembly and comparison of 45S nrDNA sequences .………73
Figure 2-5 Phylogenomic tree and divergence time of 10 Araliaceae species ...75
Figure 3-1 Summary of cp genome assembly of nine ginseng cultivars ...…….101
Figure 3-2 Chloroplast genome map of nine Panax ginseng ……….….106
Figure 3-3 Schematic diagram of a representative complete 45S nrDNA …..…108
Figure 3-4 Validation of copy number variation (CNV) of TR in rps16 ~ trnUUG region ...………..………...111
XI
LIST OF ABBREVIATIONS
Cp Chloroplast
CP Cultivar ‘Chunpoong’
dnaLCW de novo assembly of low coverage whole genome shotgun sequencing InDel Insertion/Deletion
IGS Intergenic transcribed spacer
IR Inverted repeat
ITS1, 2 Internal transcribed spacer 1 and 2 Ka Substitutions per non-synonymous site Ks Substitutions per synonymous site
LSC Large single copy
Mya Million years ago
NGS Next generation sequencing NMPT Nuclear and mitochondrial partial
NORs Nucleolar organizer regions
nrDNA Nuclear ribosomal DNA
PE Paired-end
PQ Panax quinquefolius
SNP Single nucleotide polymorphism
SSC Small single copy
TR Tandem repeat
WGS Whole genome shotgun
1
GENERAL INTRODUCTION
Plant cells contain three genomes with different evolutionary origins and history: nuclear, mitochondrial and chloroplastic. Chloroplast (cp) genomes and nuclear ribosomal DNA (nrDNA) units are the primary sequences used to analyze plant genetic diversity as well as nuclear and cytoplasmic evolution (Qiu et al. 1999; Soltis 1999). The cp genomes are maintained uni-parentally, usually via maternal inheritance (Wolfe 1989; Reboud 1994) and are 120- to 217-kb circular DNA molecules containing ~100 conserved genes and relatively diverse intergenic spaces (IGSs) (Palmer et al. 1985; Harris et al. 1991; Wolfe et al. 2004; Shaver et al. 2006; Rivarola et al. 2011; Wang 2011). Within plant nuclear genomes, nrDNA is organized into highly abundant tandemly-repeated transcription units (up to 22,000 copies) that make up nucleolar organizer regions (NORs) (Rogers et al. 1987). Due to their conserved roles in ribosome assembly and nucleolus formation, these high-copy nrDNA units have remained highly homogeneous through concerted genome evolution within species and are well-conserved among species. The 45S blocks include tandemly arrayed copies of the 45S cistron unit, which comprises conserved 18S, 5.8S, and 26S (or 28S) gene clusters, relatively variable internal transcribed spacers (ITS1 and ITS2), and variable long IGSs (Wicke et al. 2011; Galián et al. 2012; Álvarez et al. 2003).
Although next-generation sequencing (NGS) technology has enabled remarkable progress in nuclear genomics, sequencing of chloroplast and nuclear
2
ribosomal DNA has remained challenging due to their high-copy characteristics. Whereas, more than 600 complete cp genome sequences have been reported in GenBank. Recently, many studies have utilized NGS platforms to obtain complete cp genome sequences using various approaches. In this study, de novo assembly of low coverage WGS (dnaLCW) method is completed and proposed to assemble those reads into high-quality cp genome data and complete sequences for nrDNA units simultaneously. The simple method of gap-filling and error correction can be solved mis-assembled parts without additional efforts such as PCR and Sanger sequencing. Until now, the new complete sequences of the cp genome and nrDNA units sequences for more than 300 species/ or cultivars with a range of genome sizes were successfully generated. This method should revolutionize greatly facilitates the use of highly informative plastome and nrDNA dynamics data to elucidate the evolution of land plants and can also be expanded for application to mitochondrial genomes and nuclear repeats.
Araliaceae (as ginseng family) belonged P. ginseng comprise about 1500 species, many of which have been used as oriental medicine for ages as well as ginseng species. Panax, Eleutherococcus and Aralia are major groups in this family (Tang & Eisenbrand 1992; Davydov & Krikorian 2000). Panax, as major genus, consist of 15 species such as P. ginseng, P. quinquefolius, P. notoginseng, P.
japonicas and P. vietnamensis. Most of Araliaceae species distribute in northern
Asia, some part of middle Asia and, northern America. In cytological work, Araliaceae species were found highly diverse ploidy as well as various nuclear genome sizes. Most species of Araliaceae have conserved basic chromosome
3
number of x = 12, 2n = 24 to 192, are highly diverse ploidy level (Plant syst. Evol. 1999; Yi et al. 2004; Rattenbury 1957; Stace et al. 1993). Within Panax, diploid (2n = 24) such as P. notoginseng, P. vietnamensis and P. japonicas, and tetraploid (2n = 48), such as P. ginseng and P. quinquefolius, were reported. Taxonomical classification of ginseng species using molecular data derived from chloroplast (cp) DNA and nucleus DNA were reported (Wen et al. 1998; Wen et al. 2001; Plunkett et al. 2004; Lee et al. 2004; Li et al. 2013; Kim et al. 2015a; Kim et al. 2015b). Despite of many previous data, ginseng family has been still regarded as difficult group to understand and resolve taxonomical positions and evolutionary history due to those complicated morphological characters and deficiency of genetic resources. In addition, P. ginseng was domesticated more than 500 years ago, its breeding is difficult due to a long life cycle and low seed yield. In Korea, three local landraces, Jakyung, Chungkyung and Hwangsook, have been cultivated traditionally and nine elite cultivars have been bred and registered through pure line selection from the landraces. The nine registered cultivars show many agricultural traits and unique characteristics. And two local landraces, ‘Jakyung’ and ‘Hwangsook’, are still the main types cultivated in Korea, due to the lack of an established ginseng seed industry. Like this, although their various species/cultivars were recognized, their genetic information for diversity and evolution have been not fully understood inter and intra species level. Genomic study for diversity and evolution of ginseng species has been needed in many molecular parts.
4
REFERENCES
Álvarez, I. & Wendel, J. F. (2003). Ribosomal ITS sequences and plant phylogenetic inference. Mol. Phylogenet. Evol. 29, 417-434.
Davydov, M., & Krikorian, A. D. (2000). Eleutherococcus senticosus (Rupr. & Maxim.) Maxim.(Araliaceae) as an adaptogen: a closer look. Journal of Ethnopharmacology, 72(3), 345-393.
Galián, J. A., Rosato, M. & Rosselló, J. A. (2012). Early evolutionary colocalization of the nuclear ribosomal 5S and 45S gene families in seed plants: evidence from the living fossil gymnosperm Ginkgo biloba. Heredity 108, 640–646.
Harris, S. A. & Ingram, R. (1991). Chloroplast DNA and biosystematics: The effects of intraspecific diversity and plastid transmission. Taxon 40, 393-412
Kim, K., Lee, S. C., Lee, J., Lee, H. O., Joh, H. J., Kim, N. H., ... & Yang, T. J. (2015). Comprehensive survey of genetic diversity in chloroplast genomes and 45S nrDNAs within Panax ginseng species. PloS one, 10(6).
Kim, K., Lee, S. C., Lee, J., Yu, Y., Yang, K., Choi, B. S., ... & Yang, T. J. (2015). Complete chloroplast and ribosomal sequences for 30 accessions elucidate evolution of Oryza AA genome species. Scientific reports, 5
Lee, C., & Wen, J. (2004). Phylogeny of Panax using chloroplast trnC–trnD intergenic region and the utility of trnC–trnD in interspecific studies of
5
plants. Molecular phylogenetics and evolution, 31(3), 894-903.
Li, R., Ma, P. F., Wen, J., & Yi, T. S. (2013). Complete sequencing of five Araliaceae chloroplast genomes and the phylogenetic implications. PloS one, 8(10), e78568.
Palmer, J. D. (1985). Comparative organization of chloroplast genomes. Annu. Rev. Genet. 19, 325-354
Plunkett, G. M., Wen, J., & Lowry Ii, P. P. (2004). Infrafamilial classifications and characters in Araliaceae: Insights from the phylogenetic analysis of nuclear (ITS) and plastid (trnL-trnF) sequence data. Plant Systematics and Evolution, 245(1-2), 1-39.
Qiu, Y.-L. et al. (1999). The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402, 404–407.
Rattenbury, J. A. (1957). Chromosome numbers in New Zealand angiosperms. In Transactions of the Royal Society of New Zealand (Vol. 84, No. 4, pp. 936-38).
Reboud, X. & Zeyl, C. (1994). Organelle inheritance in plants. Heredity 72,132-140
Rivarola, M. et al. (2011). Castor bean organelle genome sequencing and worldwide genetic diversity analysis. PLoS ONE 6, e21743
Rogers, S. O. & Bendich, A. J. (1987). Heritability and variability in ribosomal RNA genes of Vicia faba. Genetics 117, 285-295.
Shaver, J. M., Oldenburg, D. J. & Bendich, A. J. (2006). Changes in chloroplast DNA during development in tobacco, Medicago truncatula, pea, and
6 maize. Planta 224, 72-82
Soltis, P. S., Soltis, D. E. & Chase, M. W. (1999). Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402, 402-404.
Stace, H. M., Armstrong, J. A., & James, S. H. (1993). Cytoevolutionary patterns inRutaceae. Plant Systematics and Evolution, 187(1-4), 1-28.
Tang, W., & Eisenbrand, G. (1992). Panax ginseng CA Mey (pp. 711-737). Springer Berlin Heidelberg.
Wang, W. & Messing, J. (2011). High-throughput sequencing of three Lemnoideae (Duckweeds) chloroplast genomes from total DNA. PLoS ONE 6, e24670 Wen, J., Shi, S., Jansen, R., & Zimmer, E. (1998). Phylogeny and biogeography of
Aralia sect. Aralia (Araliaceae). American Journal of Botany, 85(6), 866-866.
Wen, J., Plunkett, G. M., Mitchell, A. D., & Wagstaff, S. J. (2001). The evolution of Araliaceae: a phylogenetic analysis based on ITS sequences of nuclear ribosomal DNA. Systematic Botany, 26(1), 144-167.
Wicke, S., Costa, A., Muñoz, J. & Dietmar, Q. (2011). Restless 5S: The re-arrangement(s) and evolution of the nuclear ribosomal DNA in land plants. Mol. Phyl. Evol. 61, 321-332.
Wolfe, A. D. & Randle, C. P. (2004). Recombination, heteroplasmy, haplotype polymorphism, and paralogy in plastid genes: Implications for plant molecular systematics. Systematic Botany 29, 1011-1020
7
monocot-dicot divergence estimated from chloroplast DNA sequence data. Proc. Natl. Acad. Sci. USA. 86, 6201-6205.
Yi, T., Lowry, P. P., & Plunkett, G. M. (2004). Chromosomal evolution in Araliaceae and close relatives. Taxon, 53(4), 987-1005.
8
CHAPTER I
High throughput and simultaneous assembly of complete
chloroplast and nuclear ribosomal DNA sequences from
9
ABSTACT
Chloroplast genomes and nuclear ribosomal DNA units are the primary sequences used to analyze plant genetic diversity. In this study, I describe a high-throughput method to obtain complete sequences of the chloroplast genome and nuclear ribosomal DNA units, the representative sequences for diversity of cytoplasmic and nuclear genomes, respectively, simultaneously. De novo assembly of low-coverage whole-genome shotgun next-generation sequence (dnaLCW) and in silico error correction were developed to obtain both types of sequence. The dnaLCW was successfully applied to obtain complete cp genomes and nrDNA sequences from various plants. Genomic study using the sequences, which represent maternally inherited chloroplast genomes and bi-parentally inherited ribosomal DNAs, elucidate aspects of the molecular taxonomic classification and evolutionary history in plants. This research represents a breakthrough for application of next-generation sequencing information to analyze genetic diversity and plant evolution.
10
INTRODUCTION
Plant cells contain three genomes with different evolutionary origins and history: nuclear, mitochondrial and chloroplastic. Chloroplast (cp) genomes and nuclear ribosomal DNA (nrDNA) units are the primary sequences used to analyze plant genetic diversity as well as nuclear and cytoplasmic evolution (Qiu et al. 1999; Soltis et al. 1999). The cp genomes are maintained uni-parentally, usually via maternal inheritance (Wolfe et al. 1989; Reboud and Zeyl 1994), and are 57- to 217-kb circular DNA molecules containing ~100 conserved genes and relatively diverse intergenic spaces (IGSs) (Palmer 1985; Harris and Ingram 1991; Wolfe and Randle 2004; Shaver et al. 2006; Rivarola et al. 2011; Wang and Messing 2011). Within plant nuclear genomes, nrDNA is organized into highly abundant tandemly-repeated transcription units (up to 22,000 copies) that make up nucleolar organizer regions (NORs) (Rogers and Bendich 1987). Due to their conserved roles in ribosome assembly and nucleolus formation, these high-copy nrDNA units have remained highly homogeneous through concerted genome evolution within species and are well-conserved among species. Four rRNA components usually reside in two independent chromosomal locations, the 5S nrDNA (5S) and 45S nrDNA (45S) blocks in higher plants, though some ancient plants such as Ginkgo biloba, moss, and algae have the 5S and 45S components in one tandem unit (Wicke et al. 2011; Galián et al. 2012). The 45S blocks include tandemly arrayed copies of the 45S cistron unit, which comprises conserved 18S, 5.8S, and 26S (or 25S or 28S) gene
11
clusters, relatively variable internal transcribed spacers (ITS1 and ITS2), and variable long IGSs (Álvarez and Wendel 2003; Wicke et al. 2011). Both cp genome and nrDNA sequences are widely used to examine plant genetic diversity and evolution.
Although next-generation sequencing technology has enabled remarkable progress in nuclear genomics, sequencing of chloroplast and nuclear ribosomal DNA has remained challenging due to their high-copy characteristics. Whereas more than 600 cp genomes have been reported in GenBank, complete 45S unit sequences are known for only a few species, including rice and tomato, because of their highly repetitive nature. Although next-generation sequencing (NGS) technology has revolutionized plant genomics, Most reported cp genome sequences are achieved by conventional methods (Golenberg et al. 1990; Sang et al. 1997). Recently, several studies have utilized NGS platforms to obtain complete cp genome sequences using isolated chloroplast DNA, reference cp-guided mapping, or de novo sequence assembly, followed by significant efforts to fill gaps using PCR and Sanger sequencing (Nock et al. 2011; Wang and Messing 2011; Zhang et al. 2011; Asif et al. 2013; Liu et al. 2013; McPherson et al. 2013).
Plant whole genome shotgun (WGS) data based on NGS technologies always contains cp sequences to various levels, depending on tissues and extraction method for DNA preparation. Here, de novo assembly of low coverage WGS (dnaLCW) method has developed to assemble those reads into high-quality cp genome data and simultaneously obtain complete sequences for nrDNA units. The
12
simple and efficient solutions are provided for gap-filling and error correction in sequence assembly without additional efforts such as PCR and Sanger sequencing. Theoretically, this method can be used to simultaneously generate complete cp genomes and nrDNA units for more than 50 samples using data from a single lane of Illumina HiSeq2000. I successfully generated new complete sequences of the cp genome and 45S nrDNA units for more than 300 species or cultivars with a range of genome sizes. This method should revolutionize the use of plastome and nrDNA dynamics to elucidate the evolution of land plants and can also be expanded for application to mitochondrial genomes and nuclear repeats.
13
MATERIALS AND METHODS
Preparation of whole-genome NGS reads
Leaf samples were harvested from P. ginseng cultivar grown in a farm of Seoul National University, Suwon, Korea, and high-quality genomic DNA was extracted using a modified CTAB method (Allen et al. 2004). A PE library with 500-bp insert size was constructed using the Illumina PE DNA library kit according to the manufacturer’s instructions and sequenced using an Illumina Hiseq2000 by the National Instrumentation Center and Environmental Management (NICEM, http://nicem.snu.ac.kr/, Korea) and Macrogen (http://dna.macrogen.com/, Korea) and Illumina MiSeq or NextSeq500 by LabGenomics (www.labgenomics.co.kr, Korea). A total of 44.4 Gpb and 220.9 Gbp Illumina PE reads were produced from total genomic DNA of NP and ChP, respectively. In assemblies of WGS reads representing more than 70x genome coverage in rice and ginseng, I identified no proper long, unique cp contigs. I then tested assembly of cp genome and nrDNA using low-coverage WGS sequences.
WGS assembly and building of complete cp genome and nrDNA sequences
Raw reads with Phred scores of 20 or less were removed from among the total NGS PE reads using the CLC-quality trim tool (quality_trim software included in
14
CLC ASSEMBLY CELL package ver. 4.06 beta. 67189). Sub-datasets with various levels of cp genome coverage were extracted from trimmed NP and ChP WGS reads and used for assembly using the CLC de novo assembler included in the CLC ASSEMBLY CELL package or SOAPdenovo included in the SOAP package (ver. 1.12) with default parameters. Sequence gaps were filled by Gapcloser included in the SOAP package (ver. 1.12). Representative contigs for the cp genome or nrDNAs were retrieved from among the total contigs using Nucmer (Kurtz et al. 2004) with reference sequences. Extracted contigs were ordered based on built-in BLASTZ analysis (http://nature.snu.ac.kr/tools/blastz_v3.php) (Schwartz et al. 2003) with the related cp genome sequence and connected into single draft sequence by joining overlapping terminal sequences. Tentative error sites were identified by mapping raw reads to draft sequences using the CLC mapping tool (clc_ref_assemble in the CLC ASSEMBLY CELL package) and visualized using CLC viewer (clc_assembly_viewer in the CLC ASSEMBLY CELL package). Errors found in repeat, InDel, and SNP regions were manually corrected and validated by PCR amplification and Sanger sequencing.
Annotation and comparative analysis of cp and nrDNA sequence
The cp genome sequence was annotated using the DOGMA program (http://dogma.ccbb.utexas.edu/) (Wyman et al. 2004) and BLAST searches. Circular and comparative maps of the cp genome were generated using OGDRAW
(http://ogdraw.mpimp-golm.mpg.de/) (Loshe et al. 2007) and mVISTA
15
of rRNAs, ITS, and IGS in assembled 45S sequences were determined by comparison with reported sequences and BLAST searches.
Validation of polymorphic regions in cp genome sequences
Specific primers were designed from conserved sequences flanking polymorphic regions such as SNPs and InDels found among cp genomes. Genomic DNA was used as template for PCR amplification using Ex-Taq polymerase (Takara, Japan) and amplified fragments were analyzed using a Fragment Analyzer (Advanced Analytical Technologies Inc., USA), according to manufacturer’s instructions. DNA fragments amplified using dCAPS primers were digested with appropriate restriction enzyme and then separated by a Fragment Analyzer (Advanced Analytical Technologies Inc., USA).
Amplification of nrDNA IGS regions
Specific primers were designed from conserved sequences flanking the IGS regions in assembled 45S. Genomic DNA of Oryza and Panax species was used as template for PCR amplification using Ex-Taq polymerase (Takara, Japan) and amplified fragments were analyzed by separation in agarose gels and ethidium bromide staining.
16
RESULT
De novo assembly of low coverage WGS for plant genomes
The number of WGS reads needed to obtain complete sequences for cp genomes and nrDNA units was first optimized, using rice reference cultivar ‘Nipponbare’ (NP) (International Rice Genome Sequencing Project 2005) and ginseng reference cultivar ‘Chunpoong’ (ChP) (Choi et al. 2014). I tested whether high-copy genome components such as cp, mitochondria (mt), and nrDNA sequences, could be assembled using low-coverage WGS sequences. In de novo assemblies of rice 1x haploid genome-equivalent WGS data, the 30 longest contigs revealed that 5, 15, and 1 contigs represented cp, mt, and nrDNA sequences, respectively, and the remaining 9 contigs represented major rice repeats, mainly transposable elements (TEs) (Table 1-2; Fig. 1-1a). Importantly, the five cp contigs covered the entire 134,551 bp cp genome with approximately 20-bp overlap between adjacent contigs, indicating that those could be combined into a single complete cp genome sequence (Fig. 1-1b). One 6,889-bp contig covered most of the 45S nrDNA unit (i.e. 86%), while 15 contigs (summing to 130 kb) represented partial coverage of the mt genome (i.e. 26%; Fig. 1-1a). Similar results were obtained from assembly of 151.5 Mbp ginseng WGS data (0.05x whole genome coverage) where 3, 12, and 1 contigs represented cp, mt, and nrDNA sequences, respectively, and the remaining 14 contigs were classified as unknown (Table 1-3; Fig. 1-1a). The
17
complete cp genome was covered by alignment of only three contigs that overlapped slightly (Fig. 1-1b, c). One 9,423 bp contig represented the 45S unit and 12 contigs (38 kb) represented the mt genome.
18 Table 1-1. Statistics for assembly and NGS for Cp and nrDNAs
Species Genome
size (Mbp)
WGS reads for cp assembly
Complete sequences (bp) Estimated
copy numbers Amounts (Mbp) Coverage (x) Genome Cp Cp 45S 5S 45S 5S O. sativa J. NP 430 860 2.0 99 134,551 7,928 324 390 593 P. ginseng ChP 3,120 303 0.1 99 156,248 11,091 898 4,093 9,376
19
Table 1-2. Characterization of the 30 longest contigs in the de novo assembly of the rice Os2 dataset (1x genome coverage; 50x cp genome coverage). Ctg_no. Ctg length (bp) Ctg Coverage (x)
Best hit in GenBank
Description Acc. no. Length
(bp) Match begin Match end E-value
ctg_39 53,713 48.89 GU592207.1 134,551 18 53,535 0.0 O. sativa japonica chloroplast genome ctg_56 20,802 97.14 GU592207.1 134,551 80,605 101,406 0.0 O. sativa japonica chloroplast genome ctg_911 19,415 10.76 BA000029.3 490,520 173,036 183,181 0.0 O. sativa japonica mitochondrial genome
ctg_40 18,492 48.63 GU592207.1 134,551 53,516 72,007 0.0 O. sativa japonica chloroplast genome ctg_619 16,859 12.57 BA000029.3 490,520 214,506 228,529 0.0 O. sativa japonica mitochondrial genome ctg_596 16,024 10.00 BA000029.3 490,520 343,504 359,495 0.0 O. sativa japonica mitochondrial genome ctg_562 12,383 47.71 GU592207.1 134,551 101,387 113,769 0.0 O. sativa japonica chloroplast genome ctg_380 11,412 9.89 BA000029.3 490,520 36,168 47,528 0.0 O. sativa japonica mitochondrial genome ctg_135 8,628 53.43 GU592207.1 134,551 71,988 80,624 0.0 O. sativa japonica chloroplast genome
ctg_72 8,112 154.94 MUDR1_OS 8,052 966 7,715 0.0 O. sativa MuDR-type DNA transposon ctg_948 7,982 8.40 BA000029.3 490,520 80,797 88,762 0.0 O. sativa japonica mitochondrial genome ctg_1284 7,882 8.33 BA000029.3 490,520 382,357 390,429 0.0 O. sativa japonica mitochondrial genome
ctg_567 7,173 29.88 MDR2 6,967 76 6,516 0.0 O. sativa MuDR-type DNA transposon ctg_183 7,166 12.39 SZ-55_I 7,062 1 6,973 0.0 O. sativa retrotransposon SZ-55 ctg_173 6,889 342.62 M11585.1 3,377 2 3,377 0.0 O. sativa 25S ribosomal RNA gene ctg_180 6,776 24.93 RIRE7_I 5,899 1,394 5,899 0.0 RIRE7 gypsy-like endogenous retrovirus ctg_316 6,622 95.83 MuDR3_OS 8,604 97 6,666 0.0 O. sativa MuDR-type DNA transposon ctg_580 6,607 8.34 BA000029.3 490,520 73,247 79,345 0.0 O. sativa japonica mitochondrial genome ctg_319 6,403 108.69 RIRE1_I 5,276 800 5,275 0.0 RIRE1, a copia-like retrotransposon ctg_343 6,331 69.87 SZ-37_I 8,831 1 5,674 0.0 O. sativa retrotransposon SZ-37
ctg_265 6,210 44.78 CRM-I_OS 5,933 689 5,933 0.0 O. sativa centromeric LTR retrotransposon ctg_3958 6,059 7.67 BA000029.3 490,520 207,511 213,320 0.0 O. sativa japonica mitochondrial genome
20
ctg_1105 5,713 7.93 BA000029.3 490,520 256,888 262,540 0.0 O. sativa japonica mitochondrial genome ctg_2119 5,635 7.67 BA000029.3 490,520 397,094 400,264 0.0 O. sativa japonica mitochondrial genome
ctg_675 5,512 40.40 MDR1 7,383 867 6,361 0.0 O. sativa MuDR-type DNA transposon ctg_4531 5,341 8.03 BA000029.3 490,520 299,598 304,451 0.0 O. sativa japonica mitochondrial genome ctg_1460 5,281 9.83 BA000029.3 490,520 53,451 58,496 0.0 O. sativa japonica mitochondrial genome ctg_2770 5,280 10.04 BA000029.3 490,520 66,550 70,862 0.0 O. sativa japonica mitochondrial genome ctg_1612 5,236 9.26 BA000029.3 490,520 65,707 70,862 0.0 O. sativa japonica mitochondrial genome
The best hit sequences were found by BlastN searches using contig sequences as queries. Contigs similar to sequences from chloroplasts, mitochondria, and ribosome are indicated by green, orange, and blue, respectively.
21
Table 1-3. Characterization of the 30 longest contigs among the de novo assembly of the ginseng Pg2 dataset (0.05x genome coverage; 50x cp genome coverage). Ctg_no. Ctg length (bp) Ctg Coverage (x)
Best hit in GenBank
Description Acc. no. Length
(bp)
Match begin
Match
end E-value
ctg_3 86,351 49.12 AY582139.1 156,318 1 86,125 0.0 P ginseng chloroplast genome ctg_8 26,153 103.15 AY582139.1 156,318 86,107 112,177 0.0 P ginseng chloroplast genome ctg_15 18,122 45.15 AY582139.1 156,318 112,159 130,266 0.0 P ginseng chloroplast genome
ctg_30 9,423 150.85 GQ178077.1 3,362 1 3,362 0.0 P ginseng cv Yunpoong 26S ribosomal RNA gene ctg_671 5,196 8.55 HQ874649.1 502,773 224,443 225,174 0.0 Ricinus communis mitochondrial genome
ctg_49 4,446 87.98
Copia-74_ALY-I 5,483 1,540 5,043 0.0 Arabidopsis lyrata LTR retrotransposon ctg_347 4,234 12.01
Gypsy18-PTR_I 7,283 640 4,969 0.0 Populus trichocarpa LTR retrotransposon ctg_269 3,671 7.49 GQ856147.1 3,792,376 277,261 278,161 0.0 Citrullus lanatus mitochondrial genome
ctg_41 3,555 21.38 Copia-88_VV-I 7,228 1,627 2,612 4.00E-174 Vitis vinifera LTR retrotransposon ctg_658 3,462 4.13 EU365401.1 509,941 103,101 103,722 0.0 Bambusa oldhamii mitochondrial genome ctg_258 3,447 4.27 HQ874649.1 502,773 223,813 224,397 0.0 Ricinus communis mitochondrial genome ctg_181 3,350 43.54 EnSpm-6_STu 6,594 3,865 4,636 3.00E-50 Solanum tuberosum EnSpm DNA transposon ctg_321 3,346 4.86 JQ248574 281,132 118,398 119,277 0.0 Daucus carota ssp. sativus mitochondrial genome ctg_428 3,242 6.19 JQ248574 281,132 225,651 227,064 0.0 Daucus carota ssp. sativus mitochondrial genome ctg_256 3,189 23.66 EnSpm2_PTr 12,422 2,262 2,393 3.00E-28 Populus trichocarpa EnSpm-type DNA transposon
ctg_34 3,147 6.08 JQ248574 281,132 152,185 153,281 0.0 Daucus carota ssp. sativus mitochondrial genome ctg_359 3,142 13.20 Copia-3_CP-I 4,073 1,463 2,198 1.00E-140 Carica papaya LTR retrotransposon
ctg_407 3,077 16.79 MuDR-1_STu 9,408 1,590 1,687 2.00E-09 Solanum tuberosum MuDR-type DNA transposon ctg_93 2,900 4.55 EU431224.1 476,890 136,810 136,903 2.00E-29 Carica papaya mitochondrial genome
ctg_352 2,893 39.53 GYPOT1_I 5,099 2,780 3,366 3.00E-87 Populus trichocarpa internal sequence of GYPOT LTR
22
ctg_157 2,637 4.77 JN375330.1 715,001 704,715 705,342 0.0 Phoenix dactylifera mitochondrial genome ctg_101 2,609 10.09 Copia-72_VV-I 4,168 2,384 3,690 1.00E-153 Vitis vinifera LTR retrotransposon ctg_935 2,556 4.15 BA000042.1 430,597 127,414 127,860 0.0 Nicotiana tabacum mitochondrial genome ctg_833 2,534 5.41 Copia-4_PD-I 4,256 1,437 1,511 4.00E-07 Phoenix dactylifera LTR retrotransposon ctg_631 2,466 5.16 AY061993.1 4,584 824 1,259 0.0 Daucus carota mitochondrial genome
ctg_420 2,460 30.33 PSAT6 607 456 553 3.00E-11 Pisum sativum dispersed repetitive DNA, PSAT6 ctg_240 2,385 19.22 MuDR-5_ALy 9,235 2,468 2,573 3.00E-13 Arabidopsis lyrata MuDR-type DNA transposon
ctg_26 2,380 34.72 Gypsy22-VV_I 5,609 1,978 2,480 1.00E-64 Vitis vinifera LTR retrotransposon ctg_687 2,338 5.31 FR714868.1 396,947 135,650 136,768 0.0 Malus x domestica mitochondrial genome
ctg_75 2,240 120.50 No hit
The best hit sequences were found by BlastN searches using contig sequences as queries. Contigs similar to sequences from chloroplasts, mitochondria, and ribosomes are indicated by green, orange, and blue, respectively.
23
Figure 1-1. Characterization of the 30 longest contigs in assembly of Oryza and Ginseng species. (a) Classification based on best hit. Number of contigs and percent coverage of cp, nrDNA, mt and other sequences are presented above the bars. (b,c) Alignment of five and three contigs covering the complete cp genome sequences of rice (b) and ginseng (c), respectively. The contig numbers are indicated under the contigs and hit positions in parenthesis are under the reference cp genome sequences for rice [GU592207] and ginseng [NC_006290].
24
Optimization of dnaLCW to obtain complete cp genome sequence
Because I could obtain almost complete cp and nrDNA sequences of rice and ginseng with 1 and 0.05x genome equivalents WGS data despite their different genome sizes, 430 Mbp and 3,120 Mbp for rice (International Rice Genome Sequencing Project 2005) and ginseng (Hong et al. 2004), respectively, the WGS dataset size needed to obtain complete cp genome and nrDNA sequence assemblies was optimized. As NP and ChP WGS reads included ~1.7 and 6.0% cp genome-derived reads, respectively, 10 WGS datasets with between 25x and 5,000x coverage of the cp genome were extracted for independent assembly (Table 1-4; Fig. 1-2).
The number of contigs covering the entire cp genome and the number of assembly errors as criteria for assessment of optimal assembly was used, and Datasets 3-6 with 100x to 250x cp coverage showed the best assembly performance for cp genomes of both species (Table 1-4; Fig. 1-2). Datasets 3-6 represent 0.9-4.3 and 0.3-1.5 Gbp WGS sequence, corresponding to 2-10x and 0.1-0.5x haploid genome equivalents, for rice and ginseng, respectively. Assembly errors and contig numbers were relatively consistent in ginseng even when the datasets were larger, up to 30 Gbp, representing 10,000x cp coverage and 10x genome coverage (Table 1-4; Fig. 1-2). By contrast, assembly errors and contig numbers in rice rapidly increased when using more than 8.6 Gbp, representing1,000x cp coverage and 20x whole genome coverage. The fact that more erroneous cp contigs were generated in assemblies with higher amounts of input data suggests that short NGS reads
25
originating from nuclear or mitochondrial plastid DNAs (NMPTs; cp sequences inserted into the nuclear or mitochondrial genome) were erroneously co-assembled into cp contigs. The different assembly behaviors between rice and ginseng with regard to input data could be attributable to rice having a higher NMPT content compared to ginseng. Therefore, it is important to use the proper amount of data for assembly to minimize erroneous cp contigs caused by NMPT sequences.
The performance of two popular genome assemblers, SOAPdenovo (Li et al. 2010) and the CLC de novo assembler (
http://www.clcbio.com/products/clc-assembly-cell/) was compared, in generating small numbers of longer contigs to
cover the entire cp genome using various WGS datasets of rice and ginseng. The CLC de novo assembler outperformed SOAPdenovo (Table 1-4; Fig. 1-2).
26
Table 1-4. Summary of chloroplast genome assembly using different amounts of WGS data from O. sativa (cv. Nipponbare) and P. ginseng (cv. Chunpoong)
Species Dataset Amount of WGS used
Coverage (x) to
No. of cp contigs c
No. of errors in the initial assembly
Genome a Cp b False gaps False SNPs Tandem repeats Mono-polymer Total errors O. sativa Os1 215,000,114 0.5 24.63 4 1 5 0 2 8 Os2 430,000,026 1 49.31 5 5 2 0 0 7 Os3 860,000,052 2 99.26 5 0 0 0 1 1 Os4 1,290,000,078 3 148.67 3 4 0 0 0 4 Os5 1,720,000,104 4 198.48 3 3 6 0 0 9 Os6 2,150,000,130 5 248.5 3 2 0 0 0 2 Os7 4,300,000,260 10 496.03 4 4 0 0 1 5 Os8 8,600,000,520 20 980.42 4 7 7 1 0 15 Os9 25,800,001,560 50 2,534.87 6 2 1 0 1 4 Os10 44,425,734,760 100 5,006.54 10 13 38 3 1 55 P. ginseng Pg1 75,750,000 0.025 24.49 3 6 4 3 0 13 Pg2 151,500,000 0.05 49.89 3 9 2 1 0 12 Pg3 303,000,000 0.1 99.06 3 4 0 2 0 6 Pg4 454,500,000 0.15 150.38 3 3 1 2 0 6 Pg5 606,000,000 0.2 200.54 3 3 0 2 0 5 Pg6 757,500,000 0.25 260.13 3 3 0 2 0 5 Pg7 1,515,000,000 0.5 512.77 3 0 0 4 0 4 Pg8 3,030,000,000 1 1,044.53 3 4 0 2 0 6 Pg9 7,575,000,000 2 2,564.97 3 1 0 3 0 4 Pg10 15,150,000,000 5 5,101.95 3 2 0 2 0 4
27
Pg11 30,300,000,000 10 10,008.88 2 0 2 5 0 7
a
Coverage to genome was determined by calculation of ratio of total bases to genome size (430 Mb of O. sativa and 3.12 Gb of P. ginseng). b Coverage to cp genome was based on contents of cp reads determined by mapping raw reads to reference cp genome. c Number of contigs representing the entire cp genome sequence.
28
29
Finishing: identification and correction of de novo assembly errors
A single circular draft cp genome could be constructed with joining the initially assembled overlapping cp contigs. However, I identified several types of assembly errors by aligning paired-end (PE) reads onto assembled contigs. The mis-assembled regions were typically characterized by accumulation of discordantly mapped reads or abnormally higher read mapping depth. The identified assembly errors included false gaps, false SNPs, and copy number errors for tandem repeats (TR) or monopolymers. Detailed in silico methods for identification and correction of each type of error were established.
1) False gaps: This type of error occurs in regions with ambiguous “N” nucleotides in draft assembly contigs. The left and right sequence flanking an “N” are duplicated, leading to accumulation of commonly mis-mapped reads at the flanking regions (Fig. 1-3). Such errors can be corrected by merging the common duplicated sequences flanking the “N”, and the correction validated by re-mapping reads on the edited sequence. If the edited sequence is correct, read mapping will show clear matches on the sequence.
30
31
2) False SNPs: DNA fragments homologous to those of the cp genome are ubiquitous in mitochondrial and nuclear genomes of rice (Bevan et al. 1998; Matsuo et al. 2005) and can interfere with cp genome assembly (Compeau et al. 2011), leading to false SNPs. Each false SNP could be corrected by assigning the consensus nucleotide sequence to the false SNP location based on the reads showing the highest depth in the paired read mapping, because ~8-100-fold more reads originate from the cp genome than from the nuclear or mitochondrial genome. For example, the assembly of the Os5 dataset, which provides 4x and 200x coverage of the nuclear and cp genomes, respectively, showed two false SNPs, G/T at 51,940 nt and T/A at 51,944 nt (Fig. 1-4). The 212 reads mapped to the region revealed clear patterns of origin, in which 186 reads (from the cp) contained T and A nucleotides at those positions, 24 reads (from the mt) contained G and T, and 2 (from the nucleus) contained T and T. Overall, false SNPs in the initial contigs can be easily corrected using read mapping followed by assigning the consensus nucleotide with the highest depth.
32
33
3) Tandem repeat copy number error: There are many chances for copy number error to arise during de novo assembly using short reads (Phillippy et al. 2008; Alkanet al. 2011; Treangen and Salzberg 2011). My data show that 18-bp TR units were mis-assembled into 2 copies by default assembly options, whereas four complete copies of 18-bp TRs were correctly assembled with using a k-mer length of 64 (Fig. 1-5). When repeats are shorter than the read length, increasing the k-mer value above the TR unit length can reduce mis-assembly. Copy number errors in the assembly can be identified by comparing read-depth at the TR and the flanking region. If raw reads map to a region incorrectly assembled with too few copies of a TR, mis-mapped reads will be abundant and abnormal high read-depth can be found at the collapsed regions (Fig. 1-5). Most TR units found in cp genomes are simple and less than 100 bp, unlike those in the nuclear genome. Therefore, most errors derived from copy number variance of TRs can be fixed.
34
Figure 1-5. Identification and correction of mis-assembled in tandem repeat copy number variation region
35
4) Monopolymer copy number error: A total of 95 and 91 regions contained monopolymer tracts of more than 8 nt in the cp genomes of NP and ChP, respectively. Most monopolymers were poly A or T (Fig. 1-6). Monopolymer regions in the cp genome are susceptible to sequencing errors due to polymerase slippage and mis-assembly caused by interruption of homologous mitochondrial or nuclear sequences containing monopolymers of different lengths. One such assembly error was detected at the poly T tract region at 78,424 bp in the NP cp genome (Fig. 1-6). Similar sequences with different poly T tracts (7, 8, 9, 10, 11, 12, 15 and 17 nt long) were found in 10 chromosomal regions of the NP genome (Fig. 1-6). The initial assembly of the Os3 dataset generated a (T)8 monopolymer
tract assembly error caused by interruption with T monopolymers derived from sequences of rice chromosome 5, 6, 7, and 9 (Fig. 1-6). This error could be corrected by selection of T monopolymer tracts showing the highest read-depth after raw-read mapping on hypothetical T monopolymer sequences with 100% identity. The draft sequence with the correct (T)17 monopolymer among the eight
putative sequences showed the highest mapping depth of 33.14, as expected (Fig. 1-6).
36
Figure 1-6. Identification and correction of mis-assembled in monopolymer copy number variation region.
37
Ultimately, I obtained the complete 134,551-bp cp genome sequence for ‘Nipponbare’, using the dnaLCW approach followed by in silico correction of seven errors detected in the initial assembly. The final assembly was 100% identical to the reference cp sequence of NP (GU592207), demonstrating the effectiveness of this approach.
Obtaining complete sequences for major nrDNA units
The dnaLCW assembly also generated contigs representing the 5S and 45S nrDNA units. The initial 5S contigs contained the complete 5S units, of 324 bp and 898 bp for NP and ChP, respectively. By contrast, the 45S contigs were incomplete single contigs longer than 6 kb, including the main 45S transcriptional unit and part of the flanking IGS. Simple method to close the gaps in the IGS based on the highly homogeneous tandemly arrayed nature of the 45S was developed in this study. I generated a two-unit 45S tandem array using the initial contig and manually inserted 100 unknown nucleotides, (N)100, between the two units for the remaining
gaps in the IGS (Fig. 1-7). I then applied iterative gap closing to fill the gaps between the units using Gapcloser with the raw reads. Occasionally, GC-rich regions and sub-repeat elements in IGS made gap-filling ineffective (Fig. 1-8); however, representative complete 324-bp 5S and 7,928-bp 45S units were successfully obtained and identical to the 5S and 45S tandem array found in chromosome 11 and 9 from NP, respectively.
38
Figure 1-7. Schematic diagram of the method used to obtain a complete 45S unit. (a) A draft single contig included the 45S cistron and partial IGS. (b) To obtain the full-length IGS, a hypothetical tandem array was constructed using two copies of the contig and intervening Ns. Through gap-closing, the Ns were filled in by sequences originating from IGS. (c) If the IGS remains partial, gap-closing with N will be necessary repeatedly. Ultimately, a complete the full-length nrDNA unit can be obtained.(d) Structure of the nrDNA of Oryza and Panax species. (e,f) Black and red line indicates reads depth and GC content per 100-bp unit length.
39
40
41
The 45S unit length varied from 7,745 to 8,164 bp for Oryza species and sequence and length variations were more frequent in the IGS region (Fig. 1-9).I also obtained complete nrDNA units from two P. ginseng cultivars, ‘Chunpoong’ (ChP) and ‘Yunpoong’ (YP), and a close relative, P. quinquefolius, by carrying out low-coverage whole genome sequencing. Two ginseng cultivars, ChP and YP, had identical 11,091-bp 45S, and P. quinquefolius had an 11,169-bp 45S with a 74-bp InDel between the two Panax species in the IGS. The sequences of coding genes were highly conserved between species, even between rice and ginseng, whereas the ITS and IGS showed much variation between species (Fig. 1-9).
The lengths of the complete 5S units varied from 302 to 499 bp for Oryza species due to the divergence of sequence and length in IGS sequence although the coding gene sequence was highly conserved. The coding sequences for 5S were more conserved than the 45S across rice and ginseng, while the IGS region was different between the two genera and between Oryza species (data not shown).
42
DISCUSSION
The dnaLCW workflow for simultaneous determination of complete cp and nrDNA sequences
Recent NGS-based approaches to obtain complete cp genome sequences have been based on using isolated pure chloroplast DNA, reference cp-guided mapping or de novo sequence assembly following huge efforts for gap filling based on PCR and Sanger sequencing (Nock et al. 2011; Wang and Messing 2011; Zhang et al. 2011; Asif et al. 2013; McPherson et al. 2013). Similar to our approach, one study attempted to obtain partial organelle genome sequence and nrDNA units simultaneously using 454 platform reads in moss (Liu et al. 2011). However, these current approaches require much further effort to obtain complete sequences, indicating that these methods might not be suitable for high throughput application.
Here, I suggested a workflow to obtain cp and nrDNA sequences simultaneously based on dnaLCW (Fig. 1-10). This simple method use standard procedures for DNA preparation, PE library construction and Illumina sequencing. However, only small amount of NGS data (less than 1 Gbp) from WGS reads suffices to assemble complete cp and nrDNA sequences using this novel approach. Therefore, complete cp and nrDNA sequences for up to 50~100 samples can be assembled using 500 bp PE data generated in a single lane of Illumina HiSeq2000 with about 1 µg of PCR-quality DNA per sample. For the cp genome, high quality
43
trimmed PE reads with 100 to 500x coverage can be used as the basal dataset for de novo assembly using the CLC de novo assembler. Sequence gaps in contigs are then filled using the Gapcloser program. Major contigs representing the cp genome are chosen and ordered based on a cp genome sequence of any related plant. These cp contigs can be merged into a single draft cp sequence by guidance of relative cp genome and/or their overlapping terminal sequences. Tentative errors such as false gaps, false SNPs and copy number errors for TRs and monopolymers can be identified by mapping raw reads onto the draft cp contigs and then corrected as described above. At the same time, the main nrDNA contigs can be identified and the complete nrDNA unit sequence can be determined by iterative gap closing and raw read mapping.
44
Figure 1-10. dnaLCW pipeline for simultaneous completion of the cp genome and 45S sequences. (a) Generation of raw data and assembly of contigs. (b) Workflow for obtaining error-free cp whole-genome sequences. (c) Workflow for obtaining complete nrDNA units
45 Application to various plant species
The number of sequences originating from the cp is variable in plant WGS data depending on the tissue sampled and DNA extraction method used (i.e. nuclei vs. standard genomic DNA preparations). The cp sequences accounted for 1.7 – 14.5% and 3.3 – 7.0% of whole NGS data for Oryza and Panax, respectively. This is equivalent to 50-100x and 430-1,000x cp coverage for Oryza and Panax, respectively, for every 1x raw sequence coverage of the nuclear genome. These findings suggest that about 1 Gbp of random WGS data (about 5 million of 100 bp PE reads) is sufficient to assemble complete cp genomes with more than 100x cp coverage for most plant species.
The dnaLCW method was also applied using publicly available sequence data for wild rice. The cp and nrDNA of O. rufipogon W1943 was prepared from 500 Mbp of Illumina sequences downloaded from the GenBank database (http://www.ncbi.nlm.nih.gov/sra/ERX096841). In addition to rice and ginseng, dnaLCW was applied to more than 100 plants including Wisteria floribunda, a tree that lacks inverted repeat blocks like chickpea, a fellow member of the
Leguminosae (Jansen et al. 2008), the oldest Hibiscus syriacus tree in Korea (over
60 years old), and herbal plants with highly complex genomes such as one onion accession and a Lilium tsingtauense plant in native habitat, with 16 and 34 Gbp haploid genome equivalents, respectively (Fig. S1-1, 2, 3).
46
CONCLUSION
Considering that 600 Gbp is produced from a single run of Hiseq2000, this dnaLCW method can complete cp genomes and nrDNA sequences from more than 600 plants at the same time. Concurrent analysis of the cytoplasmic cp genome and nrDNA sequences derived from same plant is advantageous for understanding the concurrent evolution of the multiple genomes in plants. This dnaLCW approach provides a valuable avenue for exploration of diversification, adaptation, domestication, genome evolution and the tree of life in the plant kingdom (Qiu et al. 1999; Soltis et al. 1999) as well as for practical applications such as biodiversity conservation and plant barcoding.
47
REFERENCE
Allen, G. C., Flores-Vergara, M. A., Krasynanski, S., Kumar, S. & Thompson, W. F. (2006). A modified protocol for rapid DNA isolation from plant tissues using cetyltrimethylammonium bromide. Nat. Protoc. 1, 2320-2325
Álvarez, I. & Wendel, J. F. (2003). Ribosomal ITS sequences and plant phylogenetic inference. Mol. Phylogenet. Evol. 29, 417-434.
Burger, G., Lavrov, D.V., Forget, L., & Lang, B.F. (2007). Sequencing complete mitochondrial and plastid genomes. Nat. Protoc. 2, 603-614.
Cheng, Z., Buell, C. R., Wing, R. A., Gu, M. & Jiang, J. (2001). Toward a cytological characterization of the rice genome. Genome Res. 11, 2133-2141
Choi, H. I. et al. (2014). Major repeat components covering one third of the ginseng (Panax ginseng C.A. Meyer) genome and evidence for allotetraploidy. Plant J. 77, 906-916.
Frazer, K. A. Pachter, L., Poliakov, A., Rubin, E. M. & Dubchak, I. (2004). VISTA: computational tools for comparative genomics. Nucleic Acids Res. 32, W273-W279
Galián, J. A., Rosato, M. & Rosselló, J. A. (2012). Early evolutionary colocalization of the nuclear ribosomal 5S and 45S gene families in seed plants: evidence from the living fossil gymnosperm Ginkgo biloba. Heredity 108, 640–646.
Gerlach, W. L. & Bedbrook, J. R. (1979). Cloning and characterization of ribosomal RNA genes from wheat and barley. Nucleic Acids Res. 7, 1869-1885
48 species. Nature 344, 656-658.
Harris, S. A. & Ingram, R. (1991). Chloroplast DNA and biosystematics: The effects of intraspecific diversity and plastid transmission. Taxon 40, 393-412.
Hwang, Y. et al. (2009). Karyotype analysis of three Brassica species using five different repetitive DNA markers by fluorescence in situ hybridization. Horticulture, Environment, and Biotechnology 27, 456-463
International Rice Genome Sequencing Project. (2005). The map-based sequence of the rice genome. Nature 436, 793-800.
Kurtz, S. et al. (2004). Versatile and open software for comparing large genomes. Genome Biol. 5, R12
Li, R. et al. (2010). De novo assembly of human genomes with massively parallel short read sequencing. Genome research, 20(2), 265-272.
Liu, Y., Forrest, L. L., Bainard, J. D., Budke, J. M. & Goffinet, B. (2013). Organellar genome, nuclear ribosomal DNA repeat unit, and microsatellites isolated from a small-scale of 454 GS FLX sequencing on two mosses. Mol. Phylogenet. Evol. 66, 1089-1094.
Lohse, M., Drechsel, O., Kahlau, S. & Bock, R. (2013). OrganellarGenomeDRAW-a suite of tools for generating physical maps of plastid and mitochondrial genomes and visualizing expression data sets. Nucleic Acids Res. 41, W575-581
Matsuo, M., Ito, Y., Yamauchi, R. & Obokata, J. (2005). The rice nuclear genome continuously integrates, shuffles, and eliminates the chloroplast genome to cause chloroplast-nuclear DNA flux. Plant Cell 17, 665-675
McPherson, H. et al. (2013). Capturing chloroplast variation for molecular ecology studies: a simple next generation sequencing approach applied to a rainforest tree. BMC Ecol. 13, 8.
49
Nock, C. J. et al. (2010). Chloroplast genome sequences from total DNA for plant identification. Plant Biotech. J. 9, 328-333.
Ohmido, N., Kijima, K., Akiyama, Y., de Jong, J. H. & Fukui, K. (2000). Quantification of total genomic DNA and selected repetitive sequences reveals concurrent changes in different DNA families in indica and japonica rice. Mol. Gen. Genet. 263, 388-394
Palmer, J. D. (1985). Comparative organization of chloroplast genomes. Annu. Rev. Genet. 19, 325-354.
Qiu, Y.-L. et al. (1999). The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402, 404–407.
Reboud, X. & Zeyl, C. (1994). Organelle inheritance in plants. Heredity 72,132-140. Rivarola, M. et al. (2011). Castor bean organelle genome sequencing and worldwide
genetic diversity analysis. PLoS ONE 6, e21743.
Rogers, S. O. & Bendich, A. J. (1987). Heritability and variability in ribosomal RNA genes of Vicia faba. Genetics 117, 285-295.
Sang, T., Crawford, D. & Stuessy, T. 0. (1997). Chloroplast DNA phylogeny, reticulate evolution, and biogeography of Paeonia (Paeoniaceae). Am. J. Bot. 84, 112. Schwartz, S. et al. (2003). Human-mouse alignments with BLASTZ. Genome Res. 13,
103-107
Shaver, J. M., Oldenburg, D. J. & Bendich, A. J. (2006). Changes in chloroplast DNA during development in tobacco, Medicago truncatula, pea, and maize. Planta 224, 72-82.
Soltis, P. S., Soltis, D. E. & Chase, M. W. 4 (1999). Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402, 402-40.
50
Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30, 2725-2729 The EU Arabidopsis Genome Project, et al. (1998). Analysis of 1.9Mb of contiguous
sequence from chromosome 4 of Arabidopsis thaliana. Nature 391, 485-488 Waminal, N. E. et al. (2012). Karyotype analysis of Panax ginseng C.A.Meyer, 1843
(Araliaceae) based on rDNA loci and DAPI band distribution. Comp. Cytogenet. 6, 425-441
Wang, W. & Messing, J. (2011). High-throughput sequencing of three Lemnoideae (Duckweeds) chloroplast genomes from total DNA. PLoS ONE 6, e24670. Wicke, S., Costa, A., Muñoz, J. & Dietmar, Q. (2011). Restless 5S: The re-arrangement(s)
and evolution of the nuclear ribosomal DNA in land plants. Mol. Phyl. Evol. 61, 321-332.
Wolfe, A. D. & Randle, C. P. (2004). Recombination, heteroplasmy, haplotype polymorphism, and paralogy in plastid genes: Implications for plant molecular systematics. Systematic Botany 29, 1011-1020.
Wolfe, K. H., Gouy, M., Yang, Y. W., Sharp, P. M. & Li, W. H. (1989). Date of the monocot-dicot divergence estimated from chloroplast DNA sequence data. Proc. Natl. Acad. Sci. USA. 86, 6201-6205.
Wyman, S. K., Jansen, R. K. & Boore, J. L. (2004). Automatic annotation of organellar genomes with DOGMA. Bioinformatics. 20, 3252-3255
Zhang, Y. J., Ma, P. F. & Li, D. Z. (2011). High-throughput sequencing of six bamboo chloroplast genomes: phylogenetic implications for temperate woody bamboos (Poaceae: Bambusoideae). PLoS ONE 6, e20596 .
51