874850). MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43. Sarbecovirus, HCoV-OC43 and SARS-CoV data were assembled from GenBank to be as complete as possible, with sampling year as an inclusion criterion. A new coronavirus associated with human respiratory disease in China. In the meantime, to ensure continued support, we are displaying the site without styles Below, we report divergence time estimates based on the HCoV-OC43-centred rate prior for NRR1, NRR2 and NRA3 and summarize corresponding estimates for the MERS-CoV-centred rate priors in Extended Data Fig. The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 20022003 SARS-CoV virus (blue) from their most closely related bat virus. 25, 3548 (2017). from the European Research Council under the European Unions Horizon 2020 research and innovation programme (grant agreement no. Pangolin was developed to implement the dynamic nomenclature of SARS-CoV-2 lineages, known as the Pango nomenclature. 04:20. TMRCA estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent for the different data sets and different rate priors in our analyses. Publishers note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. In light of these time-dependent evolutionary rate dynamics, a slower rate is appropriate for calibration of the sarbecovirus evolutionary history. collected SARS-CoV data and assisted in analyses of SARS-CoV and SARS-CoV-2 data. 382, 11991207 (2020). Of the countries that have contributed SARS-CoV-2 data, 30% had genomes of this lineage. Collectively our analyses point to bats being the primary reservoir for the SARS-CoV-2 lineage. COVID-19 lineage names can be confusing to navigate; there are many aliases and if you want to catch them all to examine further in data analyses it helps to Allen O'Brien on LinkedIn: #r #rstudio #rstats #pangolin #covid19 #datascience #epidemiology Boxes show 95% HPD credible intervals. Split diversity in constrained conservation prioritization using integer linear programming. Nevertheless, the viral population is largely spatially structured according to provinces in the south and southeast on one lineage, and provinces in the centre, east and northeast on another (Fig. Liu, P. et al. Even before the COVID-19 pandemic, pangolins have been making headlines. Nature 579, 265269 (2020). 27) receptors and its RBD being genetically closer to a pangolin virus than to RaTG13 (refs. With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. is funded by the MRC (no. T.T.-Y.L. Google Scholar. Posterior rate distributions for MERS-CoV (far left) and HCoV-OC43 (far right) using BEAST on n=27 sequences spread over 4 years (MERS-CoV) and n=27 sequences spread over 49 years (HCoV-OC43). 13, e1006698 (2017). performed Srecombination analysis. and JavaScript. Combining regions A, B and C and removing the five named sequences gives us putative NRR1, as an alignment of 63sequences. Phylogenetic supertree reveals detailed evolution of SARS-CoV-2, Origin and cross-species transmission of bat coronaviruses in China, Emerging SARS-CoV-2 variants follow a historical pattern recorded in outgroups infecting non-human hosts, Inferring the ecological niche of bat viruses closely related to SARS-CoV-2 using phylogeographic analyses of Rhinolophus species, Genomic recombination events may reveal the evolution of coronavirus and the origin of SARS-CoV-2, A Bayesian approach to infer recombination patterns in coronaviruses, Metagenomic identification of a new sarbecovirus from horseshoe bats in Europe, A comparative recombination analysis of human coronaviruses and implications for the SARS-CoV-2 pandemic, Pandemic-scale phylogenomics reveals the SARS-CoV-2 recombination landscape, https://github.com/plemey/SARSCoV2origins, https://doi.org/10.1101/2020.04.20.052019, https://doi.org/10.1101/2020.02.10.942748, https://doi.org/10.1101/2020.05.28.122366, http://virological.org/t/ncov-2019-codon-usage-and-reservoir-not-snakes-v2/339, http://virological.org/t/ncovs-relationship-to-bat-coronaviruses-recombination-signals-no-snakes-no-evidence-the-2019-ncov-lineage-is-recombinant/331. Across a large region of the virus genome, corresponding approximately to ORF1b, it did not cluster with any of the known bat coronaviruses indicating that recombination probably played a role in the evolutionary history of these viruses5,7. A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. We focused on these three non-recombining regions/alignments for divergence time estimation; this avoids inappropriate modelling of evolutionary processes with recombination on strictly bifurcating trees, which can result in different artefacts such as homoplasies that inflate branch lengths and lead to apparently longer evolutionary divergence times. Nature 579, 270273 (2020). 3). Extended Data Fig. In our analyses of the sarbecovirus datasets, we incorporated the uncertainty of the sampling dates when exact dates were not available. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. The unsampled diversity descended from the SARS-CoV-2/RaTG13 common ancestor forms a clade of bat sarbecoviruses with generalist propertieswith respect to their ability to infect a range of mammalian cellsthat facilitated its jump to humans and may do so again. We demonstrate that the sarbecoviruses circulating in horseshoe bats have complex recombination histories as reported by others15,20,21,22,23,24,25,26. Lam, H. M., Ratmann, O. Using a third consensus-based approach for identifying recombinant regions in individual sequenceswith six different recombination detection methods in RDP5 (ref. 36) (RDP, GENECONV, MaxChi, Bootscan, SisScan and 3SEQ) and considered recombination signals detected by more than two methods for breakpoint identification. Lam, T. T. et al. We named the length-sorted BFRs as: BFRA (ntpositions 13,29119,628, length=6,338nt), BFRB (ntpositions 3,6259,150, length=5,526nt), BFRC (ntpositions 9,26111,795, length=2,535nt), BFRD (ntpositions 27,70228,843, length=1,142nt) and six further regions (EJ). It performs: K-mer based detection Map/align, variant calling Consensus sequence generation Lineage/clade analysis using Pangolin and NextClade Access the DRAGEN COVID Lineage App on BaseSpace Sequence Hub Since experts have suggested that pangolins may be the reservoir species for COVID-19, the scaly anteater has been catapulted into headlines, news reports, and conversationsand some are calling COVID-19 "the revenge of the . However, formal testing using marginal likelihood estimation41 does provide some evidence of a temporal signal, albeit with limited log Bayes factor support of 3 (NRR1), 10 (NRR2) and 3 (NRA3); see Supplementary Table 1. We used an uncorrelated relaxed clock model with log-normal distribution for all datasets, except for the low-diversity SARS data for which we specified a strict molecular clock model. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 1015% divergent throughout the entire Sprotein (excluding the N-terminal domain). Viral metagenomics revealed Sendai virus and coronavirus infection of Malayan pangolins (Manis javanica). 88, 70707082 (2014). Because coronaviruses are known to be highly recombinant, we used three different approaches to identify non-recombinant regions for use in our Bayesian time-calibrated phylogenetic inference. GARD identified eight breakpoints that were also within 50nt of those identified by 3SEQ. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus . A., Lytras, S., Singer, J. Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. We extracted a similar number (n=35) of genomes from a MERS-CoV dataset analysed by Dudas et al.59 using the phylogenetic diversity analyser tool60 (v.0.5). Current sampling of pangolins does not implicate them as an intermediate host. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. A distinct name is needed for the new coronavirus. Zhou, H. et al. 2, vew007 (2016). Novel Coronavirus (2019-nCoV) Situation Report 1, 21 January 2020 (World Health Organization, 2020). Duchene, S., Holmes, E. C. & Ho, S. Y. W. Analyses of evolutionary dynamics in viruses are hindered by a time-dependent bias in rate estimates. matics program called Pangolin was developed. Bruen, T. C., Philippe, H. & Bryant, D. A simple and robust statistical test for detecting the presence of recombination. Relevant bootstrap values are shown on branches, and grey-shaded regions show sequences exhibiting phylogenetic incongruence along the genome. In our second stage, we wanted to construct non-recombinant regions where our approach to breakpoint identification was as conservative as possible. Use the Previous and Next buttons to navigate the slides or the slide controller buttons at the end to navigate through each slide. Nature 503, 535538 (2013). All three approaches to removal of recombinant genomic segments point to a single ancestral lineage for SARS-CoV-2 and RaTG13. These means are based on the mean rates estimated for MERS-CoV and HCoV-OC43, respectively, while the standard deviations are set ten times higher than empirical values to allow greater prior uncertainty and avoid strong bias (Extended Data Fig. Katoh, K., Asimenos, G. & Toh, H. in Bioinformatics for DNA Sequence Analysis (ed. 16, e1008421 (2020). Green boxplots show the TMRCA estimate for the RaTG13/SARS-CoV-2 lineage and its most closely related pangolin lineage (Guangdong 2019), with the light and dark coloured version based on the HCoV-OC43 and MERS-CoV centred priors, respectively. The Sichuan (SC2018) virus appears to be a recombinant of northern/central and southern viruses, while the two Zhejiang viruses (CoVZXC21 and CoVZC45) appear to carry a recombinant region from southern or central China. For the HCoV-OC43, MERS-CoV and SARS datasets we specified flexible skygrid coalescent tree priors. RegionsB and C span nt3,6259,150 and 9,26111,795, respectively. Lond. The genetic distances between SARS-CoV-2 and Pangolin Guangdong 2019 are consistent across all regions except the N-terminal domain, implying that a recombination event between these two sequences in this region is unlikely. ISSN 2058-5276 (online). Xiao, K. et al. 17, 15781579 (1999). A phylogenetic treeusing RAxML v8.2.8 (ref. Nat Microbiol 5, 14081417 (2020). Based on the identified breakpoints in each genome, only the major non-recombinant region is kept in each genome while other regions are masked. Pangolin relies on a novel algorithm called pangoLEARN. The fact that these estimates lie between the rates for MERS-CoV and HCoV-OC43 is consistent with the intermediate sampling time range of about 18years (Fig. Regions AC were further examined for mosaic signals by 3SEQ, and all showed signs of mosaicism. A third approach attempted to minimize the number of regions removed while also minimizing signals of mosaicism and homoplasy. Trends Microbiol. J. Gen. Virol. Preprint at https://doi.org/10.1101/2020.02.10.942748 (2020). Aiewsakun, P. & Katzourakis, A. Time-dependent rate phenomenon in viruses. 4. Biol. Evol. As informative rate priors for the analysis of the sarbecovirus datasets, we used two different normal prior distributions: one with a mean of 0.00078 and s.d. The coronavirus genome that these researchers had assembled, from pangolin lung-tissue samples, contained some gene regions that were ninety-nine per cent similar to equivalent parts of the SARS . Ji, W., Wang, W., Zhao, X., Zai, J. . Lu, R. et al. Early detection via genomics was not possible during Southeast Asias initial outbreaks of avian influenza H5N1 (1997 and 20032004) or the first SARS outbreak (20022003). While such models have recently been made available, we lack the information to calibrate the rate decline over time (for example, through internal node calibrations44). The time-calibrated phylogeny represents a maximum clade credibility tree inferred for NRR1. Suchard, M. A. et al. the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in 3) to examine the sensitivity of date estimates to this prior specification. It is clear from our analysis that viruses closely related to SARS-CoV-2 have been circulating in horseshoe bats for many decades. PubMed Central By mid-January 2020, the virus was spreading widely within Hubei province and by early March SARS-CoV-2 was declared a pandemic8. Python 379 102 pangoLEARN Public Store of the trained model for pangolin to access. c, Maximum likelihood phylogenetic trees rooted on a 2007 virus sampled in Kenya (BtKy72; root truncated from images), shown for five BFRs of the sarbecovirus alignment. Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. Biol. Sliding window analysis of changes in the patterns of sequence similarity between human SARS-CoV-2, and pangolin and bat coronaviruses as described further in Fig.