Understanding Evolution with Simulations: Three Tales about Trees by Murillo Fernando Rodrigues A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology Dissertation Committee: Matt Streisfeld, Chair Andrew Kern, Co-advisor Peter Ralph, Co-advisor Patrick Phillips, Core Member Jayson Paulose, Institutional Representative University of Oregon Winter 2024 © 2024 Murillo Fernando Rodrigues This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY-NC-SA). 2 DISSERTATION ABSTRACT Murillo Fernando Rodrigues Doctor of Philosophyin Biology Title: Understanding Evolution with Simulations: Three Tales about Trees Evolutionary processes impact patterns of genetic variation, so there is an opportunity to reverse this relationship and use genetic data to learn about past evolutionary events. Traditional evolutionary inference from genetic data is plagued by a few issues: (i) different processes can impact a particular feature in similar ways, making it difficult to disentangle them; (ii) there is a growing need for modeling interactions between processes; and (iii) many models do not make full use of genomic data and instead assume that loci are unlinked. Simulation-based evolutionary inference can help alleviate many of these issues. It is now possible to simulate complex evolutionary scenarios, and these can be used to approximate analytically intractable likelihoods, for example by using supervised machine learning. The major downsides to using simulations is the computational cost, but recent advancements both in hardware and software have lessened this cost. In this dissertation, I pushed the boundaries on how simulation- based inference can be applied in evolutionary genetics. To mitigate the costs associated with simulations, I made a few contributions to the tskit ecosystem of evolutionary simulation tools. First, I developed a way to partially parallelize the simulation of multiple populations to make inference using multi-population genomic datasets more feasible. Second, I helped create standards for reproducible simulations with natural selection within the stdpopsim consortium. I implemented the ability to simulate using previously 3 published distribution of fitness effects (DFEs) and to simulate selective sweeps. I demonstrate the utility of this tool by tackling the long-standing question of whether the power to detect sweeps varies along realistic chromosomes. Next, I used simulations to better understand the behavior of a complex multi-population model. Species can be thought as semi-independent realizations of the same (or very similar) evolutionary process. Thus, by looking at multiple species at once it may be possible to better disentangle the processes that shape variation along genomes. Using simulations, I show that positive selection is necessary to explain the genetic data obtained from multiple great ape species. Further, I lay down a framework for leveraging multi-species information to better understand the effects of different processes on a group’s evolutionary history. Lastly, I present a new method that uses whole-genome genealogies for evolutionary inference. This data structure efficiently and sufficiently encodes evolutionary processes. I develop a machine-learning framework, tsNN, that takes whole-genome genealogies as inputs and is flexible enough to perform tasks at different scales (e.g., inferring mutation times, demographic parameters, etc.). I demonstrate that tsNN can learn to predict mutation times accurately, outperforming current likelihood-based methods. tsNN, represents an important step in genealogy-based evolutionary inference, but there still much work to be done in applying deep learning to gain new insights into past evolutionary events. This dissertation includes both published and unpublished coauthored material. 4 CURRICULUM VITAE NAME OF AUTHOR: Murillo Fernando Rodrigues GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Universidade de São Paulo, São Paulo, SP, Brasil DEGREES AWARDED: Doctor of Philosophy, Biology, 2024, University of Oregon Master of Science, Genetics and Evolutionary Biology, 2018, Universidade de São Paulo Bachelor of Science, Biology, 2015, Universidade de São Paulo AREAS OF SPECIAL INTEREST: Evolutionary Biology Population Genetics Computational Biology GRANTS, AWARDS AND HONORS: Harvey E. Lee Graduate Scholarship, University of Oregon, 2022-2023 Marthe E. Smith Memorial Science Scholarship, University of Oregon, 2022- 2023 Hill Fund Graduate Award, University of Oregon, 2020-2021 Genetics Training Grant, University of Oregon, 2019-2021 Research Internship Abroad Fellowship, The São Paulo Research Foundation, 2017-2018 Master’s Research Fellowship, The São Paulo Research Foundation, 2016- 2018 Undergraduate Research Fellowship, The São Paulo Research Foundation, 2013-2014 5 PUBLICATIONS: Rodrigues, M. F., Kern, A. D., & Ralph, P. L. (2024). Shared evolutionary processes shape landscapes of genomic variation in the great apes. Genetics, iyae006. Estevez-Castro, C.F., Rodrigues, M.F., Babarit, A., ... & Olmo, R. P. (2024). Neofunctionalization driven by positive selection led to the retention of the loqs2 gene encoding an Aedes specific dsRNA binding protein. BMC Biology 22, s12915-024-01821-4. Lauterbur, M. E., Cavassim, M. I. A., Gladstein, A. L., Gower, G., Pope, N. S., Tsambos, G., ..., Rodrigues, M. F., ... & Gronau, I. (2023). Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations. eLife, 12, RP84874. Baumdicker, F., Bisschop, G., Goldstein, D., Gower, G., Ragsdale, A. P., Tsambos, G., ..., Rodrigues, M. F., ... & Kelleher, J. (2022). Efficient ancestry and mutation simulation with msprime 1.0. Genetics, 220(3), iyab229. Rodrigues, M. F., Vibranovski, M. D., & Cogni, R. (2021). Clinal and seasonal changes are correlated in Drosophila melanogaster natural populations. Evolution, 75(8), 2042-2054. Rodrigues, M. F., & Cogni, R. (2021). Genomic responses to climate change: Making the most of the Drosophila model. Frontiers in Genetics, 12, 676218. Stankowski, S., Chase, M. A., Fuiten, A. M., Rodrigues, M. F., Ralph, P. L., & Streisfeld, M. A. (2019). Widespread selection and gene flow shape the genomic landscape during a radiation of monkeyflowers. PLoS Biology, 17(7), e3000391. 6 ACKNOWLEDGEMENTS With some sense of accomplishment, I am delighted to present this dissertation, which is the culmination of over five years of dedication to expanding our understanding of biology. This journey would not have been possible without the support, guidance, and encouragement of many. I wish I could thank everyone involved personally, but my memory is not what it used to be when I first embarked on this quest... As I reflect on this long process, I would like to offer my gratitude to those who have been instrumental in shaping my academic growth and in keeping my mental sanity. I am so fortunate to have found not one, but two incredibly kind, patient and generous advisors. Andy, thank you for your enthusiasm for science, for supporting my growth as a scientist, and for your encouragement through ups and downs. Peter, I deeply appreciate your generosity and patience, the valuable and thorough insights, and your steadfast support throughout my time as a PhD student. I am so incredibly thankful for all your effort in making me feel at home here in Oregon. Your dedication to nurturing sense of community and collegiality within our CoLab has been admirable. Next, I would like to thank my lab mates, collaborators and colleagues who have provided academic and emotional support throughout my journey. Thank you to my amazingly helpful, supportive and fun lab mates: Jeff Adrion, CJ Battey, Jared Galloway, Matt Lukac, Anastasia Teterina, Gabby Coffing, Victoria Caudill, Saurabh Belsare, Vince Buffalo, Gilia Patterson, Chris Smith, Clara Rehmann, Jordan Anderson, Nate Pope, Jiseon Min, Jordan Rodriguez, Georgia Tsambos, Silas Tittes, Scott Small and Bruce Edelman. Nate Pope, in particular, has been 7 extremely patient and kind in collaborating with me to develop tsNN, herein presented as my fourth chapter. I would like to thank my committee members, Matt Streisfeld, Kelley Harris, Jayson Paulose, and Patrick Phillips for all their support throughout these years. My cohort mates have been so wonderful through the years and helped me make it through: Sabrina Mostoufi, Sophia Frantz, Monika Ruwaimana, Lina Aoyama, Max Spencer, Matt Lukac and Kayla Evans. My friends and former roommates Zac Bush, Matt Lukac and Ethan Shaw made my graduate school experience so much more fun and enjoyable. Clara Rehmann, my lab mate and friend, has been so kind to offer me shelter in Eugene over the past few months. I am thankful to many other members of the Institute of Ecology and Evolution, including the behind the scenes staff that helped me with so many travel and purchase needs over these years; thank you Arlene Crain, Sara Nash, Maria Heider, and Leah Frazier. Thanks are also in order to my always supportive collaborators and members of the tskit ecosystem and the PopSim consortium. Before my PhD, I was fortunate to encounter amazing individuals who played a pivotal role in shaping me as a scientist and biologist. I would like to thank my undergraduate thesis advisor, Fernando Marques, for helping me get my feet wet with academic research. Rodrigo Cogni, my master’s advisor, who helped me make the jump into empirical population genetics. John Pool welcomed me into his lab and gave me incredible freedom to do whatever research I wanted. Paulo Inácio Prado, Alexandre Adalardo de Oliveira and Adriana Martini opened my eyes to the importance of theory, models and simulation in biology. The entire team behind the Atlantic Forest Field Ecology course (offered by the Institute of Biosciences at the Universidade de São Paulo) contributed significantly to my growth as a naturalist. 8 I want to wholeheartedly thank my family and close friends, whose unwavering encouragement and love have sustained me during the demanding years of my academic pursuit. I have grown so much alongside my long-time friends: Bunni, Carol, Isa, Gabriel and Nat. Looking forward to many more years of adventures and laughter together. My Brazilians in Eugene crew – Bia, Ciro, Giovani, Helena, Jack, Karl, Ĺıvia, Suenia – have made my life in Eugene much more colorful. I would like to thank my siblings, Giovanna and João, for being a source of laughter, support and love. Thank you to my mom, Kelly, for always supporting me with anything I set my mind to. Her strength and determination are inspiring, and I would not have made it through without it. Thank you to my dad, Aguinaldo, for showing me the great things one can accomplish with hard work and grit. I am proud for what my parents accomplished in raising me and my siblings at such a young age with little family support and no college education. I also need to acknowledge my parents for partially supporting me financially throughout my entire academic career, as this journey would have been orders of magnitude harder without it. Lastly, I want to express gratitude to my partner, Luiza, and my pets, Ollie and Milo, who bring me daily moments of happiness. Luiza, thank you for providing steadfast support and understanding over the past (almost) decade. Your love, caring, patience, and belief in me have been a great source of strength. I could not have done any of this without you by my side anchoring and championing me! 9 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 16 1.1. Decoding evolution: a lay-friendly prelude . . . . . . . . . . . . 16 1.2. The current landscape in inference of evolutionary processes from genetic data . . . . . . . . . . . . . . . . . . . 18 1.2.1. The impact of evolutionary processes on genetic variation . . 18 1.2.2. Challenges in evolutionary inference . . . . . . . . . . . . 20 1.2.3. The promise of simulation-based evolutionary inference . . . 23 1.2.4. Outline of this dissertation . . . . . . . . . . . . . . . . 24 II. ROBUST AND EFFICIENT TOOLS FOR EVOLUTIONARY SIMULATIONS . . . . . . . . . . . . . . . . . 27 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2. A way to parallelize multi-population simulations with tree sequences . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3. Towards more realistic and reproducible simulations: Introducing models of selection to Stdpopsim . . . . . . . . . . . 34 2.3.1. Adding selection to simulations using Stdpopsim . . . . . . 36 2.3.2. Analysis of power to detect sweeps along realistic chromosomes 36 2.4. Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 III. SHARED EVOLUTIONARY PROCESSES SHAPE LANDSCAPES OF GENOMIC VARIATION IN THE GREAT APES . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1. Genomic data . . . . . . . . . . . . . . . . . . . . . 52 3.2.2. Simulations . . . . . . . . . . . . . . . . . . . . . . 54 10 Chapter Page 3.2.3. Visualizing correlated landscapes of diversity and divergence . 56 3.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3.1. Landscapes of within-species diversity and between-species divergence . . . . . . . . . . . . . . . . 59 3.3.2. Remarkable correlations between landscapes of diversity and divergence . . . . . . . . . . . . . . . . . 62 3.3.3. Neutral demographic processes . . . . . . . . . . . . . . 65 3.3.4. GC-biased gene conversion . . . . . . . . . . . . . . . . 67 3.3.5. Positive and negative natural selection . . . . . . . . . . . 69 3.3.6. Mutation rate variation . . . . . . . . . . . . . . . . . 71 3.3.7. Visualizing similarity between simulations and data . . . . . 73 3.3.8. Correlations between genomic features and diversity and divergence . . . . . . . . . . . . . . . . . 75 3.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5. Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 IV. A POWERFUL MACHINE LEARNING FRAMEWORK FOR EVOLUTIONARY INFERENCE USING WHOLE- GENOME GENEALOGIES . . . . . . . . . . . . . . . . . . . . 90 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.1. tsNN . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.2. GNN . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2.3. Training and validation simulations . . . . . . . . . . . . 96 4.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1. Comparing tsNN and GNN . . . . . . . . . . . . . . . . 98 4.3.2. Improving inference of mutation times . . . . . . . . . . . 100 4.4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 100 11 Chapter Page V. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . 104 APPENDIX: SUPPLEMENTAL MATERIAL FOR CHAPTER III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.0.1. Correlation between divergences that share branches . . . . . 107 12 LIST OF FIGURES Figure Page 2.1. Example of a population history . . . . . . . . . . . . . . . . . . 29 2.2. An example tree sequence and its tabular encoding . . . . . . . . . . 31 2.3. The union operation for tree sequences . . . . . . . . . . . . . . . 33 2.4. Boundary effects in a background selection simulation . . . . . . . . . 38 2.5. Power to detect a selective sweep along chromosome 1 . . . . . . . . . 40 2.6. Relationship between power and genetic recombination . . . . . . . . 42 2.7. Relationship between statistics and recombination rates under neutraility . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.8. Relationship between statistics and recombination rates under the sweep model . . . . . . . . . . . . . . . . . . . . . . 44 3.1. Simulated demographic history of the great apes . . . . . . . . . . . 57 3.2. Landscapes of diversity, divergence, recombination rate and exon density . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3. Visualizing the relationships between nucleotide diversity and divergence statistics between closely related taxa . . . . . . . . . 63 3.4. Correlations between landscapes of diversity and divergence across the great apes . . . . . . . . . . . . . . . . . . . . . . . 66 3.5. Landscapes are not well correlated in a neutral simulation . . . . . . . 68 3.6. Correlations between landscapes of divergence partitioned by site type (W-W/S-S and W-S) . . . . . . . . . . . . . . . . . 69 3.7. Correlations between landscapes of diversity and divergence in simulations with natural selection . . . . . . . . . . . . . . . . 72 3.8. Correlations between landscapes of diversity and divergence across the great apes for simulations with variation in mutation rate along the chromosome . . . . . . . . . . . . . . . . 74 13 Figure Page 3.9. PCA visualization of data and simulations . . . . . . . . . . . . . . 76 3.10. Correlations and covariances between landscapes of diversity and divergence and annotation features in the real great apes data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.1. Schematic representation of the tsNN algorithm . . . . . . . . . . . 95 4.2. Edge traversal order in tsNN . . . . . . . . . . . . . . . . . . . . 98 4.3. Accuracy of tsNN and GNN in predicting mutation times . . . . . . . 99 4.4. The effect of number of mutations and chromosome length on accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 14 LIST OF TABLES Table Page 3.1. Range of parameters explored in the simulations . . . . . . . . . . . 56 15 CHAPTER I INTRODUCTION 1.1 Decoding evolution: a lay-friendly prelude Evolution can occur over incredibly large spatial scales over the course of thousands to millions of years. This makes Evolutionary Biology unlike other fields within Biology in that hypotheses can not always be subject to experimentation. Thus, we are left with the task of putting together what happened in the past based on information we have today. To do so we need a sizable amount of data, models and statistical and computational tools. One way to make inferences about past evolutionary processes is by studying genetic variation within and between species. Evolutionary processes, such as population size changes, natural selection and mutation, impact genetic variation in many ways. For example, we expect the amount of variation within a species to increase with the mutation rate (i.e., the higher the mutation rate, the more variation a population will harbor). On the other hand, real populations lose variation every generation due to genetic drift (i.e., sampling error), and the magnitude of this loss is proportional to the population size. We can write models in form of equations that describe how some of these evolutionary processes impact genetic variation. For example, π (a measure of genetic variation) depends on the size of the population (N) and the mutation rate (µ), such that π ∼ 4Nµ. If we knew the mutation rate, we could rearrange the equation to estimate the size of a population N = π from genetic data! To arrive 4µ at this simple equation, we make the assumption that an individual is equally likely to mate with any other individual in the population (i.e., panmixia). It is easy to see how this may not always hold: for example, a tree is much more likely to mate 16 with its neighbor than a distant tree across the forest (because pollen cannot travel too far). Adding biological realism to our equation would prove to be complicated. Alternatively, we can more easily simulate this process with a computer: we can create virtual individuals and tell a computer how they choose their mates, pass on genetic material to their offspring, and change over time. If we were to run virtual experiments like this many times, it would be possible to map the relationship between genetic variation (π) and our parameters of interest (population size N and mutation rate µ). This new map would be closer to the reality than our simplified mathematical model above, but this approach is not a panacea, as simulations have a high computational cost (i.e., they can take a long time to run). For my dissertation, I have explored how we can incorporate computer simulations to help us understand evolution. To this end, we need computational tools that are fast and reproducible (so that other researchers can use them). So, in my second chapter, I discuss my contributions to tools that make these simulations run faster and more reproducible. Next, I sought to answer a long standing biological question: to what extent have humans and other apes been shaped by natural selection? In my third chapter, I use simulations to tackle this question and show that both positive and negative selection are necessary to explain the genetic data obtained from multiple individuals for all great ape species. Lastly, I wanted to build a proper statistical method that uses simulations to infer evolutionary processes from genetic data. In my fourth chapter, I present a new method for evolutionary inference that uses genealogical trees along chromosomes (instead of tables with samples and mutations). These trees capture the genetic data perfectly, 17 but they are much more scalable (use less space) and they organize the data in a way that could be more conducive to extracting relevant evolutionary information. 1.2 The current landscape in inference of evolutionary processes from genetic data Many questions in evolutionary biology are of a historical nature, because evolution is restricted in space and time and it cannot be replicated (Cleland, 2002; Losos, 2009). An avenue for investigating past evolutionary events and processes lies in genetic data. As evolutionary processes leave footprints on the genetic composition of populations, the possibility arises to invert this relationship, that is to use genetic data to make inferences about past events. Indeed, much of evolutionary genetics is tasked with trying to infer past evolutionary events and processes from genetic data. Among the major questions explored in the field are: Is it possible to find regions of the genome that are constrained by natural selection (and thus may be of functional importance)? Can we infer the movement of individuals and changes in population sizes over time? What are the relative contributions of different evolutionary processes, such as demography and selection, in shaping patterns of genetic variation? 1.2.1 The impact of evolutionary processes on genetic variation. To understand how it is possible to answer these questions with genetic data, it is necessary to map how different evolutionary processes affect genetic variation. At each position along a chromosome, there is a tree that describes how sampled individuals are related to each other. Due to linkage, neighboring trees are more like each other than distant trees. These trees or genealogies can fully encode the outcomes of evolutionary processes. Therefore, it is usually more intuitive to think about how these processes impact tree shapes. 18 The trees, along with mutations, can then be used to understand genetic variation itself. Demographic processes directly impact the underlying genealogies because of the inverse relationship between population size and coalescence rate (coalescences are the merging of lineages backwards to a most recent common ancestor) (Wakely, 2016). For example, for a population that goes through a bottleneck (i.e., a sudden decrease in population size), the coalescences are expected to be concentrated during the bottleneck period. On the other hand, for a population that undergoes a size expansion, most of the coalescences will happen in the period preceding the expansion. These underlying trees yield different patterns of genetic variation in the extant population. One summary of genetic data that can be used to distinguish between these tree shapes and, in turn, between possible demographies is the site frequency spectrum (SFS). The SFS is the distribution of allele frequencies across sites in a sampled set of genomes. (For the sake of simplicity, I will focus on the unfolded SFS, where the frequency of the derived alleles have been measured; derived allele refers to the allele that is new to a population). A genealogy that has coalescences concentrated near the present (due to a recent bottleneck) will yield a SFS that is skewed towards the high frequency derived alleles. Conversely, a genealogy with coalescences happening earlier on will yield a SFS with an excess of low frequency derived alleles. Natural selection can also impact genealogies and genetic variation. When a new (sufficiently) beneficial mutation arises in a population it rapidly increases in frequency, ultimately reaching fixation. This leads to many of the coalescences to occur during the fixation trajectory. If we were to sample the population right after fixation, the SFS would be skewed towards the high frequencies (Y. Kim, 2006), 19 similar to the effect of a population bottleneck. A strongly deleterious mutation, on the other hand, is rapidly purged from the population. Therefore, the shape of the underlying tree is less affected, though the tree will be shorter (Barton & Etheridge, 2004; S. Williamson & Orive, 2002). Both beneficial and deleterious mutations lead to loss of genetic variation at the selected site (either due to fixation or purging), but nearby sites are also affected due to linkage. Linked mutations are carried along as a beneficial mutation increases in frequency, in a process called selective sweep (Coop & Ralph, 2012; Kaplan et al., 1989; Maynard Smith & Haigh, 1974). The further away from the focal site, the more likely it is for recombination to happen, allowing mutations to escape this “sweep”. A deleterious mutation can also affect linked mutations in a similar way, and this process is called background selection (Charlesworth et al., 1993). Importantly, most of the models for selection assume strong selection, because it is complicated to deal with the interactions between multiple selected mutations in linkage disequilibrium (Y. Kim & Stephan, 2000). 1.2.2 Challenges in evolutionary inference. Although we have a good understanding of how different evolutionary processes impact variation, a few issues remain in the way of inversing this relationship to infer these processes from genetic data. First, different processes can impact genetic variation in similar ways, making it difficult to determine the specific processes that are consistent with data. Second, there is a growing need for modeling interactions between processes. Third, many models do not properly incorporate linkage information and treat whole genome data as unlinked single loci. Summaries of genetic data may not contain enough information to distinguish between evolutionary scenarios. For example, both positive and negative 20 selection remove variation surrounding the selected site. Because the width of the effects of sweeps and background selection depends on the recombination rate, both these models predict relationships between (i) recombination rate and genetic diversity and (ii) density of functional sites and genetic diversity (Charlesworth et al., 1993; Kaplan et al., 1989; J. M. Smith & Haigh, 1974). Indeed, in many species of plants and animals it has been observed that regions of high recombination harbor more genetic variation (Corbett-Detig et al., 2015). Disentangling the relative contributions of sweeps and background selection to these patterns, however, is complicated (Andolfatto, 2001; Leffler et al., 2012). One alternative lies in going beyond single species metrics of variation, and to instead compare multiple species. Selection also directly impacts divergence between species: whereas positive selection increases fixation rates, negative selection decreases the rate of substitutions (Andolfatto, 2007; Cai et al., 2009; Macpherson et al., 2007) . It may also be possible to disentangle sweeps and background selection by aggregating multiple different measures of variation (Schrider, 2020). Individually dissecting and modelling the effects of different processes on genetic variation is not enough. In the early days of evolutionary genetics, it was possible to survey just a few genetic markers, usually microsatellites. These markers are short tandem repeats (1-6bp) that are abundant in the genomes of many taxa. Microsatellites have a high mutation rate and are thought to be mostly neutral (Field & Wills, 1997). This allowed researchers to use these markers for understanding demographic processes almost independently of other processes, such as natural selection. There have been incredible advancements in our ability to obtain whole-genome data, and now it is feasible to sequence thousands of individuals for multiple species at once. More data opens up the 21 opportunities to investigate more complex questions (i.e., beyond the traditional studies of population structure), such as understanding constraint along genomes, recombination rate variation, among others. However, with whole-genome data the assumptions of neutrality, for example, are much less likely to hold. So there is a need for models that account for complex interactions between processes, which is not trivial to do. One interesting (and perhaps unintuitive) way that processes can interact is that demographic processes can exacerbate the effects of background selection (Torres et al., 2018, 2020). Inferences of background selection that do not take into account demography can lead to biased estimates of the rate of deleterious mutations and their fitness effects. Conversely, both positive and negative selection can bias demographic inferences in many ways (Ewing & Jensen, 2016; Johri et al., 2021; Schrider et al., 2016). Taking full advantage of linkage information is complicated, and many current methods treat whole-genome data as a collection of unlinked loci. Much of demography inference relies on allele frequency data (i.e., the SFS) ignoring the information contained in patterns of linkage between sites (Gutenkunst et al., 2009; Schraiber & Akey, 2015). It is possible to incorporate linkage information by using the sequentially Markovian coalescent (SMC), a model which approximates the coalescent with recombination by disregarding long-range correlations between genealogies (Harris & Nielsen, 2013; Li & Durbin, 2011; Schraiber & Akey, 2015). Inference of selective sweeps usually proceeds by breaking up the chromosome into non-overlapping genomic windows of a particular size (DeGiorgio et al., 2016; Garud et al., 2015; Pavlidis et al., 2013), and thus it disregards linkage between nearby windows. This issue may be partially mitigated by training a classifier over larger regions, an approach which has been sucessfully implemented (Schrider & 22 Kern, 2016, 2017). The a priori choice of window sizes for calling sweeps remains an important consideration because the signatures of a sweep depend on the strength of a sweep and the local recombination rate (which determine the linked effects of a sweep). Choosing a particular window size can limit the power to detect sweeps of a specific strength (Caldas et al., 2022). 1.2.3 The promise of simulation-based evolutionary inference. A computer simulation is a virtual model that mimics the behavior of a complex system with known rules. It is widely used in many disciplines to gain insights into systems that are too complex for analytical solutions. The major downsides to using simulations are the runtime and computational costs, but recent advances in computing have lessened these costs. Simulations have played an integral part in evolutionary inference for the past five decades (Hoban et al., 2012). The main three uses for simulations in evolutionary genetics are: (i) to describe analytically intractable evolutionary models, (ii) to verify analytical solutions based on approximations, and (iii) to perform parameter estimation and/or model selection. For example, Ohta and Kimura (1974) used simulations to compare the distribution of allele frequencies yielded from the simple infinite alleles model with the analyitically intractable stepwise mutation model, which is more appropriate for allozyme data that was abundant at the time. On the other hand, Kimura (1980) used simulations to verify their analyitical solution of the average time to fixation for new beneficial alleles. Using evolutionary simulations for model selection and parameter estimation has gained momentum in the recent past. As genomic data availability has increased, so has our ability to interrogate more complex questions from these data. However, as detailed in the earlier sections, writing down mathematical 23 models and likelihoods is unfeasible in many instances (Coop & Ralph, 2012)e.g.. Simulations can be used to generate data under complex evolutionary scenarios more easily. Then, replicates of these data can be generated under different evolutionary models (or over a parameter range) so that they can be used for inference. Advancements in computational power and algorithms have lessened the burden both in generating data and inferring parameters or doing model selection on them. For example, deep learning algorithms can handle complex inference tasks better than earlier approaches, such as Approximate Bayesian Computation (Sheehan & Song, 2016). 1.2.4 Outline of this dissertation. My overarching goal with this dissertation has been to push the boundaries of simulation-based inference in evolutionary genetics. Although I do not expect simulation-based inference to solve all the challenges in evolutionary inference, simulations hold promise in helping us model interactions between processes and in allowing us to explore the utility of metrics for which there is no theory yet. An intermediate step to enable robust simulation-based inference is to develop and maintain scalable and reproducible tools for evolutionary simulations. In Chapter II, I present two of my contributions to the tskit ecosystem of evolutionary simulation tools. A drawback of simulations is the computational cost associated with them. So I worked out a way to parallelize multi-species simulations (either forward or backward-in-time). This allows for faster and less computationally intensive simulations, which helps enable inference from multi- species genomic datasets. Another challenge with evolutionary simulations has been reproducibility and compatibility (across tools). As part of the Stdpopsim consortium, we have been standardizing evolutionary models and simulations 24 (Adrion et al., 2020; Lauterbur et al., 2023). Using this new resource, I explore how the power to detect selective sweeps vary along chromosomes in humans. This illustrates how our tools enable the study of interactions between processes (in this case, recombination rate variation, background selection and selective sweeps). We found that the power to detect sweeps varies substantially along chromosomes mostly due to variation in recombination rates. Surprisingly, the variation in power along the chromosome is greater than across humans sampled from different parts of the world (which have drastically different demographic histories). Simulations can help us understand the behavior of complex evolutionary models. In Chapter III, we use simulations to understand how the landscapes of diversity and divergence change over time under different evolutionary scenarios. By looking at multiple species at once, it is possible to better disentangle the processes that shape variation along genomes. Using a well-sampled set of great ape genomes, we find that landscapes of diversity (and divergence) are highly correlated over long time scales (about 60N generations, assuming N = 10, 000). We tease apart how different processes, such as GC-biased gene conversion, mutation rate variation, positive and negative selection, can contribute to the observed correlations. Although many of these processes are able to produce patterns similar to the observed data, we find that positive selection is necessary to fully explain the data. This chapter has been published with co-authors Andrew D. Kern and Peter L. Ralph. Beyond qualitative explorations of complex evolutionary models, simulations can be used for proper statistical inferences, such as model selection and parameter estimation. In Chapter IV, I present a new deep learning method for estimating parameters from genetic data using whole-genome genealogies. Whole-genome 25 genealogies may be useful in evolutionary inference because this data structure naturally encodes evolutionary processes (e.g., coalescences over time) and they can be more efficient than other representations (such as genotype matrices). We describe a neural network architecture that can be used to infer parameters at different scales using whole-genome genealogies. Next, we test the usefulness of the whole-genome genealogies for inference by estimating mutation times given a known demographic history. We find that our neural network performs well, even outperforming other approaches (that use genealogies or genotype matrices). Our proposed whole-genome genealogy inference framework is able to extract relevant evolutionary information from this data structure and it might be well suited for genome bank scale datasets. This chapter includes unpublished material with co- authors Nathaniel S. Pope, Peter L. Ralph, and Andrew D. Kern. Taken together, these chapters demonstrate how simulations can be employed to help us understand the role of different evolutionary processes in shaping genetic variation. I demonstrate how useful simulations are both in helping us describe intractable evolutionary models and in inferring parameters or performing model selection from genetic data. Simulations can enable us to answer complex questions that have tormented the field of evolutionary genetics for the past decades, and I anticipate that the contributions of simulation-based studies will grow even more as we overcome computational constraints. 26 CHAPTER II ROBUST AND EFFICIENT TOOLS FOR EVOLUTIONARY SIMULATIONS 2.1 Introduction As the availability of genetic data has increased, so has our ability to infer evolutionary processes at finer scales. An essential tool for evolutionary inference has been simulations, and many flexible and efficient evolutionary simulators have been developed over the past decade (Bulmer, 1976; Carvajal-Rodriguez, 2008; Hoban et al., 2012; Hudson, 1983; Hudson et al., 1987; Ohta & Kimura, 1974). Until recently, little had been done to integrate different tools and to ensure compatibility of inferred models across studies. A unifying thread that has emerged is the tree sequence data structure (Kelleher et al., 2016a). The tree sequence provides a way of concisely representing correlated genealogies along chromosomes. Both msprime, an efficient coalescent simulator, and SLiM, the most widely used forward-in-time simulator, record the genealogical history of all samples in a population in the form of tree sequences (Haller & Messer, 2019; Haller et al., 2019; Kelleher et al., 2018). One of the major advantages of doing so in forward-in-time simulations is that it allows neutral mutations to be omitted. By definition, such mutations do not impact the underlying genealogies, but they pose immense computational demand (due to bookkeeping). Therefore, by omitting the neutral mutations from the forward step, it is possible to decrease computational resources dramatically. After completion of the forward simulation, it is possible to simply overlay the genealogies with the neutral mutations if necessary. As inference becomes ever more complex, now enabled by powerful simulation tools, the need for easy and error-free dissemination of estimated models 27 has increased. Many researchers used to re-implement models independently, a process which is error-prone and results in duplication of effort. Indeed, a recent study described errors in the implementation of a demographic model in two published papers, which affected the interpretation of biological signals (Ragsdale et al., 2020). Stdpopsim is a community-driven open source package developed to minimize these issues (Adrion et al., 2020; Lauterbur et al., 2023). We organized a well-documented library with published simulation models from a range of organisms, which can be accessed with simple Python and command-line interfaces. The library contains demographic models, recombination maps and mutation rates for tens of species. All the models have to pass a quality control method, in which an independent party validates the model. Further, we provide a Python API to easily run simulations using the catalog, with msprime and SLiM as simulators on the backend, decreasing even further the barrier to reproducing simulations. In this chapter, my contributions to the current ecosystem of evolutionary simulation tools in two vignettes. First, I describe a method I devised to make multi-population simulations faster with parallelization, which is enabled by tree sequences. Second, I report my main contribution to stdpopsim: the inclusion of selection models. Using this new feature, I investigate how the power to detect sweeps along more realistic chromosomes. Both of these contributions will appear (in some form) in future publications of the tree sequence toolkit library (tskit) and of the stdpopsim consortium. 28 2.2 A way to parallelize multi-population simulations with tree sequences A major problem in simulation-based inference is the computational cost of simulations. In many cases, it is possible to minimize the time cost of simulations by running them in parallel, that is to divide up the simulation over many processes that can be executed concurrently over many CPUs (or cores). It is not always clear how to divide a simulation into sub-tasks, however. In the case of evolutionary simulations, a natural way to break up a big simulation might be population splits. After a population splits into two (or more) subpopulations, their histories become independent (assuming no migration). Therefore, it is possible to parallelize a multi-population simulation by dividing the simulation over population splits. The history of populations A, B and C (shown in Figure 2.1) can be parallelized: any two branches stemming from the same node are independent. Figure 2.1. The relationships between populations A, B and C are depicted. Note how branches of the same color can be simulated in parallel if there is no migration. To parallelize the simulations, we need to (i) simulate each branch of the tree independently and (ii) put together the bits that are simulated independently 29 at the end. There are many ways to parallelize the simulations, but see Rodrigues and Ralph (2021) for an idea. Below, I will describe the operation for putting together two (or more) tree sequences, which I implemented in tskit (see https: //tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.union for the documentation). A description of this operation will appear in an upcoming publication describing the tree sequence toolkit library (tskit). The information underlying a tree sequence is stored as a collection of tables which define different features. This simple tabular format allows for rapid accessing of information. The main components are the Node, Edge, Site and Mutation tables. A haploid genome can be thought of as a node in a tree, which exists at a particular time. An edge defines genetic inheritance between two nodes, and it consists of a parent node, a child node, and the left and right coordinates (along a chromosome) over which the child genome inherited from the parent genome. A site is a location in the genome, and a mutation defines a change of state at a particular node. See Figure 2.2a for an example tree sequence and Figure 2.2b for the corresponding collection of tables. Note how information about the trees are completely independent from the notion of genetic variation (defined by the Site and Mutation tables). After simulating two independent populations, one might want to obtain the node-wise union of these two tree sequences (e.g., for analyzing patterns of between population variation). This is almost as simple as concatenating the Node, Edge, Site and Mutation tables. However, nodes from the second tree sequence need to be re-enumerated, and this new numeric order needs to be propagated onto the other tables. I simulated the history of two populations independently using msprime (Baumdicker et al., 2022; Kelleher et al., 2016a) (Figures 2.3a and 2.3b). Then, I 30 Nodes Edges 7 7 115695.67 Mutations Sites 0 96106.40 6 3 1 87564.35 5 4 10400.98 4 4 2 0.00 0 1 2 3 0 3 1 2 0 47 200 Genome position (b) The underlying collection of (a) An example tree sequence. tables. Figure 2.2. The relationship between four samples are depicted in a sequence of trees. Note how because of shared ancestry, mutations that are common to many of the samples need to be represented only once. However, in a matrix of variant sites against samples, these mutations would be repeated unnecessarily. used the union operation to merge the two tree sequences (Figure 2.3c). Lastly, I added the pre-split history to the merged tree sequence so that all samples coalesce back-in-time (Figure 2.3d). See the code in Python to reproduce this analysis (Listing 2.1). By doing the simulation this way, we are able to run the simulation of each extant population in parallel. In the case of large simulations (as we will see in Section 2.4), this is essential so that we can distribute the memory usage over different computer nodes. import msprime import numpy as np import t s k i t # Fi r s t , we bu i ld an msprime . Demography ob j e c t with our two−populat ion h i s t o r y that s p l i t s 100 ,000 un i t s o f time ago 31 demography = msprime . Demography ( ) demography . add populat ion (name= ‘ ‘A ’ ’ , i n i t i a l s i z e =10 000 ) demography . add populat ion (name= ‘ ‘B ’ ’ , i n i t i a l s i z e =100) demography . add populat ion (name= ‘ ‘C ’ ’ , i n i t i a l s i z e =100 000 ) demography . add popu l a t i o n sp l i t ( time=100 000 , der ived =[ ‘ ‘A ’ ’ , ‘ ‘B ’ ’ ] , a n c e s t r a l = ‘ ‘C ’ ’ ) # Now, we can s imulate the h i s t o r i e s o f the two extant popu la t i ons independent ly t s1 = msprime . s im ance s t ry ( samples ={ ‘ ‘A ’ ’ : 4} , p l o idy=1, demography= demography , s equence l eng th=50, r e combinat i on ra te=1e−6, random seed =123) t s2 = msprime . s im ance s t ry ( samples ={ ‘ ‘B ’ ’ : 4} , p l o idy=1, demography= demography , s equence l eng th=50, r e combinat i on ra te=1e−6, random seed =1) # Then , we can union these two t r e e sequences . # Note that because the two popu la t i ons do not share any h i s to ry , # we s p e c i f y a ‘ node mapping ’ in which none o f the nodes in ‘ t s2 ’ have an equ iva l en t in ‘ t s1 ’ . t su = ts1 . union ( ts2 , node mapping=np . f u l l ( t s2 . num nodes , t s k i t .NULL) , add populat ions=False ) # Fina l ly , we can add the pre−s p l i t h i s t o r y to the union ’ ed t r e e sequence with msprime . s im ance s t ry . t su r e cap = msprime . s im ance s t ry ( i n i t i a l s t a t e = tsu , demography= demography , random seed=3) Listing 2.1 Python code to union two independently simulated populations using msprime and tskit. 32 7 158.72 7 723320.81 6 108.17 6 19402.96 6 101.30 5 17694.55 5 97.88 4 4 2080.20 4 4 0.00 0.00 0 1 3 2 0 1 2 3 0 1 2 3 0 3 1 2 0 45 50 0 12 50 Genome position Genome position (a) Tree sequence of population 1. (b) Tree sequence of population 2. 7 7 7 23320.81 19402.96 6 6 17694.55 5 2080.20 4 4 4 15 158.72 14 14 108.17 14 101.30 13 13 97.88 12 12 12 0.00 8 9 11 10 0 1 2 3 8 9 11 10 0 3 1 2 0 3 1 2 8 9 10 11 0 12 45 50 Genome position (c) Union of tree sequences 1 and 2. 16 16 16 125868.85 23320.81 7 7 7 19402.96 6 6 17694.55 5 2080.20 4 4 4 158.72 15 108.17 14 14 14 101.30 13 13 97.88 12 12 12 0.00 0 1 2 3 8 9 11 10 0 3 1 2 8 9 11 10 0 3 1 2 8 9 10 11 0 12 45 50 Genome position (d) Adding the pre-split history to the union. Figure 2.3. The union operation for tree sequences. (a) and (b) show the tree sequences for two independently simulated populations. (c) depicts the union of these tree sequences. (d) shows the merged tree sequences with added history for the ancestral population. 33 A trickier use case for the node-wise union operation is when there are two tree sequences which share part of their histories. That is, imagine you simulate the ancestor population of A and B, and then start two new independent simulations of A and B. In this case, it is necessary to defined which portions are not shared between the two tree sequences, so that the histories can be joined properly. To define the non-shared portions, it is necessary to specify which nodes in one tree sequence are equivalent to those in the other. Then, the portions that are not shared from one tree sequence are copied over onto the other. New nodes are given a new numeric identifier which are propagated to the other tables. Refer to Rodrigues and Ralph (2021) for a concrete example on how to union tree sequences with some shared history. 2.3 Towards more realistic and reproducible simulations: Introducing models of selection to Stdpopsim Population genetics can aid in identifying the genetic basis of adaptation. As a new beneficial mutation increases in frequency due to positive selection, it wipes out genetic variation surrounding the selected site (Kaplan et al., 1989; J. M. Smith & Haigh, 1974). This genetic footprint can be used to infer past selection events in natural populations (Enard et al., 2014; Garud et al., 2015; Hernandez et al., 2011; Nielsen, 2000; Przeworski, 2002; Schrider & Kern, 2017; R. J. Williamson et al., 2014). Our ability to detect selective sweeps depends on many parameters, such as the recombination rate, the time since fixation, and the demographic history. The recombination rate determines the width of a selective signal (together with the strength of selection) (Kaplan et al., 1989). As time passes by, new mutations are accumulated restoring pre-sweep levels of genetic variation and erasing sweep signatures. Demographic events can leave footprints similar to a selective sweep, 34 confounding detection (Jensen et al., 2005; Przeworski, 2002). Conversely, sweep signatures can be erased by recent demographic events, such as bottlenecks. Another process that can impact our ability to detect sweeps is background selection. Background selection is the process whereby neutral genetic variation linked to deleterious mutations are lost (Charlesworth et al., 1993), and the extent of this effect is also modulated by strength of selection and recombination rate. Because of this relationship with recombination rate, background selection can confound the search for selective sweeps (Andolfatto, 2001). This process seems to be pervasive in multiple species; so much so that background selection may be considered a better null hypothesis in population genomic studies rather than neutrality (Comeron, 2017). Background selection may be discernible from selective sweeps, specially when considering more realistic models (Schrider, 2020). Previous studies that found background selection can confound sweep calling assumed that the region constrained by selection is flanked by large streches of neutral sites, but the locations of exons and other constrained elements are usually scattered throughout the chromosome. In such cases, sweeps can be separated from constraint more easily. To better understand the role of positive selection in shaping genetic variation across the genome, it is necessary to better delineate how different processes impact our ability to detect selective sweeps. I present below our efforts to include models of natural selection to stdpopsim, our community-driven library of evolutionary models. Colleagues and I included the ability to easily simulate background selection and selective sweeps using previously published distribution of fitness effects (DFEs). I then leveraged this new feature to understand how 35 power to detect selective sweeps varies over a realistic looking human chromosome, with recombination rate variation and a map of constraint taken from human annotations. These results will appear in an upcoming publication of the stdpopsim consortium. 2.3.1 Adding selection to simulations using Stdpopsim. There are two main ways of adding selection to a simulation in stdpopsim: (i) it is possible to specify a distribution of fitness effects (DFE) for new mutations that can occur across the genome or over some pre-specified regions; and (ii) we can introduce and track a single mutation to a population, as one would need to study selective sweeps. Beyond the machinery to actually perform these two kinds of simulations, we now added to Stdpopsim a library of previously published distributions of fitness effects and genomic annotations (e.g., exon and intron coordinates) which can be used to build realistic models with selection. See our Tutorial for more details on how to implement models with selection in stdpopsim: https://popsim-consortium.github.io/stdpopsim-docs/stable/tutorial.html# incorporating-selection. 2.3.2 Analysis of power to detect sweeps along realistic chromosomes. To demonstrate the utility of stdpopsim, we produced maps of power to detect sweeps along a realistic chromosome. We ran simulations of human chromosome 1 using the three population out-of-Africa model (identified as OutOfAfrica 3G09 in the stdpopsim library; Gutenkunst et al., 2009) and the genetic map estimated in Frazer et al. (2007) (HapMapII GRCh38). We had three classes of simulations: (i) neutral (with the specified demography and genetic map), (ii) background selection, in which we applied the DFE estimated in B. Y. Kim et al. (2017) (Gamma K17) to exon annotations (taken from Ensembl; 36 ensembl havana 104 exons), and (iii) sweep with background selection, in which we simulated the hard sweep as described with the addition of constraint in exonic regions (same as for ii). For computational efficiency, we simulated a 5Mb region flanking 100 evenly distributed points along the chromosome 1. When simulating truncated regions with selection, it is necessary to consider a boundary effect effect: a new deleterious mutation will be linked to fewer neutral mutations if it arises near the ends of a chromosome, so linked selection is not as efficient. Thus, we first established this effect in simulations in which new deleterious mutations can happen anywhere in the chromosome with the same DFE as in our main simulations, but with constant recombination rate. We found that by adding a buffer surrounding our focal simulated regions of 2.5cM it is possible to mitigate the edge effect (Figure 2.4). We rescaled the simulations to reduce computational cost by simulating smaller populations (factor of 2) while increasing times, mutation and recombination rates, and selection coefficients by the same amount (see Uricchio and Hernandez, 2014 and Adrion et al., 2020 for more details). In total, we produced 200 replicates for each class at each poisition, totalling 60,000 simulations. We computed three statistics used for finding selective sweeps along the simulated regions: (i) nucleotide diversity, (ii) the probability of a sweep using a machine learning sweep classifier called diploSHIC (Kern & Schrider, 2018; Schrider & Kern, 2016), and (iii) the composite likelihood ratio (CLR) (Nielsen et al., 2005). Nucleotide diversity was computed directly from the underlying tree sequences using tskit (Ralph et al., 2020) over 10 equally sized and non- overlapping subwindows. diploSHIC was applied over 15 partially overlapping 37 Figure 2.4. Linked selection is not as efficient near the ends of chromosomes. Note how the boundary effect plateaus around 2Mb from the ends (assuming a constant recombination rate). 38 subwindows. CLR was computed at 21 uniformly distributed points along the simulated region. We then classified the central subwindow as a sweep or not sweep. To do so, we empirically determined the distribution of each statistic under a null model and defined the top 5% quantile as a cutoff to call a sweep. We considered two null models: neutral and background selection (as explained before). We computed the power to detect a sweep at each one of the 100 locations along chromosome 1, which is the proportion of simulations with a sweep that were correctly classified as a sweep. There is significant amount of variation in the power to detect sweeps along human chromosome 1 (Figure 2.5). Sweeps are easier to detect in YRI than in CEU, most likely because of the higher effective population size of YRI. Variation in power varies across statistics, with nucleotide diversity yielding the greatest variation in power. The power varies tremendously when using nucleotide diversity to call sweeps, ranging from 0% to 100%. diploSHIC has good power overall, reaching close to 100% at many locations with YRI demography, but it falters at around 15% of the locations where it dips below 75%. CLR, on the other hand, maintains a good average power of around 80% and it does not vary as much. Using background selection as a null model decreases power for both CLR and Diversity. The effects of background selection on these two statistics are similar to a sweep, so using this null model leads to more stringent cutoffs to call sweeps. diploSHIC behaves differently: using background selection as a null model increases power. This indicates that background selection does not confound sweep detection using diploSHIC, as has been hypothesized before (Schrider, 2020). It 39 CEU YRI 1.00 0.75 0.50 0.25 0.00 Exon overlap 1.00 6% 4% 2% 0.75 0% Null model 0.50 BGS Neutral 0.25 cM 0.03 0.00 0.02 0.01 1.00 0.00 0.75 0.50 0.25 0.00 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Position Figure 2.5. There is variation in the power to detect sweeps along human chromosome 1 for all three statistics considered (CLR, diploSHIC probability of a sweep, and nucleotide diversity). Note that power is computed by assuming either a neutral (blue) or background selection (red) null model. 40 Power (TPR) logCLR diploSHIC Diversity is encouraging that a composite statistic such as diploSHIC can discern between modes of selection, which has been a major problem in population genetics. To gain a better understanding at which genomic features affect variation in power, we plotted power against recombination rates and percentage of exon overlap. We found that sweeps placed regions of higher recombination are generally harder to detect (Figure 2.6). diploSHIC is not as affected by recombination rates, probably because it can distinguish between locations directly affected by sweeps and locations linked to sweeps. This decrease in power with high recombination rates is caused by two things: (i) the effect of recombination rate on the variance of statistics under the null, and (ii) the effect of recombination rate on the mean of the statistics under a sweep. With lower recombination rates, we expect the variance of statistics to increase (due to coalescent noise). Indeed, both nucleotide diversity and diploSHIC probability of a sweep vary more in lower recombination regions (Figure 2.7). CLR is not as affected, and this is probably why power does not vary as much along the chromosome. In general, we expect regions of lower recombination to be more affected by a sweep, because it is harder for linked variation to escape a sweep. This is true for both nucleotide diversity and CLR, and we see that the median value for these statistics decrease with higher recombination rates (Figure 2.8). However, the probability of sweep computed by diploSHIC increases with higher recombination rates in our sweep simulations. Because diploSHIC models the difference between regions directly affected by a sweep and regions linked to a sweep, it is able to distinguish between classes better in regions of higher recombination rate. 41 CEU YRI 1.00 0.75 0.50 0.25 0.00 1.00 0.75 Null model 0.50 BGS Neutral 0.25 0.00 1.00 0.75 0.50 0.25 0.00 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 cM Figure 2.6. Recombination rates affect power to detect sweeps. The scatterplots depict the relationship between the genetic recombination rate and power for all three statistics (CLR, diploSHIC and Diversity). Note that power is computed by assuming either a neutral (blue) or background selection (red) null model. 42 Power(TPR) logCLR diploSHIC Diversity CEU YRI 2.5 0.0 −2.5 −5.0 1.00 0.75 0.50 0.25 0.00 0.0035 0.0030 0.0025 0.0020 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 cM Figure 2.7. Recombination rates affect the variance of statistics under neutrality. Red lines delimit the central 90% of the distribution for the three statistics at each recombination rate. 43 ValueNeutral Neutral Neutral logCLR diploSHIC Diversity CEU YRI 5.0 2.5 0.0 −2.5 −5.0 1.00 0.75 0.50 0.25 0.00 0.0036 0.0032 0.0028 0.0024 0.00 0.01 0.02 0.03 0.00 0.01 0.02 0.03 cM Figure 2.8. Recombination rates affect the median of statistics under the sweep model. The blue line is the median of the distribution for the three statistics at each recombination rate. 44 ValueSweep Sweep Sweep logCLR diploSHIC Diversity Taken together, our results demonstrate that power to detect sweeps varies considerably along realistic looking chromosomes. Indeed, power along chromosomes may vary almost as much as they do with different demographic models. It is necessary for future studies to consider these effects more carefully, and we provide the tools to do so using the stdpopsim library and software package. It is possible that previous scans have only found sweeps where they can be found. Thus, the effects of positive selection may have been understated in the literature, and important sweeps might have not been correctly identified. It seems important to better investigate whether ascertainment biases in reported sweeps correlate with power along chromosomes, and how this varies with different methods and across species. 45 2.4 Bridge In Chapter II, I presented my contributions to make evolutionary tools more robust and efficient. To gain insights into the complex interactions between evolutionary processes that shape genetic variation, it is necessary to increase the biological realism of our models. This increase in realism comes at a cost: simulations that are closer to reality take longer to run and consume more computational resources. In Chapter III, we apply our ideas presented in this chapter on how to parallelize multi-population simulations and on simulating realistic chromosomes. The realistic simulations we produce in Chapter III were extremely costly, despite our best efforts to optimize, and each replicate took over 30 days to run consuming at the peak over 160Gb of RAM. Nevertheless, we were able to push the boundaries in what is possible with our tools and computational resources today to answer a long-standing question in evolutionary genetics: What are the relative contributions of different evolutionary processes in shaping genetic variation in the great apes? 46 CHAPTER III SHARED EVOLUTIONARY PROCESSES SHAPE LANDSCAPES OF GENOMIC VARIATION IN THE GREAT APES This chapter was published in the journal Genetics in 2024. Andrew D. Kern and Peter L. Ralph are co-authors on this paper. Co-authors and I conceptualized the study, I curated the dataset, I performed analyses and simulations with input from my co-authors, and I wrote the manuscript with editorial assistance from my co-authors. The citation for this publication is as follows: Rodrigues, M. F., Kern, A. D., & Ralph, P. L. (2024). Shared evolutionary processes shape landscapes of genomic variation in the great apes. Genetics, iyae006. https://doi.org/10.1093/genetics/iyae006 3.1 Introduction Genetic variation is determined by the combined action of mutation, demographic processes, recombination and natural selection. However, there is still no consensus on the relative contributions of these processes and their interactions in shaping patterns of genetic variation. Two major open questions are: how does the influence of selection compare to other processes? And, to what degree is genetic variation influenced by beneficial versus deleterious mutations? Genetic variation can be measured within a species or between species with two related metrics: within-species genetic diversity and between-species genetic divergence. Both can be estimated with genetic data by computing the per site average number of differences between pairs of samples within a species or between two species, and these are estimates of the mean time to coalescence. 47 (Note that we do not discuss relative divergence, which is often measured using FST .) Evolutionary processes impact diversity and divergence in different ways, so the relationship between these carries information regarding these processes. Natural selection directly impacts genetic diversity because it can reduce the frequencies of alleles that are deleterious (negative selection) or increase those of beneficial alleles (positive selection). Selection can also directly affect between- species genetic divergence. Deleterious alleles are more likely to be lost from the population, thus reducing divergence at the affected sites. On the other hand, beneficial alleles have a higher probability of fixation. This leads to an increase in the rate of substitution at sites under positive selection that in turn increases divergence between species. Thus, contrasting patterns of diversity and divergence at the same time can help disentangle between modes of selection (Hudson et al., 1987). Indeed, perhaps the most widely used test for detecting adaptive evolution, the McDonald-Kreitman test, compares diversity and divergence contrasted between neutral (e.g., synonymous) and functional (e.g., non-synonymous) site classes (McDonald & Kreitman, 1991). This test and its extensions have been applied to a myriad of taxa, and it has become clear that a substantial proportion of amino acid substitutions are driven by positive selection in a number of taxa (Galtier, 2016; Ingvarsson, 2010; Slotte, 2014; N. Smith & Eyre-Walker, 2002). Selection also disturbs genetic variation at nearby locations on the genome, and this indirect effect of selection on diversity is called “linked selection”. Linked selection can be caused by at least two familiar mechanisms: genetic hitchhiking and background selection. Under genetic hitchhiking, as a beneficial mutation quickly increases in frequency in a population, its nearby genetic background is carried along, causing local reductions in levels of genetic diversity. The size of the 48 region affected by the sweep depends on the strength of selection, which determines how fast fixation happens, and the crossover rate, because recombination allows linked sites to escape from the haplotype carrying the beneficial mutation (Kaplan et al., 1989; Maynard Smith & Haigh, 1974). Under background selection, neutral variation linked to deleterious mutations is removed from the population unless, as before, focal lineages escape via recombination (Charlesworth et al., 1993). Both of these processes leave similar footprints on patterns of within-species genetic diversity, and so attempts to determine the contributions of positive and negative selection in shaping levels of genetic variation genome-wide have proven to be difficult (Andolfatto, 2001; Y. Kim & Stephan, 2000), although the processes seem separable more locally (Schrider, 2020; Schrider & Kern, 2017). Importantly, linked selection does not affect between-species genetic divergence as strongly, as a beneficial or deleterious mutation does not affect the substitution rate of linked, neutral mutations (Birky & Walsh, 1988) (although it does affect divergence through ancestral levels of polymorphism (Begun et al., 2007; Phung et al., 2016)). The effects of linked selection in shaping genetic variation are pervasive across genomes (Begun & Aquadro, 1992; Cai et al., 2009; Corbett-Detig et al., 2015; Lohmueller et al., 2011; Murphy et al., 2022). For example, dips in nucleotide diversity surrounding functional substitutions have been uncovered in many taxa, such as fruit flies (Kern et al., 2002; Sattath et al., 2011), rodents (Halligan et al., 2013), Capsella (R. J. Williamson et al., 2014) and maize (Beissinger et al., 2016). In Drosophila melanogaster, levels of synonymous diversity (which is putatively neutral) and amino acid divergence are negatively correlated (Andolfatto, 2007; Macpherson et al., 2007); positive selection can cause such a pattern if beneficial amino acid mutations are fixing and as they do reducing levels of linked neutral 49 variation via selective sweeps. In contrast in humans, levels of synonymous diversity are roughly the same near amino acid substitutions and synonymous substitutions, suggesting recent, recent fixations at amino acids sites may not be the result of strongly beneficial alleles (Hernandez et al., 2011; Lohmueller et al., 2011). However, in the human genome, amino acid substitutions tend to be located in regions of lower constraint than synonymous substitutions, implying that the signal of positive selection may be confounded by the effects of background selection (Enard et al., 2014). Two major challenges remain in the way of a fuller characterization of the effects of selection on genetic variation: (i) it is hard to model interactions between evolutionary processes (e.g., sweeps within highly constrained regions), and (ii) model identifiability is challenging for some summaries of the data (e.g., sweeps and background selection may impact diversity in similar ways). Recent computational advances have made it possible for us to move from simpler backwards-in-time coalescent models (Hudson, 1983) to more complex and computationally demanding forward-in-time simulations, and these have provided a route to studying these hard to model interactions between evolutionary processes across multiple sites (Haller & Messer, 2019; Haller et al., 2019; Kelleher et al., 2016a). Simulation-based inference can then allow us to better describe the roles of different modes of selection and other processes in shaping genomic variation. However, the problem of identifying features of the data that are informative of the strength and mode of selection still remains. One promising approach might be to compare patterns of genetic variation in multiple species jointly as each species can be thought of as semi-independent realizations of the same evolutionary processes (c.f. Won & Hey, 2005). In 50 speciation genomics studies, it is common to visualize large scale patterns of genetic variation along chromosomes (so-called landscapes of diversity and divergence), which may contain substantial information to help us disentangle evolutionary processes. Earlier empirical surveys have focused on the identification of regions of accentuated relative divergence between populations (Cruickshank & Hahn, 2014; Harr, 2006; Turner et al., 2005), although patches of increased divergence can be the result of myriad forces besides reproductive isolation and adaptation. Recent comparative studies have found that landscapes of diversity are highly correlated between related groups of species, such as Ficedula flycatchers (Burri et al., 2015; Ellegren et al., 2012), warblers (Irwin et al., 2016), stonechats (van Doren et al., 2017), hummingbirds (Battey, 2020), monkeyflowers (Stankowski et al., 2019) and Populus (Wang et al., 2020). Burri (2017) proposed that we could capitalize on correlated genomic landscapes to study the interplay between different forms of selection and other evolutionary processes. Neutral processes, such as retained ancestral diversity (i.e., incomplete lineage sorting) or migration, could potentially produce significant correlations in levels of diversity across species, however strong correlations have been observed among taxa with long divergence times and without evidence of gene flow. For example, Stankowski et al. (2019) found that landscapes of diversity and divergence are highly correlated across a radiation of monkeyflowers which spans one million year (or about 10Ne generations, where Ne is the effective population size), far longer than the time scale on which we expect to see effects of ancestral variation and incomplete lineage sorting (since the coalescent timescale spans just a few multiples of Ne). However, a shared process that independently occurs in the branches of a group of species could maintain correlations over long timescales. For example, if two species’ physical arrangement 51 of functional elements and local recombination rates are similar, the direct and indirect effects of selection could make it so that peaks and valleys on the landscape of diversity are similar, maintaining correlation between their landscapes over evolutionary time (Burri, 2017; Delmore et al., 2018). Further, if mutational processes are heterogeneous across the genome in a manner that is shared among species, then correlated landscapes of diversity could be created through mutational variation as well. Here, we aim to (i) describe whether and in what ways landscapes of within- species diversity and between-species divergence are correlated, and (ii) to tease apart the relative roles of positive and negative selection and other processes (e.g., ancestral variation, mutation rate variation) in shaping patterns of genetic variation. To understand processes driving these correlations, we employ highly realistic, chromosome-scale, forward-in-time simulations, since analytical predictions are not available. We use the great apes as a system to investigate correlated patterns of genetic variation because there is high quality population genomic data for all species (Prado-Martinez et al., 2013), the clade is about 12 million years old or 60Ne generations (but there have not been many chromosomal arrangements Jauch et al., 1992), and lastly the landscapes of gene density, recombination rate and mutation rate are roughly conserved (Kronenberg et al., 2018; Stevison et al., 2016). Our study demonstrates that correlated landscapes can be useful in distinguishing between modes of selection and the balance of direct and linked selection shaping genomic variation. 3.2 Methods 3.2.1 Genomic data. We retrieved SNP calls for ten great ape populations made on high coverage (∼25×) short-read sequencing data from the 52 Great Ape Genome Project (Prado-Martinez et al., 2013), mapped onto the human reference genome (NCBI36/hg18). We analyzed 86 individuals divided into the following populations: human (n = 9 samples), bonobo (n = 13), Nigeria- Cameroon chimpanzee (n = 10), eastern chimpanzee (n = 6), central chimpanzee (n = 4), western chimpanzee (n = 4), eastern lowland gorilla (n = 3), western gorilla (n = 27), Sumatran orangutan (n = 5), Bornean orangutan (n = 5) (we excluded two samples from the original dataset: the Cross River gorilla and the chimpanzee hybrid). Prado-Martinez et al. (2013) applied several quality filters to the SNP calls (see Section 2.1 of their Supplementary Information) and, for each species, identified the genomic regions in which it would be unreliable to call SNPs (uncallable regions). For our downstream analyses, we only considered sites which were callable in all populations. We calculated nucleotide diversity and divergence (dXY ) in non-overlapping 1Mb windows using scikit-allel (Miles et al., 2020). Windows in which there were less than 40% callable sites were not used in any of the analyses. For example, this yielded 129 (out of 132) 1Mb windows in chromosome 12 in which 75% of the sites were callable on average. To tease apart the effects of GC-biased gene conversion (gBGC), we decomposed diversity and divergence by allelic states. gBGC is expected to affect weak bases (A or T) which are disfavored when in heterozygotes which also carry a strong base (G or C). Thus, one way understand the effects of gBGC is by comparing sites which were weak to those that were strong in the ancestor (ancestrally strong alleles are not affected by gBGC, but ancestrally weak alleles can be). We assumed that the state in the ancestor of the great apes to be the state seen in rhesus macaques (genome version RheMac2) — sites without enough 53 information in RheMac2 were excluded. Then, we computed divergence only considering sites which were ancestrally weak or ancestrally strong (Figure A.4). This approach has two major drawbacks: (i) many of the sites cannot be used because they are missing in RheMac2 and (ii) sites can be mispolarized. Thus, we came up with a second approach to tease apart the effects of gBGC on correlations between genomic landscapes. When comparing two landscapes of divergence (which encompass four species), we can classify each site by the change in state that happened without needing to polarize mutations by looking at the ancestor. For example, if we have allelic states for four species and we see A-A-T-T as the configuration of alleles at a particular site, we know that there must have been one mutation which changed the state from a weak base to another weak base (W- W). On the other hand, if we see A-G-A-A there must have been one mutation from weak to strong (W-S) (or vice-versa). Sites with multiple mutations (e.g., A- G-G-C) were removed from the analyses. Sites that did not change from W to S (or vice-versa) are not expected to be affected by gBGC, and we refer to these as W-W or S-S mutations (Figure 3.6A). Sites where there may have been a weak to strong change (W-S mutations) may be affected by gBGC (Figure 3.6B). We only considered windows with at least 5% of callable sites in these analyses. 3.2.2 Simulations. We implemented forward-in-time Wright-Fisher simulations of the entire evolutionary history of the great apes using SLiM (Haller & Messer, 2019; Haller et al., 2019). Each branch in the great apes’ tree was simulated as a single population with constant size (Figure 3.1). Population splits occurred in a single generation, and there was no contact between populations post- split. Population sizes and split times were taken from the estimates in Prado- Martinez et al. (2013). Across all our simulations, we simulated crossover events 54 occurred with the sex-averaged rates from the deCODE genetic map (in assembly NCBI36/hg18 coordinates) (Kong et al., 2002). We then computed diversity and divergence in the same windows used for the real data using tskit (Kelleher et al., 2018; Ralph et al., 2020). To improve run time, we simulated sister branches in parallel and recorded the final genealogies as tree sequences (Kelleher et al., 2016a). Further, neutral mutations were not simulated with SLiM and were added after the fact with msprime. The resulting tree sequences were later joined and recapitated (i.e., we simulated genetic variation in the ancestor of all great apes using the coalescent) using msprime, tskit and pyslim (Kelleher et al., 2016a, 2018; Rodrigues & Ralph, 2021). Despite our efforts to improve run time, our simulations of the entire history of the great apes were still incredibly costly (taking over a month to complete in many instances). In our neutral simulations, we assumed that neutral mutations occurred at a rate of 2 × 10−8 new mutations per generation per site (Scally & Durbin, 2012), uniformly across the entire chromosome. To understand the effects of natural selection on landscapes, we simulated beneficial and deleterious mutations only within exons, assuming that the locations of exons were shared across all great apes (Kronenberg et al., 2018) and using exon annotations from the human reference genome NCBI36/hg18. We varied the proportions of neutral, beneficial and deleterious mutations within exons, but the distribution of fitness effects (DFE) for both deleterious and beneficial mutations were shared across all apes. The DFE for deleterious mutations was gamma-distributed with a fixed shape α and scale as estimated in Castellano et al. (2019), and the DFE for beneficial mutations followed an exponential distribution (Orr, 2003). 55 By default, we added neutral mutations to the simulated genealogies with msprime so that the total mutation rate (of neutral plus non-neutral mutations, if any) was constant along the genome. In addition, to simulate local variation in mutation rates along the chromosome, we selected three simulated genealogies (the fully neutral simulation, one with deleterious mutations and one with both beneficial and deleterious mutations) to add neutral mutations in a way that resulted in varying levels of neutral mutation rate variation along the chromosome. To do this, we built mutation rate maps by sampling mutation rates for each 1Mb window independently from a normal distribution with mean 2 × 10−8 and standard deviation chosen from σ/2× 10−8 = {0.005, 0.007, 0.011, 0.016, 0.023, 0.033, 0.048, 0.070.0.103, 0.150}. In simulations with non-neutral mutations, we subtracted the non-neutral mutation rate from the respective window mutation rate for the intervals that intersected with exons. In total, we explored 56 different parameter combinations with all the different simulations (see Table 3.1 and Table A.1 for the parameter space). The code used to produce the simulations can be found at https://github.com/kr-colab/greatapes sims. Regime Neutral Deleterious only Beneficial only Both Proportion of deleterious mutations 0% 10% – 70% 0% 10% – 70% Proportion of beneficial mutations 0% 0% 0.005% – 0.5% 0.005% – 0.5% Gamma distributed with Gamma distributed with Deleterious DFE — s̄ = {− —0.015,−0.03} and α = 0.16 s̄ = {−0.015,−0.03} and α = 0.16 Exponentially distributed with Exponentially distributed with Beneficial DFE — — s̄ = {0.01, 0.005} s̄ = {0.01, 0.005} Table 3.1. Range of parameters explored in the simulations. Non-neutral mutations were only allowed within exons. “DFE” refers to the distribution of fitness effects. Gamma distribution was parameterized with shape α and mean s̄=α/β, where β is the rate parameter. 3.2.3 Visualizing correlated landscapes of diversity and divergence. To compare landscapes of diversity and divergence along chromosomes, we computed the Spearman correlation between the landscapes 56 Great apes Orangutans African apes Gorillas Human-Pan Pan Chimps Eastern-Central Nigerian-Western Figure 3.1. Range of parameters explored in the simulations. Arrows indicate population splits. Branch widths are proportional to population size. For example, the population size was 125, 089 for the great apes branch and 7, 672 for the humans branch. Figure was produced using demesdraw (Gower et al., 2022). 57 time ago (Kya) 0 2000 4000 6000 8000 10000 12000 Sumatran orangutan Bornean orangutan Western gorilla Eastern gorilla Western chimp Nigerian chimp Eastern chimp Central chimp Bonobo Humans across windows within a chromosome. Because of computational constraints, we focus on chromosome 12. Chromosome 12 is one of the smallest chromosomes in the great apes, there are no major inversions, and it has good variation in exon density and recombination rate. The choice was made blindly before looking at the data, but we found it behaves similarly to other chromosomes (see Figure A.10 through Figure A.31). We expected landscapes of two closely related species to be more correlated than the landscapes of two distantly related species. Thus, the correlation between any two landscapes of diversity and divergence is expected to depend on distances between them in the phylogenetic tree. We decided to plot our correlations against distance (in generations) between the most common recent ancestor (MRCA) of each landscape. These distances were computed using the demographic model estimated in Prado-Martinez et al. (2013). In comparing two landscapes of diversity, this amounts to the total distance between the two tips in the species tree. For instance, the phylogenetic distance dT between diversity in humans and diversity in bonobos is the sum of the lengths of the human, pan and bonobo branches in the species tree used for simulation (shown in Figure 3.1). In comparing a landscape of diversity to a landscape of divergence, this amounts to the distance between the species of the landscape of diversity and the MRCA of the two species involved in the divergence. For example, dT for the landscapes of diversity in humans and divergence between Sumatran orangutans and eastern gorillas would be the distance between the humans tip and the great apes internal node. dT for the landscapes of divergence between the orangutans and divergence between the gorillas would be the distance between the orangutan and gorilla 58 internal nodes. Some divergences may share branches in the tree, but these are excluded from our main figures; see subsection A.0.1 and Figure A.2. 3.3 Results First, we will provide a qualitative view of the landscapes of diversity and divergence in the great apes. Then, we explore the correlations between landscapes in the real data and how they vary depending on phylogenetic distance. To understand the processes that can drive these correlations, we use forward-in- time simulations of the great apes history under different models (e.g., with and without natural selection). Lastly, we describe how genomic features are related to patterns of diversity and divergence in the real great apes data, and we speculate which processes can explain what we see in the data and simulations. 3.3.1 Landscapes of within-species diversity and between- species divergence. There is considerable variation in levels of genetic diversity across the great apes (Figure 3.2). Species may differ in overall levels of diversity due to population size history: species with greater historical population sizes (e.g., central chimps and western gorillas) harbor the most amount of genetic variation (Prado-Martinez et al., 2013). Levels of diversity vary along the chromosome, but do not appear to be strongly structured. Instead, diversity seems to haphazardly fluctuate up and down along the chromosome, and this variation might be attributed to neutral genealogical and mutational processes alone. A notable feature is the large dip in diversity around the 50Mb mark, which is so extensive that it almost erases the differences between-species. This dip coincides with three of the windows with the highest exon density, possibly pointing to the role of selection in shaping genetic variation in those windows. 59 Levels of between-species genetic divergence also vary along the genome, by an even greater amount in absolute terms. Interestingly, diversity (π) varies (along the chromosome) by about 0.2%, whereas divergence (dXY ) varies by more than 0.5%. Because d = πancXY + rT (where π anc is diversity in the ancestor, r is the substitution rate and T is the split time between the two species), this excess in variance may be due to the substitution process. Landscapes of divergence which share their most common recent ancestor (e.g., human-Bornean orangutan and bonobo-Bornean orangutan divergences — both colored in red in Figure 3.2A) overlap almost perfectly with each other. Curiously, divergence seems to accumulate faster in the ends of the chromosome, leading to a “smiley” pattern in the landscape of divergence — which is not apparent in the landscape of diversity. That is, with deeper split times, divergence in the ends of the chromosome seem to increase faster than in other regions of the genome (see how the divergences whose MRCA is the great apes look more like a convex parabola than a horizontal line in Figure 3.2A; see also Figure A.1). In comparing landscapes across species side by side, a remarkable pattern emerges: levels of genetic diversity and divergence along chromosomes have similar peaks and troughs. To get a sense of how strong this observation is, we can compare it to one of the most well studied properties of genomic variation: the correlation between exon density and genetic diversity. We found that the correlation between human diversity and exon density is −0.2 (at the 1Mb scale), but the correlation between levels of diversity in humans and western gorillas is 0.48. Below, we dissect this observation of strong correlation between landscapes across the great apes and discuss the processes that may cause it. 60 A π 0.3% sumatran−orangutan 0.2% central−chimp western−gorilla bonobo 0.1% humans 0.0e+00 5.0e+07 1.0e+08 1.5e+08 dXY great−apes 3% 2% african−apes human−pan 1% pan 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Chromosomal position B 4 2 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 C 20% 15% 10% 5% 0% 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Chromosomal position Figure 3.2. A) Landscapes of nucleotide diversity (π) and divergence (dXY ) in 1Mb windows along chromosome 12. Lines are colored by species on the top plot and by the most common recent ancestor (MRCA) on the bottom. Genomic windows with less than 40% of callable sites were masked. Only a subset of the species are displayed for clarity. B) Recombination rate estimates from humans (deCODE map; Kong et al., 2002). C) Exon density along chromosome 12, computed as the percentage of callable nucleotides in a window that fall within an exon. 61 Exon density Recombination rate(cM/Mb) Per site differences 3.3.2 Remarkable correlations between landscapes of diversity and divergence. The landscapes of diversity and divergence are highly correlated across the great apes. To interpret this signal, we first need to understand what processes can cause such correlations, and so first we describe the toy example depicted in Figure 3.3. Both genetic diversity (π) and divergence (dXY ) are estimates of the mean time to the most recent common ancestor (multiplied by twice the effective mutation rate). Populations V and W split recently, and so ancestral variation contributes significantly to within-species diversity (i.e., the coalescences for samples within species happen before the species split). As a result, samples from one population may coalesce first with a sample from another population (e.g., samples v2 and w1), a pattern called incomplete lineage sorting (ILS) (see the branch marked with * in the gene tree). This sharing of ancestral variation causes πV and πW to be correlated with each other. The −T probability two samples from V coalesce before the split with W is 1 − e 2Ne , where T is the split time and Ne is the effective population size. Therefore, split time (T ) should be a good predictor of the correlation between two landscapes of diversity (and/or divergence). Thus, we decided to visualize correlations between landscapes of diversity and divergence by computing the phylogenetic distance dT , which is simply the distance in generation time between two statistics. For example, we define dT (πW , dXY ) = 2TVWXY −TXY . Divergences may share branches by definition (irrespective of split times), as you can see with dV X and dXY (see subsection 3.2.3 for more details). In such cases, our chosen metric dT would not be a good proxy for expected correlations, so we omit such cases from our main figures. See subsection 3.2.3 and Figure A.2 for more on the correlations between landscapes that share branches. 62 Figure 3.3. Visualizing the relationships between nucleotide diversity and divergence statistics between closely related taxa. A population and gene tree for four populations (V, W, X, Y) are depicted with the light gray polygon and gray solid line, respectively. The branch that is shared between πV and πW due to incomplete lineage sorting is highlighted in pink. 63 Figure 3.4 shows the pairwise correlations between great apes landscapes of diversity and divergence against phylogenetic distance (dT , which is computed from the split times of the model shown in Figure 3.1). We see ancestral variation seems to play a role in structuring correlations between landscapes: pairs of species that recently split have their landscapes of diversity highly correlated. Surprisingly, correlations still plateau at around 0.5. We expect ancestral variation to play a minor role when comparing orangutans and chimps, which separated around 60×Ne generations ago, but their landscapes are still highly correlated. Population size history seems to affect the correlation between landscapes since the weakest correlations involve the landscape of diversity of one of the species with small historical population sizes (i.e., bonobos, eastern gorillas and western chimps). Correlations between landscapes of divergence and diversity and between landscapes of divergence are also quite high, often surpassing 0.5, and they also decay with phylogenetic distance (dT ) (see middle and right most plots in Figure 3.4). In theory, these landscapes can also be correlated due to ancestral variation. To see how ancestral variation can create correlations even between landscapes with no overlap in the tree, consider Figure 3.3: divergence between X and Y and divergence between V and W can each contain contributions from ancestral diversity if lineages have not coalesced in both branches leading from the ancestor. If a particular portion of the genome happens to have higher diversity in the ancestor, it will also have higher divergence. Since this correlation is produced by sharing of ancestral variation, it is expected to have a very small effect except when branches are short. As discussed in subsection 3.2.3, two divergences can also be correlated by definition (because they share branches in the tree). For example, when comparing human-Bornean orangutan and gorilla-Bornean orangutan 64 divergence we expect some correlation because these divergences share the large African apes and orangutan branches in the tree (Figure 3.1). In Figure 3.4 we excluded these comparisons where branches are shared. Such comparisons can be seen in Figure A.2. We found that even these comparisons that share branches have an excess of correlation compared to a theoretical expectation (derived from a simplified neutral model), that is the correlations are above the y = x line in Figure A.2 even for distantly related species. There are many processes that could maintain landscapes correlated. Above, we discussed how we expect ancestral variation to explain these correlations. The alternative would be to have a process that structures variation along chromosomes which is shared across species. Using forward-in-time simulations, we set out to (i) confirm that ancestral variation alone is not causing landscapes to remain correlated, and (ii) test which process or processes that when shared among a group of species could maintain correlations in similar ways to what we observed in the great apes’ data. 3.3.3 Neutral demographic processes. To assess the extent to which ancestral variation alone could explain our observations, we performed a forward-in-time simulation of the great apes’ evolutionary history. As expected, the resulting landscapes of diversity and divergence are not well correlated (Figure 3.5). Ancestral variation seems to maintain correlations between some landscapes; for instance, the landscapes of diversity in central and eastern chimps have a 0.61 correlation, the highest across all pairs of comparisons (Figure 3.5A, point a). Nevertheless, correlations between landscapes of diversity and divergence decay quickly with phylogenetic distance to 0. Some distant comparisons are moderately correlated (e.g., the landscape of diversity in Bornean orangutans and divergence 65 Great apes dataset π − π dXY − π dXY − dXY 1.00 ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● 0.75 ●● ● ●● ●● ● ●● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● π in low N ● ● ●● ●● e ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0.50 ●● ● ●● species? ● ● ●● ● ●●● ●● ● ●● ● No Yes 0.25 0.00 0 250 500 750 1 0000 250 500 750 1 0000 250 500 750 1 000 Phylogenetic distance (103) Figure 3.4. Correlations between landscapes of diversity and divergence across the great apes. Each point on the plots correspond to the (Spearman) correlation between two landscapes of diversity/divergence, computed on 1Mb windows across the entire chromosome 12. Correlations were split by type of landscapes compared (π − π, π − dXY , dXY − dXY ). dT is the phylogenetic distance (in number of generations) between the most common recent ancestor of the two landscapes compared (e.g., the dT for correlation between landscapes of diversity in humans and divergence between eastern gorillas and orangutans is distance between the humans and the great apes nodes in the phylogenetic tree, Figure 3.1). Note that species with low Ne — for which the estimated species Ne was less than 8 × 103: bonobos, eastern gorillas and western chimps — have a different point shape. Only comparisons for which the definition of the statistics do not overlap are shown, as explained in subsection 3.2.3. 66 Correlation between landscapes between central and western chimps have a correlation coefficient of 0.23, see Figure 3.5A, point b), but that seems to be driven by the outlier window around 80Mb. This outlier window has a recombination rate close to 0 (Figure 3.2C), so the average nucleotide diversity over the window has a higher variance because of coalescent noise (see the extreme peaks and valleys in Figure 3.5). Recombination rate variation can create some moderate correlations, but when we look at multiple species at once it becomes clear that the mean correlation goes to 0. 3.3.4 GC-biased gene conversion. A prominent feature of the landscapes of divergence in the great apes is the faster accumulation of divergence in the ends of the chromosomes (Figure 3.2). This feature was not present in any of our simulations, so we sought to understand its possible causes. Double strand breaks are more common at the ends of chromosomes (Kong et al., 2002, 2010), and these can be repaired either by crossover or gene conversion events. GC-biased gene conversion (gBGC), the process whereby weak alleles (A and T) are replaced by strong alleles (G and C) in the repair of double-stranded breaks in heterozygotes, mimics positive selection – in that it increases the probability of fixation of G and C alleles (e.g., Galtier et al., 2009). We suspected gBGC could have caused the increased rate of accumulation divergence in the ends of chromosomes, as has been observed previously (Katzman et al., 2010), and contributes to the maintenance of correlations between landscapes over long time scales. To tease apart the effects of gBGC on correlated landscapes, we partitioned divergence by mutation type (weak to weak, strong to strong and weak to strong). If correlations are being driven by gBGC, then we would expect the correlation between landscapes of divergence to be stronger for weak to strong mutations. We found that the overall correlations are very similar across mutation types, 67 A Neutral simulation π − π dXY − π dXY − dXY 0.8 a π in low N● e● ● 0.4 ● species? ● ● ● b ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●●● ● ● No 0.0 ● ● ● ●● ● ● ●●● ●● ●● ●●● ● ● ●● ● ●●●● ●● ● ● Yes ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●●●● ● ● 0 250 500 750 1 0000 250 500 750 1 0000 250 500 750 1 000 Phylogenetic distance (103) B π dXY 0.3% 3% great−apes 0.2% central−chimp 2% western−gorilla african−apes sumatran−orangutan 0.1% human−pan humans 1% pan bonobo 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 Chromosomal position Figure 3.5. Landscapes are not well correlated in a neutral simulation. (A) Correlations between landscapes of diversity and divergence in a neutral simulation. See Figure 3.4 for more details. (B) Nucleotide diversity and divergence along the simulated neutral chromosome. See Figure 3.2A for details. 68 Per site differences Correlation between landscapes suggesting gBGC does not play a strong role in structuring the correlations between landscapes (Figure 3.6). A B dXY − dXY for W−W or S−S mutations dXY − dXY for W−S mutations 1.00 1.00 ● ●● ●● ● ● ● ● ●● ●●● ● ● ●●● 0.75 ● ●●● ●●●●●● ● ● ●●●●●● ● 0.75● ● ● ●● ● ●● ●● ●●● ●●●●●● ● ●●●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● 0.50 ●● 0.50 ● ● ● ●●●● ●● ●● ●●● ● ● 0.25 0.25 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Figure 3.6. Correlations between landscapes of divergence partitioned by site type (W-W/S-S and W-S). W-W sites are sites in which the state did not change between-species (and remained weak which corresponds to A or T). Similar logic applies to S-S sites (S or strong states are G or C). W-S sites are sites in which a new mutation appeared either going from weak to strong or from strong to weak. Note these definitions do not rely on identifying the exact ancestral state, we simply compare the current states in the four species involved (two species per dXY landscape). For example, if by looking at the four species we see the following states A,T,A,T the site would be classified as W-W. If we saw G,A,A,A the site would be classified as W-S. Other details are the same as in the rightmost panel in Figure 3.4. 3.3.5 Positive and negative natural selection. Another process whose intensity is likely correlated across all branches in the great apes tree is natural selection. If targets of selection and recombination maps are shared across species, then we would expect both the direct and indirect effects of selection to be shared across branches. It can be difficult to model natural selection in a realistic manner because we do not know precisely which locations of the genome 69 Correlation between landscapes are subject to stronger selection. Nevertheless, exons are expected to have higher density of functional mutations than other places in the genome. Thus, we ran simulations in which beneficial and deleterious mutations can happen only within exons. Using human annotations, we simulated the great apes’ history assuming a common recombination map and exon locations. See the landscapes resulting from these simulations in Figure A.3. We found that negative selection can slightly increase correlations between landscapes (Figure 3.7A-C). If 30% of all mutations within exons were strongly deleterious (mean selection coefficient s̄ = −0.03), landscapes would be weakly correlated (Figure 3.7B). The correlations between landscapes rarely surpass 0.5, even with 70% of all mutations within exons being strongly deleterious (Figure 3.7C). Positive selection, on the other hand, can quickly increase correlations between landscapes. A beneficial mutation rate within exons of µ̄ = 1 × 10−12p produced moderate correlations between landscapes (Figure 3.7D). With too much positive selection, correlations can break down because of the contrasting effects of positive selection on diversity and divergence. That is, while positive selection increases fixation rates and hence divergence between-species, its linked effects decrease diversity within the species. This can create negative correlations between landscapes, as can be seen in Figure 3.7F. Note that some correlations between landscapes of diversity and divergence remain high when the divergence is computed between closely related species (e.g., central and eastern chimps). Divergence is d = πancXY + 2rT , where πanc is diversity in the ancestor, r is the substitution rate and T is the time since species split. Thus, for the divergences 70 in which the two species split recently are dominated by genetic diversity in the ancestor, correlations between π − dXY remain high because dXY ≃ πanc. Positive and negative selection can work synergistically to produce correlated landscapes that look like the real data. For example, comparing figures Figure 3.7D,G,H which differ in rate of negatively selected mutations µn, it is possible to see that the correlations between landscapes start to resemble the real data with more deleterious mutations. Figure 3.7H seems to resemble the data fairly well, with π − dXY and dXY − dXY correlations plateauing around 0.5. The π − π correlations are a bit lower than the real data, however. Recent demographic events can affect genetic diversity and although our simulations are heavily parameterized with respect to the effects of selection, we are not capturing all the variation caused by more realistic demographic models. Figure 3.7D and H look very similar to each other. These have the same amount of positive selection, but the first did not have any negative selection. The major difference between them is that with negative selection there is a more clear separation between the correlations involving low Ne species, similar to what is seen in the data. 3.3.6 Mutation rate variation. Since mutation rate can vary along chromosomes, if this mutation rate map were shared across species, it would maintain correlations between landscapes over longer periods of time. To assess this, we used three of our previous simulated genealogies of the great apes and replaced all neutral mutations assuming a common neutral mutation rate map across the phylogeny: for each window, we drew a mutation rate from a normal distribution with mean 2 × 10−8 (the same as all other simulations) and standard deviation µSD. We found that, under neutrality, a mutation rate map with µSD close to 7% × 2 × 10−8 would be needed to get correlations similar to the data 71 A μₙ B C=2.0e-09 μₙ=6.0e-09 μₙ=1.4e-08 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) D μₚ E F=1e-12 μₚ=1e-11 μₚ=1e-10 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) G H I μₙ=2.0e-09 μₚ=1e-12 μₙ=1.2e-08 μₚ=1e-12 μₙ=1.4e-08 μₚ=1e-11 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) Figure 3.7. Correlations between landscapes of diversity and divergence in simulations with natural selection. (A-C) Simulations with negative selection. (D- F) Simulations with positive selection. (G-I) Simulations with both negative and positive selection. The selection parameters µn and µp are the rate of mutations in exons with negative and positive fitness effects, respectively. The mean fitness effect was s̄ = −0.03 for deleterious mutations and s̄ = 0.01 for beneficial mutations (see subsection 3.2.2 for more details). See how panel H looks the most like the data (Figure 3.4). 72 Correlation Correlation Correlation Correlation Correlation Correlation Correlation Correlation Correlation (Figure 3.8A-C). Although mean correlations look similar to the data, we see that correlations tend to increase slightly with time in the simulations with mutation rate variation. This is expected because windows with higher mutation rate accumulate divergence faster, creating a correlation with mutation rate that gets stronger with time. In the great apes’ data, however, we see a slow but steady decrease in correlations with time. When we added variation in the neutral mutation rate to simulations with selection, we found that a mutation rate map with a standard deviation of rates of slightly less than µSD = 7% could plausibly create the correlations observed in the real data (Figure 3.8D-I). The neutral and deleterious simulations with mutation rate variation fail to recover one aspect of the real data: the lower correlations between landscapes that include at least one low Ne species (seen in the π − π and π − dXY comparisons of Figure 3.4). This feature, however, is seen in the simulations with both beneficial and deleterious mutations (Figure 3.8G-I). 3.3.7 Visualizing similarity between simulations and data. To see how a particular simulation resembles the real data, we can use figures Figure 3.4 and Figure 3.7 to compare how the patterns of all 1260 pairwise correlations between landscapes match the real data. However, it is difficult to assess the fit of the simulated scenarios to real data from such a comparison. Instead, we use principal component analysis (PCA) and create a low dimensional visualization, shown in Figure 3.9, in which each point is a simulation or the real data (shown in yellow). We created this PCA from the 57 × 1260 matrix in which rows are the simulations and the data, and columns are the pairwise Spearman correlations between landscapes. Unlike in the plots above, here we include the correlations between overlapping landscapes (as detailed in subsection 3.2.3) 73 A B C μ-sd=0.033 μ-sd=0.070 μ-sd=0.150 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) D E F μₙ=1.4e-08 μ-sd=0.033 μₙ=1.4e-08 μ-sd=0.070 μₙ=1.4e-08 μ-sd=0.150 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) G ₙ H Iμ =1.2e-08 μₚ=1e-12 μ-sd=0.033 μₙ=1.2e-08 μₚ=1e-12 μ-sd=0.070 μₙ=1.2e-08 μₚ=1e-12 μ-sd=0.150 π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y π − π dX Y − π dX Y − dX Y 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 0 250 500 750 1 000 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) Figure 3.8. Correlations between landscapes of diversity and divergence across the great apes for simulations with variation in mutation rate along the chromosome. Panels A through I show different simulations in which we varied the standard deviation in neutral mutation rate between 1Mb windows, in each setting the standard deviation to the mean mutation rate (2 × 10−8) multiplied by µSD. First row (A-C) use a neutral simulation, second row (D-F) a simulation with negative selection, and third row (G-I) a simulation with both positive and negative selection. The selection parameters µn and µp are the rate of mutations in exons with negative and positive fitness effects, respectively. The mean fitness effect was s̄ = −0.03 for deleterious mutations and s̄ = 0.01 for beneficial mutations (see subsection 3.2.2 for more details). See how without selection (A-C), the simulation with µSD = 7% (panel B) looks close to the data (Figure 3.4). With selection, the simulation with both positive and negative selection and µSD = 3.3% looks even more similar to the data (correlations between divergences decay over time, and there is a more pronounced differentiation between low and high Ne comparisons). 74 Correlation Correlation Correlation Correlation Correlation Correlation Correlation Correlation Correlation (Figure 3.9). In PC space, the data most closely resembles a subset of our simulations with both positive and negative selection (e.g., µ̄p = 1 × 10−12 and µ̄ = 1.2 × 10−8n ), including no or very little variation in mutation rates (less than 4%). We also performed PCA on correlations computed at two different scales, 500Kb and 5Mb, in addition to the previously shown results for 1Mb (Figure A.5, Figure A.6). At 500Kb, the observed data are slightly more distant from simulations than at the higher scales, possibly because the recombination map used in simulations had a coarser resolution. Nevertheless, the observed data most closely match the simulations with both positive and negative selection in all scales. 3.3.8 Correlations between genomic features and diversity and divergence. Next, we describe how two important genomic features (i.e., exon density and recombination rate) are related to diversity and divergence in the real great apes data set. The correlations between recombination rate and genetic diversity are positive in all great apes (Figure 3.10A). The strongest correlation between genetic diversity and recombination rate is seen in humans, which is unsurprising given our recombination map was estimated for humans. Recent demographic events also seem to impact the strength of the correlation; for example, the correlation between recombination rate and diversity is higher in Nigerian chimps than in western chimps, which have a much lower recent effective population size. We found that diversity is negatively correlated with exon density across all species (Figure 3.10D). Contrary to what we observed with recombination rate, the correlation between exon density and diversity was even stronger in most other apes than in humans. Species with smaller Ne tend to show weaker correlation between diversity and exon density (see Nam et al., 2017 for related 75 20 μ-sd=0.070 Mutation rate 10 Constant ₚ Local variation μ =1e-12 μₙ=1.2e-08 μₚ=1e-12 0 μₙ=1.4e-08 μₚ=1e-12 Regime Empirical data Empirical data Neutral Deleterious -10 Positive μₚ=1e-11 Both Neutral -20 -75 -50 -25 0 25 PC1 (87.47%) Figure 3.9. Principal component analysis (PCA) visualization of data and simulations. Colors differentiate the empirical data from simulations with different parameters: “Neutral” refers to the simulation without any selection, “Deleterious” refers to simulations with deleterious mutations, “Positive” refers to simulations with beneficial mutations, “Both” refers to simulations with both beneficial and deleterious mutations. The shape of the points differentiate simulations with constant mutation rate along the genome and variable local mutation rates. Local variation in mutation rates were added on top of three simulations: neutral, deleterious with µn = 1.4× 10−8, and both with µn = 1.2× 10−8 and µp = 1× 10−12. The PCA performed on a matrix containing all pairwise correlations between landscapes across the great apes (i.e., all π−π, π−dXY and dXY −dXY comparisons) for the great apes dataset and simulations (with selection and with mutation rate variation). We excluded simulations with µp ≥ 1 × 10−10 from the PCA analysis because PC2 was capturing negative correlations caused by strong positive selection — as seen in Figure 3.7F. 76 PC2 (5.56%) findings). A striking feature of the correlations of between-species divergence and genomic features, shown in Figure 3.10, is that the correlations get stronger with the amount of phylogenetic time that goes into the comparison (i.e., the TMRCA), in a way that is roughly linear with time. To describe why this increase in correlation with time might occur, we turn to an analytic approach. Genetic divergence (D) in the ith window between two species that split t generations ago can be decomposed as: Di(t) = πi(t) +Rit+ εi, where πi(t) is the genetic diversity in the ancestor at time t, Ri is the substitution rate in the window and εi is a contribution from genealogical and mutational noise (which has mean zero). This decomposition follows from the definition of genetic divergence as the number of mutations since the common ancestor, as depicted in Figure 3.3 (see how D = πancV X + 2RTVWXY ). The covariance between D(t), the vector of divergences along windows, and a genomic feature X is, using bilinearity of covariance, Cov(D(t), X) = Cov(π(t), X) + tCov(R,X) + Cov(ε,X). (3.1) Happily, this equation predicts the linear change of the covariance with time that is seen in Figure 3.10C and perhaps Figure 3.10F. However, caution is needed because the correlation between diversity and the genomic feature (Cov(π(t), X)) may be different in different ancestors, and indeed the inferred effective population size is greater in older ancestors in the great apes (Figure 3.1). Next consider covariances of diversity with recombination rate, Figure 3.10C. Consulting the equation above, the fact that the covariance between divergence and recombination rate increases with time can be caused by two 77 factors (taking X to be the vector of mean recombination rates along the genome): (i) a positive covariance between substitution rates and recombination rates (Cov(R,X) > 0), and/or (ii) greater genetic diversity in longer ago ancestors (Ne(t) larger for larger t). It is unlikely that the increase in Ne in more ancient ancestors was sufficient to produce the dramatic increase in covariance seen in Figure 3.10C, since it would require Cov(π(t), X) to be far larger in the ancestral species than is seen in any modern species. On the other hand, there are various plausible mechanisms that would affect Cov(R,X). One factor that certainly contributes is the “smile”: we found that divergence increases faster near the ends of the chromosomes where recombination rate is greater, probably in part because of GC-biased gene conversion. Interestingly, positive and negative selection are predicted to have opposite effects here: greater recombination rate increases the efficacy of both through reduced interference among selected alleles, so positive selection would increase substitution rate and hence increase Cov(R,X), while negative selection would decrease Cov(R,X). When considering only the middle half of the chromosome (i.e., excluding the effect of gBGC) (Figure A.7), the covariances between divergence and recombination rate flip to negative, and they continue to decrease over time. Thus, it seems that negative selection is the most important driver of divergence in the middle, whereas gBGC strongly affects the tails of the chromosome. The covariance of diversity and exon density has a less clear pattern (Figure 3.10F), although it generally gets more strongly negative with time. This decrease could be a result of a negative covariance between substitution rates and exon density and/or an increase in the population sizes of the ancestors (if Cov(ν,X) < 0, as expected since ν is relative diversity and X is now exon density). 78 As before, positive selection in exons would be expected to produce a positive covariance between exon density and substitution rate, while negative selection would produce a negative covariance. It is hard to determine a priori which is likely to be stronger, because although negative selection is thought to be much more ubiquitous, a small amount of positive selection can have a strong effect on substitution rates. The fact that covariance generally goes down with time suggests that negative selection (i.e., constraint) is more strongly affecting substitution rates. It is at first surprising that the correlations between exon density and divergence go up with time, but the covariances go down with time (Figure 3.10E,F). However, correlation is defined as Cor(Dt, X) = Cov(Dt, X)/ SD(Dt) SD(X). Thus, if the variance in divergences increases over time the correlations will decrease over time. Indeed, we see this happening as gBGC increases divergences on the ends of the chromosome faster than in the middle, leading to an increase in variance of divergence along the genome. This also explains why correlations of landscapes of very recent times are very noisy, but covariances are not. Indeed, the patterns are clearer when we exclude the tails of the chromosome (Figure A.7): there is only a modest increase in the correlation between exon density and divergence over time and the covariances go down with time more linearly. 3.4 Discussion A central goal of population genetics is to understand the balance of evolutionary forces at work in shaping the origin and maintenance of variation within and between-species (Lewontin, 1974). While the field has been historically data-limited, with the current flood of genome sequencing data, we are poised to make progress on such old questions. Over the past decades, an important lever 79 A B C 0.0015 ●● ● ● ● ● ● 0.5 ● 0.5 ● ● ● ● ● 0.4 0.4 0.0010 Species ● ● bonobo 0.3 ● 0.3 ●● ●● bornean−orangutan● ● ● central−chimp ● ● ● eastern−chimp ● 0.2 0.2 0.0005 ● eastern−gorilla ● humans ● ● nigerian−chimp 0.1 0.1 ● sumatran−orangutan ● ● western−chimp● ● western−gorilla 0.0 0.0 0.0000 0 250 500 750 1 000 0 250 500 750 1 000 Split time (103) Split time (103) Includes low Ne species? ● No D E F Yes −0.1 −0.1 ● ●●●● ● MRCA ● −0.2 −0.2 −0.0005 ● ● african−apes ● ● ● chimps ● ● ● eastern−central● −0.3 −0.3 ● gorilla●● −0.0010 ● great−apes ● ● ● ●● ● human−pan● ● −0.4 −0.4 ●● ● nigerian−western● ● ● ● orangutans −0.0015 ● pan −0.5 ● −0.5● ●● ● −0.0020 −0.6 ● −0.6 ●● 0 250 500 750 1 000 0 250 500 750 1 000 Split time (103) Split time (103) Figure 3.10. Correlations and covariances between landscapes of diversity and divergence and annotation features in the real great apes data. Exon density and recombination rates were obtained as detailed in Figure 3.2. Split time is the time distance between the two species involved in the divergence. Points are colored by the species of within-species diversity (π) in plots A and D. In plots B,C,E,F, the points are colored by the most common recent ancestor of the species for which between-species divergence was computed. Species with low Ne — for which the estimated species N was less than 8 × 103e : bonobos, eastern gorillas and western chimps — have a different point shape. 80 Cor( π , % Exon ) Cor( π , Recombination rate ) Cor( dXY , % Exon ) Cor( dXY , Recombination rate ) Cov( dXY , % Exon ) Cov( dXY , Recombination rate ) in understanding the relative impact of genetic drift versus selection in shaping genomic patterns of variation has been to examine the relationship between levels of diversity and genomic features, such as recombination rate and exon density. The overarching observation has been that regions of reduced crossing over generally harbor less variation than regions of increased crossing over in many but not all species (e.g., Begun & Aquadro, 1992; Corbett-Detig et al., 2015). This observation is consistent with a role for linked selection shaping patterns of variation in recombining genomes, but the relative contributions of deleterious and beneficial mutations is still largely unknown. Indeed, it seems likely that some complex mixture of both processes shapes variation in natural populations (Kern & Hahn, 2018). In this paper, we moved beyond genetic diversity within a single species to look at how divergence between closely related species changes with time and how this correlates with genomic features. Previous studies (e.g., Stankowski et al., 2019) looked at similar patterns (in monkeyflowers) and found strong correlations between landscapes of diversity and divergence between related species, despite deep split times. Landscapes of closely related species can remain correlated for two main reasons (i) shared ancestral variation or (ii) shared heterogeneous process. If two species recently split, their landscapes of diversity are expected to be correlated due to shared ancestral variation. If the process that structures genetic diversity along chromosomes is heterogeneous and somewhat shared between-species, then their landscapes are expected to remain correlated over longer periods of time. For example, if the effects of selection are concentrated in the same genomic regions in two species, then their landscapes of diversity will be correlated. By incorporating information from multiple species at once, we are able to pool 81 information across species and thus increase our power to disentangle the role of different evolutionary forces. Patterns across multiple species are more likely to be robust to the idiosyncrasies of any one species, such as demographic history. For instance within-species metrics can be confounded by demography: demographic events can create spurious troughs of diversity (Simonsen et al., 1995) or exacerbate the effects of background selection on diversity (Torres et al., 2018). However, correlations between landscapes can only be produced due to shared ancestral variation or a shared heterogeneous process. In the great apes, we found that landscapes of within-species diversity and between-species divergence are highly correlated across the phylogeny. Those correlations are often stronger than those that have been historically used as evidence for the effects of selection on genetic variation. For example, the correlation between genetic diversity in humans and exon density is −0.2, yet the correlation between diversity in humans and diversity in western gorillas is 0.48. This stronger correlation may not be entirely due to shared landscape of selection — it may also be a result of shared ancestral variation (and incomplete lineage sorting), mutation rate variation, and/or GC-biased gene conversion. To understand how much of the correlation between landscapes can be attributed to ancestral variation, we performed extensive simulations of the great apes’ evolutionary history, and found that ancestral variation explains very little of the correlations we observed. Thus, a shared heterogeneous process seems to be needed to explain the data. Two neutral processes can be heterogeneous along the genome and shared across species: GC-biased gene conversion and mutation. GC-biased gene conversion (gBGC) is thought to be an important factor in shaping levels of 82 variation in humans (Chen et al., 2007; Glémin et al., 2015; Pouyet et al., 2018), and it has similar effects to those of natural selection. However, if gBGC were a major driver of correlations we would expect to see a difference in overall levels of correlation between different classes of substitution, and we do not (Figures 3.6 and A.4). As such gBGC seems to be a minor contributor to the correlations we observe, although it does seem to be leading to increased substitution rates near the telomeres (where divergences are increasing roughly 5% faster; see Figure 3.2 and Figure A.1). In birds, an excess of divergence near telomeres has been attributed to meiotic drives (Ellegren et al., 2012). When the history of the great apes is simulated with a shared heterogeneous mutation map, correlations between landscapes do emerge. These were as strong as seen in the data when the rates were drawn from a normal distribution with a standard deviation of the mutation rate of at least a 7% of the mean mutation rate. However, our mutation map was perfectly shared among was species in our simulations, so it is possible that a mutation map which changes over time might move closely to match the data. T. C. A. Smith et al. (2018) estimated the standard deviation of de novo mutation rate in humans at the 1Mb scale to be around 25% of the mean mutation rate. However, the lack of congruency in de novo mutations identified in different data sets raises questions about the role of ascertainment biases that need to be addressed in future studies (Castellano et al., 2020). Our simulations showed a facet of shared mutational heterogeneity along the genome that we do not observe in real data: with variable mutation rate correlations increase over time, whereas in the real data they decrease. It is unknown how conserved mutation rate heterogeneity is across the great apes, so it remains to be seen how an evolving heterogeneous mutation rate map affects 83 landscapes of diversity and divergence. A major driver of mutation rate variation stems from CpG dinucleotides, which have much higher mutation rates than other sites (Agarwal & Przeworski, 2021; Hodgkinson & Eyre-Walker, 2011; Nachman & Crowell, 2000). Nevertheless, when we partitioned the landscapes of divergence by mutation types, we did not see an excess of correlation between landscapes with mutations that can be affected by CpG-induced mutation rate variation (Figures 3.6 and A.4). Natural selection can also structure genetic variation heterogeneously along the genome. In simulations, both positive and negative selection are needed for the correlations between landscapes to resemble the data. By examining the correlations between landscapes (summarized in Figure 3.9), we found that the best fitting simulation is the one with a beneficial mutation rate within exons of 1 × 10−12 and deleterious rate within exons of 1.2 × 10−8. Positive selection seems to be needed to explain one particular feature of the data: the separation between correlations involving a low Ne species (i.e., correlations are lower if diversity is computed in a species with a low Ne – as seen in humans, bonobos, western chimps and eastern gorillas; see Figure 3.4). Bottlenecks can erase sweep signatures (Jensen et al., 2005; Nielsen et al., 2005; Przeworski, 2002), but demography does not affect local variation in mutation rates, and it can exacerbate signatures of background selection (Torres et al., 2018). Thus, if sweeps are causing correlations between landscapes, we expect it to be more sensitive to the strong bottleneck in humans than the other processes. This conclusion largely agrees with previous studies which found that positive selection is necessary to explain reduction in genetic diversity surrounding genes in the great apes (Nam et al., 2017). 84 Another way we might characterize our simulations is through examination of substitution processes. In our best fitting simulation, we get a fixation rate of beneficial mutations of around 1 × 10−9 per generation per exon base pair, what amounts to approximately 9% of the fixations within exons (along the human lineage), i.e., about one new fixation of a beneficial mutation every 250 generations. Fixation rate within exons is decreased to around 60% of the rate in our neutral simulation due to the constant removal of deleterious mutations within these regions. Indeed, previous studies (Boyko et al., 2008; Laval et al., 2021; Zhen et al., 2021) have estimated that between 10% and 16% of amino acid differences between humans and chimpanzees were caused by positive selection, which is strikingly similar to our best fitting simulation. We would expect to see the fixation of around 16 beneficial mutations in the past 4000 generations, which is close to the number of hard sweeps genome scans for selections have found in humans over this same time period (Schrider & Kern, 2016, 2017). Our best fitting simulation with selection assumes that 60% of new mutations within exons are deleterious, similar to estimates from the site frequency spectrum (Boyko et al., 2008; Huber et al., 2017; B. Y. Kim et al., 2017). Thus while we have not done exhaustive model fitting due to computational constraints, our simulations reproduce estimates from studies which model a different facet of genetic variation (i.e., the site frequency spectrum). Heterogeneous processes that correlate with a genomic feature will create differences in rates of substitution along the genome that correlate with the genomic feature. As shown in Equation (3.1), this implies that the covariance along the genome between a genomic feature and divergence is expected to increase with time, and the rate of increase is equal to the covariance between 85 that feature and the substitution rate. (It is important to note that varying covariances with ancestral diversity can be a confounding factor, and that the observation applies to covariance, not correlation.) Indeed, the covariance between divergence and recombination rate increases roughly linearly with time (see Figure 3.10C), as expected because the rate of gBGC-induced fixations are correlated with recombination rate. Once this effect is removed (see Figure A.7F), the covariance between exon density and divergence decreases linearly with time, as we would expect due to the effects of negative selection directly removing deleterious mutations in or near exons. The magnitude of this slope might produce a quantitative estimate of the strength of this effect, although more work is needed to disentangle confounders. It is important to contrast this observation, which applies mostly to the direct effects of selection, to other observations which also include linked effects (as discussed in Phung et al., 2016). Although simulations allow for more biological realism, we made assumptions to constrain the parameter space explored. Our simulations used randomly mating populations of constant size, as inferred in Prado-Martinez et al. (2013). These population sizes were inferred using neutral model, and so are likely affected by the effects of selection (Jensen et al., 2005). Mean levels of diversity and divergence in our neutral simulation match the data, but simulations with natural selection differ, at times substantially (Figure A.9). On the other hand, our simulations with selection match the data more closely with respect to standard deviation in levels of diversity and divergence along genomes. However, inaccuracy of the demographic model should not affect any of our main observations, because the effects of demography on levels of variation along genomes are not shared across multiple species. 86 We chose exons to be the targets of selection in our simulations. Exons cover about 1% of the human genome, and in reality selection affects non-coding regions as well. However, a substantial portion of this selection affects cis- regulatory regions, whose density along the genome is well-predicted by coding sequence itself. Furthermore, highly conserved noncoding sequences have long been identified and characterized as functional (Bejerano et al., 2004; Katzman et al., 2007; Siepel et al., 2005). In the great apes, non-coding diversity is correlated with recombination rate, pointing to the role of selection (Castellano et al., 2020). However, it would be circular to include conserved noncoding elements in our simulations because such elements are identified based in part on levels of divergence, which themselves depend on ancestral levels of genetic diversity. Because conserved noncoding elements generally occur close to coding regions of the genome (at the 1Mb scale, the correlation between density of exons and PhastCons elements is around 0.6), we might expect a more realistic model to have the same amount of selection (in terms of total influx of selected mutations), but spread out over a somewhat wider region of the genome since we have omitted such sites. Even without considering all potential targets of selection, patterns of genetic diversity in our simulations match the data well: we see a correlation between simulated and observed diversity of 0.45 for chimps (Figure A.8), for example. While it has long been recognized that genetic variation among species might be structured similarly due to shared targets of selection, our results demonstrate that these correlations contain important information about the processes at work that has yet to be utilized fully. Here we have used large-scale simulations to demonstrate the combination of forces required to pattern shared divergence and diversity as we observe it in nature. Indeed, our results show that 87 a combination of negative and positive selection, GC-biased gene conversion and mutation rate variation all contribute in shaping genetic variation in the great apes. Although some processes are not necessarily needed to recapitulate the real data (e.g., mutation rate variation), positive selection seems to be the only force that can explain most of our observations. There is clearly a need for future analytical work that might describe expected correlations across the genome given variation in local mutation, recombination, and selection. Further, statistical model fitting, based on theory or simulation is clearly desirable, although our experience suggests that the latter approach would prove computationally expensive. 88 3.5 Bridge As presented in Chapter III, simulations can be useful to help us distinguish between evolutionary models. We showed that although many processes are consistent with great apes population genomic data, positive selection seems necessary to fully explain the observed data. Beyond qualitative assessments, in many cases we might want to infer parameters from a model. The evolutionary simulations of the entire great apes history were too costly to allow for proper parameter estimation. In Chapter IV, we present a new method for simulation- based parameter estimation using whole-genome genealogies. This method is expected to resolve a different bottleneck that plagues evolutionary inference: the scaling with number of samples and genome sizes. Whole-genome genealogies are much more efficient than genotype matrices (matrices with dimensions of number of samples by number of sites). Moreover, they are expected to organize genetic information in a way that can be more conducive for estimating evolutionary parameters. Our main contribution, a deep learning architecture that takes whole- genome genealogies as input, seems to perform well over different tasks. More importantly, it could enable evolutionary inference over biobank scale datasets (that contain millions of samples). 89 CHAPTER IV A POWERFUL MACHINE LEARNING FRAMEWORK FOR EVOLUTIONARY INFERENCE USING WHOLE-GENOME GENEALOGIES Nathaniel S. Pope, Peter L. Ralph and Andrew D. Kern are co-authors on this manuscript. Co-authors and I conceptualized the study, Nathaniel S. Pope and I developed the machine learning framework, I performed analyses with input from my co-authors, I wrote the manuscript with editorial assistance from my co- authors. 4.1 Introduction Processes like mutation, recombination, assortative mating and selection leave footprints on genetic variation. A major goal of population genetics has been to invert this relationship and infer past, unobserved evolutionary events from genetic variation data (Schraiber & Akey, 2015). With the dramatic increase in our ability to generate whole-genome data, there is a growing need for more efficient inference methods capable of scaling well to handle over tens of thousands of individuals. Although our interest is to infer evolutionary processes from population genetic data, it is usually simpler to first think about the effects of these processes on the shape of underlying genealogies (Wakely, 2016), and how this ultimately translates to observed genetic variation. In the absence of recombination, a genealogy is a tree that describes the shared ancestry of a sample of DNA molecules (e.g. haploid individuals), until they “coalesce” into a single common ancestor. The shared ancestry of individuals reflects historical processes that acted upon their ancestors. For example, recent contractions in population sizes will lead to genealogies where individuals coalesce relatively recently. 90 For recombinant DNA, the genealogy is not a tree but instead a graph that encodes the transmission of genetic material from ancestors to present-day individuals at every point in the genome, known as the “ancestral recombination graph” (ARG). The ARG records two types of historical events: coalescences (where separate lineages merge into a common ancestor), and recombinations (where a single lineage splits into multiple lineages, moving backward in time). Crucially, ancestors are “local” in the sense that only a subset of the ancestors in the entire ARG are associated with each location in the genome. Thus, one factorization of ARG is into a sequence of correlated marginal trees, where changes in tree structure are caused by recombination events (Kelleher et al., 2016a, 2019; Speidel et al., 2019). The branches of these trees have two dimensions: the time separating descendant from ancestor, and the length of sequence spanned by the tree containing the branch. A second factorization is into the minimal set of ancestral haplotypes and transmission paths between these ancestors, which is equivalent to collapsing coalescence nodes and branches across marginal trees. If a genetic variant is segregating in a sample of present-day individuals, then the mutational events leading to that variant can be mapped to particular edges in the ARG. In particular, neutral mutations occur on a given edge at a rate proportional to its area. Thus, the distribution of variant frequency reflects the shape of the underlying ARG. More generally, summary statistics of frequencies of neutral variants are noisy measurements of equivalent summary statistics of branch areas across marginal trees in the underlying ARG (Ralph et al., 2020). A typical strategy for evolutionary inference is to compare summary statistics (based on genotype matrices) to their expectations under a parametrized model. For example, the site frequency spectrum (SFS) describes the distribution 91 of allele frequencies across sites in the genome, and is frequently used to infer the demographic histories of natural populations (Gutenkunst et al., 2009; Schraiber & Akey, 2015). However, individual statistics that summarize genetic variation cannot capture all aspects of the data that are informative of the underlying processes. These limitations may be circumvented to some degree by using multiple summary statistics either in a composite likelihood framework or with likelihood-free methods (Caldas et al., 2022; DeGiorgio et al., 2016; Nielsen et al., 2005; Pavlidis et al., 2013; Sheehan & Song, 2016); and by stratifying data into windows across chromosomes (Flagel et al., 2019; Schrider & Kern, 2016; Sheehan & Song, 2016). However, there is inevitably a tradeoff between granularity and informativeness, both in the choice of summary (from raw genotypes to compact summaries like the SFS) and scope (from single bases to large windows). Assume the underlying ARG for a dataset could be observed. Using the ARG as an input would solve many of the issues above (Rasmussen et al., 2014). This is because it is simultaneously: (A) sufficient, in the sense that it contains all information obtainable from observed genetic data; (B) compact, in the sense that shared genetic variation is encoded as relationships between ancestral haplotypes; (C) multiscale, in the sense that the span of ancestral haplotypes and edges naturally reflects the ”spatial” and temporal scales at which shared ancestry impacts variation, without arbitrary discretization into windows. This core idea of using ARGs for evolutionary inference has two challenges in practice. First, we cannot observe the ARG directly but instead we have to infer it. There are approaches for inferring ARGs in a more-or-less scalable way (Kelleher et al., 2019; Speidel et al., 2021; Zhang et al., 2023). But these will inevitably contain error (e.g., in genotyping, phasing, etc.), so any inference method 92 using inferred ARGs as input must be able to model error either explicitly or implicitly. Second, we need some way to “parameterize” the ARG in terms of quantities we are interested in. This would traditionally be done by specifying some generative process, like the coalescent with recombination, and calculating likelihood of observed genetic data conditional on this process (Fan et al., 2023). However, these likelihoods are only tractable under a very small class of models. Likelihood-free inference seems to be the most promising solution, specially now given recent advances in population genetic simulation tools that allow ARG recording (Haller et al., 2019; Kelleher et al., 2016a; Ralph et al., 2020). Graph neural networks (GNNs), an emerging class of deep learning algorithms, provide a natural way to exploit ARGs for evolutionary inference. These networks fall into the broader message passing paradigm whereby information is aggregated in neighborhood of nodes, ultimately yielding an embedding for nodes in a graph (Battaglia et al., 2018; Bronstein et al., 2017; Hamilton et al., 2018). These node embeddings can then be used for downstream tasks such as node classification, graph classification, and edge prediction. GNNs can leverage information from genealogies (i.e., historical relationships between nodes and the timing of coalescent events) (Korfmann et al., 2023), but they would not take into account additional structure in ARGs induced by the spatial (along the chromosome) and temporal (from parent to child) ordering of nodes. Here, we present a new method that uses whole-genome genealogies for evolutionary inference. Our main goal is to develop an architecture that can efficiently use genealogies for inference at different levels. As a proof-of-concept, we test our method on dating mutations in an ARG. Our approach, tsNN, outperforms an out-of-the-box graph neural network and current likelihood-based 93 approaches. Taken together, our results demonstrate that tsNN is a powerful and flexible framework for leveraging information from whole-genome genealogies for evolutionary inference. 4.2 Methods 4.2.1 tsNN. Taking inspiration from the graph neural network literature (Rossi et al., 2020), we developed an encoder that can take whole-genome genealogies as input and generates node embeddings (Figure 4.1). In summary, we obtain a lower dimension representation of these genealogies by passing messages between nodes. We start with an initialized embedding for nodes, and then we use an update scheme that takes an edge and updates the embedding of both child and parent nodes. Because the flow of information is structured from parent to child and from left-to-right, we hoped this neural network would learn a better representation of a tree sequence than a more naive approach, such as a simpler graph neural network. The node embeddings can then be decoded to obtain embeddings at different levels (e.g., at the edge, mutation or tree sequence level). First, we define an encoder module for obtaining node embeddings. We randomly (or arbitrarily) initialize a vector with node features H of dimension number of nodes by number of node features. From a given tree sequence, we also compute a vector E of edge features of dimension number of edges by number of edge features. We define two edge features: the span (width in number of base pairs that an edge spans, scaled by the mutation rate) and edge length (the number of mutations on an edge). We traverse the tree sequence over edge differences (from letft-to-right, and optionally from right-to-left), and with each edge addition, we build a message that is used to update the node features H. The message for the edge that links the parent node p to the child node c is 94 A. Input 6 1.93 5 1.21 1 2 3 4 4 4 4 0.19 4 4 0.19 3 0.00 0 1 2 0 1 2 0 1 2 0 216 386 1000 Genome position B. Encoder Edge   Message Updated   e mbedding Input GRU     Node   e mbedding Hidden state        C. Decoder  Neural network that takes n ode embeddings as input - ARG classification - Allele age prediction - Node clustering ...  Figure 4.1. Schematic representation of the tsNN algorithm. A) An example tree sequence that can be used as input to tsNN. Edges that differ from the previous tree are highlighted with thicker lines. Note how the embedding of node 4 (h4) is updated with edge differences along the tree sequence. B) The tsNN encoder updates node embeddings as it iterates over edge insertions along the tree sequence. The update is done with a Gated Recurrent Unit (GRU) that takes the span and length of the edge along with the node embedding for the parent and child nodes. C) The node embeddings can be decoded in different ways for a supervised task. 95 Node time (generations) mpc = epc ∥ (hp ∥ hc) where epc is the row of E for the edge p − c, and hp and hc are the rows of H for nodes p and c. hp and hc are then updated with a Gated Recurrent Unit cell, such that houtp ∥ houtc = GRU(mpc, hp ∥ hc) The node embedding module above can be combined with different neural network architectures to decode the node embeddings to infer parameters at different levels. For our mutation time task, we first compute edge embeddings, where for each edge we concatenate the corresponding parent and child node embeddings. For each mutation, we assign its embedding as the corresponding edge embedding (we ignore mutations that map to more than one edge). Then, we feed these embeddings into a Multi-layer Perceptron (MLP) of an arbitrary shape with an output size of 1 (one predicted mutation time for each mutation). 4.2.2 GNN. We also tested a simple graph neural network architecture (GNN), similar to Korfmann et al. (2023). The GNN leverages topological information, as well as edge features similarly to tsNN. The GNN performs graph convolutions (passing information between related nodes) to obtain a node embedding. However, most spatial (along chromosome) information is lost. These node embeddings can then be decoded in the same way as described above for tsNN. 4.2.3 Training and validation simulations. We used stdpopsim to simulate the ARG of a sample (100 diploid individuals) under the three 96 population Out-of-Africa demographic model (OutOfAfrica 3G09), with constant recombination rate of 1.3 × 10−8 and mutation rate of 2 × 10−8. To avoid leaking of information from the true simulated ARGs, we reordered the nodes by mean number of descendants (as opposed to true node times) with the constraint that parent nodes are older the respective child nodes. The only features, beyond the topologies, that the neural networks saw were edge features (span and length). Span was scaled by the mutation rate and edge length was computed as the number of mutations on an edge. We simulated 900 replicate ARGs under this model for training and 100 for validation (note no hyper-parameter tuning was performed). To compare tsNN and GNN, we simulated 10Kb genomes. However, we also trained tsNN on 1Mb genomes to understand the relationship between sequence length and accuracy. We trained both models with mean squared error loss on the log10 transformed mutation times. 4.3 Results We devised a test to better understand whether a neural network can extract relevant evolutionary information from ARGs. The process of ARG inference is usually split into two steps: (i) local tree topology estimation, and (ii) inference of branch lengths and node times is performed using a “molecular clock” (after mapping mutations onto trees). The second step is straightforward, but strong coalescent priors are used and dating is heavily affected by issues in the data. Thus, we reasoned that estimating times from tree sequences is a simple, yet fundamental task in ARG inference. The choice of estimating mutation times, as opposed to node times, is due to the fact that true nodes are never known in practice, as the only actual data we can observe are mutations. Importantly, 97 we sought to estimate mutation times by presenting the neural networks with a modified true ARG, in which nodes are reordered based on the mean number of descendants. Further, the network only sees span (scaled by the mutation rate) and edge lengths (as the number of mutations in an edge). 4.3.1 Comparing tsNN and GNN. The fundamental difference between tsNN and GNN is the fact that tsNN induces a particular ordering to updating node embeddings, that we hypothesize is important for extracting evolutionary information. The node embeddings are updated with the edge insertion oder along a tree sequence (Figure 4.2). Because younger haplotypes persist far longer than old haplotypes, a young haplotypes will be involved in many more updates than an old one. 3500 2 3000 0 2500 2 2000 4 1500 6 1000 8 500 10 0 0.0 0.2 0.4 0.6 0.8 1.0 Position 1e7 Figure 4.2. Edges in an arbitrary tree sequence, with the edge traversal order in tsNN represented by a color gradient. Edges are represented as horizontal lines, where the y-axis denotes the age of an edge’s child and x-axis denotes the left and right positions along the chromosome. 98 Age of edge's child Edge traversal order We found that tsNN outperforms the GNN in inferring mutation times (Figure 4.4), even though the GNN architecture has about 6 times more parameters (560,000 trainable parameters in the GNN versus 90,000 parameters in the tsNN). Current dating algorithms can achieve an R2 of around 0.9 (Wohns et al., 2022) using the true ARG. Our model trained on 10Kb sequences fail to properly estimate the time of mutations younger than log210 generations, greatly affecting overall accuracy. Mutation time (R2 = 0.701) 4 3 2 1 0 1 0 1 2 3 4 5 Target (a) tsNN Mutation time (R2 = 0.621) 4 3 2 1 0 1 0 1 2 3 4 5 Target (b) GNN Figure 4.3. Accuracy of tsNN and GNN in predicting mutation times. Scatterplots show predicted against target mutation times for 100 tree sequences. Times are log10 transformed. 99 Prediction Prediction 4.3.2 Improving inference of mutation times. The failure to correctly infer the ages of young mutations could be caused by two factors: (i) the training dataset does not contain enough young mutations, and (ii) the model needs larger sequences (more than 10Kb) to better learn the recombination clock (i.e., the spans of young haplotypes will not be artificially truncated by 10Kb). To test what could be causing this issue, we changed the training data either by scaling the mutation rate or by increasing the simulated chromosome. We found that accuracy increases as mutation rates increases (Figures 4.4a and 4.4b), suggesting that the number of young mutations does limit learning. However, increasing the chromosome length increases accuracy even more (Figure 4.4c). Thus, it seems the network is learning the recombination clock and using it to improve prediction of mutation times. 4.4 Discussion Ancestral Recombination Graphs (ARGs) can solve many issues currently plaguing evolutionary inference, because it fully encodes evolutionary outcomes in a compact manner. Treating the ARG as a latent parameter to be averaged out is not easy, because the state space is overwhelmingly large (Griffiths & Marjoram, 1996; Nielsen, 2000). More tractable approximations, such as the Sequentially Markovian coalescent (SMC), have been widely applied in evolutionary inference (McVean & Cardin, 2005; Schraiber & Akey, 2015), but the simplifying assumptions of these models limit their utility. An alternative lies in shifting towards inferring a single plausible ARG (Kelleher et al., 2019; Speidel et al., 2019; Wong et al., 2023). There is a clear need for methods that leverage an inferred ARG for population genetic inference and to quantify potential gains in accuracy. Nevertheless, it is not obvious how to use the all the information encoded in ARGs 100 (a) 10Kb with mutation rate of 2× 10−7 (b) 10Kb with mutation rate of 2× 10−6 (c) 1Mb with mutation rate of 2× 10−8 Figure 4.4. The effect of number of mutations and chromosome length on accuracy. Scatterplots show predicted against target mutation times for 100 tree sequences. Times are log10 transformed. 101 for inference, and current studies either compute summary statistics based on ARGs or assume independency between marginal trees (Fan et al., 2023; Hejase et al., 2022). Our main goal has been to develop a framework that can leverage most of the information encoded in ARGs for evolutionary inference. We show that our neural network, tsNN, can learn how to date mutations in an ARG, outperforming likelihood-based models (tsNN R2 = 0.927, tsdate R2 = 0.902) (Wohns et al., 2022). By comparing performance on different training sets, we demonstrate that the neural network learns to leverage both the mutation and recombination clocks to produce accurate estimates of mutation times. Although ARGs carry sufficient information for inferring evolutionary events, ARG inference methods could significantly impact downstream applications. Indeed, the two most widely adopted ARG inference programs, tsinfer and Relate, tend to overestimate small coalescence times and underestimate large ones (Y C Brandt et al., 2022). In our mutation time inference task, we used the true ARGs for training and validation, but some of these biases can be alleviated in a supervised machine learning framework by training on inferred ARGs, as opposed to true ARGs. A trickier issue might arise when the true generative process, which yielded the real data, does not match that of the training data. In our tests, we assumed that the training and validation data came from the exact model. In the future, tests where we slightly alter the simulation model for training will aid in understanding the robustness of tsNN to model mis-specification. Previous studies have shown that machine learning methods can be as robust to model mis-specification as their likelihood-based counterparts (Hejase et al., 2022). A 102 new approach to deal with the mis-specification issue, called Domain-Adaptive Networks, can mitigate mis-speficication by leveraging target-domain data in an unsupervised manner (Mo & Siepel, 2023). An exciting avenue for future research lies in expanding the tasks that tsNN can perform, what can be accomplished by developing different decoders. For example, the rich information contained in the ARG could be used to distinguish between complex demographic models that are unidentifiable with allele frequency data (e.g., the site frequency spectrum) (Fan et al., 2023; Schraiber & Akey, 2015). The inference of selective sweeps has much to gain from leveraging ARGs, mostly because it allows us to bypass the specification of a window size. The span of ancestral haplotypes and edges naturally encode both the spatial and temporal scales over which processes impact variation, and an ARG-based method can implicitly leverage this information. With the recent advances in scalable ARG inference methods, ARG-based evolutionary inference methods are still in its infancy. A powerful framework that is able to efficiently leverage all the rich information contained in ARGs is needed. The field of graph neural networks is ripe with ideas that can be applied to ARGs. Indeed, we demonstrate that our method, tsNN, represents an important step in this direction, but there still much work to be done in applying deep learning to gain new insights into past evolutionary events. 103 CHAPTER V CONCLUSION Understanding the balance of evolutionary forces shaping the origin and maintenance of genetic variation has been the core driver of population genetics (Lewontin, 1974). Our ability to collect data has exponentially increased over the last few decades, moving from allozyme gels of a few samples to whole-genome data of thousands of individuals. With this flood of data, we are poised to make huge progress on long standing evolutionary questions. However, the traditional framework for evolutionary inference has somewhat stalled new discoveries. Many of the issues we are facing can be mitigated with evolutionary simulations. A great deal of progress has been made on evolutionary simulation tools (Adrion et al., 2020; Haller & Messer, 2019; Haller et al., 2019; Kelleher et al., 2016a). These advancements, coupled with huge increases in computational power, now allow us to use simulations to effectively infer previously intractable likelihoods, and to explore features of the data that are not easy to model mathematically. Indeed, over the past decade simulation-based inference has gained immense popularity (Caldas et al., 2022; Chan et al., 2018; Korfmann et al., 2023; Schrider & Kern, 2016; Torres et al., 2018). As the questions and models increase in complexity, there is a growing need for standards in simulation models and for increase reproducibility. stdpopsim is a community-maintained library for previously published simulation models, which includes species-specific population genetic parameters (e.g., mutation rates, recombination maps) as well as demographic and selection models. In Chapter II, I presented my contributions to evolutionary simulation tools, which will help facilitate simulation-based inference in population genetics. Much more is needed 104 still in two fronts: (i) benchmarking our current tools under a common variety of evolutionary scenarios, and (ii) verifying for consistency of published models, for example by ensuring that the models can yield simulated data that actually resembles the real data. It is now easier than ever to use evolutionary simulations to better understand complex models and features of the data for which there is no theory yet. Indeed, simulations have opened up new opportunities to better understand interactions between evolutionary processes. For example, Schrider (2020) used more realistic simulations to show that the footprints selective sweeps are not as easily confounded by background selection as previously thought (Andolfatto, 2001). In Chapter III, I leveraged complex and realistic simulations of the entire great apes history to learn about which processes have shaped genetic variation in the group. Without simulations, it would not be possible to jointly study the effects of positive and negative selection, mutation rate variation and GC-biased gene conversion on large-scale patterns of genomic variation. Another factor that is bound to improve evolutionary inference is the shift towards Ancestral Recombination Graphs (ARGs). This data structure is compact, and so it can enable inference over larger scales, both in number of samples as well as in number of sites. ARGs are now used as a backbone to many different evolutionary simulation tools, allowing for a better integration (Haller et al., 2019; Kelleher et al., 2016b). The field has moved away from treating the ARG as a latent parameter to be averaged out (Griffiths & Marjoram, 1996; Nielsen, 2000) to inferring a single plausible ARG (Kelleher et al., 2019; Speidel et al., 2019). The state space for possible ARGs is overwhelmingly large, so this shift allows us to leverage ARGs at scale. However, it is still unclear how to actually use all the 105 information encoded in ARGs for inference. Indeed, many recent studies instead either develop topological summary statistics or treat marginal trees in the ARG as independent (Fan et al., 2023; Hejase et al., 2022). In Chapter IV, I proposed a new neural network framework that can make better use of ARGs for inference, but there is much to be done in this space still. Computer simulations have drastically altered the field of evolutionary genetics in the past decade. One of the major downside of simulations is the computational cost. The parameter space for any moderately complex model can quickly become prohibitively large. No major leaps in the efficiency of simulators or in hardware are in the horizon, so likelihood-based models will be useful in complementing simulation studies. For example, it is often useful to constrain the parameter space using estimates from likelihood-based models. Simulation- based inference can produce biased estimates when the generative process in simulations does not match that of the actual data, an issue also known as model mis-specification. Some of this can be alleviated by incorporating error, such as genotyping error and missingness, to the idealized simulations. However, there is still no consensus on how to include such errors to simulations in a systematic way. There are tools for dealing with mis-specification in the broader machine learning literature, and these seem promising in evolutionary contexts as well (Mo & Siepel, 2023). I expect that our ability to leverage simulations to gain evolutionary insights will only increase with improvement in these areas. 106 APPENDIX SUPPLEMENTAL MATERIAL FOR CHAPTER III Supplementary material A.0.1 Correlation between divergences that share branches. Landscapes of divergence can be correlated by their definition, as they can share part of their histories. In most of our analyses (except for Figure A.2), we do not show the correlations for such cases but below we describe how this sharing would affect correlations (using a simplified theory). For example, in Figure 3.3 dV X and dXY share the branch X; depending on how the length of the branch X compares to the total tree length, these two landscapes are bound to be correlated. Assuming that mutations follow a Poisson process and that coalescences happen instantaneously, we derive the following. There are three non-overlapping parts in the tree between these, the branch from the XY ancestor to X with length E[τX ] = TXY , the branch from the XY ancestor to Y with length E[τY ] = TXY and the branch from V to the XY ancestor with length E[τV ] = 2TVWXY − TXY . If we just consider the genealogical definition of divergence and assume dV X = τV + τX and dXY = τX + τY (i.e., ignoring the contributions of ancestral diversity to divergence), then Cov[dV X , dXY ] = Cov[τX + τV , τX + τY ] :0 :0 :0 = Cov[τX , τX ] +Cov[τ  X , τY ] + Cov[τ,   V τX ] +Cov [τ  V , τY ] = Var(τX) = E[τX ] = TX Therefore, 107 √ Cov[τX + τV , τX + τY ]Cor[dV X , dXY ] = √Var[τX + τV ] Var[τX + τY ] Var[τX ]2 = √(Var[τX ] + Var[τV ])(Var[τX ] + Var[τY ]) Var[τX ] Var[τX ] = √Var[τX ] + Var[τV ] Var[τX ] + Var[τY ] TX TX = TX + TV TX + TY √ = pd pV X dXY where p = TXd is the proportion of dV X that is shared with dXY , andV X TX+TV p = TXd is the proportion of dXY that is shared with d .XY T +T V XX Y 108 µN µP s̄N s̄P Regime µSD 0 0 0 0 Neutral 0 0 0 0 0 Variable µ 0.005 0 0 0 0 Variable µ 0.007 0 0 0 0 Variable µ 0.011 0 0 0 0 Variable µ 0.016 0 0 0 0 Variable µ 0.023 0 0 0 0 Variable µ 0.033 0 0 0 0 Variable µ 0.048 0 0 0 0 Variable µ 0.070 0 0 0 0 Variable µ 0.103 0 0 0 0 Variable µ 0.150 0 1 × 10−12 0 1 × 10−2 Beneficial 0 0 1 × 10−11 0 1 × 10−2 Beneficial 0 2 × 10−9 0 −3 × 10−2 0 Deleterious 0 2 × 10−9 1 × 10−11 −3 × 10−2 1 × 10−2 Both 0 2 × 10−9 0 −1.5 × 10−2 0 Deleterious 0 2 × 10−9 1 × 10−11 −1.5 × 10−2 1 × 10−2 Both 0 2 × 10−9 0 −1 × 10−2 0 Deleterious 0 2 × 10−9 1 × 10−12 −1 × 10−2 5 × 10−3 Both 0 2 × 10−9 1 × 10−12 −1 × 10−2 1 × 10−2 Both 0 2 × 10−9 0 −3 × 10−3 0 Deleterious 0 2 × 10−9 1 × 10−12 −3 × 10−3 5 × 10−3 Both 0 2 × 10−9 1 × 10−12 −3 × 10−3 1 × 10−2 Both 0 2 × 10−9 0 −1 × 10−3 0 Deleterious 0 2 × 10−9 1 × 10−11 −1 × 10−3 1 × 10−2 Both 0 6 × 10−9 0 −3 × 10−2 0 Deleterious 0 6 × 10−9 1 × 10−11 −3 × 10−2 1 × 10−2 Both 0 6 × 10−9 0 −1.5 × 10−2 0 Deleterious 0 6 × 10−9 1 × 10−11 −1.5 × 10−2 1 × 10−2 Both 0 1 × 10−8 1 × 10−11 −1 × 10−3 1 × 10−2 Both 0 1.2 × 10−8 0 −3 × 10−2 0 Deleterious 0 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.005 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.007 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.011 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.016 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.023 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.033 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.048 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.070 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.103 1.2 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0.150 1.2 × 10−8 1 × 10−11 −3 × 10−2 1 × 10−2 Both 0 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.005 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.007 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.011 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.016 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.023 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.033 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.048 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.070 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.103 1.4 × 10−8 0 −3 × 10−2 0 Deleterious 0.150 1.4 × 10−8 1 × 10−12 −3 × 10−2 1 × 10−2 Both 0 1.4 × 10−8 1 × 10−11 −3 × 10−2 1 × 10−2 Both 0 Table A.1. Parameter space explored with simulations. µN and µP are the rates of mutations under negative and positive selection, respectively. s̄N and s̄P and the mean fitness effects of negatively and positively selected mutations. µSD is the scaled standard deviation of the mutation rate map. See Table 3.1 and subsection 3.2.2 for more details. 109 win−size_1000000_merged−mask_True_state_all_curr_all_prop−acc_0.4 ● ●●●●●●● ●● ● 0.03 ●●●●●●●● ●● ● ● ● ●●●● ●● ●● ●● ●●●●●●● ● ●● 0.02 ●●●●●●●●● ●●● ●● ●● ●●● ●●●● ●●●●●●●●●● ●●●●●●● 0.01 ●●●●● ● ● ●● ● ●● ● ●●●● ● ●● ● ● ● ●● ● ● ●●●●●● ● ● ●●●●● ●●●●●●● ●●●● ●● 0.00 ● ● 0 250000 500000 750000 1000000 dT Rec:%Exon ● LO_REC:LO_EX ● LO_REC:HI_EX ● HI_REC:LO_EX ● HI_REC:HI_EX Figure A.1. Effect of exon density and recombination rate on the accumulation of genetic divergence in chromosome 12 with phylogenetic distance. Within-species genetic diversities are shown at dT = 0. Mean diversity and divergences were computed for four groups depending on whether they fell or not on the top 90% percentile of recombination rate and exon density. 110 Diversity/Divergence Great apes dataset dXY − π dXY − dXY 1.00 ● ●●● ●●●●●● ● ●●●●●●●●● ● ● ● Species involved ● ● ● ●● ● ● ●● ●●●● ●● ●●● ●●●●●●●●● in the comparison ● 0.75 ● ●● ●● ● ●●● ● ●● ● ●●●● ●● ●● ● ●●●●●●●● ●● ●● 2 ● ●● ● ● ● ●● ● ● ●● 3● ● ●● ● ● ●●●● ●● ●● ● 4●● ●● ● 0.50 π in low Ne 0.25 species? ● No 0.00 Yes 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Branch overlap ( 4 pXpY ) Figure A.2. Correlations between landscapes of diversity and divergence for comparisons with branch overlap. For example, diversity in humans and divergence between humans and bonobos share part of their history. Each point on the plots correspond to the (Spearman) correlation between two landscapes of diversity/divergence, computed on 1Mb windows across the entire genome. Correlations were split by type of landscapes compared (π − dXY , dXY − dXY ). The x-axis is a metric of expected branch overlap between the landscapes. See subsection A.0.1 for more information. Note that species with low Ne (bonobos, eastern gorillas and western chimps) have a different point shape. The colors reflect the number of species involved in the comparison. For example, the comparison between human-western gorilla and eastern chimp-Sumatran orangutan divergences includes four different species. On the other hand, the comparison between human- western gorilla and human-Sumatran orangutan divergences includes just three species. 111 Correlation between landscapes A μₙ B C=2.0e-09 μₙ=6.0e-09 μₙ=1.4e-08 π dX Y π dX Y π dX Y great-apes 0.005 great-apes great-apes 0.0020 central-chimp 0.00200.004 western-gorilla 0.0015 western-0go.0rill2a 0.02 0.0015 0.020.003 central-chimp sumatran-orangutan african-apes african-apes 0.0010 sumatran-orangutan0.0010 western-gorilla african-apes humans human-pan 0.002 bonobo 0.01 0.01 0.01 human-pan 0.0005 0.001 humcaennstral-chimp human-pansumatran-orangutan 0.0005 panbonobo humans pan bonobo pan 0.0000 0.0000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) D ₚ E F μ =1e-12 μₚ=1e-11 μₚ=1e-10 π dX Y π dX Y π dX Y 0.025 0.0015 0.0015 0.020 great-apes0.0015 sumatra0n-.o0ra2ngutan great-apes 0.020 great-apes western-gorilla 0.0010 african-apes 0.0010 0.015 0.015 central-chimp african-apes0.0010 central-chimp african-apes 0.01 human-pan humwaensstern-0g.o0ril1la00.0005 0.0005 bonobo0.010 human-pan humans 0.0005 0.005 hcuemntaranls-chimpsbuomnoabtroan-orangutan human-pan 0.005bonobo panpan pan 0.0000 sumatwraens-toerann-ggourtiallan0.0000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) G μₙ H I=2.0e-09 μₚ=1e-12 μₙ=1.2e-08 μₚ=1e-12 μₙ=1.4e-08 μₚ=1e-11 π dX Y π dX Y π dX Y 0.0020 0.0020 0.025 western-gorilla 0.0015 0.0015 0.02 great-apes 0.0015 0.02 0.020central-chimp great-apes great-apessumatran-orangutan 0.0010 sumatran-orangutan 0.015 western-gorilla african-apes 0.0010 0.0010 african-apes african-apes central-chimp sumatran-orangutan 0.01 human-pan central-chimpbonobo 0.01 0.010humans 0.0005 0.0005 human-panwesternh-ugmoarinllas 0.0005 pan human-pan bonobo humans 0.005 pan bonobo pan 0.0000 0.0000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Phylogenetic distance (103) Phylogenetic distance (103) Phylogenetic distance (103) Figure A.3. Landscapes of diversity and divergence in selected simulations with natural selection. The selection parameters µn and µp are the rate of mutat1io1n2s in exons with negative and positive fitness effects, respectively. The mean fitness effect was s̄ = −0.03 for deleterious mutations and s̄ = 0.01 for beneficial mutations (see subsection 3.2.2 for more details). Other details are as in Figure 3.2. C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n C o r r e l a t i o n A dXY for ancestral W mutations 0.04 great−apes 0.03 0.02 human−pan 0.01 african−apes pan 0.00 0.0e+00 5.0e+07 1.0e+08 1.5e+08 B dXY for ancestral S mutations 0.06 0.04 great−apes human−pan 0.02 african−apes pan 0.00 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.4. Landscapes of divergence partitioned by allele state in the ancestor. Ancestral states were assumed to be the same as seen in rhesus macaques (RheMac2), and sites not called in macaques were not used. dXY for W sites is simply the mean pairwise differences between samples in species X and Y per ancestral W sites (A/T). Similar reasoning applies for dXY for S ancestral sites, but only considering (G/C) sites. Points were colored by the most common recent ancestor of the two species compared in each divergence. Lines were fitted using local linear regression. Note that for ancestrally weak mutations (A) there is an increase in divergence at the ends of the chromosomes, but that is not seen for ancestrally strong mutations (B). 113 Value Value 20 Mutation rate 10 μ-sd=0.103 Constant Local variation μₚ=1e-12 μₙ=1.2e-08 μₚ=1e-12 Regime 0 μₙ=1.4e-08 μₚ=1e-12 Empirical data Empirical data Neutral Deleterious Positive Both -10 Neutral μₚ=1e-11 -50 -25 0 25 50 PC1 (87.7%) Figure A.5. PCA visualization of data and simulations at 500Kb. The colors differentiate the empirical data from simulations with different parameters: Neutral refers to the simulation without any selection, Deleterious refers to simulations with deleterious mutations, Positive refers to simulations with beneficial mutations, Both refers to simulations with both beneficial and deleterious mutations. The shape of the points differentiate simulations with constant mutation rate along the genome and variable local mutation rates. Principal component analysis (PCA) applied to a matrix with all pairwise correlations between landscapes across the great apes (including π − π, π − dXY and dXY − dXY comparisons) for the great apes dataset and simulations (with selection and with mutation rate variation). We excluded simulations with µp ≥ 1 × 10−10 from the PCA analysis because PC2 was capturing negative correlations caused by strong positive selection — as seen in Figure 3.7F. 114 PC2 (6.21%) 20 μ-sd=0.033 Mutation rate 10 μₙ=1.4e-08 μ-sd=0.033 Constant μₙ=2.0e-09 μₚ=1e-12 μₚ=1e-12 Local variation Empirical data μₙ=1.2e-08 μₚ=1e-12 0 Regime Empirical data Neutral -10 Deleterious Positive Both -20 Neutral -50 0 PC1 (76.34%) Figure A.6. PCA visualization of data and simulations at 5Mb. The colors differentiate the empirical data from simulations with different parameters: Neutral refers to the simulation without any selection, Deleterious refers to simulations with deleterious mutations, Positive refers to simulations with beneficial mutations, Both refers to simulations with both beneficial and deleterious mutations. The shape of the points differentiate simulations with constant mutation rate along the genome and variable local mutation rates. Principal component analysis (PCA) applied to a matrix with all pairwise correlations between landscapes across the great apes (including π − π, π − dXY and dXY − dXY comparisons) for the great apes dataset and simulations (with selection and with mutation rate variation). We excluded simulations with µp ≥ 1 × 10−10 from the PCA analysis because PC2 was capturing negative correlations caused by strong positive selection — as seen in Figure 3.7F. 115 PC2 (5.34%) A B C ● 0.00005 ● 0.2 0.2 ● ● 0.00000 0.1 0.1 ●● Species ● ● ● bonobo 0.0 0.0 −0.00005 ● bornean−orangutan ● ● central−chimp ●● ● ● eastern−chimp ● ● ● eastern−gorilla −0.1 −0.1 ●●● ●● ● ●●●● ● ● ●● −0.00010 ● humans ●● ●● nigerian−chimp● ● ● sumatran−orangutan ● ● −0.2 −0.2 western−chimp ● −0.00015 western−gorilla● ● 0 250 500 750 1 000 0 250 500 750 1 000 Split time (103) Split time (103) Includes low Ne species? ● No D E F Yes0.000 ● −0.3 −0.3 ● MRCA ● −0.001 ● ●● african−apes● −0.4 −0.4 ● chimps ● ● eastern−central ● gorilla ● −0.5 −0.5 great−apes● ● −0.002 ●● human−pan ●●●● ● ● ● nigerian−western ● orangutans −0.6 −0.6 ●●● ● ● pan● ● −0.003 ● −0.7 −0.7 ●● ●●● ● ● ● ●●● ● ● ● ● −0.004 0 250 500 750 1 000 0 250 500 750 1 000 Split time (103) Split time (103) Figure A.7. Correlations and covariances between landscapes of diversity and divergence and annotation features in the real great apes data. Only windows in the middle half of chromosome 12 were included. Compare to Figure 3.10. 116 Cor( π , % Exon ) Cor( π , Recombination rate ) Cor( dXY , % Exon ) Cor( dXY , Recombination rate ) Cov( dXY , % Exon ) Cov( dXY , Recombination rate ) Both Deleterious Neutral R = 0.18, p = 0.039 R = 0.23, p = 0.0088 0.0015 R = 0.1, p = 0.25 0.0010 0.0005 0.0015 R = 0.063, p = 0.48 R = 0.078, p = 0.38 R = − 0.082, p = 0.35 0.0010 0.0005 0.0000 0.003 R = 0.45, p = 1e-07 R = 0.32, p = 0.00026 R = − 0.018, p = 0.84 0.002 Regime Both Deleterious Neutral 0.001 R = 0.47, p = 1.8e-08 0.0020 R = 0.43, p = 3.2e-07 R = 0.17, p = 0.052 0.0015 0.0010 0.0005 0.00125 R = 0.37, p = 1.4e-05 R = 0.29, p = 0.00084 0.00100 R = − 0.15, p = 0.086 0.00075 0.00050 0.00025 0.001 0.002 0.003 0.001 0.002 0.003 0.001 0.002 0.003 Diversity in real data Figure A.8. Scatterplots of genetic diversity in the real great apes data against diversity seen in simulations. Simulation with deleterious mutations had µn = exp 1.4−8, and the simulation with both deleterious and beneficial mutations had µn = exp 1.4−8 and µp = exp 1−12. 117 Diversity in simulation humans bonobo central-chimp western-gorilla bornean-orangutan 4e-04 0.00150 0.00125 3e-04 0.00100 Species 2e-04 bonobo bornean-orangutan 0.00075 central-chimp humans 1e-04 western-gorilla 0.00050 Species combo Data Neutral Deleterious Both Data Neutral Deleterious Both Regime Regime bonobo_bornean-orangutan bonobo_central-chimp 0.03 0.003 bonobo_humans bonobo_western-gorilla bornean-orangutan_central-chimp bornean-orangutan_humans bornean-orangutan_western-gorilla central-chimp_humans 0.02 0.002 central-chimp_western-gorilla humans_western-gorilla 0.01 0.001 Data Neutral Deleterious Both Data Neutral Deleterious Both Regime Regime Figure A.9. Summaries of genetic diversity (top) and divergence (bottom) across all species for the data and simulations. Mean is shown on the left and standard deviation on the right. “Neutral” refers to the simulation without any selection, “Deleterious” refers to the simulation with deleterious mutations ocurring at a rate of 1.4 × 10−8, “Both” refers to the simulation with both beneficial and deleterious mutations, with rates 1× 10−12 and 1.4× 10−8 respectively. 118 Mean Divergence Mean Diversity SD Divergence SD Diversity chr1 0.03 0.02 0.01 0.00 0.005 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr1 ● 15 ● R = 0.1 6 ● ● ● 10 ● ● ● ● ● ● 4 ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● 5 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 2 ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●●● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● 0 ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Window Figure A.10. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 1. See Figure 3.2 for more details. 119 Rec rate Value % Exon dxy pi chr2 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr2 R = 0.054 ● 6 ● ● ● 10 ● ● 4 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ●● ● ● ● ● ● ● ● ● ●● ● ● 2 ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●● ●● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Window Figure A.11. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 2. See Figure 3.2 for more details. 120 Rec rate Value % Exon dxy pi chr3 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr3 R = 0.072 ● 6 ● 10 ● ● ● ● 4 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● 2 ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● 0 ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 Window Figure A.12. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 3. See Figure 3.2 for more details. 121 Rec rate Value % Exon dxy pi chr4 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr4 R = 0.24 6 ● ● 10 ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● 0 ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ● ●●●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.13. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 4. See Figure 3.2 for more details. 122 Rec rate Value % Exon dxy pi chr5 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr5 10.0 ● 20 7.5 15 R = 0.24 5.0 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 ● ● ●● ●● ● ● ● ●● ● 5● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.14. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 5. See Figure 3.2 for more details. 123 Rec rate Value % Exon dxy pi chr6 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr6 ● 8 ● 15 R = 0.16 6 ● ● ● 10 ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.15. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 6. See Figure 3.2 for more details. 124 Rec rate Value % Exon dxy pi chr7 0.04 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr7 ● 8 15 ● R = 0.054 6 ● ● 10 ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 5● ● ● 2 ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.16. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 7. See Figure 3.2 for more details. 125 Rec rate Value % Exon dxy pi chr8 0.05 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr8 ● 20 9 15 6 R = 0.22 ● 10 ● ● ● 3 ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 5● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0.0e+00 5.0e+07 1.0e+08 1.5e+08 Window Figure A.17. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 8. See Figure 3.2 for more details. 126 Rec rate Value % Exon dxy pi chr9 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0e+00 5e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr9 ● ● 7.5 15 R = 0.0033 ● 5.0 ●● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 ● ●● ●● ●● 5 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 0e+00 5e+07 1e+08 Window Figure A.18. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 9. See Figure 3.2 for more details. 127 Rec rate Value % Exon dxy pi chr10 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0e+00 5e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr10 R = 0.032 6 ● ● 10 ● 4 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● 2 ●● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● 0 0e+00 5e+07 1e+08 Window Figure A.19. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 10. See Figure 3.2 for more details. 128 Rec rate Value % Exon dxy pi chr11 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 0e+00 5e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr11 ● 15 R = 0.046 6 ●● ● ● 10 ● ● ● ● 4 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● 2 ●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0e+00 5e+07 1e+08 Window Figure A.20. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 11. See Figure 3.2 for more details. 129 Rec rate Value % Exon dxy pi chr12 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0e+00 5e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr12 R = 0.14 ● 6 ● ● ● ● ● 10 ● ● ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● 5 ● ● ● ● ●2 ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0 ● ● ● ● ●● ● ● ● ● 0 0e+00 5e+07 1e+08 Window Figure A.21. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 12. See Figure 3.2 for more details. 130 Rec rate Value % Exon dxy pi chr13 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 3e+07 6e+07 9e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr13 R = 0.36 6 10 ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ●● ● ● ● ● ● 2 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 3e+07 6e+07 9e+07 Window Figure A.22. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 13. See Figure 3.2 for more details. 131 Rec rate Value % Exon dxy pi chr14 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 2e+07 4e+07 6e+07 8e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr14 8 ● 15 ● R = 0.32 ● 6 ● 10 4 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● 2 ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● 0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0 2e+07 4e+07 6e+07 8e+07 1e+08 Window Figure A.23. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 14. See Figure 3.2 for more details. 132 Rec rate Value % Exon dxy pi chr15 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 2e+07 4e+07 6e+07 8e+07 1e+08 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr15 R = − 0.53 6 ● ● 10 ● ● ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 5 ● ● ● ● ● ● ● ● ● 2 ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●● 0 2e+07 4e+07 6e+07 8e+07 1e+08 Window Figure A.24. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 15. See Figure 3.2 for more details. 133 Rec rate Value % Exon dxy pi chr16 0.04 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0.0e+00 2.5e+07 5.0e+07 7.5e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr16 ● 25 12 ● 20 ● 8 ● 15 R = − 0.26 ● ● ● ● 10 4 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● 0 0.0e+00 2.5e+07 5.0e+07 7.5e+07 Window Figure A.25. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 16. See Figure 3.2 for more details. 134 Rec rate Value % Exon dxy pi chr17 0.04 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0e+00 2e+07 4e+07 6e+07 8e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr17 ● 10.0 20 ● 7.5 ● 15 ● R = − 0.23 ● ● ● ● ● 5.0 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 2.5 ●● ● ●● ● ●● ● ● ● ● 5● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ●● ● ● ● 0 0e+00 2e+07 4e+07 6e+07 8e+07 Window Figure A.26. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 17. See Figure 3.2 for more details. 135 Rec rate Value % Exon dxy pi chr18 0.04 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0e+00 2e+07 4e+07 6e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr18 R = 0.082 6 10 ● ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ●2 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 0 ● ● ● ●● ● ●● ● ● 0 0e+00 2e+07 4e+07 6e+07 Window Figure A.27. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 18. See Figure 3.2 for more details. 136 Rec rate Value % Exon dxy pi chr19 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0e+00 2e+07 4e+07 6e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr19 10.0 ● 20 ● ● 7.5 ● 15 ● ● R = 0.19 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● 5.0 ● ● ●● 10 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 2.5 ●● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● 0 0e+00 2e+07 4e+07 6e+07 Window Figure A.28. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 19. See Figure 3.2 for more details. 137 Rec rate Value % Exon dxy pi chr20 0.04 0.03 0.02 0.01 0.00 0.003 0.002 0.001 0.000 0e+00 2e+07 4e+07 6e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr20 ● R = − 0.28 ● 6 ● ● 10 ● 4 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● 5 ● ● ●● ● ● ● 2 ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0 ● ● ●●● ● ● ● ● ● ● ● 0 0e+00 2e+07 4e+07 6e+07 Window Figure A.29. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 20. See Figure 3.2 for more details. 138 Rec rate Value % Exon dxy pi chr21 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 0.000 2e+07 3e+07 4e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr21 R = 0.13 6 ● 10 ● ● 4 ● ● ● ● ● ● ● ● ● ●● ● 5 ● ● 2 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0 ●● ● ● ● 0 2e+07 3e+07 4e+07 Window Figure A.30. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 21. See Figure 3.2 for more details. 139 Rec rate Value % Exon dxy pi chr22 0.04 0.03 0.02 0.01 0.00 0.004 0.003 0.002 0.001 2e+07 3e+07 4e+07 Window humans nigerian−chimp bornean−orangutan nigerian−western human−pan central−chimp bonobo sumatran−orangutan chimps african−apes Species western−chimp eastern−gorilla eastern−central orangutans great−apes eastern−chimp western−gorilla gorilla pan chr22 R = − 0.27 6 ● ● ● 10 ● ● ● ● 4 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● 2 ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● 0 ● ● 0 2e+07 3e+07 4e+07 Window Figure A.31. Landscapes of diversity, divergence, exon density and recombination rate across chromosome 22. See Figure 3.2 for more details. 140 Rec rate Value % Exon dxy pi Bibliography Adrion, J. R., Cole, C. B., Dukler, N., Galloway, J. G., Gladstein, A. L., Gower, G., Kyriazis, C. C., Ragsdale, A. P., Tsambos, G., Baumdicker, F., Carlson, J., Cartwright, R. A., Durvasula, A., Gronau, I., Kim, B. Y., McKenzie, P., Messer, P. W., Noskova, E., Ortega-Del Vecchyo, D., . . . Kern, A. D. (2020). A community-maintained standard library of population genetic models. eLife, 9, e54967. https://doi.org/10.7554/eLife.54967 Agarwal, I., & Przeworski, M. (2021). Mutation saturation for fitness effects at human CpG sites (J. Ross-Ibarra & P. J. Wittkopp, Eds.). eLife, 10, e71513. https://doi.org/10.7554/eLife.71513 Andolfatto, P. (2001). Adaptive hitchhiking effects on genome variability. Current Opinion in Genetics & Development, 11 (6), 635–641. https://doi.org/10. 1016/S0959-437X(00)00246-X Andolfatto, P. (2007). Hitchhiking effects of recurrent beneficial amino acid substitutions in the drosophila melanogaster genome. Genome Research, 17 (12), 1755–1762. https://doi.org/10.1101/gr.6691007 Barton, N. H., & Etheridge, A. M. (2004). The effect of selection on genealogies. Genetics, 166 (2), 1115–1131. https://doi.org/10.1093/genetics/166.2.1115 Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., Gulcehre, C., Song, F., Ballard, A., Gilmer, J., Dahl, G., Vaswani, A., Allen, K., Nash, C., Langston, V., . . . Pascanu, R. (2018, October 17). Relational inductive biases, deep learning, and graph networks. https : // doi.org/10.48550/arXiv.1806.01261 141 Battey, C. J. (2020). Evidence of linked selection on the z chromosome of hybridizing hummingbirds*. Evolution, 74 (4), 725–739. https ://doi .org/ 10.1111/evo.13888 Baumdicker, F., Bisschop, G., Goldstein, D., Gower, G., Ragsdale, A. P., Tsambos, G., Zhu, S., Eldon, B., Ellerman, E. C., Galloway, J. G., Gladstein, A. L., Gorjanc, G., Guo, B., Jeffery, B., Kretzschumar, W. W., Lohse, K., Matschiner, M., Nelson, D., Pope, N. S., . . . Kelleher, J. (2022). Efficient ancestry and mutation simulation with msprime 1.0. Genetics, 220 (3), iyab229. https://doi.org/10.1093/genetics/iyab229 Begun, D. J., & Aquadro, C. F. (1992). Levels of naturally occurring DNA polymorphism correlate with recombination rates in d. melanogaster. Nature, 356 (6369), 519–520. https://doi.org/10.1038/356519a0 Begun, D. J., Holloway, A. K., Stevens, K., Hillier, L. W., Poh, Y.-P., Hahn, M. W., Nista, P. M., Jones, C. D., Kern, A. D., Dewey, C. N., Pachter, L., Myers, E., & Langley, C. H. (2007). Population genomics: Whole-genome analysis of polymorphism and divergence in drosophila simulans [Publisher: Public Library of Science]. PLOS Biology, 5 (11), e310. https://doi.org/10.1371/journal.pbio.0050310 Beissinger, T. M., Wang, L., Crosby, K., Durvasula, A., Hufford, M. B., & Ross- Ibarra, J. (2016). Recent demography drives changes in linked selection across the maize genome. Nature Plants, 2 (7), 1–7. https ://doi .org/10 . 1038/nplants.2016.84 Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W. J., Mattick, J. S., & Haussler, D. (2004). Ultraconserved elements in the human genome. 142 Science (New York, N.Y.), 304 (5675), 1321–1325. https : / /doi . org/10 . 1126/science.1098119 Birky, C. W., & Walsh, J. B. (1988). Effects of linkage on rates of molecular evolution. Proceedings of the National Academy of Sciences, 85 (17), 6414–6418. https://doi.org/10.1073/pnas.85.17.6414 Boyko, A. R., Williamson, S. H., Indap, A. R., Degenhardt, J. D., Hernandez, R. D., Lohmueller, K. E., Adams, M. D., Schmidt, S., Sninsky, J. J., Sunyaev, S. R., White, T. J., Nielsen, R., Clark, A. G., & Bustamante, C. D. (2008). Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS genetics, 4 (5), e1000083. https://doi. org/10.1371/journal.pgen.1000083 Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., & Vandergheynst, P. (2017). Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34 (4), 18–42. https://doi.org/10.1109/MSP.2017. 2693418 Bulmer, M. G. (1976). The effect of selection on genetic variability: A simulation study [Publisher: Cambridge University Press]. Genetics Research, 28 (2), 101–117. https://doi.org/10.1017/S0016672300016797 Burri, R. (2017). Interpreting differentiation landscapes in the light of long-term linked selection. Evolution Letters, 1 (3), 118–131. https://doi.org/10.1002/ evl3.14 Burri, R., Nater, A., Kawakami, T., Mugal, C. F., Olason, P. I., Smeds, L., Suh, A., Dutoit, L., Bureš, S., Garamszegi, L. Z., Hogner, S., Moreno, J., Qvarnström, A., Ružić, M., Sæther, S.-A., Sætre, G.-P., Török, J., & Ellegren, H. (2015). Linked selection and recombination rate variation 143 drive the evolution of the genomic landscape of differentiation across the speciation continuum of ficedula flycatchers. Genome Research, 25 (11), 1656–1665. https://doi.org/10.1101/gr.196485.115 Cai, J. J., Macpherson, J. M., Sella, G., & Petrov, D. A. (2009). Pervasive hitchhiking at coding and regulatory sites in humans. PLOS Genetics, 5 (1), e1000336. https://doi.org/10.1371/journal.pgen.1000336 Caldas, I. V., Clark, A. G., & Messer, P. W. (2022, July 20). Inference of selective sweep parameters through supervised learning [Pages: 2022.07.19.500702 Section: New Results]. https://doi.org/10.1101/2022.07.19.500702 Carvajal-Rodriguez, A. (2008). Simulation of genomes: A review. Current Genomics, 9 (3), 155–159. https://doi.org/10.2174/138920208784340759 Castellano, D., Eyre-Walker, A., & Munch, K. (2020). Impact of mutation rate and selection at linked sites on DNA variation across the genomes of humans and other homininae. Genome Biology and Evolution, 12 (1), 3550–3561. https: //doi.org/10.1093/gbe/evz215 Castellano, D., Macià, M. C., Tataru, P., Bataillon, T., & Munch, K. (2019). Comparison of the full distribution of fitness effects of new amino acid mutations across great apes. Genetics, 213 (3), 953–966. https://doi.org/ 10.1534/genetics.119.302494 Chan, J., Perrone, V., Spence, J. P., Jenkins, P. A., Mathieson, S., & Song, Y. S. (2018). A likelihood-free inference framework for population genetic data using exchangeable neural networks. Advances in Neural Information Processing Systems, 31, 8594–8605. 144 Charlesworth, B., Morgan, M. T., & Charlesworth, D. (1993). The effect of deleterious mutations on neutral molecular variation. Genetics, 134 (4), 1289–1303. Chen, J.-M., Cooper, D. N., Chuzhanova, N., Férec, C., & Patrinos, G. P. (2007). Gene conversion: Mechanisms, evolution and human disease. Nature Reviews Genetics, 8 (10), 762–775. https://doi.org/10.1038/nrg2193 Cleland, C. E. (2002). Methodological and epistemic differences between historical science and experimental science [Publisher: Cambridge University Press]. Philosophy of Science, 69 (3), 474–496. https://doi.org/10.1086/342455 Comeron, J. M. (2017). Background selection as null hypothesis in population genomics: Insights and challenges from drosophila studies. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 372 (1736), 20160471. https://doi.org/10.1098/rstb.2016.0471 Coop, G., & Ralph, P. (2012). Patterns of neutral diversity under general models of selective sweeps. Genetics, 192 (1), 205–224. https://doi.org/10.1534/ genetics.112.141861 Corbett-Detig, R. B., Hartl, D. L., & Sackton, T. B. (2015). Natural selection constrains neutral diversity across a wide range of species. PLOS Biology, 13 (4), e1002112. https://doi.org/10.1371/journal.pbio.1002112 Cruickshank, T. E., & Hahn, M. W. (2014). Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow. Molecular Ecology, 23 (13), 3133–3157. https://doi.org/10.1111/mec.12796 DeGiorgio, M., Huber, C. D., Hubisz, M. J., Hellmann, I., & Nielsen, R. (2016). SweepFinder2: Increased sensitivity, robustness and flexibility. 145 Bioinformatics, 32 (12), 1895–1897. https://doi.org/10.1093/bioinformatics/ btw051 Delmore, K. E., Lugo Ramos, J. S., van Doren, B. M., Lundberg, M., Bensch, S., Irwin, D. E., & Liedvogel, M. (2018). Comparative analysis examining patterns of genomic differentiation across multiple episodes of population divergence in birds. Evolution Letters, 2 (2), 76–87. https://doi.org/10.1002/ evl3.46 Ellegren, H., Smeds, L., Burri, R., Olason, P. I., Backström, N., Kawakami, T., Künstner, A., Mäkinen, H., Nadachowska-Brzyska, K., Qvarnström, A., Uebbing, S., & Wolf, J. B. W. (2012). The genomic landscape of species divergence in ficedula flycatchers. Nature, 491 (7426), 756–760. https://doi. org/10.1038/nature11584 Enard, D., Messer, P. W., & Petrov, D. A. (2014). Genome-wide signals of positive selection in human evolution. Genome Research, 24 (6), 885–895. https://doi. org/10.1101/gr.164822.113 Ewing, G. B., & Jensen, J. D. (2016). The consequences of not accounting for background selection in demographic inference [ eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/mec.13390]. Molecular Ecology, 25 (1), 135–141. https://doi.org/10.1111/mec.13390 Fan, C., Cahoon, J. L., Dinh, B. L., Ortega-Del Vecchyo, D., Huber, C., Edge, M. D., Mancuso, N., & Chiang, C. W. (2023). A likelihood-based framework for demographic inference from genealogical trees. bioRxiv, 2023.10.10.561787. https://doi.org/10.1101/2023.10.10.561787 Field, D., & Wills, C. (1997). Long, polymorphic microsatellites in simple organisms [Publisher: Royal Society]. Proceedings of the Royal Society of 146 London. Series B: Biological Sciences, 263 (1367), 209–215. https://doi.org/ 10.1098/rspb.1996.0033 Flagel, L., Brandvain, Y., & Schrider, D. R. (2019). The unreasonable effectiveness of convolutional neural networks in population genetic inference. Molecular Biology and Evolution, 36 (2), 220–238. https://doi.org/10.1093/molbev/ msy224 Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs, R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M., Pasternak, S., Wheeler, D. A., Willis, T. D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu, W., . . . Stewart, J. (2007). A second generation human haplotype map of over 3.1 million SNPs. [Place: England]. Nature, 449 (7164), 851–861. https://doi.org/10.1038/nature06258 Galtier, N. (2016). Adaptive protein evolution in animals and the effective population size hypothesis. PLOS Genetics, 12 (1), e1005774. https://doi. org/10.1371/journal.pgen.1005774 Galtier, N., Duret, L., Glémin, S., & Ranwez, V. (2009). GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends in genetics: TIG, 25 (1), 1–5. https://doi.org/10.1016/j.tig.2008.10.011 Garud, N. R., Messer, P. W., Buzbas, E. O., & Petrov, D. A. (2015). Recent selective sweeps in north american drosophila melanogaster show signatures of soft sweeps [Publisher: Public Library of Science]. PLOS Genetics, 11 (2), e1005004. https://doi.org/10.1371/journal.pgen.1005004 Glémin, S., Arndt, P. F., Messer, P. W., Petrov, D., Galtier, N., & Duret, L. (2015). Quantification of GC-biased gene conversion in the human genome. Genome Research, 25 (8), 1215–1228. https://doi.org/10.1101/gr.185488.114 147 Gower, G., Ragsdale, A. P., Bisschop, G., Gutenkunst, R. N., Hartfield, M., Noskova, E., Schiffels, S., Struck, T. J., Kelleher, J., & Thornton, K. R. (2022). Demes: A standard format for demographic models. Genetics, 222 (3), iyac131. https://doi.org/10.1093/genetics/iyac131 Griffiths, R., & Marjoram, P. (1996). Ancestral inference from samples of DNA sequences with recombination [Publisher: Mary Ann Liebert, Inc., publishers]. Journal of Computational Biology, 3 (4), 479–502. https://doi. org/10.1089/cmb.1996.3.479 Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., & Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data [Publisher: Public Library of Science]. PLOS Genetics, 5 (10), e1000695. https://doi.org/10.1371/journal.pgen. 1000695 Haller, B. C., Galloway, J., Kelleher, J., Messer, P. W., & Ralph, P. L. (2019). Tree- sequence recording in SLiM opens new horizons for forward-time simulation of whole genomes. Molecular Ecology Resources, 19 (2), 552–566. https://doi. org/10.1111/1755-0998.12968 Haller, B. C., & Messer, P. W. (2019). SLiM 3: Forward genetic simulations beyond the wright-fisher model. Molecular Biology and Evolution, 36 (3), 632–637. https://doi.org/10.1093/molbev/msy228 Halligan, D. L., Kousathanas, A., Ness, R. W., Harr, B., Eöry, L., Keane, T. M., Adams, D. J., & Keightley, P. D. (2013). Contributions of protein-coding and regulatory change to adaptive molecular evolution in murid rodents. PLoS Genetics, 9 (12). https://doi.org/10.1371/journal.pgen.1003995 148 Hamilton, W. L., Ying, R., & Leskovec, J. (2018, September 10). Inductive representation learning on large graphs. https://doi.org/10.48550/arXiv. 1706.02216 Harr, B. (2006). Genomic islands of differentiation between house mouse subspecies. Genome Research, 16 (6), 730–737. https://doi.org/10.1101/gr.5045006 Harris, K., & Nielsen, R. (2013). Inferring demographic history from a spectrum of shared haplotype lengths [Publisher: Public Library of Science]. PLOS Genetics, 9 (6), e1003521. https://doi.org/10.1371/journal.pgen.1003521 Hejase, H. A., Mo, Z., Campagna, L., & Siepel, A. (2022). A deep-learning approach for inference of selective sweeps from the ancestral recombination graph. Molecular Biology and Evolution, 39 (1), msab332. https://doi.org/10. 1093/molbev/msab332 Hernandez, R. D., Kelley, J. L., Elyashiv, E., Melton, S. C., Auton, A., McVean, G., 1000 Genomes Project, Sella, G., & Przeworski, M. (2011). Classic selective sweeps were rare in recent human evolution. Science (New York, N.Y.), 331 (6019), 920–924. https://doi.org/10.1126/science.1198878 Hoban, S., Bertorelle, G., & Gaggiotti, O. E. (2012). Computer simulations: Tools for population and evolutionary genetics [Number: 2 Publisher: Nature Publishing Group]. Nature Reviews Genetics, 13 (2), 110–122. https://doi. org/10.1038/nrg3130 Hodgkinson, A., & Eyre-Walker, A. (2011). Variation in the mutation rate across mammalian genomes. Nature Reviews Genetics, 12 (11), 756–766. https : //doi.org/10.1038/nrg3098 Huber, C. D., Kim, B. Y., Marsden, C. D., & Lohmueller, K. E. (2017). Determining the factors driving selective effects of new nonsynonymous 149 mutations. Proceedings of the National Academy of Sciences, 114 (17), 4465–4470. https://doi.org/10.1073/pnas.1619508114 Hudson, R. R. (1983). Testing the constant-rate neutral allele model with protein sequence data [Publisher: [Society for the Study of Evolution, Wiley]]. Evolution, 37 (1), 203–217. https://doi.org/10.2307/2408186 Hudson, R. R., Kreitman, M., & Aguadé, M. (1987). A test of neutral molecular evolution based on nucleotide data. Genetics, 116 (1), 153–159. https://doi. org/10.1093/genetics/116.1.153 Ingvarsson, P. K. (2010). Natural selection on synonymous and nonsynonymous mutations shapes patterns of polymorphism in populus tremula. Molecular Biology and Evolution, 27 (3), 650–660. https://doi.org/10.1093/molbev/ msp255 Irwin, D. E., Alcaide, M., Delmore, K. E., Irwin, J. H., & Owens, G. L. (2016). Recurrent selection explains parallel evolution of genomic regions of high relative but low absolute differentiation in a ring species. Molecular Ecology, 25 (18), 4488–4507. https://doi.org/10.1111/mec.13792 Jauch, A., Wienberg, J., Stanyon, R., Arnold, N., Tofanelli, S., Ishida, T., & Cremer, T. (1992). Reconstruction of genomic rearrangements in great apes and gibbons by chromosome painting. Proceedings of the National Academy of Sciences of the United States of America, 89 (18), 8611–8615. Retrieved February 21, 2020, from https://www.ncbi .nlm.nih.gov/pmc/articles/ PMC49970/ Jensen, J. D., Kim, Y., DuMont, V. B., Aquadro, C. F., & Bustamante, C. D. (2005). Distinguishing between selective sweeps and demography using DNA 150 polymorphism data. Genetics, 170 (3), 1401–1410. https://doi.org/10.1534/ genetics.104.038224 Johri, P., Riall, K., Becher, H., Excoffier, L., Charlesworth, B., & Jensen, J. D. (2021). The impact of purifying and background selection on the inference of population history: Problems and prospects. Molecular Biology and Evolution, 38 (7), 2986–3003. https://doi.org/10.1093/molbev/msab050 Kaplan, N. L., Hudson, R. R., & Langley, C. H. (1989). The ”hitchhiking effect” revisited. Genetics, 123 (4), 887–899. Katzman, S., Kern, A. D., Bejerano, G., Fewell, G., Fulton, L., Wilson, R. K., Salama, S. R., & Haussler, D. (2007). Human genome ultraconserved elements are ultraselected. Science, 317 (5840), 915–915. https://doi.org/ 10.1126/science.1142430 Katzman, S., Kern, A. D., Pollard, K. S., Salama, S. R., & Haussler, D. (2010). GC-biased evolution near human accelerated regions (J. Zhang, Ed.). PLoS Genetics, 6 (5), e1000960. https://doi.org/10.1371/journal.pgen.1000960 Kelleher, J., Etheridge, A. M., & McVean, G. (2016a). Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS computational biology, 12 (5), e1004842. https://doi.org/10.1371/journal. pcbi.1004842 Kelleher, J., Etheridge, A. M., & McVean, G. (2016b). Efficient coalescent simulation and genealogical analysis for large sample sizes [Publisher: Public Library of Science]. PLOS Computational Biology, 12 (5), e1004842. https: //doi.org/10.1371/journal.pcbi.1004842 Kelleher, J., Thornton, K. R., Ashander, J., & Ralph, P. L. (2018). Efficient pedigree recording for fast population genetics simulation. PLoS 151 computational biology, 14 (11), e1006581. https://doi.org/10.1371/journal. pcbi.1006581 Kelleher, J., Wong, Y., Wohns, A. W., Fadil, C., Albers, P. K., & McVean, G. (2019). Inferring whole-genome histories in large population datasets [Number: 9 Publisher: Nature Publishing Group]. Nature Genetics, 51 (9), 1330–1338. https://doi.org/10.1038/s41588-019-0483-y Kern, A. D., & Hahn, M. W. (2018). The neutral theory in light of natural selection. Molecular Biology and Evolution, 35 (6), 1366–1371. https://doi. org/10.1093/molbev/msy092 Kern, A. D., Jones, C. D., & Begun, D. J. (2002). Genomic effects of nucleotide substitutions in drosophila simulans. Genetics, 162 (4), 1753–1761. Retrieved February 9, 2020, from https://www.genetics.org/content/162/4/1753 Kern, A. D., & Schrider, D. R. (2018). diploS/HIC: An updated approach to classifying selective sweeps. G3 Genes—Genomes—Genetics, 8 (6), 1959–1970. https://doi.org/10.1534/g3.118.200262 Kim, B. Y., Huber, C. D., & Lohmueller, K. E. (2017). Inference of the distribution of selection coefficients for new nonsynonymous mutations using large samples. Genetics, 206 (1), 345–361. Kim, Y., & Stephan, W. (2000). Joint effects of genetic hitchhiking and background selection on neutral variation. Genetics, 155 (3), 1415–1427. Retrieved February 8, 2020, from https ://www.ncbi .nlm.nih .gov/pmc/articles/ PMC1461159/ Kim, Y. (2006). Allele frequency distribution under recurrent selective sweeps. Genetics, 172 (3), 1967–1978. https://doi.org/10.1534/genetics.105.048447 152 Kimura, M. (1980). Average time until fixation of a mutant allele in a finite population under continued mutation pressure: Studies by analytical, numerical, and pseudo-sampling methods [Publisher: Proceedings of the National Academy of Sciences]. Proceedings of the National Academy of Sciences, 77 (1), 522–526. https://doi.org/10.1073/pnas.77.1.522 Kong, A., Gudbjartsson, D. F., Sainz, J., Jonsdottir, G. M., Gudjonsson, S. A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., Palsson, S. T., Frigge, M. L., Thorgeirsson, T. E., Gulcher, J. R., & Stefansson, K. (2002). A high-resolution recombination map of the human genome. Nature Genetics, 31 (3), 241–247. https://doi.org/10.1038/ng917 Kong, A., Thorleifsson, G., Gudbjartsson, D. F., Masson, G., Sigurdsson, A., Jonasdottir, A., Walters, G. B., Jonasdottir, A., Gylfason, A., Kristinsson, K. T., Gudjonsson, S. A., Frigge, M. L., Helgason, A., Thorsteinsdottir, U., & Stefansson, K. (2010). Fine-scale recombination rate differences between sexes, populations and individuals. Nature, 467 (7319), 1099–1103. https://doi.org/10.1038/nature09525 Korfmann, K., Sellinger, T., Freund, F., Fumagalli, M., & Tellier, A. (2023, May 11). Simultaneous inference of past demography and selection from the ancestral recombination graph under the beta coalescent [Pages: 2022.09.28.508873 Section: New Results]. https://doi.org/10.1101/2022. 09.28.508873 Kronenberg, Z. N., Fiddes, I. T., Gordon, D., Murali, S., Cantsilieris, S., Meyerson, O. S., Underwood, J. G., Nelson, B. J., Chaisson, M. J. P., Dougherty, M. L., Munson, K. M., Hastie, A. R., Diekhans, M., Hormozdiari, F., Lorusso, N., Hoekzema, K., Qiu, R., Clark, K., Raja, A., 153 . . . Eichler, E. E. (2018). High-resolution comparative analysis of great ape genomes. Science, 360 (6393). https://doi.org/10.1126/science.aar6343 Lauterbur, M. E., Cavassim, M. I. A., Gladstein, A. L., Gower, G., Pope, N. S., Tsambos, G., Adrion, J., Belsare, S., Biddanda, A., Caudill, V., Cury, J., Echevarria, I., Haller, B. C., Hasan, A. R., Huang, X., Iasi, L. N. M., Noskova, E., Obšteter, J., Pavinato, V. A. C., . . . Gronau, I. (2023). Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations [Publisher: eLife Sciences Publications Limited]. eLife, 12. https://doi.org/10.7554/eLife.84874 Laval, G., Patin, E., Boutillier, P., & Quintana-Murci, L. (2021). Sporadic occurrence of recent selective sweeps from standing variation in humans as revealed by an approximate bayesian computation approach. Genetics, 219 (4), iyab161. https://doi.org/10.1093/genetics/iyab161 Leffler, E. M., Bullaughey, K., Matute, D. R., Meyer, W. K., Ségurel, L., Venkat, A., Andolfatto, P., & Przeworski, M. (2012). Revisiting an old riddle: What determines genetic diversity levels within species? PLOS Biology, 10 (9), e1001388. https://doi.org/10.1371/journal.pbio.1001388 Lewontin, R. C. (1974). The genetic basis of evolutionary change (Vol. 560). Columbia University Press New York. Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences [Number: 7357 Publisher: Nature Publishing Group]. Nature, 475 (7357), 493–496. https://doi.org/10.1038/nature10231 Lohmueller, K. E., Albrechtsen, A., Li, Y., Kim, S. Y., Korneliussen, T., Vinckenbosch, N., Tian, G., Huerta-Sanchez, E., Feder, A. F., Grarup, N., Jørgensen, T., Jiang, T., Witte, D. R., Sandbæk, A., Hellmann, I., 154 Lauritzen, T., Hansen, T., Pedersen, O., Wang, J., & Nielsen, R. (2011). Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome. PLOS Genetics, 7 (10), e1002326. https://doi.org/10.1371/journal.pgen.1002326 Losos, J. B. (2009, August 15). Evolutionary biology as a historical science. In J. Losos (Ed.), Lizards in an evolutionary tree: Ecology and adaptive radiation of anoles (p. 0). University of California Press. https://doi.org/10. 1525/california/9780520255913.003.0001 Macpherson, J. M., Sella, G., Davis, J. C., & Petrov, D. A. (2007). Genomewide spatial correspondence between nonsynonymous divergence and neutral polymorphism reveals extensive adaptation in drosophila. Genetics, 177 (4), 2083–2099. https://doi.org/10.1534/genetics.107.080226 Maynard Smith, J., & Haigh, J. (1974). The hitch-hiking effect of a favourable gene. Genetical Research, 23 (1), 23–35. McDonald, J. H., & Kreitman, M. (1991). Adaptive protein evolution at the adh locus in drosophila. Nature, 351 (6328), 652–654. https://doi.org/10.1038/ 351652a0 McVean, G. A., & Cardin, N. J. (2005). Approximating the coalescent with recombination [Publisher: Royal Society]. Philosophical Transactions of the Royal Society B: Biological Sciences, 360 (1459), 1387–1393. https://doi.org/ 10.1098/rstb.2005.1673 Miles, A., Bot, P. I., Rodrigues, M. F., Ralph, P., Harding, N., Pisupati, R., & Rae, S. (2020). Cggh/scikit-allel: V1. 3.2. Zenodo. Mo, Z., & Siepel, A. (2023). Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data [Publisher: 155 Public Library of Science]. PLOS Genetics, 19 (11), e1011032. https://doi. org/10.1371/journal.pgen.1011032 Murphy, D. A., Elyashiv, E., Amster, G., & Sella, G. (2022). Broad-scale variation in human genetic diversity levels is predicted by purifying selection on coding and non-coding elements (M. Nordborg, Ed.) [Publisher: eLife Sciences Publications, Ltd]. eLife, 11, e76065. https://doi .org/10.7554/ eLife.76065 Nachman, M. W., & Crowell, S. L. (2000). Estimate of the mutation rate per nucleotide in humans. Genetics, 156 (1), 297–304. https : / / doi . org / 10 . 1093/genetics/156.1.297 Nam, K., Munch, K., Mailund, T., Nater, A., Greminger, M. P., Krützen, M., Marquès-Bonet, T., & Schierup, M. H. (2017). Evidence that the rate of strong selective sweeps increases with population size in the great apes. Proceedings of the National Academy of Sciences, 114 (7), 1613–1618. https: //doi.org/10.1073/pnas.1605660114 Nielsen, R. (2000). Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics, 154 (2), 931–942. https : //doi.org/10.1093/genetics/154.2.931 Nielsen, R., Williamson, S., Kim, Y., Hubisz, M. J., Clark, A. G., & Bustamante, C. (2005). Genomic scans for selective sweeps using SNP data [Company: Cold Spring Harbor Laboratory Press Distributor: Cold Spring Harbor Laboratory Press Institution: Cold Spring Harbor Laboratory Press Label: Cold Spring Harbor Laboratory Press Publisher: Cold Spring Harbor Lab]. Genome Research, 15 (11), 1566–1575. https://doi.org/10.1101/gr. 4252305 156 Ohta, T., & Kimura, M. (1974). SIMULATION STUDIES ON ELECTROPHORETICALLY DETECTABLE GENETIC VARIABILITY IN a FINITE POPULATION. Genetics, 76 (3), 615–624. https://doi.org/10.1093/genetics/76.3.615 Orr, H. A. (2003). The distribution of fitness effects among beneficial mutations. Genetics, 163 (4), 1519–1526. https://doi.org/10.1093/genetics/163.4.1519 Pavlidis, P., Živković, D., Stamatakis, A., & Alachiotis, N. (2013). SweeD: Likelihood-based detection of selective sweeps in thousands of genomes. Molecular Biology and Evolution, 30 (9), 2224–2234. https ://doi .org/10. 1093/molbev/mst112 Phung, T. N., Huber, C. D., & Lohmueller, K. E. (2016). Determining the effect of natural selection on linked neutral divergence across species. PLOS Genetics, 12 (8), e1006199. https://doi.org/10.1371/journal.pgen.1006199 Pouyet, F., Aeschbacher, S., Thiéry, A., & Excoffier, L. (2018). Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences (K. Veeramah, P. J. Wittkopp & I. Gronau, Eds.). eLife, 7, e36317. https://doi.org/10.7554/eLife.36317 Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente- Galdos, B., Veeramah, K. R., Woerner, A. E., O’Connor, T. D., Santpere, G., Cagan, A., Theunert, C., Casals, F., Laayouni, H., Munch, K., Hobolth, A., Halager, A. E., Malig, M., Hernandez-Rodriguez, J., . . . Marques-Bonet, T. (2013). Great ape genetic diversity and population history. Nature, 499 (7459), 471–475. https://doi.org/10.1038/nature12228 Przeworski, M. (2002). The signature of positive selection at randomly chosen loci. Genetics, 160 (3), 1179–1189. https://doi.org/10.1093/genetics/160.3.1179 157 Ragsdale, A. P., Nelson, D., Gravel, S., & Kelleher, J. (2020). Lessons learned from bugs in models of human history. The American Journal of Human Genetics, 107 (4), 583–588. https://doi.org/10.1016/j.ajhg.2020.08.017 Ralph, P., Thornton, K., & Kelleher, J. (2020). Efficiently summarizing relationships in large samples: A general duality between statistics of genealogies and genomes. Genetics, 215 (3), 779–797. https://doi.org/10. 1534/genetics.120.303253 Rasmussen, M. D., Hubisz, M. J., Gronau, I., & Siepel, A. (2014). Genome-wide inference of ancestral recombination graphs [Publisher: Public Library of Science]. PLOS Genetics, 10 (5), e1004342. https://doi.org/10.1371/journal. pgen.1004342 Rodrigues, M. F., Kern, A. D., & Ralph, P. L. (2024). Shared evolutionary processes shape landscapes of genomic variation in the great apes. Genetics, iyae006. https://doi.org/10.1093/genetics/iyae006 Rodrigues, M. F., & Ralph, P. L. (2021). Vignette: Parallelizing SLiM simulations in a phylogenetic tree — PySLiM manual. Retrieved September 24, 2021, from https://tskit.dev/pyslim/docs/latest/vignette parallel phylo.html Rossi, E., Chamberlain, B., Frasca, F., Eynard, D., Monti, F., & Bronstein, M. (2020, October 9). Temporal graph networks for deep learning on dynamic graphs. https://doi.org/10.48550/arXiv.2006.10637 Sattath, S., Elyashiv, E., Kolodny, O., Rinott, Y., & Sella, G. (2011). Pervasive adaptive protein evolution apparent in diversity patterns around amino acid substitutions in drosophila simulans. PLoS genetics, 7 (2), e1001302. https: //doi.org/10.1371/journal.pgen.1001302 158 Scally, A., & Durbin, R. (2012). Revising the human mutation rate: Implications for understanding human evolution [Number: 10 Publisher: Nature Publishing Group]. Nature Reviews Genetics, 13 (10), 745–753. https : / / doi.org/10.1038/nrg3295 Schraiber, J. G., & Akey, J. M. (2015). Methods and models for unravelling human evolutionary history [Number: 12 Publisher: Nature Publishing Group]. Nature Reviews Genetics, 16 (12), 727–740. https://doi.org/10.1038/nrg4005 Schrider, D. R. (2020). Background selection does not mimic the patterns of genetic diversity produced by selective sweeps. Genetics, 216 (2), 499–519. https : //doi.org/10.1534/genetics.120.303469 Schrider, D. R., & Kern, A. D. (2016). S/HIC: Robust identification of soft and hard sweeps using machine learning. PLOS Genetics, 12 (3), e1005928. https: //doi.org/10.1371/journal.pgen.1005928 Schrider, D. R., & Kern, A. D. (2017). Soft sweeps are the dominant mode of adaptation in the human genome. Molecular Biology and Evolution, 34 (8), 1863–1877. https://doi.org/10.1093/molbev/msx154 Schrider, D. R., Shanku, A. G., & Kern, A. D. (2016). Effects of linked selective sweeps on demographic inference and model selection. Genetics, 204 (3), 1207–1223. https://doi.org/10.1534/genetics.116.190223 Sheehan, S., & Song, Y. S. (2016). Deep learning for population genetic inference [Publisher: Public Library of Science]. PLOS Computational Biology, 12 (3), e1004845. https://doi.org/10.1371/journal.pcbi.1004845 Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L. W., Richards, S., Weinstock, G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J., Miller, W., & Haussler, D. (2005). 159 Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research, 15 (8), 1034–1050. https://doi.org/10.1101/gr. 3715005 Simonsen, K. L., Churchill, G. A., & Aquadro, C. F. (1995). Properties of statistical tests of neutrality for DNA polymorphism data. Genetics, 141 (1), 413–429. https://doi.org/10.1093/genetics/141.1.413 Slotte, T. (2014). The impact of linked selection on plant genomic variation. Briefings in Functional Genomics, 13 (4), 268–275. https : //doi . org/10 . 1093/bfgp/elu009 Smith, J. M., & Haigh, J. (1974). The hitch-hiking effect of a favourable gene [Publisher: Cambridge University Press]. Genetics Research, 23 (1), 23–35. https://doi.org/10.1017/S0016672300014634 Smith, N., & Eyre-Walker, A. (2002). Adaptive protein evolution in drosophila. Nature, 415 (6875), 1022–1024. https://doi.org/10.1038/4151022a Smith, T. C. A., Arndt, P. F., & Eyre-Walker, A. (2018). Large scale variation in the rate of germ-line de novo mutation, base composition, divergence and diversity in humans. PLOS Genetics, 14 (3), e1007254. https://doi.org/10. 1371/journal.pgen.1007254 Speidel, L., Cassidy, L., Davies, R. W., Hellenthal, G., Skoglund, P., & Myers, S. R. (2021). Inferring population histories for ancient genomes using genome- wide genealogies. Molecular Biology and Evolution, 38 (9), 3497–3511. https: //doi.org/10.1093/molbev/msab174 Speidel, L., Forest, M., Shi, S., & Myers, S. R. (2019). A method for genome-wide genealogy estimation for thousands of samples [Number: 9 Publisher: Nature 160 Publishing Group]. Nature Genetics, 51 (9), 1321–1329. https://doi.org/10. 1038/s41588-019-0484-x Stankowski, S., Chase, M. A., Fuiten, A. M., Rodrigues, M. F., Ralph, P. L., & Streisfeld, M. A. (2019). Widespread selection and gene flow shape the genomic landscape during a radiation of monkeyflowers. PLOS Biology, 17 (7), e3000391. https://doi.org/10.1371/journal.pbio.3000391 Stevison, L. S., Woerner, A. E., Kidd, J. M., Kelley, J. L., Veeramah, K. R., McManus, K. F., Bustamante, C. D., Hammer, M. F., & Wall, J. D. (2016). The time scale of recombination rate evolution in great apes. Molecular Biology and Evolution, 33 (4), 928–945. https://doi.org/10.1093/molbev/ msv331 Torres, R., Stetter, M. G., Hernandez, R. D., & Ross-Ibarra, J. (2020). The temporal dynamics of background selection in nonequilibrium populations. Genetics, 214 (4), 1019–1030. https://doi.org/10.1534/genetics.119.302892 Torres, R., Szpiech, Z. A., & Hernandez, R. D. (2018). Human demographic history has amplified the effects of background selection across the genome. PLOS Genetics, 14 (6), e1007387. https://doi.org/10.1371/journal.pgen.1007387 Turner, T. L., Hahn, M. W., & Nuzhdin, S. V. (2005). Genomic islands of speciation in anopheles gambiae. PLoS Biology, 3 (9). https : / /doi . org / 10.1371/journal.pbio.0030285 Uricchio, L. H., & Hernandez, R. D. (2014). Robust forward simulations of recurrent hitchhiking. Genetics, 197 (1), 221–236. https : / / doi . org / 10 . 1534/genetics.113.156935 van Doren, B. M., Campagna, L., Helm, B., Illera, J. C., Lovette, I. J., & Liedvogel, M. (2017). Correlated patterns of genetic diversity and 161 differentiation across an avian family. Molecular Ecology, 26 (15), 3982–3997. https://doi.org/10.1111/mec.14083 Wakely, J. (2016, April 22). Coalescent theory: An introduction [Google-Books-ID: x30RAgAACAAJ]. Macmillan Learning. Wang, J., Street, N. R., Park, E.-J., Liu, J., & Ingvarsson, P. K. (2020). Evidence for widespread selection in shaping the genomic landscape during speciation of populus. Molecular Ecology, 29 (6), 1120–1136. https://doi.org/10.1111/ mec.15388 Williamson, R. J., Josephs, E. B., Platts, A. E., Hazzouri, K. M., Haudry, A., Blanchette, M., & Wright, S. I. (2014). Evidence for widespread positive and negative selection in coding and conserved noncoding regions of capsella grandiflora. PLoS Genetics, 10 (9). https://doi.org/10.1371/journal.pgen. 1004622 Williamson, S., & Orive, M. E. (2002). The genealogy of a sequence subject to purifying selection at multiple sites. Molecular Biology and Evolution, 19 (8), 1376–1384. https://doi.org/10.1093/oxfordjournals.molbev.a004199 Wohns, A. W., Wong, Y., Jeffery, B., Akbari, A., Mallick, S., Pinhasi, R., Patterson, N., Reich, D., Kelleher, J., & McVean, G. (2022). A unified genealogy of modern and ancient genomes [Publisher: American Association for the Advancement of Science]. Science, 375 (6583), eabi8264. https://doi. org/10.1126/science.abi8264 Won, Y.-J., & Hey, J. (2005). Divergence population genetics of chimpanzees. Molecular Biology and Evolution, 22 (2), 297–307. https : / /doi . org / 10 . 1093/molbev/msi017 162 Wong, Y., Ignatieva, A., Koskela, J., Gorjanc, G., Wohns, A. W., & Kelleher, J. (2023, November 4). A general and efficient representation of ancestral recombination graphs [Pages: 2023.11.03.565466 Section: New Results]. https://doi.org/10.1101/2023.11.03.565466 Y C Brandt, D., Wei, X., Deng, Y., Vaughn, A. H., & Nielsen, R. (2022). Evaluation of methods for estimating coalescence times using ancestral recombination graphs. Genetics, 221 (1), iyac044. https : / / doi . org / 10 . 1093/genetics/iyac044 Zhang, B. C., Biddanda, A., Gunnarsson, Á. F., Cooper, F., & Palamara, P. F. (2023). Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits [Number: 5 Publisher: Nature Publishing Group]. Nature Genetics, 55 (5), 768–776. https://doi.org/10. 1038/s41588-023-01379-x Zhen, Y., Huber, C. D., Davies, R. W., & Lohmueller, K. E. (2021). Greater strength of selection and higher proportion of beneficial amino acid changing mutations in humans compared with mice and drosophila melanogaster. Genome Research, 31 (1), 110–120. https://doi.org/10.1101/gr.256636.119 163