Harnessing Single Cell RNA Sequencing to Uncover Developmental Genetic Underpinnings of Syngnathid Fishes’ Derived Traits by Hope Meredith Healey A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology Dissertation Committee: Nadia Singh, Chair William Cresko, Advisor Matthew Barber, Core Member John Postlethwait, Core Member Kirstin Sterner, Institutional Representative University of Oregon Fall 2024 2 © 2024 Hope Meredith Healey This work is openly licensed via CC BY. 3 DISSERTATION ABSTRACT Hope Meredith Healey Doctor of Philosophy in Biology Fall 2024 Title: Harnessing Single Cell RNA Sequencing to Uncover Developmental Genetic Underpinnings of Syngnathid Fishes’ Derived Traits Syngnathid fishes are extraordinary creatures with highly derived features. Notably, these fishes have elongated, toothless snouts, male pregnancy, and exoskeletons. In addition to their highly altered morphologies, syngnathids have lost highly conserved signaling genes (fgf3 and 4), transcription factors (eve1), and tooth mineralization genes (scpp genes). Although these genomic changes are predicted to contribute to the development, or lack thereof, of syngnathid’s derived traits, it is unknown whether this is the case. To investigate developmental origins of syngnathid’s highly altered traits, I harnessed single cell RNA sequencing (scRNAseq) methods for my dissertation research. First, I identified limitations for completing scRNAseq analysis in non-traditional model organisms, then overcame these challenges using scISOseq to improve gene annotations. Then, I used these methods to study the unique craniofacial morphology (elongate head and loss of teeth) of syngnathids. I identified the cell types present and signaling genes expressed at the start of craniofacial elongation in Gulf pipefish. Additionally, I found unique changes in signaling gene expression patterns during early craniofacial development of Gulf pipefish. This thesis discovered local modifications (select Fgf, Wnt, and Bmp genes) in pipefish craniofacial development, but found no evidence for global gene network re-wiring. Overall, this study suggests that syngnathid’s highly derived heads evolved from local network re-wiring. This dissertation includes previously published and unpublished co-authored material. 4 ACKNOWLEDGMENTS I am extraordinarily lucky to have worked with and been supported by amazing people during my dissertation. My advisor, Dr. Bill Cresko, has been instrumental for my scientific growth and achievement; he has always supported my scientific goals and pushed me to grow as a researcher. My advancement would not have been possible without my role models, Dr. Susie Bassham and Dr. Emily Beck, who showed me how to be a strong, successful woman in science, provided scientific guidance, and extended opportunities to me. I am also lucky to have worked with Dr. Clay Small, who has always provided key scientific input. Additionally, I thank Mark Currey and Tiffany Thornton (and many undergraduate assistants) for taking excellent care of all the fish used in this dissertation. I was fortunate to ford through graduate school with other Cresko Lab graduate students: Sophia Frantz, Shannon Snyder, Carla Campos, and Amberly Buer. Additionally, I am grateful for all the undergraduates who supported my work: Starla Chambrose, Thomas Skates, Micah Woods, Vithika Goyal, and Hayden Penn. Lastly, I want to thank rotation students who have inspired and assisted with the work from these chapters: LyAndra Lujan, Clara Rehmann, and Aubrey Mayer. I am grateful for the support and guidance of my committee: Dr. Kirstin Sterner, Dr. Nadia Singh, Dr. Matt Barber, and Dr. John Postlethwait. Additionally, I am lucky to have had a second home in the Bioinformatics and Genomics Master’s Program, and thank Dr. Stacey Wagner, Dr. Leslie Coonrod, Pete Batzel, and Jason Sydes. I am also grateful for the BGMP cohort from 2019-2020, especially Wyatt Eng, who grew with me as bioinformaticians. I also want to thank our wonderful collaborators from University of Idaho, Dr. Adam Jones and Dr. Balan Ramesh, for truly making this work possible. I am grateful for additional collaborators and scientific mentors including Dr. Thomas Desvignes, Dr. Angel Amores, Dr. Martin Stervander, Dr. Ingo Braasch, Dr. Andrew Thompson, and Dr. Emily Rose. I would further like to thank my community here in the Institute of Ecology and Evolution and the Department of Biology. I am grateful for support from the Genetics Training Program, the GrEBES student group, and zebrafish groupie. But most importantly, I am grateful for my IE2 cohort: Victoria Caudill, Gabrielle Coffing, Aidan Short, and Caitlin Smith. I am also thankful for crucial administrative support from Sara Nash, Arlene Deyo, and Leah Frazier. 5 Finally, I am thankful for Women in Graduate Sciences, which brought together a supportive community ready to make positive impacts on our local community. I am grateful for the funding that made this research possible. Specifically, funding from the Genetics Training Program (NIH T32GM007413), F31 funding from NIDCR (5F31DE032559-02), University of Oregon Research Excellence funds (Bill Cresko), and a National Science Foundation Grant (OPP-2015301; Bill Cresko, Susie Bassham, and Clay Small). Finally, I want to thank all my loved ones for supporting and making this dream possible. 6 DEDICATION For my grandfather, who gave me a love for learning 7 TABLE OF CONTENTS Chapter Page I. INTRODUCTION .................................................................................................... 14 Introduction ........................................................................................................... 17 References ............................................................................................................. 17 II. SINGLE-CELL ISO-SEQUENCING ENABLES RAPID GENOME ANNOTATION FOR SCRNASEQ ANALYSIS ........................................................................................... 18 Introduction ........................................................................................................... 18 Materials and Methods .......................................................................................... 19 Tissue Dissociation to Generate a Pool of Single Cells .................................. 19 Library Preparation .......................................................................................... 20 PacBio Data Processing ................................................................................... 20 Annotation Analysis ........................................................................................ 22 ScRNAseq Analysis ........................................................................................ 22 Results and Discussion .......................................................................................... 23 ScISOr-Seq Captured Novel Isoforms and Improved 3’ UTR ........................ 23 ScISOr-Seq Improved the Number of Reads Retained for ScRNAseq ........... 25 ScISOr-Seq Enhances Biological Interpretation of Cell Clusters ................... 28 Conclusion ............................................................................................................. 30 References ............................................................................................................. 32 Bridge .................................................................................................................... 36 8 Chapter Page III. SINGLE CELL RNA SEQUENCING PROVIDES CLUES FOR THE DEVELOPMENTAL GENETIC BASIS OF SYNGNATHID FISHES EVOLUTIONARY ADAPTATIONS ..................................................................................................................................... 37 Introduction ........................................................................................................... 37 Materials and Methods .......................................................................................... 40 Single Cell RNA Sequencing Libraries Preparation ....................................... 40 Single Cell Atlas Construction ........................................................................ 41 Single Cell Atlas Cluster Identification ........................................................... 42 Differentiation State Analysis ......................................................................... 44 Single Cell KEGG Analysis ............................................................................ 44 Single Cell Atlas Network Analysis ................................................................ 44 In situ Hybridization ........................................................................................ 45 Bone and Cartilage Staining ............................................................................ 46 Gene Cluster Analysis ..................................................................................... 46 Results ................................................................................................................... 47 Valuable cRNAseq Atlas for Studying Syngnathid Development .................. 47 Discovery of Cell Cluster Function and State Using KEGG Analysis ........... 49 Commonalities of Cell Clusters, Unique Networks, and Elusive Cell Types Identified in Network Analysis ............................................................................................ 51 Conserved Signaling Pathways Are Active during Syngnathid Craniofacial Development.. .................................................................................................. 53 9 Gulf Pipefish Retain Tooth Development Genes but Likely Lack Onset of Tooth Development .................................................................................................... 55 Tooth and Skeletal Genes Are Expressed during Dermal Armor Development 58 Epithelial Expression of Immune and Nutrient Processing Genes May Facilitate Embryo- Paternal Interactions in the Brood Pouch ........................................................ 59 Discussion .............................................................................................................. 61 Our atlas Represents a Novel Resource for Evo-Devo Research .................... 62 Conserved Pathways May Contribute to Derived Syngnathid Heads ............. 62 No Evidence of Primordia Suggests Changes in Early, Not Late, Tooth Development Are Likely at the Root of Evolutionary Tooth Loss ............................................... 64 Redeployment of the Bone Gene Network to Build Dermal Armor ............... 65 Signatures of Embryonic Interactions within the Novel Pouch Environment 65 Conclusion ............................................................................................................. 67 References ............................................................................................................. 68 Bridge .................................................................................................................... 81 IV. REDUNDACY AND PATHWAY EVOLUTION IN SYNGNATHID FISH CRANIOFACIAL GENE NETWORKS .................................................................... 82 Introduction ........................................................................................................... 82 Materials and Methods .......................................................................................... 84 Production of ScRNAseq Sequencing Libraries ............................................. 84 Production of ATACseq Sequencing Libraries ............................................... 85 ScRNAseq Library Initial Processing ............................................................. 85 Ortholog Identification .................................................................................... 85 10 ScRNAseq Atlas Construction ........................................................................ 86 ScRNAseq Trajectory Analysis ....................................................................... 87 ScRNAseq Differentially Expressed Gene Analysis ....................................... 87 ScRNAseq Gene Network Analysis ................................................................ 87 ScRNAseq Cross-Species Correlation Analysis ............................................. 87 ATACseq Read Processing and Analysis ........................................................ 88 Whole Genome Alignment and Identification of Conserved Elements .......... 88 In situ Hybridization ........................................................................................ 89 Results ................................................................................................................... 89 Successful Integration of Gulf Pipefish, Stickleback, and Zebrafish Single Cell Dataset… ............................................................................................................................... 89 Cranial Neural Crest Pharyngeal Arch Core Gene Regulatory Network is Largely Conserved in Gulf Pipefish ............................................................................. 92 Gene Losses and Potential Novel Genes in Gulf Pipefish Pharyngeal Arch Cells Alter Gene Regulatory Networks ............................................................................. 94 GENESPACE Analysis Identifies Candidates for Broader Syngnathid Evolution… ............................................................................................................................... 95 Numerous Candidate Genes Identified from Global Analyses, but few are Promising for Craniofacial Evolution ..................................................................................... 96 Gulf Pipefish has Novel Fgf Expression in Pharyngeal Arch Cells ................ 96 Potential Changes in Gene Sequence and Regulation may Lead to Gulf Pipefish Expression Differences .................................................................................... 98 Discussion .............................................................................................................. 100 11 Extensive Teleost Genome Duplication Paralog Evolution but no Evidence for Global Re-wiring in Pipefish CNC ............................................................................. 100 Local Gene Regulatory Network Re-wiring through Co-option of Fgf23 in Pipefish… ............................................................................................................................... 101 Evolution of Redundancy in Fgf pathway may Enable Genome Evolution ... 102 References ............................................................................................................. 104 V. CONCLUSION ...................................................................................................... 112 APPENDICES ............................................................................................................. 115 A. SUPPORTING INFORMATION FOR CHAPTER II ..................................... 115 B. SUPPORTING INFORMATION FOR CHAPTER III .................................... 117 C. SUPPORTING INFORMATION FOR CHAPTER IV .................................... 198 12 LIST OF FIGURES Figure Page 2.1 SQANTI3 Classification of ScISOr-Seq Data Revealed that the Majority of Isoforms were Previously Unannotated in the Stickleback Genome .................................................. 24 2.2 ScISOr-Seq and Bulk Iso-Seq Captured a Limited Number of Annotated Transcripts, but Both Resulted in an Increase in Reads Mapping to the Transcriptome ....................... 25 2.3 ScISOr-Seq Improved Gene Models by Extending the 3’UTRs Illustrated Here by the Two Models for Nkx2.3 ....................................................................................................... 26 2.4 Annotation improvements using Iso-Seq (bulk Iso-Seq or ScISOr-Seq) data extended genes both 5’ and 3’, with 3’ expansions leading to greater numbers of reads counted. ...... 27 2.5 Using the Updated Gene Models from Our Merged and Pooled Dataset, We Identified Expected Cell Types in 70hpf Stickleback ScRNAseq Data and Illustrated How Annotations Improvements Are Relevant for Analysis ................................................................... 29 3.1 Gulf Pipefish Exemplify Syngnathid Derived Traits ............................................ 40 3.2 Gulf Pipefish Single Cell Atlas Contains Cells from the Entire Embryo and Identifies Genetic Pathways Active in Different Cell Types .................................................................... 48 3.3 Weighted Gene Network Analysis (WGCNA) Identifies Gene Modules that Define and Unite Cell Clusters ................................................................................................................ 50 3.4 Conserved Cell Types and Gene Pathways Build Unique Faces of Syngnathids. 52 3.5 Pipefish Do Not Possess Identifiable Tooth Primordium Cells, but Continue to Express Tooth Development Genes in Other Contexts ....................................................................... 57 3.6 Tooth and Bone Development Genes Expressed during Exoskeleton Development 59 3.7 Gene Expression Signatures Suggest Embryonic Interactions within Brood Pouch Environment ................................................................................................................ 60 4.1 Two Hypotheses for Syngnathid Craniofacial Development and Evolution ........ 83 13 4.2 Integrated Atlas Encompasses Cells from the Entire Embryo .............................. 91 4.3 Number of Cells per Cluster Varied by Species .................................................... 92 4.4 Pharyngeal Arch Cells Express Highly Conserved Core Genetic Program .......... 94 4.5 Novel Fgf and Wnt Pathway Expression Changes in Gulf Pipefish CNC Pharyngeal Arch Cells. ........................................................................................................................... 98 4.6 Potential Changes in Gene Sequence and Regulation may Lead to Gulf Pipefish Expression Differences .................................................................................................................. 99 14 CHAPTER I INTRODUCTION The beauty of life is one of evolution’s greatest mysteries. Solving life’s riddles may be impossible, but uncovering the underlying mechanisms that create the incredible is worth it. In this thesis, I attempted to understand the extraordinary: the development and evolution of highly derived traits. Highly derived traits appear across the diversity of life and often allow species to adapt to new niches. Although the ecological and evolutionary consequences of these traits is occasionally clear, their developmental genetic underpinnings are often challenging to discern. In practice, finding these needles in the haystack requires comparisons to close outgroups, hypothesis driven sampling, and different modalities of data. New genome assemblies have recently enabled these types of analyses in the charismatic clade of fishes known as syngnathids. These popular fishes include seahorses, pipefishes, and seadragons. They are known for their derived traits including male pregnancy, dermal armor, elongated snouts, and jaws without teeth. However, the developmental basis of these traits is unknown. Syngnathids’ genomes have been extensively studied to identify candidates for their trait evolution, and numerous candidates have been uncovered through these approaches. Notably, syngnathids have lost fgf3 and fgf4, two deeply conserved signaling genes involved in craniofacial development (Small et al. 2022). The loss of fgf3 is particularly intriguing given that the absence of fgf3 in zebrafish is lethal (Herzog et al. 2004). Additionally, syngnathids have lost genes related to tooth development including eve1 and most scpp genes (Small et al. 2016; Lin et al. 2016; Zhang et al. 2020). Beyond these protein coding losses, conserved non-coding elements (including in dlx1a and near dlx2a) and miRNAs (including mir130 clusters and mir196b) are also missing in syngnathids (Small et al. 2015; Small et al. 2022). Connecting these genomic losses with syngnathid craniofacial evolution requires transcriptomic data to uncover whether these losses led to shifts in developmental networks. To probe developmental transcription patterns in syngnathid fishes, I used a sequencing method called single cell RNA sequencing (scRNAseq). This technique determines the RNA expression profile of individual cells from a sample. Using these data, cell identities can be fleshed out post-hoc. Therefore, using this approach in non-traditional systems, including 15 syngnathids, is powerful because expression patterns of cell types can be deduced without dissection or genetic modifications. These datasets can be used to determine the cell types present at different stages of development, expression patterns of signaling pathway genes, and detect genetic networks. In Chapter II, I identified a limitation of scRNAseq in non-traditional organisms and solved this problem using single cell ISO-sequencing (scISOseq). Although numerous developmental scRNAseq atlases exist for model systems, there are few atlases in non-model systems. Due to the 3’ bias of scRNAseq reads, many sequencing reads land in the 3’ UTR of gene models which are often poorly annotated in non-model organisms. I addressed this limitation using scISOseq, a sequencing method that used leftover cells from the scRNAseq library to produce full length RNA reads. Using this method, I lengthened 3’ UTR annotations in the stickleback genome assembly leading to a 26.1% improvement in the number of scRNAseq reads counted. These findings suggest that scRNAseq is feasible in non-model systems but requires careful planning to ensure proper read quantification. This chapter has been published and is co-authored with Dr. Susan Bassham and Dr. William Cresko (Healey et al. 2022). In Chapter III, I explored the developmental genetic basis of numerous syngnathid derived traits using scRNAseq. Although genomic candidates have been identified for their derived traits, there are few studies that examine how these changes relate to development. In this study, I developed a scRNAseq atlas that probed the elongate snout, loss of teeth, dermal armor, and male pregnancy (from the embryonic lens). Using Gulf pipefish embryos from early craniofacial elongation, I identified candidate genes (sfrp1a, bmp4, and prdm16) expressed in mesenchyme above the lengthening ethmoid plate in the face. Next, I searched for tooth primordial cells to determine whether tooth development initiates in syngnathids. Although I did not find these tooth primordial cells, expression of select tooth genes (dlx3b and scpp1) was observed in the developing dermal armor. Finally, I identified an enrichment of immune and metabolism genes in the embryo epidermis, potentially indicating nutrient uptake from the brood pouch. These findings suggest that syngnathid highly derived traits emerge from re-deployment of signaling pathways. This chapter has been published and is co-authored with Hayden Penn, Dr. Clay Small, Dr. Susan Bassham, Micah Woods, Vithika Goyal, and Dr. William Cresko (Healey et al. 2024). 16 In Chapter IV, I asked whether syngnathids have changes in the pharyngeal arch cells during early craniofacial development that could relate to their unique morphology. In laboratory and evolutionary models, changes in gene expression or numbers of cranial neural crest cells that build the pharyngeal arches can lead to changes in head shapes. Since syngnathids have an evolutionary knockout of fgf3, a gene essential during pharyngeal arch development, I probed the extent of genetic network changes in syngnathid pharyngeal arch cells. Using comparative scRNAseq, I identified that Gulf pipefish have largely conserved pharyngeal arch gene networks. Furthermore, I discovered that fgf22 from the fgf3 sub-family had evolved pharyngeal arch expression prior to the loss of fgf3 in syngnathids. These findings suggest that redundancy permitted the loss of fgf3 in syngnathids and, furthermore, there has not been extensive craniofacial gene regulatory network re-wiring in these fishes. This chapter is co-authored with Clara Rehmann, Dr. Susan Bassham, and Dr. William Cresko. Together, these chapters form the foundation for successful developmental scRNAseq atlases in non-traditional systems. Chapters III and IV contribute significantly to the syngnathid literature by providing the first transcriptomic insights into their unique development. 17 References Healey, Bassham, Cresko. 2022. Single-cell Iso-Sequencing enables rapid genome annotation for scRNAseq analysis. Genetics 220 (3): iyac017. Healey, Penn, Small, Bassham, Goyal, Woods, Cresko. 2024. Single Cell RNA Sequencing Provides Clues for the Developmental Genetic Basis of Synganthidae’s Evolutionary Adaptations. eLIFE. Herzog, W., Sonntag, C., von der Hardt, S., Roehl, H. H., Varga, Z. M., & Hammerschmidt, M. (2004). Fgf3 signaling from the ventral diencephalon is required for early specification and subsequent survival of the zebrafish adenohypophysis. Development, 131(15), 3681–3692. Lin, Q., Fan, S., Zhang, Y., Xu, M., Zhang, H., Yang, Y., Lee, A. P., Woltering, J. M., Ravi, V., Gunter, H. M., Luo, W., Gao, Z., Lim, Z. W., Qin, G., Schneider, R. F., Wang, X., Xiong, P., Li, G., Wang, K., … Venkatesh, B. (2016a). The seahorse genome and the evolution of its specialized morphology. Nature, 540(7633), 395–399. Small, C. M., Bassham, S., Catchen, J., Amores, A., Fuiten, A. M., Brown, R. S., Jones, A. G., & Cresko, W. A. (2016). The genome of the Gulf pipefish enables understanding of evolutionary innovations. Genome Biology, 17(1), 258. https://doi.org/10.1186/s13059-016- 1126-6 Small, Clayton M., Healey, H. M., Currey, M. C., Beck, E. A., Catchen, J., Lin, A. S. P., Cresko, W. A., & Bassham, S. (2022). Leafy and weedy seadragon genomes connect genic and repetitive DNA features to the extravagant biology of syngnathid fishes. Proceedings of the National Academy of Sciences of the United States of America, 119(26), e2119602119. Zhang, Y., Ravi, V., Qin, G., Dai, H., Zhang, H., Han, F., Wang, X., & Liu, Y. (2020a). Comparative genomics reveal shared genomic changes in syngnathid fishes and signatures of genetic convergence with placental mammals ! Phylogenomics and evolutionary rate Transposable elements in syngnathids Expansion and contraction of gene families in t. National Science Review, 7(6), 964–977. 18 CHAPTER II SINGLE-CELL ISO-SEQUENCING ENABLES RAPID GENOME ANNOTATION FOR SCRNASEQ ANALYSIS From Healey, Bassham, Cresko. 2022. Single-cell Iso-Sequencing enables rapid genome annotation for scRNAseq analysis. Genetics 220 (3): iyac017. This chapter was published in Genetics. Susan Bassham and William Cresko are co-authors on this publication. The project originated from conversations between William Cresko and me. I prepared the sample for sequencing, performed the scRNAseq and scISOr-Seq analysis, and took the lead on writing the manuscript. Susan Bassham contributed to the analysis and made writing contributions. William Cresko oversaw the analysis, contributed to writing, and acquired funding. The full supplemental material for this publication can be found at https://doi.org/10.25386/genetics.16811614.v1 Introduction Single cell RNA sequencing (scRNAseq) is a revolutionary technique in biology that provides expression information from tissues and embryos (Shapiro et al. 2013). By barcoding RNA from individual cells directly from dissociated samples, scRNAseq allows for post hoc analysis of cell types and can be used to ascertain novel cell populations, explore developmental trajectories, and define gene regulatory networks (Luecken and Theis 2019). To maximize the utility of scRNAseq datasets, however, 3’ UTRs must be annotated for several reasons. ScRNAseq captures transcripts through poly(A) tails leading to a 3’ bias in coverage (Hwang et al. 2018). Partially annotated genes may be represented in a dataset with lower read counts, leading to erroneous conclusions regarding their magnitude of expression across cell types. In addition, downstream scRNAseq analysis clusters cell types through the determination of a covariance structure across highly variable genes (Stuart et al. 2019). The systematic absence of such genes can lead to inferential errors in multivariate analyses and obscure biological reality. 19 The annotations of even intensively studied models, such as mice and zebrafish, continue to be improved (Gupta et al. 2018; Lawson et al. 2020). Most other organismal genome assemblies are even less well annotated. Facilitating a broader utility of scRNAseq requires more efficient methods for 3’ UTR annotation. Recently, full length, single-molecule isoform sequencing (Iso-Seq) has been used to improve genome annotations (Kuo et al. 2017; Beiki et al. 2019; Kuo et al. 2020; Ali et al. 2021). PacBio’s Iso-seq has been further adapted to use the 10x Genomics platform for scRNA barcoding (ScISOr-Seq) to track cell type specific isoform expression (Gupta et al. 2018; Zheng et al. 2020). Here we show that ScISOr-Seq in the context of a scRNAseq experiment allows rapid 3’UTR annotation in threespine stickleback fish (Gasterosteus aculeatus). This fish has long been a focus of study in behavior, ecology and evolution (Bell and Foster 1994; Colosimo et al. 2004; Cresko et al. 2004a; Shapiro et al. 2004; Cresko et al. 2007; Hohenlohe et al. 2010; Reid et al. 2021), and is now a nascent system for biomedical research (Miller et al. 2007; Gardell et al. 2017; Small et al. 2017; Beck et al. 2020; Beck et al. 2021; Fuess et al. 2021). Although stickleback have a well assembled genome, its 3’ UTR annotations are incomplete which limits scRNAseq’s utility. We demonstrate that a single PacBio SMRT cell of ScISOr-Seq data is sufficient to significantly improve the stickleback annotations to an extent on par with zebrafish for the purpose of scRNAseq analysis at this stage. Our findings demonstrate that ScISOr-Seq will be a useful tool to efficiently improve genome annotations for scRNAseq in many organisms. Materials and Methods Tissue Dissociation to Generate a Pool of Single Cells We crossed a laboratory line of stickleback originally isolated from Cushman Slough (Oregon) and raised embryos to 70 hours post fertilization (hpf) at 20°C using standard procedures from the Cresko Laboratory Stickleback Facility (Cresko et al. 2004b). 70hpf stickleback embryos are developmentally at an equivalent stage to 24 hpf zebrafish (Cresko et al. 2003). We euthanized 36 embryos in MS-222 following IACUC approved procedures then dechorionated and deyolked them at room temperature. We limited our embryo dissection to 20 minutes based on (Farrell et al. 2018). Following protocols from (Bresciani et al. 2018), we dissociated the cells for 6 minutes in 0.25% trypsin in PBS at 30°C, pipetting up and down every 20 30 seconds, then, stopped the dissociation using 10% FBS DMEM and spun down cells at 400xG for 3min at 4°C. We resuspended cells in 1mL of 0.1% BSA PBS, centrifuged at 400xG for 3min at 4°C, and resuspended in 100ul 0.04% BSA in PBS. At room temperature, we filtered cells through a 40uM cell strainer (Thomas scientific 1181X52) then washed the original tube twice with 100ul of 0.04% BSA in PBS and poured over the same cell strainer. Library Preparation The scRNAseq and ScISOr-Seq libraries were prepared and sequenced by the University of Oregon Genomics and Cell Characterization Core Facility (https://gc3f.uoregon.edu). The dissociated cells were diluted to 800 cells/ul using 0.04% BSA in PBS to target 10,000 cells for the 10X Genomics Single Cell 3' Genome Expression (GEX) mRNA-Seq prep with v3.1 NextGem chemistry. The single 10X preparation was split to create the scRNAseq and ScISOr- Seq libraries. The scRNAseq sample was sequenced on 1/7th of a single S4 lane on a NovaSeq 6000 platform (Illumina). For the ScISOr-Seq library, 400ng of the amplified 10X cDNA was used as input for the SMRTbell Express Template Prep Kit 2.0 (P/N 100-938-900) without reamplification. Sample specific barcode (ATATAGCGCGCGTGTG) was added using the Barcoded Adapter Kit 8B—OVERHANG (P/N 101-628-500). The ScISOr-Seq library was sequenced on a single SMRT Cell 8M on the PacBio Sequel II platform using the v4 primer, v2.1 polymerase, 1 hour binding, 30-hour movie, and 2-hour pre-extension time at a loading concentration of 100pM. PacBio Data Processing The University of Oregon Genomics and Cell Characterization Core Facility (https://gc3f.uoregon.edu) generated “circular consensus” reads for our ScISOr-Seq dataset (ccs - j 39 --min-passes 3 --min-snr 2.5 --min-length 10 --max-length 50000 --min-rq 0.99; v6.6.0) and used lima (-j 39 –isoseq; v2.2.0) to remove a sample specific barcode (ATATAGCGCGCGTGTG) as a part of the PacBio SMRT Analysis software (Supplemental file 6). After barcode clipping, we used a custom ScISOr-Seq processing script (scISOr_Seq_processing.py) that removed the sample primers (3p: CTACACGACGCTCTTCCGATCT; 5p: CCCATGTACTCTGCGTTGATACCACTGCTT), removed and saved cell and UMI barcodes, removed poly(A) tails, then filtered out duplicated reads. This script outputs reads that had the expected presence and orientation of primers and 21 polyA tails for downstream analysis. The script additionally outputs reads without the expected primers in a separate file; however, these other reads were not used for further analysis. We aligned reads to the stickleback genome (BROAD S1, 104.1 database version) using minimap2 (v2.7; parameters: -ax splice, -uf, -06,24, -B4; (Li 2018). We clustered for unique transcripts with collapse_isoforms_by_sam.py script (c=.99, i=.95; (Tseng 2021) from cDNA Cupcake (v27.0.0), classified transcripts with the sqanti3_qc.py script from SQANTI3 (v4.2), and refined transcripts with sqanti3_RulesFilter.py script (Tardaguila et al. 2018). We used the sqanti_classification.filtered_lite.gtf that resulted from the refining step (Supplemental file 2) to run Cell Ranger from 10x Genomics (v3.2.0) as described below. We downloaded the raw data from (Naftaly et al. 2021) from NCBI. This adult Iso-Seq dataset contains gonad, pronephros, brain, and liver reads from both sexes, a total of 16 SMRT cells. We processed reads from this dataset with the following steps: we generated circular consensus reads with ccs (--min-rq=.9; v6.0.0), clipped barcodes listed in Naftaly, Pau, and White (2021) with lima (--isoseq –dump-clips; v2.2.0), removed polyA tails with isoseq3 refine (--require-polyA; v3.4.0-0), and clustered reads with isoseq3 cluster (v3.4.0-0) from the PacBio SMRT Analysis software (Supplemental file 6). For the analysis with solely these data, we aligned reads to the stickleback genome (BROAD S1, 104.1 database version), collapsed transcripts with cDNA Cupcake (v28.0.0), and filtered and classified with SQANTI3 (v4.2) as done with the ScISOr-Seq data. Again, we used the sqanti_classification.filtered_lite.gtf from SQANTI3 (Supplemental file 3) to run Cell Ranger as described below. Due to reduced novel and increased annotated genes relative to (Naftaly et al. 2021), we additionally tried using minimap2-2.15 with -ax splice -uf -C5 -secondary=no to match their parameters; however, we observed negligible changes. We propose that differences in our analyses arise from different earlier processing steps, different versions of stickleback gene models, and an updated version of SQANTI. We pooled both sets of Iso-Seq reads and merged it with existing Ensembl annotations to create an improved version of the Ensembl annotations. Previously, we created a modified version of the Ensembl annotations where we extended the 3’ UTRs (Supplemental files 7,8) for several marker genes (tbx16, sox10, sox32, and eya1) and fgf/fgfr genes (fgfr1a, fgfr1b, fgfrl1a, FGFRL1, fgfr2, fgfr3, fgfr4, fgf3, fgf4, fgf16, fgf17, fgf6, fgf6a, fgf8a, and fgf8b). Because we lacked Iso-Seq reads for fgf4, we used this modified gtf for the merging (Supplemental file 8). 22 We completed the processing of the ScISOr-Seq and bulk Iso-Seq data separately as described above then combined the files prior to alignment. We aligned the combined files with minimap2, collapsed transcripts with cDNA cupcake, and refined and classified isoforms with SQANTI3 as explained above. We merged the sqanti_classification.filtered_lite.gtf resulting from SQANTI3 with the Ensembl annotation using TAMA (Kuo et al. 2020). To prepare the gtf files for TAMA Merge, we converted them to bed files using bedparse gtf2bed (--extraFields gene_id). Then, modified the output with awk to rearrange the columns of the bed file such that the gene id was separated from the transcript id by a semicolon (awk -v OFS=’\t’ ‘{print $1,$2,$3, $13 “;” $4, $5, $6,$7,$8,$9,$10,$11,$12}’). We merged with TAMA’s script tama_merge.py (-s ensembl -cds ensembl -d merge_dup). We applied the ensembl gene names to genes with corresponding IDs using a custom script (tama_associating_ensembl_ids_with_genes.py). For our final gtf file, we converted the output bed file back to a gtf with TAMA’s tama_convert_bed_gtf_ensembl_no_cds.py (Supplemental file 4). We also generated a gtf file specifically for scRNAseq analysis where mitochondrial genes started with “MT” by transforming the bed file with a custom script (tama_associating_ensembl_ids_with_genes_for_scRNAseq.py) and then converting the output back to a gtf in the same fashion as above (Supplemental file 5). Annotation analysis We assessed the Iso-Seq improvements to Ensembl gene models in two ways: 1) differences in starting and ending positions and 2) differences in reads captured. We compared the 5’ start and 3’ end of Ensembl and Iso-Seq generated gene models with a custom script written by H.M.H. (tama_observations_bed.py). Using the same custom script, we also compared the reads present in the first and last exons between Ensembl and Iso-Seq gene models. scRNAseq analysis We quantified gene counts using 10x Genomics Cell Ranger v3.0.2 (Supplemental file 1). We generated a reference using mkgtf and mkref, retaining protein coding and non-protein coding genes, and then aligned and counted reads using the stickleback genome (BROAD S1, 104.1). We analyzed the counts using the Seurat package (v3.2.3; (Stuart et al. 2019) on R (v4.0.2). We retained all cells for the analysis. We normalized the counts using SCTransform and 23 regressed out the mitochondrial genes using the glmGamPoi method (Hafemeister and Satija 2019). Based on the inflection point in the elbow plot of the PCA results, we chose 38 dimensions for generating the UMAP and identifying clusters. We identified cluster identities using three marker gene approaches. First, we searched for marker genes identified in Farnsworth et al. (2020) for specific cell types found in a similar developmental stage from zebrafish (24hpf in zebrafish). Next, we identified markers with Seurat’s FindAllMarkers searching for genes that were positively upregulated in each cluster compared to the other identities, in a minimum of 25% of cells, and a log fold change threshold of 0.25. Finally, we identified markers with Seurat’s FindMarkers searching for genes that were positively upregulated in each cluster versus all the other cells. For the second and third approaches, we searched for the homologous gene in zebrafish and used expression data from zfin to predict which cell types expressed the gene. To quantify frizzled gene counts, we used FetchData to isolate raw counts for each gene and the number of cells expressing the gene. To determine the number of cell clusters each frizzled gene was expressed in, we saved a dotplot, subsetted for clusters where the percent of cells expressing the gene was greater than 10%, and then counted the number of clusters left for each gene. Results and Discussion ScISOr-Seq captured novel isoforms and improved 3’ UTRs 24 A ScISOr-Seq library was produced using dissociated cells from 70 hpf stickleback embryos. The ScISOr-Seq reads were classified with SQANTI3 (Tardaguila et al. 2018) using Ensembl gene models to describe isoforms based on how well they match splice variants (Figure 2.1A). The most common structural category (45.15% of isoforms) was “novel not in catalog” containing at least one new splicing site relative to the existing annotation (Figure 2.1A and 2.1B). 17.42% of collapsed isoforms were “full splice matches” (FSM) that contained splicing sites and exons present in the Ensembl annotation but allowed differences in 5’ or 3’ ends. Notably, only 3% of the unique isoforms matched the existing gene annotations, while 97% improved existing or added gene models. Confirming that 3’ UTRs were poorly annotated, 46.2% FSM isoforms had alternative 3’ ends and 27.4% had alternative 3’ and 5’ ends relative to Figure 2.1: SQANTI3 classification of ScISOr-Seq data revealed that the majority of isoforms were previously unannotated in the stickleback genome. A) SQANTI3 categorizes isoforms based on their match to the reference, the major categories are shown here. B) the distribution of ScISOr-Seq isoforms in the major SQANTI3 classes illustrating that the bulk of isoforms are novel not in catalog. C) within the novel not in catalog category, isoforms are largely coming from cases where there is at least one new splicing donor or acceptor with some coming from cases of intron retention (for example, nkx2.3 isoforms from Figure 2). D) the majority of isoforms that are full splice matches have at least an alternative 3’ site while few are complete reference matches. A B C Full Splice Match Novel not in catalog Annotated transcript Incomplete splice match Novel in catalog D Genic genomic Antisense 25 Ensembl gene models (Figure 2.1C). Therefore, the existing stickleback annotation - in addition to having incomplete 3’ UTRs - is also missing many splice variants. Our work indicates that additional ScISOr-Seq or bulk Iso-Seq is necessary to capture these variants as well as prune erroneous transcript models from the Ensembl annotation. ScISOr-Seq improved the number of reads retained for scRNAseq We initially tested how well the existing stickleback gene models from Ensembl (BROAD S1, 104.1 database version) would capture scRNAseq reads. Strikingly, less than 50% of reads were retained for downstream scRNAseq analysis (Figure 2.2B; Supplemental file 1). Using such an incomplete annotation for scRNAseq would likely result in erroneous interpretation of gene expression patterns for genes lacking 3’ UTR annotations. Next, we used ScISOr-sequencing data to generate new gene models and tested how well these new models would capture 10X based Illumina scRNAseq reads (Supplemental file 2). Using the ScISOr-Seq dataset, 13,028 of 22,456 previously annotated genes and 2,942 novel genes were identified (Figure 2A). The ScISOr-Seq annotations lead to a notable 26.1% increase Figure 2.2: ScISOr-Seq and bulk Iso-Seq captured a limited number of annotated transcripts, but both resulted in an increase in reads mapping to the transcriptome. Ultimately, pooling this data and merging with the ensembl annotation resulted in the largest gains in reads mapped. A) the number of genes in common and unique to each Iso-Seq read category is compared to the ensembl genome. B) results from cell ranger runs using the different genome annotation files illustrating the overall improvements by using Iso-Seq data particularly the ScISOr-Seq gene models. 26 in reads retained from scRNAseq compared to the Ensembl gene models alone (Figure 2B; Supplemental file 1). The alignment of scRNAseq reads with existing Ensembl and new ScISOr-Seq gene models illustrates that ScISOr-Seq models retain greater numbers of reads due to improved annotation of 3’ UTRs (Figure 2.3; Figure S2.1). This result compares favorably with current research in the much better studied zebrafish model. Farnsworth, Saunders, and Miller (2020) completed their zebrafish atlas with 80% of reads retained in 5dpf (days post fertilization) fish. After (Lawson et al. 2020) improved the zebrafish transcriptome using RNAseq data, they noted a 4% increase in reads retained that lead to an increase of 2,257 cells and 8 clusters using the same 5 dpf atlas (Farnsworth et al. 2020). The similarity in number of retained reads in our dataset with those published in zebrafish improves the validity of stickleback scRNAseq investigations. Figure 2.3: ScISOr-Seq improved gene models by extending the 3’ UTRs illustrated here by the two models for nkx2.3. scRNAseq reads are shown at the top in grey. The existing ensembl model is shown in dark blue. Notably, this model does not overlap with the majority of the scRNAseq reads. The ScISOr-Seq reads (in grey) are shown above the respective gene models (in red) that they generated. Colored lines on the ScISOr-Seq and scRNAseq reads indicate a different base pair in the read than the stickleback reference genome (BROAD S1, 104.1 database version). These ScISOr-Seq gene models capture all of the scRNAseq reads. Ensembl ScISOr-Seq !"!#$"%%%&'( !")))"*%%&'( +%,-..&/01234 A 27 Recently, (Naftaly et al. 2021) published a large bulk analysis of adult stickleback Iso- Seq data from 16 PacBio SMRT cells and 4 tissue types (gonad, brain, pronephros, and liver). We reanalyzed this dataset to test whether we would see comparable improvements in our scRNAseq analysis (Supplemental file 3). Although we identified similar overall annotated and novel genes, the bulk Iso-Seq annotation from adult stickleback captured 11% fewer reads than our embryonic ScISOr-Seq annotation (Figure 2.2A and 2.2B; Supplemental file 1). These results are likely due to differential expression of transcripts between the adult and embryonic Figure 2.4: Annotation improvements using Iso-Seq (bulk Iso-Seq or ScISOr-Seq) data extended genes both 5’ and 3’, with 3’ expansions leading to greater numbers of reads counted. Comparing the median difference in starting or ending position between the Ensembl annotations and pooled Iso-seq annotations shows that bulk Iso-Seq and ScISOr- Seq reads overall lead to earlier starting positions (median=26, a) and extended ending positions (median=94, b). These Iso-Seq differences increased the number of reads counted in the final exon of the gene (median=341, d); however, the alterations led to negligible changes of reads counted in the first exon (median=0, c), which highlights the 30 bias of scRNAseq reads. The x-axis is plotted on a pseudo log scale to account for the negative values. 0 5000 10000 15000 20000 −1 0,0 00 ,00 0 0 Final exon read count difference (Iso−Seq minus Ensembl) on psuedo−log scale 10 ,00 0,0 00 −1 0,0 00 ,00 0 0 10 ,00 0,0 00 −1 0,0 00 ,00 0 − 10 ,00 0 10 ,00 0 0 2500 5000 7500 10000 -2, 00 0,0 00 0 2,0 00 ,00 0 Ending position difference (Iso−Seq minus Ensembl) on psuedo−log scale -2, 00 0 2, 00 0 A B C D 0 2500 5000 7500 10000 0 Starting position difference (Iso−Seq minus Ensembl) on psuedo−log scale C ou nt s -2, 00 0,0 00 2,0 00 ,00 0 -2, 00 0 2 ,00 0 !"#$"%&'()$*$("& +$,-.-"/- 0*1.*$"%&'()$*$(" &+$,-.-"/- 0 5000 10000 15000 20000 -10 ,00 0,0 00 0 1 0,0 00 ,00 0 First exon read count difference (Iso−Seq minus Ensembl) on psuedo−log scale C ou nt s 10 ,00 0 -1 0,0 00 2$"13&!4("&5-1#&6(7"*& +$,-.-"/- 2$.)*&!4("&5-1#&6(7"*& +$,-.-"/- 28 samples and highlight that even for species with previous Iso-Seq libraries, additional sequencing may be needed for scRNAseq analysis in other developmental stages or tissue types. Because both our ScISOr-Seq and the bulk Iso-Seq gtf files improved the number of scRNAseq reads retained, we pooled them to create a comprehensive annotation and also merged this pooled dataset with the Ensembl annotations. We retained all transcripts and united them under a single gene model for scRNAseq analysis (Supplemental file 4). The Iso-Seq generated gene models extended 3’ ends of transcripts and increased reads counted in the final exon (Figure 2.4). This new annotation contained 6,992 genes unique to Ensembl, 15,464 genes containing transcripts from Ensembl and Iso-Seq, and 5,206 genes unique to the pooled Iso-Seq dataset (Figure 2.2A). Since reads are filtered during analysis, the 6,992 genes unique to Ensembl might still be expressed in 70hpf embryos or the adult tissues. For instance, 2,152 Ensembl annotated genes were removed from the pooled Iso-Seq dataset by SQANTI3’s intra- priming and template switching filter. Although pooling and merging added 16 SMRT cells of Iso-Seq data as well as the full suite of Ensembl annotations, this effort only improved the proportion of reads retained by +0.8% beyond the annotation created by just our single ScISOr- Seq sample (Figure 2.2B; Supplemental file 1). Therefore, the gene models originating from a matching ScISOr-Seq library alone could be used for scRNAseq analysis if a system lacks a prior annotation. ScISOr-Seq enhances biological interpretation of cell clusters Using the pooled (ScISOr-Seq and bulk Iso-Seq) annotation that was merged with Ensembl annotation, we proceeded with scRNAseq analysis (Supplemental file 5) to test whether biologically meaningful cell clusters would be created. We clustered cells by their transcriptional profiles into cell identities and analyzed the results with Seurat (Stuart et al. 2019). We identified 30 clusters of cells with 38 principal components (Figure 2.5A). We putatively annotated cell types via distinguishing sets of marker genes based on zebrafish literature (Howe et al. 2013; Wagner et al. 2018; Farnsworth et al. 2020). Although zebrafish and stickleback diverged ~ 229.9 m.y.a. (Howe et al. 2021), we identified similar cell types as observed in existing zebrafish atlases for similar developmental stages as we analyzed in stickleback (Wagner et al. 2018; Farnsworth et al. 2020). Additionally, we compared the expression of sox9a and sox9b in our atlas to published descriptions of stickleback in situ expression at the same stage (Cresko et al. 2003). Supporting our cluster 29 annotations, we observed expression of both SOX genes in roughly the same cell types as defined by in situ hybridization (Figure S2.2A and S2.2B; Cresko et al. 2003). Despite our success using a single SMRT cell for ScISOr-Seq for isoform annotation, we would need to sequence additional SMRT cells to have enough data to correlate cell types and specific isoforms. Since the pooled and merged dataset has higher numbers of cells, genes per cell, and total genes detected than the Ensembl scRNAseq dataset (Figure 2.2B), assessing the degree to which pooled and merged gene models improved the scRNAseq analysis is complex. For instance, a higher overall expression of a gene in a cluster might be the result of different clustering patterns, more cells retained, and/or more reads counted. We compared the raw counts of specific genes, number of cells expressing these genes, and number of clusters where at least Figure 2.5: Using the updated gene models from our merged and pooled dataset, we identified expected cell types in 70hpf stickleback scRNAseq data and illustrated how annotations improvements are relevant for analysis. A) the clustering and cell identities of the scRNAseq data is illustrated on the UMAP. B) frizzled gene family expression compared between the ensembl annotation scRNAseq analysis and the pooled and merged analysis using total raw counts, number of cells expressing each gene, and the number of clusters where over 10% of cells are expressing each gene. C) fzd1 gene model is greatly extended in the pooled and merged annotation (purple) compared to the ensembl model (blue) which allows for much greater counting of reads (grey). D) fzd9b had minor changes in the pooled and merged model (purple), a longer 5’ UTR, in comparison with the ensembl annotation (blue) which led to minimal increases in reads (grey) counted. The scRNAseq read pileups in C and D have colored lines on base pairs that are different from the stickleback reference genome (BROAD S1, 104.1 database version). 0 Brain progenitors 1 Unknown cells 2 Brain 1 3 Neural 1 4 Neural 2 5 Neural 3 6 Somites 1 7 Brain 2 8 Developing muscle 9 Somites 2 10 Retinal progenitors 11 Neural 4 12 Epidermis 13 Neural 5 14 Pharyngeal arch endoderm 15 Developing mesenchyme/cartilage 16 Di!erentiating retinal cells 17 Blood 18 Pharyngeal arch neural crest 19 Skeletal muscle 20 Retina 21 Pigment cells 22 Hematopoietic cells 23 Heart 24 Neural 6 25 Digestive system precursors 26 Endothelial 27 Neural crest 28 Cranial neural crest derived neural cells 29 Putative hatching gland A B C D Genes To tal c ounts (E NS) To tal c ounts (P &M) # ce lls expre ssi ng gene (P &M)# ce lls expre ssi ng gene (E NS) # cl uste rs expre sse d (P&M) # cl uste rs expre sse d (E NS) !"# $%&" !"% #%'& % '!"#$ ()!) $#"( !""( !)%& !# !%!"#%& 103 3238 101 1625 0 9!"#$ 13 736 13 539 0 2!"#% 157335 3164 35 80!"#&' 689 693 420 424 2 1!"#() !"#$%"&&&'() !"#$*"&&&'() ENS P&M +#"##%"!&&'() +#"#*&"&&&'() ENS P&M 0 1 2 34 56 78 9 10 11 12 1314 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 !"#* 132 2327 129 1594 0 8 !"#+ 0 200 0 192 0 0 !"#,- 20 20 20 20 0 0 !"#,' 82 685 77 477 0 1 !"#. 1352 1962 753 3 5591 !"#$ *+, !"#%& !"# 30 10% of cells expressed them (Figure 2.5B). Since Cell Ranger provided a subset of these values on a global scale (Figure 2.2B), we chose to examine changes in gene counts, cell number, and number of clusters with a case study: the frizzled genes, a family of Wnt-pathway signaling molecules expressed in a wide range of tissues (Huang and Klein 2004; Wang et al. 2016). For 10/11 frizzled genes, we observed increases in raw counts and number of cells expressing them in the pooled and merged annotation (Figure 5B). Some genes, however, exhibited dramatic increases (fzd1, fzd2, fzd3b, fzd4, fzd6, fzd7, and fzd8b) while others exhibited minor increases (fzd5, fzd9b, and fzd10). Comparing the gene models of fzd1 and fzd9b, we determined which factors influenced expression detection. Importantly, the 5’ and 3’ UTRs expanded for fzd1 in the merged and pooled annotation, however only the 5’ end of fzd9b was extended (Figure 2.5C and 2.5D). Based on the read alignments (Figure 2.5C and 2.5D), the increase in counted reads for fzd1 was due solely to 3’ UTR changes, while a slight increase in fzd9b reads was due to 5’ end extension. In addition to increased read counts, 7/11 genes were expressed in more cell clusters in the pooled and merged dataset. Two genes, fzd9b and fzd10, were found in one less cluster. We hypothesized that two clusters (containing fzd9b and fzd10) were combined in the merged and pooled dataset relative to Ensembl. Overall, this gene family comparison illustrated the problem of using an annotation with incomplete 3’UTRs. If the Ensembl dataset’s fzd counts were accepted as their true qualitative and quantitative expression, flawed conclusions would have been made regarding the contributions of each gene in the family. Conclusion Incomplete 3’UTR annotations can hinder single cell transcriptional profiling studies. In addition to reducing the overall number of genes included in an analysis, systematic differences in preliminary UTR annotations could lead to significant inferential errors. We illustrated in stickleback fish that a minimal ScISOr-Seq dataset, generated concurrently with scRNAseq data, was capable of dramatic improvements in retained read counts (+26.1%). Additionally, we showed that pooling additional Iso-Seq reads from adult bulk samples and merging with existing Ensembl annotations improved scRNAseq reads retained negligibly beyond models solely from embryonic ScISOr-Seq (+0.8%). Using the improved annotation for scRNAseq permitted identification of cell types and increased the observed expression of numerous genes. Overall, 31 our work illustrates that ScISOr-Seq is a rapid and cost-effective method to annotate genomes of various organisms for scRNAseq and can improve efficacy of biological inferences. 32 References Ali A, Thorgaard GH, Salem M. 2021. PacBio Iso-Seq improves the rainbow trout genome annotation and identifies alternative splicing associated with economically important phenotypes. Front. Genet. 12. Beck EA, Currey MC, Small CM, Cresko WA. 2020. QTL mapping of intestinal neutrophil variation in threespine stickleback reveals possible gene targets connecting intestinal inflammation and systemic health. G3. 10(2):613–622. Beck EA, Healey HM, Small CM, Currey MC, Desvignes T, Cresko WA, Postlethwait JH. 2021 Jul 29. Advancing human disease research with fish evolutionary mutant models. Trends Genet. 38(1):22-44. Beiki H, Liu H, Huang J, Manchanda N, Nonneman D, Smith TPL, Reecy JM, Tuggle CK. 2019. Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data. BMC Genomics. 20(1):1–19. Bell MA, Foster SA. 1994. The evolutionary biology of the threespine. 1st ed. New York: Oxford University Press. Bresciani E, Broadbridge E, Liu PP. 2018. An efficient dissociation protocol for generation of single cell suspension from zebrafish embryos and larvae. MethodsX. 5:1287. Colosimo P, Summers B, Kingsley D, Hosemann K. 2004. A simple and efficient microinjection protocol for making transgenic sticklebacks. Behaviour. 141(11–12):1345–1355. Cresko WA, Amores A, Wilson C, Murphy J, Currey M, Phillips P, Bell MA, Kimmel CB, Postlethwait JH. 2004a. Parallel genetic basis for repeated evolution of armor loss in Alaskan threespine stickleback populations. Proc. Natl. Acad. Sci. U. S. A. 101(16):6050–6055. Cresko WA, Amores A, Wilson C, Murphy J, Currey M, Phillips P, Bell MA, Kimmel CB, Postlethwait JH. 2004b. Parallel genetic basis for repeated evolution of armor loss in Alaskan threespine stickleback populations. Proc. Natl. Acad. Sci. 101(16):6050–6055. Cresko WA, McGuigan KL, Phillips PC, Postlethwait JH. 2007. Studies of threespine stickleback developmental evolution: Progress and promise. Genetica. 129(1):105–126. Cresko WA, Yan YL, Baltrus DA, Amores A, Singer A, Rodríguez-Marí A, Postlethwait JH. 2003. Genome duplication, subfunction partitioning, and lineage divergence: Sox9 in stickleback and zebrafish. Dev. Dyn. 228(3):480–489. 33 Farnsworth DR, Saunders LM, Miller AC. 2020. A single-cell transcriptome atlas for zebrafish development. Dev. Biol. 459(2):100–108. Farrell JA, Wang Y, Riesenfeld SJ, Shekhar K, Regev A, Schier AF. 2018. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science. 360(6392). Fuess LE, den Haan S, Ling F, Weber JN, Steinel NC, Bolnick DI. 2021. Immune gene expression covaries with gut microbiome composition in stickleback. MBio. 12(3). Gardell AM, von Hippel FA, Adams EM, Dillon DM, Petersen AM, Postlethwait JH, Cresko WA, Buck CL. 2017. Exogenous iodide ameliorates perchlorate-induced thyroid phenotypes in threespine stickleback. Gen. Comp. Endocrinol. 243:60–69. Gupta I, Collier PG, Haase B, Mahfouz A, Joglekar A, Floyd T, Koopmans F, Barres B, Smit AB, Sloan SA, et al. 2018. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nat. Biotechnol. 36(12):1197–1202. Hafemeister C, Satija R. 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20(1):296. Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, Cresko WA. 2010. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 6(2). Howe DG, Bradford YM, Conlin T, Eagle AE, Fashena D, Frazer K, Knight J, Mani P, Martin R, Moxon SAT, et al. 2013. ZFIN, the Zebrafish Model Organism Database: increased support for mutants and transgenics. Nucleic Acids Res. 41(D1):D854–D860. Howe KL, Achuthan P, Allen James, Allen Jamie, Alvarez-Jarreta J, Amode MR, Armean IM, Azov AG, Bennett R, Bhai J, et al. 2021. Ensembl 2021. Nucleic Acids Research. 49(D1):D884–D891. Huang H-C, Klein PS. 2004. The Frizzled family: receptors for multiple signal transduction pathways. Genome Biol. 5(7):1–7. Hwang B, Lee JH, Bang D. 2018. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. and Mol. Med. 50(8). Kuo RI, Cheng Y, Zhang R, Brown JWS, Smith J, Archibald AL, Burt DW. 2020. Illuminating the dark side of the human transcriptome with long read transcript sequencing. BMC Genomics. 21(1):1–22. 34 Kuo RI, Tseng E, Eory L, Paton IR, Archibald AL, Burt DW. 2017. Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics. 18(1):1–19. Lawson ND, Li R, Shin M, Grosse A, Onur YMS, Stone OA, Kucukural A, Zhu LJ. 2020. An improved zebrafish transcriptome annotation for sensitive and comprehensive detection of cell type-specific genes. eLife. 9:1–76. Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 34(18):3094–3100. Luecken MD, Theis FJ. 2019. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol. 15(6). Miller CT, Beleza S, Pollen AA, Schluter D, Kittles RA, Shriver MD, Kingsley DM. 2007. cis- Regulatory changes in Kit ligand expression and parallel evolution of pigmentation in sticklebacks and humans. Cell. 131(6):1179–1189. Naftaly AS, Pau S, White MA. 2021. Long-read RNA sequencing reveals widespread sex- specific alternative splicing in threespine stickleback fish. Genome Res. 31(8):1486– 1497. Reid K, Bell MA, Veeramah KR. 2021. Threespine stickleback: a model system for evolutionary genomics. Annu Rev Genomics Hum Genet. 22:357–383. Shapiro E, Biezuner T, Linnarsson S. 2013. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Gen. 14(9):618–630. Shapiro MD, Marks ME, Peichel CL, Blackman BK, Nereng KS, Jonsson B, Schluter D, Kingsley DM. 2004. Genetic and developmental basis of evolutionary pelvic reduction in threespine sticklebacks. Nature. 428(6984):717–723. Small CM, Milligan-Myhre K, Bassham S, Guillemin K, Cresko WA. 2017. Host genotype and microbiota contribute asymmetrically to transcriptional variation in the threespine stickleback gut. Genome Biol. Evol. 9(3):504–520. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. 2019. Comprehensive integration of single-cell data. Cell. 177(7):1888-1902.e21. Tardaguila M, Fuente L de la, Marti C, Pereira C, Pardo-Palacios FJ, Risco H del, Ferrell M, Mellado M, Macchietto M, Verheggen K, et al. 2018. SQANTI: extensive 35 characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28(3):396. Tseng E. 2021. cDNA cupcake. Wagner DE, Weinreb C, Collins ZM, Briggs JA, Megason SG, Klein AM. 2018. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science. 360(6392):981–987. Wang Y, Chang H, Rattner A, Nathans J. 2016. Frizzled receptors in development and disease. Curr. Top. Dev. Biol. 117:113. Zheng Y-F, Chen Z-C, Shi Z-X, Hu K-H, Zhong J-Y, Wang C-X, Shi W, Chen Y, Xie S-Q, Luo F, et al. 2020 Jul 28. HIT-scISOseq: high-throughput and high-accuracy single-cell full- length isoform sequencing for corneal epithelium. bioRxiv.:2020.07.27.222349. 36 BRIDGE Chapter II detailed challenges to using single cell RNA sequencing (scRNAseq) in less traditional model organisms. This technique captures reads at the 3’ end of mRNA transcripts, therefore it requires high quality 3’ UTR annotations that are often incomplete in non-model systems. I harnessed single cell ISO-sequencing data to improve 3’ UTR annotations in threespine stickleback. Through lengthening 3’ UTR sequences, I increased the percentage of scRNAseq reads retained by 26.5%. In Chapter III, I demonstrated the utility of these techniques for analyzing the development of highly derived traits in Gulf pipefish. I created a single cell atlas for the Gulf pipefish, using single cell ISO-sequencing to improve gene models and single cell RNA-sequencing to quantify gene expression. Using this atlas, I studied the development of pipefish’s elongate snout and exoskeleton, explored developmental origins of their loss of teeth, and asked whether epidermal cells express genes related to development within the brood pouch. This study provided preliminary evidence that pipefish’s derived traits may originate from changes within developmental gene networks. 37 CHAPTER III SINGLE CELL RNA SEQUENCING PROVIDES CLUES FOR THE DEVELOPMENTAL GENETIC BASIS OF SYNGNATHID FISHES EVOLUTIONARY ADAPTATIONS From Healey, Penn, Small, Bassham, Goyal, Woods, Cresko. 2024. Single Cell RNA Sequencing Provides Clues for the Developmental Genetic Basis of Synganthidae’s Evolutionary Adaptations. eLIFE. The full supplemental material for thesis chapter can be found at https://doi.org/10.7554/eLife.97764.1 This project was conceptualized by William Cresko and I. Sequencing data was generated by Clay Small and me. I completed the scRNAseq analysis. The in situ hybridizations were largely completed by Hayden Penn and myself with assistance from Susan Bassham, Vithika Goyal, and Micah Woods. I took the lead on writing the manuscript. The paper was reviewed and revised by all authors. Introduction Seahorses, pipefishes, and seadragons are extraordinary fishes in the family Syngnathidae with diverse body plans, coloration, and elaborate structures for paternal brooding. The syngnathid clade comprises over 300 diverse species that vary in conservation status, distribution, ecology, and morphology (Leysen et al., 2011; Manning et al., 2019; Schneider, Woltering, et al., 2023; Stiller et al., 2022). Syngnathids have numerous highly altered traits, trait losses, and evolutionary novelties. They have elongated snouts bearing small, toothless jaws (Leysen et al., 2011) specialized for capture of small zooplankton (Van Wassenbergh et al., 2009). Additionally, syngnathids are distinctly protected by a bony dermal armor rather than scales (Jungerson, 1910). Other skeletal differences include a lack of ribs and pelvic fins, and an expansion in the number of vertebrae (Schneider, Woltering, et al., 2023). Finally, syngnathids exhibit male pregnancy, which has involved the evolution of specialized brooding tissues and structures (Whittington & Friesen, 2020). Paternal investment varies among lineages; for example, seadragons tether embryos externally to their tails while seahorses and some pipefishes 38 have enclosed brood pouches proposed to support embryos through nutrient transfer and osmotic regulation (Carcupino, 2002; Melamed et al., 2005; Ripley & Foran, 2006). Despite advances in understanding the ecology and evolution of syngnathid novelties, the developmental genetic basis for these traits is largely unknown. The recent production of high quality syngnathid genome assemblies (Qu et al., 2021; Ramesh et al., 2023; Small et al., 2022; Wolf et al., 2024) provides initial clues for the developmental genetic basis of some evolutionary changes. Studies have found that syngnathids lack several genes with deeply conserved roles in vertebrate development, including pharyngeal arch development (fgf3), tooth development (fgf3, fgf4, and eve1, and most scpp genes), fin development (tbx4), and immune function (MHC pathway components) (Lin et al., 2016; Qu et al., 2021; Small et al., 2016; Small et al., 2022; Zhang et al., 2020). Though these gene losses are highly suggestive of leading to unique changes, exploration of the actual developmental consequences of their losses is needed. To fill this gap in knowledge, we used single cell RNA sequencing (scRNAseq) to investigate how these striking genomic changes have affected the developmental genetic and cellular basis of syngnathids’ evolutionary derived traits. The Gulf pipefish (Syngnathus scovelli) is an attractive model for this study (Figure 3.1A). This species has a high-quality reference genome assembly annotated by NCBI and is amenable to laboratory culture (Anderson & Jones, 2019; Ramesh et al., 2023). Furthermore, species from the Syngnathus genus are used worldwide to address questions about syngnathid evolution in microbial, developmental, histological, transcriptomic, ecotoxicological, and genomic studies (Berglund et al., 1986; Carcupino, 2002; Fuiten & Cresko, 2021; Harada et al., 2022; Harlin-Cognato et al., 2006; Partridge et al., 2007; Ripley & Foran, 2006a; Rose et al., 2023; Roth et al., 2012; Small et al., 2016; Small et al., 2013). In this paper, we focus on a subset of unique traits, including their elongated head, toothlessness, dermal armor, and development of embryos inside the brood pouch. These traits represent the diversity of evolutionary changes observed in the syngnathid clade (highly altered, lost, and novel traits) and hypotheses from studies of model organisms suggest developmental pathways involved in their evolution (Lin et al., 2016b; Roth et al., 2020; Small et al., 2016; Small et al., 2022). scRNAseq atlases are a powerful complement to genomic analyses (Shema et al., 2019; Ton et al., 2020). They can provide crucial insights into types of cells present, genes that distinguish cell types (marker genes), active gene networks, and a means to identify expression of 39 genes of interest within predicted cell types (Farnsworth et al., 2020; Farrell et al., 2018; Williams et al., 2019). Specifically, scRNAseq captures RNA expression profiles from individual cells allowing cell types to be inferred post-hoc. scRNAseq has successfully been applied to syngnathid adult kidneys (Parker et al., 2022), but there are no published syngnathid developmental atlases. Here we report the first developmental scRNAseq atlas for syngnathids from late embryogenesis staged Gulf pipefish. We delineate the overall structure of this atlas, which describes 38 cell clusters composed of 35,785 cells, and use these data to make inferences about the morphological evolution of several syngnathid innovations. In addition to inferring present cell types and their underlying genetic networks, we detail in situ spatial expression patterns of select marker and other candidate genes in pipefish embryos and juveniles. We found conserved signaling pathways expressed during craniofacial development but did not detect evidence of tooth primordia. The embryonic dermis and epidermis, respectively, expressed genes for the dermal armor development (bone development pathways) and genes potentially involved in interaction with the male brood pouch (e.g., nutrient acquisition genes). Overall, this atlas provides a deeper understanding of the development of Gulf pipefish and identifies gene candidates for understanding the development of syngnathid evolutionary innovations. In 40 addition to these discoveries, this atlas provides a significant resource for researchers studying syngnathid evolution and development. Materials and Methods Single Cell RNA Sequencing Libraries Preparation We created scRNAseq atlases from embryos of wild caught Gulf pipefish (Syngnathus scovelli, acquired from our collaborator Emily Rose using collection permit SAL-21-0182-E), and all work was performed under an approved IACUC protocol. We harvested 20 embryos per pouch from two wild caught male pipefish. The embryos were at a stage before the tubular face was fully elongated, and while the head skeleton is cartilaginous with minimal signs of mineralization of superficial intramembranous bones. This corresponds to a stage termed “frontal jaws” in a recent description of pipefish development (Sommer et al., 2012). We dissociated the embryos using 460ul of .25% trypsin in water and 40ul 100 mg/mL collagenase I (Sigma C0130-200mg) for 16 minutes. We filtered cells using a 40uM cell strainer Figure 3.1: Gulf pipefish exemplify syngnathid derived traits. Gulf pipefish have elongate snouts, have lost teeth on their oral and pharyngeal jaws, possess dermal armor, and have brood pouches in males (A-B). Cartilage (alcian) and bone (alizarin) stained clutch siblings of embryos from the two scRNAseq samples are shown in panels C-H. Embryos have cartilaginous craniofacial skeletons (panels B, C, F; E marks the (Mes)ethmoid cartilage, C indicates the Ceratohyal, H shows the Hyosymplectic cartilage, M marks the Meckel's cartilage, Q indicates the quadrate, B shows the Basihyal, and P marks the palatoquadrate) with the onset of ossification in the jaw. They have cartilaginous fin radials in the dorsal fin (D and G; F indicates fin radials and N denotes the notochord). The embryos do not have signs of ossification in the trunk where the exoskeleton will form later (panels E and H; PD shows the region where dermal armor primordia will arise). Panels C, D, F, and G, scale bar 200µm; E, H scale bar 100 µm. A B C D E F G E E M M C C H H Elongated ethmoid region No teeth Dermal armor Male brooding N F N PD F N PD N E H C B Q M P H P 41 (Thomas Scientific #1181X52). We quantified cell concentrations using the TC20 Automated Cell Counter (Biorad) then diluted the samples to 800 cells/ul in 4% BSA in PBS. The University of Oregon Genomics and Cell Characterization Core (GC3F; https://gc3f.uoregon.edu) prepared single cell libraries for each sample using 10X Genomics Single Cell 3’ Genome Expression mRNAseq kit with NextGEM v3.1 chemistry. We sequenced these libraries on an S4 lane on the NovaSeq 6000 at the GC3F. To improve the 3’ UTR genome annotations, we also prepared scISOrSeq libraries from the first embryonic sample and from dissociated pouch cells from pregnant and nonpregnant males. These libraries were produced in accordance with (Healey et al., 2022). Embryonic, pregnant pouch and non-pregnant pouch libraries were sequenced separately on PacBio Sequel II - SMRT Cells 8M. To turn the scISOrSeq reads into gene models, we followed the pipeline from Healey et al. (2022). We ran our custom script (scISOr_Seq_processing.py) to remove barcodes, identify cell barcodes, and demultiplex with the single cell flag and appropriate barcodes (5’ CCCATGTACTCTGCGTTGATACCACTGCT and 3’ CTACACGACGCTCTTCCGATCT). We aligned the reads to the 2022 Gulf pipefish genome GenBank: GCA_024217435.2) using minimap v2.9 (cite: Li 2018). We filtered the reads using cDNA cupcake to remove duplicate transcripts (Tseng, 2021). We used SQANTI3 to identify gene models and filter them (Tardaguila et al., 2018). We merged the SQANTI3 annotations with the Gulf pipefish genome assembly (NCBI: ROL_Ssco_1.1) using TAMA merge (Kuo et al., 2017). Since the Gulf pipefish genome assembly does not contain mitochondrial genes, we appended the genome files with the Gulf pipefish mitochondrial genome (NCBI RefSeq: NC_065499.1). Single Cell Atlas Construction We ran Cell Ranger (10X Genomics 3.0.2) using our scRNAseq reads, the Gulf pipefish genome assembly with mitochondrial genome, and the modified gene annotations. Cell Ranger estimated 20,733 cells for sample one, 23,682 genes expressed, and 21,039 mean reads per cell. For sample two, Cell Ranger predicted 17,626 cells, 23,740 genes expressed, and 29,804 mean reads per cell. We analyzed Cell Ranger’s output using Seurat (v4.1.0) on R (v4.0.2; Butler et al., 2018; Hafemeister & Satija, 2019). 42 To remove extraneous RNA counts from the dataset, we used SoupX (v1.5.2; Young & Behjati, 2020). We identified doublet scores for our dataset using scrublet (v0.2.3). The doublet removal step reduced the first sample by 114 cells (from 20,733 cells to 20,619 cells) and second sample by 167 cells (from 17,626 cells to 17,459 cells). We finally removed cells with less than 500 features, greater than 9000 features, greater than 1E5 RNA counts, with a scrublet score greater than the detected threshold (.76 for sample 2 and .21 for sample 2), or greater than 10% mitochondrial reads. The second filtering step removed 727 cells from sample one (20,619 cells to 19,892 cells) and 1,566 cells from sample two (from 17,459 cells to 15,893 cells). We normalized the datasets with SCTransform (v0.3.3). We used Seurat’s integration tools, SelectIntegrationFeatures using 3,000 feature genes, FindIntegrationAnchors using SCT normalization, and IntegrateData using SCT normalization, to integrate the two datasets. After integration, our combined atlas had 35,785 cells (Supplementary File 1,2). We then used the integrated dataset to complete PCA analysis. We tested using a variety of principle components for further analysis and chose 30 PCs for our analysis based on the clear delineation of major cell types. We next clustered the cells using 30 PCs and plotted the data on a UMAP with Seurat. Single Cell Atlas Cluster Identification To identify cluster identities, we used the RNA assay of the scRNAseq data to find cluster markers with Seurat’s FindAllMarkers command with the parameters only.pos = TRUE and logfc.threshold = 0.25, requiring markers to be upregulated in the cluster and have a log fold change of at least .25. We found a second set of cluster markers through our custom function which searched through all genes and identified genes uniquely expressed in greater than 60% of cells in the cluster and in less than 10% of cells in every other cluster using Seurat’s DotPlots. We searched for our identified markers in available zebrafish datasets (Fabian et al., 2022; Farnsworth et al., 2020; Lange et al., 2023), ZFIN (Howe et al., 2013), NCBI, medlineplus.gov, and genecards to give the clusters initial annotations. For each cluster, we examined multiple genes using DotPlots and FeaturePlots to propose the cluster identity. Next, we identified one gene for each cluster which marked the cluster best (expressed in the most cells in the focal cluster and expressed in as few of the other clusters as possible) through consulting the two marker gene lists and examining markers with Dot Plots (Supplementary File 43 3). Using these markers, we completed a set of in situ hybridizations to hone our cluster annotations. Due to challenges culturing Gulf pipefish, we used both embryos and larvae from Syngnathus leptorhynchus, a local pipefish from the same genus, and from cultured Gulf pipefish for the cluster annotation in situ experiments. We caught pregnant male Syngnathus leptorhynchus using a beach seine near Coos Bay, Oregon under Oregon Department of Fisheries and Wildlife permit number 26987. Syngnathus scovelli used for in situ experiments were purchased from Alyssa’s Seahorse Saavy and Gulf Specimens Marine Lab then reared in our facility at 25 degrees C water and 25-28 PPT Salinity. Since Syngnathus leptorhynchus does not have a published genome assembly, we designed probes using NCBI Primer Blast with the Gulf pipefish genome and produced these probes using Gulf pipefish embryonic cDNA pools. To create the probes, we completed two rounds of PCR. The first round used a gene specific forward primer with a reverse gene specific primer that had 10 nucleotides of T7 promotor sequence attached. PCR products were cleaned with Zymo clean and concentrator DNA kit and eluted in 15 ul of elution buffer. Round two of PCR used the same gene specific forward primer with a modified T7 promoter sequence (TGGACTAATACGACTCACTATAGGG) as the reverse primer, and finally the product was cleaned again with Zymo clean and concentrator DNA kit and eluted in 15 ul of elution buffer. Round one PCR conditions are as followed: 95 degrees Celsius for 3:00 minutes, 40 cycles of denaturation (95 degrees for 30 seconds), annealing (30 seconds, annealing temperature varied by probe), and extension (72 degrees for 1:00 minute), and a final extension step of 3:00 minutes. Round two PCR conditions are as followed: 98 degrees Celsius for 30 seconds, 30 degrees for 10 seconds, 72 degrees for 50 seconds, then 35 rounds of denaturation (98 degrees for 10 seconds), annealing (50 degrees for 10 seconds), and extension (72 degrees for 50 seconds), and a final extension step of 10 minutes at 72 degrees. For round two PCRs with multiple bands (specifically, ifitm5), the band of the expected size was excised with a razorblade and DNA was extracted with Zymoclean gel DNA recovery kit. The round two PCR product was Sanger sequenced to confirm identity. All the marker genes primers and successful PCR conditions are in Supplementary File 4. The probes were transcribed with T7 polymerase for 2-6 hours then cleaned with Zymo RNA clean and concentrator and eluted into 30 ul of water. For the in situ hybridizations, we selected embryos and newly spawned larvae close to the developmental stage used in the atlas. We completed in situ 44 hybridizations in keeping with Thisse & Thisse, 2007, leaving the embryos in stain until background was observed. For spawned larvae, we completed a bleaching step (1% H2O2 and .5% KOH for 8 minutes) prior to the proteinase K digest. After imaging, we used the levels tool in Adobe Photoshop (v23.4.2) to white balance the pictures. Differentiation State Analysis To assess whether proposed primordial cell clusters were composed of undifferentiated cells relative to other clusters from their lineages, we completed a differentiation analysis using CytoTRACE (v0.3.3). Clusters from similar lineages (neural: 0, 3, 7, 8, 12, 22, 25, 33, and 35; muscle: 2, 10, 17, and 37; connective: 4, 5, 6, 9, 15, 16, 18, 24, 27, 28, and 29) were isolated using subset. Cell counts were gathered using as.Matrix(GetAssayData), then CytoTRACE was run on these counts. Data was plotted on a VlnPlot to show the variability. Single Cell KEGG Analysis To identify pathways upregulated in cell clusters, we completed a KEGG analysis. We downloaded gulf pipefish KEGG pathways (https://www.kegg.jp). For the KEGG analysis, we used the marker genes identified from our FindAllMarkers list as the input genes. We converted these gene ids using keggConv to KEGG ids. We used a Wilcoxon enrichment test to ask whether cluster marker genes were enriched for each KEGG pathway. Single Cell Atlas Network Analysis To identify genetic networks present in our atlas, we completed a weighted gene network correlation analysis using WGCNA (v1.72-1,Langfelder & Horvath, 2008). We selected 3,000 variable features from the integrated assay of the single cell dataset for the WGCNA. We created an adjacency matrix from the data using bicor with a maxPOutliers of .05. To decide on a beta value or the soft threshold power, we created an adjacency matrix plot using pickSoftThreshold and picked the threshold where the scale free topology model fit leveled off. We selected the value of two. Then, we raised the adjacency matrix to the power of two. We calculated the dissimilarity matrix by calculating TOM similarity of the adjacency matrix and subtracting it from one. We then created a gene tree through hclust of the dissimilarity matrix with the average method. From the gene tree, we created modules using 45 cutreeDynamic with deepSplit of two and the minClusterSize of 15 (setting the smallest cluster size to 15). We calculated module eigengenes using moduleEigengenes of the adjacency matrix then calculated the dissimilarity of the module eigengenes with cor of the module eigengenes subtracted from one and clustered the module eigengenes with hclust. We then chose a dissimilarity of 35% as the cut off for merging modules; however, no modules had dissimilarity scores with each other below 40%. Since the module eigengenes are calculated for each cell in the dataset, we used these values to calculate the t-statistic for each cell cluster (considered each sample) in the module (all of the module eigengene scores were used as the population). We next tested the hypothesis that certain cell clusters were strongly associated with specific gene modules through a two-way permutation test (1000 permutations) and corrected p- values to control the False Discovery Rate (FDR). Additionally, we tested whether specific cell clusters drove underlying gene network structure by measuring network connectivity. The network connectivity was first calculated using the entire dataset, then each cell cluster was progressively dropped from the dataset and the connectivity was remeasured. This resulted in changes in connectivity scores for each module-cluster pair. To assess whether these changes were significant, we completed 1000 permutations whereby cells were randomly dropped (the number of cells dropped was equal to the cell cluster size of the focal cluster), connectivity measured, and the change in connectivity recorded. The p-value is the number of instances where the change in connectivity is greater in the permutations than in the focal cell cluster run. P- values were corrected to control the FDR. Using the module gene lists, we identified the number of genes from each module that were found in each KEGG pathway. Since the KEGG modules do not have p-values associated with the genes, we could not complete a Wilcoxon enrichment test. We instead removed any pathways where there were fewer than three genes present for any pathway and note that there is no statistical test run on these KEGG results. We visualized the networks using Cytoscape (v3.10.0). In situ Hybridization For genes chosen for further follow up analysis, we completed in situ hybridizations in Gulf pipefish (Syngnathus scovelli) or bay pipefish (Syngnathus leptorhynchus). Primer sequences were designed using NCBI Primer Blast with the Gulf pipefish genome assembly and 46 synthesized using Gulf pipefish embryonic cDNA. The sfrp1a probe was synthesized using the PCR based probe preparation protocol described in the Single Cell Identification section. All other probes (bmp4, dlx2a, dlx3b, fgf22, lhx6a, pitx2, and scpp1) were prepared using TOPO cloning. Probe primer sequences as well as the species used for the in situ experiments are described in supplementary file In_Situ_Probe_and_Samples_Used.xlsx. Pregnant male bay pipefish were caught as described above. For Gulf pipefish embryos, we allowed the fish to mate in our facility then harvested embryos once they reached the appropriate stages. Fish were reared in 25 degrees C water with 25-28 ppt salinity. Embryos and larvae were fixed in 4% PFA, dehydrated through a series of PBT/MeOH washes, and stored in MeOH at -20C. 5-12 embryos were used for each probe. We completed in situ hybridizations in keeping with Thisse & Thisse, 2007, leaving the embryos in stain until background was observed. After imaging, we used the levels tool in Adobe Photoshop (v23.4.2) to white balance the pictures. Bone and Cartilage Staining We used alcian and alizarin stains to mark cartilage and bones. Specifically, we assayed cartilage and bone development in siblings of the scRNAseq samples. These embryos were fixed in 4% PFA then stored at -20C in MeOH. We followed the protocol from Walker and Kimmel (2007) with minor alterations. We stored samples in 50% glycerol/0.1% KOH at 4C and imaged them in 100% glycerol. After imaging, we white balanced the photographs using Photoshop (v23.4.2) levels tool. Gene Cluster Analysis To examine close syngnathid outgroups, we downloaded the 2023 mandarin dragonet genome assembly from NCBI (GenBank assembly accession: GCA_027744825.1) and used an unpublished highly contiguous cornetfish genome assembly. Since these genome assemblies were unannotated, we manually identified scpp genes. We searched for scpp genes using BLASTN with medaka and additional fish sequences as the query. We also searched for these genes with mVISTA plots across conserved gene synteny regions (LAGAN alignment using translated anchoring) with medaka as the focal species. To identify scpp genes in additional species, we gathered cluster information from NCBI and ensembl. Additionally, we used mVISTA plots to further search for unannotated scpp genes. 47 Results Valuable scRNAseq atlas for studying syngnathid development We produced the first developmental scRNAseq atlas for a syngnathid from two samples comprising 20 similarly staged embryos from pregnant, wild-caught Gulf pipefish (Syngnathus scovelli) males. The samples represent a late organogenesis developmental stage (Figure 1B-H). These embryos had a primarily cartilaginous skeleton with minimal mineralization, including jaw cartilages that were at the onset of mineralization and ethmoid elongation. The embryos also possessed cartilaginous dorsal fin pterygiophores but had no signs of dermal armor mineralization. This stage is referred to as “frontal jaws” in syngnathid literature (Sommer et al., 2012). 48 The atlas included 35,785 cells (19,892 and 15,893 cells from each sample; S3.1:S3.2), which formed 38 cell clusters (Figure 3.2A, Supplementary File 2). We classified cells into 4 different broad tissue types – epithelial, connective, neural, and muscle – using Seurat identified marker genes and published model organism resources (S3.3:S3.57). We next used Seurat identified marker genes to pinpoint single marker genes that were most unique to each cluster (Figure S3.58, Supplementary File 3). We completed in situ hybridization using Gulf and bay pipefish embryos for cell clusters for which examining gene expression would help hone and Figure 3.2: Gulf pipefish single cell atlas contains cells from the entire embryo and identifies genetic pathways active in different cell types. The UMAP plot (panel A) shows all of the cell clusters and their identities reduced to the first two UMAP dimensions. The graph in panel B displays results of the KEGG pathway analysis in cell clusters identified as connective tissue (excluding blood, pigment, digestive, and immune cells). The number of Seurat identified marker genes for each cluster that was a part of each pathway is displayed on the y axis. Bars are colored and labeled by cell cluster. ! " #$ _% −&' −&( −' ( ' &( −&( −' ( ' &( &' !"#$_& ! " #$ % ( & % ) * ' + , - . &( && &% &) &* &' &+ &, &- &. %( %& %% %) %* %' %+ %, %- %. )( )& )% )) )* )' )+ ), (/01234567809263597 &/0:5992;<=4280>?55@ %/0"67;?2 )/01234567809263597 +/0:5992;<=428057<25;A59@350B2729;ACB2 ,/0123456780>3D=909263597 -/0123456780E577=>?20>3D=909263590E35F29=<537 ./0:5992;<=4280<29@590D9@0?=FDB29<0;2??7 &(/0"67;?280E577=>?20E35F29=<537 &&/0!9G95H9 &%/01234567809263597 &*/0:5992;<=428029@5?D7<7 &./0:5992;<=4280>?55@ %(/0:5992;<=4280E=FB29< %&/0:5992;<=4280=BB692 %%/0123456780F?=D %)/0IE=?D7<7 %-/0:5992;<=4280J9 %./0:5992;<=4280B2729;ACB2 )(/0:5992;<=4280=BB692 )&/012345678092635970K32<=9DL )%/0:5992;<=4280>?55@ ))/0123456780AD=30;2??7 )'/01234567809263597 )+/0"67;?2 ),/0"67;?2 )& . &' &+ %+ &* && %- &-%, ' %* + %. &) * )% &. & %& )( )+ %( %% &( % &, %) ),)* &% -%' )) , ) ( )' '/0:5992;<=428057<25;A59@350B2729;ACB2 */0:5992;<=428057<25;A59@350B2729;ACB2 &)/0:5992;<=4280J>35>?D7<7 &'/0:5992;<=428095<5;A53@ )*/0:5992;<=4280F6<0;2??7 A B 14 14 16 5 14 27 28 14 28 16 16 13 14 18 2627 28 29 16 13 14 2628 29 14 16 14 26 28 28 5 14 13 14 16 27 14 26 29 28 26 14 16 27 28 13 14 14 14 13 14 0 30 60 90 ABC tra nsp orte rs Adherens ju ncti on ATPdependent c hromatin re modelin g Calci um sig nalin g pathway Cell a dhesio n m olecu les Cell c ycle Cellu lar s enesce nce Cytokine-cy tokine re ce ptor in teracti on DNA re plic atio n ECMrece ptor in teracti on Endocy tosis Fa nco ni a nemia pathway Foca l a dhesio n Hedgehog sig nalin g pathway MAPK sig nalin g pathway Melanogenesis Motor p roteins Neuroacti ve lig andrece ptor in teracti on Phagoso me PPA R sig nalin g pathway Regulatio n of a cti n cy toskeleton TGFbeta sig nalin g pathway Tight ju ncti on Vascu lar s mooth m uscl e co ntra cti on Wnt s ignalin g pathway An no ta te d G en es in P at hw ay factor(cluster) 5 13 14 16 18 26 27 28 29 49 validate cluster annotations (Figure S3.2, S3.59:S3.71). In total, our atlas contained 13,027 connective tissue cells (excluding cells from blood, immune, and the digestive system) from 14 clusters, 10,112 nervous system cells from ten clusters, 4,363 muscle cells from five clusters, 4,133 blood cells from three clusters, 650 immune cells from two clusters, 432 pigment cells from one cluster, 370 epidermal cells from one cluster, and 137 gut cells from one cluster. Within the connective tissue cell types, we also identified cartilage (302 cells), developing bone (442 cells), fins (253 cells), and notochord (693 cells). The number of recovered cells per identity may not necessarily represent organismal cellular proportions because of potential variability in dissociation success for different cell types (Denisenko et al., 2020; Venema et al., 2022). Discovery of cell cluster function and state using KEGG analysis To affirm identities and discover potential properties of each cluster, we completed a KEGG pathway analysis for each cluster using Seurat’s marker genes (Figure S3.4, Figure 3.2B). For eight of the clusters (1, 4, 6, 9, 11, 15, 19, and 24), we did not find any significantly enriched pathways, possibly due to similar gene expression profiles across cell types that reduced the number of identified markers. However, we found one or more significantly enriched pathways for the other 29 cell clusters. We observed enriched pathway terms that supported cluster annotations. For example, phototransduction in the retina cluster, melanogenesis in the pigment cluster, cardiac muscle contraction in muscle clusters, and neuroactive ligand receptor interaction in neuronal clusters. The inferred KEGG pathways demonstrated some commonalities across the different tissue types, including in signaling pathways and cell states. Notably, our identified KEGG terms delineated progenitor and differentiated cell clusters. Based on their KEGG terms, we classified clusters 8, 10, and 16 as possible neural, muscle, and connective tissue progenitor cells, respectively. We also detected expression of pax3a and pax3b, muscle primordia markers, in cluster 10, supporting this annotation. These clusters had enriched KEGG terms associated with cell division (‘cell cycle’, ‘DNA replication’, ‘nucleotide excision repair’, and ‘homologous recombination’), and lacked enrichment for KEGG pathways present with differentiated cell types of their lineage. Specifically, cluster 8 lacked the neural KEGG term ‘neuroactive ligand receptor 50 interaction’, cluster 10 lacked muscle KEGG terms ‘adrenergic signaling in cardiomyocytes’, ‘calcium signaling pathways’, and ‘cardiac muscle contraction’, and cluster 16 lacked connective Figure 2.3: Weighted Gene Network Analysis (WGCNA) identifies gene modules that define and unite cell clusters. A) The strength of association between the gene modules and cell clusters is shown in panel A with dendrogram clustering illustrating the distance between modules and cell clusters. Gene modules are represented by rows and cell clusters by columns. The modules and clusters are clustered using the Pearson distance method. The number of genes in each gene module are shown in the right hand bar plots. Cell clusters are colored based on their identity. The asterisks indicate the module-cluster relationships that have a p-value less than .05 from a two-sided permutation test after correction for multiple tests (FDR). T