Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning by Steven Walton A dissertation accepted and approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Dissertation Committee: Hank Childs, Chair Humphrey Shi, Core Member Daniel Lowd, Core Member Thien Nguyen, Core Member Edward Rubin, Institutional Representative University of Oregon Summer 2025 © 2025 Steven Walton This work, including text and images of this document but not including supplemental files (for example, not including software code and data), is licensed under a Creative Commons Attribution 4.0 International License. 2 http://creativecommons.org/licenses/by/4.0/ DISSERTATION ABSTRACT Steven Walton Doctor of Philosophy in Computer Science Title: Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning Major advancements in the capabilities of computer vision models have been primarily fueled by rapid expansion of datasets, model parameters, and computational budgets, leading to ever-increasing demands on computational infrastructure. However, as these models are deployed in increasingly diverse and resource-constrained environments, there is a pressing need for architectures that can deliver high performance while requiring fewer computational resources. This dissertation focuses on architectural principles through which models can achieve increased performance while reducing their computational demands. We discuss strides towards this goal through three directions. First, we focus on data ingress and egress, investigating how information may be passed into and retrieved from our core neural processing units. This ensures that our models make the most of available data, allowing smaller architectures to become more performant. Second, we investigate modifications to the core neural architecture, applied to restricted attention in vision transformers. This section explores how removing uniform context windows in restricted attention increases the expressivity of the underlying neural architecture. Third, we explore the natural structures of Normalizing Flows and how we can leverage these properties to better distill model knowledge. 3 These contributions demonstrate that careful design of neural architectures can increase the efficiency of machine learning algorithms, allowing them to become smaller, faster, and cheaper. This dissertation includes previously published and unpublished co-authored material. 4 CURRICULUM VITAE NAME OF AUTHOR: Steven Walton GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: University of Oregon, Eugene, OR, USA Embry-Riddle Aeronautical University, Prescott, AZ, USA DEGREES AWARDED: Doctor of Philosophy in Computer Science, 2025, University of Oregon Master of Science in Computer Science, 2023, University of Oregon Bachelor of Science in Space Physics, 2014, Embry-Riddle Aeronautical University AREAS OF SPECIAL INTEREST: Computer Vision Machine Learning Artificial Intelligence Generative Modeling PROFESSIONAL EXPERIENCE: Graduate Researcher, University of Oregon, Eugene, OR, Aug. 2018 - Jun. 2025 Metropolis Intern, Nvidia, Sep. 2023 - Mar. 2024 Ph.D. Research Intern, Picsart AI Research, Eugene, OR, Jun. 2021 - Nov. 2022 Computer Science Intern, Lawrence Livermore National Labratory, Livermore, CA, Jun. - Sept. 2020 Computer Science Intern, Lawrence Livermore National Labratory, Livermore, CA, Jun. - Sept. 2019 ASTRO Intern, Oak Ridge National Labratory, Oak Ridge, TN, Jun. - Aug. 2018 5 GRANTS, AWARDS AND HONORS: Outstanding Reviewer, CVPR 2025 PUBLICATIONS: Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Efficient image generation with variadic attention heads. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025 Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, and Humphrey Shi. Distilling normalizing flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025. Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar, Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen mei Hwu, Ming-Yu Liu, and Humphrey Shi. Generalized neighborhood attention: Multi-dimensional sparse attention at the speed of light, 2025. arXiv:2504.16922 Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio, Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven Walton, Charig Yang, Kai Han, and Samuel Albanie. Zerobench: An impossible visual benchmark for contemporary large multimodal models, 2025. arXiv:2502.09696 Noble Kennamer, Steven Walton, and Alexander Ihler. Design amortization for bayesian optimal experimental design. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8220–8227, 2023. Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6185–6194, 2023. 6 Steven Walton. Isomorphism, normalizing flows, and density estimation: Preserving relationships between data, 2022. https://www.cs.uoregon.edu/Reports/AREA-202307-Walton.pdf Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, and Humphrey Shi. Semask: Semantically masked transformers for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 752–761, 2023. Jiachen Li, Ali Hassani, Steven Walton, and Humphrey Shi. Convmlp: Hierarchical convolutional mlps for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 6307–6316, 2023. David Pugmire, James Kress, Jieyang Chen, Hank Childs, Jong Choi, Dmitry Ganyushin, Berk Geveci, Mark Kim, Scott Klasky, Xin Liang, Jeremy Logan, Nicole Marsaglia, Kshitij Mehta, Norbert Podhorszki, Caitlin Ross, Eric Suchyta, Nick Thompson, Steven Walton, Lipeng Wan, Matthew Wolf, Jeffrey Nichols, Becky Verastegui, Arthur ‘Barney’ Maccabe, Oscar Hernandez, Suzanne Parete-Koon, and Theresa Ahearn. “Visualization as a Service for Scientific Data”. In “Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI”, pages “157–174”, “Cham”, “2020”. “Springer International Publishing”. Steven Walton, Ali Hassani, Abulikemu Abuduweili, and Humphrey Shi. Training compact transformers from scratch in 30 minutes with pytorch. medium.com/pytorch, 2021. arXiv:2104.05704 Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, and Humphrey Shi. Escaping the big data paradigm with compact transformers, 2022. Steven Walton. Datum: Dotted attention temporal upscaling method. 2020. https://www.cs.uoregon.edu/Reports/DRP-202006-Walton.pdf 7 ACKNOWLEDGEMENTS I’d like to thank my mentors and professors from my Universities for helping get me to where I am today. Thank you Jeff Spear, for being the first to show me how to be creative with math. To Karla Westphal, for helping me find passion and dedication to the subject. To my undergraduate professors: Timothy Callahan, Andri Gretarsson, Edward Poon, Hisaya Tsutsui, and Darrel Smith who taught me my passion for math, physics, and providing me the tools to understand the world around me. To my graduate professors and advisors, who helped get me through these difficult times. I especially want to thank Hank Childs for encouraging me to pursue Machine Learning and to be my acting advisor after Humphrey moved to Georgia Tech. I want to thank Humphrey Shi for being my advisor and helping me make all the connections and pushing me to become a better researcher. I’d like to thank my friends and family for helping get through this. It was a journey that I could not have made alone. Noble, you’ve been a close friend for so many years and your insights helped shape my research and encouraged me to go to graduate school. You constantly challenge my ideas, often frustratingly so, but they always end up better and more refined for it. Never change. Ali, I couldn’t ask for a better co-author nor friend. Your intelligence and work ethic have always pushed me to better myself, and I look forward to calling you “doctor”. I want to thank my cat Hypatia, who has been my best friend for the last decade. She’s had to listen to many explinations and I’m sorry you have not received formal recognition for your contributions despite frequent appearances in my works (including this one). Lastly, I want to thank my wonderful girlfriend: Jaichung Lee. We have been through so much and I could not have crossed the finish line without 8 you. I know it was as much of a challenge for you as it was for me, and this PhD would not have been possible without your many efforts. Thank you. 9 DEDICATED TO My mom, and the many years of watching Star Trek together. My dad, and the many years of reading Asimov together. Jaichung, and the many years to build the future together. 10 TABLE OF CONTENTS Chapter Page I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 18 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2. Research Goals and Approaches . . . . . . . . . . . . . . . . . 19 1.3. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . 21 1.4. Co-Authored Material . . . . . . . . . . . . . . . . . . . . . 22 II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1. Learned Data Mappings . . . . . . . . . . . . . . . . . . . . 24 2.2. Scale Is Not All You Need . . . . . . . . . . . . . . . . . . . 27 2.2.1. Scaling Data . . . . . . . . . . . . . . . . . . . . . . 28 2.2.2. Model Size . . . . . . . . . . . . . . . . . . . . . . . 30 2.3. The Foundations That Shape Us . . . . . . . . . . . . . . . . 31 2.3.1. Transformers . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2. Adversarial Generation . . . . . . . . . . . . . . . . . 33 2.3.3. Normalizing Flows . . . . . . . . . . . . . . . . . . . 34 2.4. The Tyranny of Measurements . . . . . . . . . . . . . . . . . 37 III. ESCAPING THE BIG DATA PARADIGM . . . . . . . . . . . . . 39 3.1. Vision Transformers . . . . . . . . . . . . . . . . . . . . . . 40 3.2. Data Efficient Vision Transformers . . . . . . . . . . . . . . . 43 3.2.1. Convolutional Tokenizer . . . . . . . . . . . . . . . . . 43 3.2.2. SeqPool . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 47 11 Chapter Page 3.3.2. Computational Resources . . . . . . . . . . . . . . . . 47 3.3.3. Hyperparameters . . . . . . . . . . . . . . . . . . . . 47 3.3.4. Transformers On Small Datasets . . . . . . . . . . . . . 48 3.3.5. Ablations . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.6. Scaling Study . . . . . . . . . . . . . . . . . . . . . 54 3.3.7. Natural Language Processing . . . . . . . . . . . . . . . 61 3.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 61 IV. VARIADIC NEIGHBORHOOD ATTENTION . . . . . . . . . . . . 64 4.1. Localized Attention . . . . . . . . . . . . . . . . . . . . . . 67 4.2. Neighborhood Attention . . . . . . . . . . . . . . . . . . . . 68 4.3. Variadic Attention Heads . . . . . . . . . . . . . . . . . . . . 70 4.4. Generating The Right Experiment . . . . . . . . . . . . . . . . 72 4.4.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.2. Hyperparameters . . . . . . . . . . . . . . . . . . . . 78 4.5. When Faced With Sparse Attention . . . . . . . . . . . . . . . 80 4.6. A Bump While Headed To Church . . . . . . . . . . . . . . . . 83 4.7. Metrics Are Not Enough . . . . . . . . . . . . . . . . . . . . 85 4.7.1. The Face Says It All . . . . . . . . . . . . . . . . . . . 86 4.7.2. Quick Training on Deep Fake Detection . . . . . . . . . . 87 4.7.3. Fingerprints . . . . . . . . . . . . . . . . . . . . . . 89 4.7.3.1. StyleGAN . . . . . . . . . . . . . . . . . . . 89 4.7.3.2. StyleSwin . . . . . . . . . . . . . . . . . . . 91 4.7.3.3. StyleNAT . . . . . . . . . . . . . . . . . . . 93 4.7.4. Attention To Details . . . . . . . . . . . . . . . . . . 95 12 Chapter Page V. DISTILLATION OF INVERTIBLE NETWORKS . . . . . . . . . . 98 5.1. Model Distillation . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Distilling Normalizing Flows . . . . . . . . . . . . . . . . . . 100 5.2.1. Categories of Flow Distillations . . . . . . . . . . . . . . 101 5.2.1.1. Latent Knowledge Distillation . . . . . . . . . . 101 5.2.1.2. Intermediate Latent Knowledge Distillation . . . . 102 5.2.1.3. Synthesized Knowledge Distillation . . . . . . . . 102 5.2.1.4. All Together . . . . . . . . . . . . . . . . . . 104 5.3. Distillation Experiments . . . . . . . . . . . . . . . . . . . . 105 5.3.1. Density Estimation . . . . . . . . . . . . . . . . . . . 106 5.3.2. Image Generation . . . . . . . . . . . . . . . . . . . . 108 5.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 111 VI. CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . 113 6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2. Future Directions . . . . . . . . . . . . . . . . . . . . . . . 114 6.2.1. Core Challenges . . . . . . . . . . . . . . . . . . . . 115 6.2.2. Scaling . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2.3. Ingress and Egress of Data . . . . . . . . . . . . . . . . 117 6.2.3.1. Parameterization . . . . . . . . . . . . . . . . 117 6.2.3.2. Automated Preprocessing . . . . . . . . . . . . 118 6.2.3.3. Making The Most of it . . . . . . . . . . . . . 118 6.2.4. Core Processing Architectures . . . . . . . . . . . . . . 119 6.2.4.1. Flexible Learning . . . . . . . . . . . . . . . . 120 6.2.4.2. Is Beauty in the Eye of the Beholder? . . . . . . . 120 6.2.5. Structurally Aware Architectures . . . . . . . . . . . . . 121 13 Chapter Page 6.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 122 14 LIST OF FIGURES Figure Page 1. Domain, Range, Image, Preimage diagram . . . . . . . . . . . . . . 26 2. Vision Transformer Architecture . . . . . . . . . . . . . . . . . . 32 3. Taxonomy of Generative Models . . . . . . . . . . . . . . . . . . 33 4. Diagram of Injection, Surjection, and Bijection . . . . . . . . . . . . 36 5. Architectural design of Compact Transformers . . . . . . . . . . . . 41 6. Variations of ViT Architectures . . . . . . . . . . . . . . . . . . 44 7. Vision Transformer Salient Maps . . . . . . . . . . . . . . . . . . 52 8. Accuracy of ViTs on Restricted Samples per Class . . . . . . . . . . 59 9. Vision Transformer Resolution Based Performance . . . . . . . . . . 60 10. StyleNAT Samples (FFHQ-256, FFHQ-1024, LSUN Church) . . . . . . 66 11. NAT vs Swin Vs ConvNext ImageNet Performance . . . . . . . . . . 68 12. Neighborhood Attention Transformer (NAT) . . . . . . . . . . . . . 70 13. StyleNAT Architecture . . . . . . . . . . . . . . . . . . . . . . 72 14. StyleNAT: FID vs. Throughput vs. Parameters . . . . . . . . . . . 74 15. StyleNAT Samples: FFHQ & LSUN Church . . . . . . . . . . . . . 81 16. StyleNAT FID vs Iteration . . . . . . . . . . . . . . . . . . . . 82 17. StyleGAN3 Visual Artifacts . . . . . . . . . . . . . . . . . . . . 90 18. StyleSwin Visual Artifacts . . . . . . . . . . . . . . . . . . . . . 92 19. StyleNAT Visual Artifacts . . . . . . . . . . . . . . . . . . . . . 93 20. StyleNAT and StyleSwin Attention Maps . . . . . . . . . . . . . . 96 21. StyleNAT and StyleSwin Attention Maps . . . . . . . . . . . . . . 97 22. Illustration of Knowledge Transfer for Normalizing Flows . . . . . . . 103 15 Figure Page 23. Distilling Normalizing Flow CIFAR-10 Samples . . . . . . . . . . . 110 24. Distilling Normalizing Flow CelebA Samples . . . . . . . . . . . . . 111 16 LIST OF TABLES Table Page 1. Terminology of Mathematical Sets . . . . . . . . . . . . . . . . . 25 2. CCT Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 48 3. CCT Main Results . . . . . . . . . . . . . . . . . . . . . . . . 49 4. Extended CCT Training . . . . . . . . . . . . . . . . . . . . . . 50 5. CCT Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 51 6. CCT Positional Embedding Comparison . . . . . . . . . . . . . . . 55 7. CCT ImageNet Accuracy . . . . . . . . . . . . . . . . . . . . . 57 8. CCT Flowers-102 Accuracy . . . . . . . . . . . . . . . . . . . . 58 9. CCT Text Classification . . . . . . . . . . . . . . . . . . . . . . 62 10. StyleNAT Configurations . . . . . . . . . . . . . . . . . . . . . 78 11. Comparison of Generative Models . . . . . . . . . . . . . . . . . . 80 12. StyleNAT Ablations . . . . . . . . . . . . . . . . . . . . . . . . 83 13. GLOW and MAF Model Configurations . . . . . . . . . . . . . . . 106 14. Distilling Normalizing Flow Density Estimation Metrics . . . . . . . . 107 15. Distilling Normalizing Flow Performance Metrics . . . . . . . . . . . 108 16. Distilling Normalizing Flow Model Configurations for Image Generation . 108 17. Distilling Normalizing Flow Image Generation Metrics . . . . . . . . . 109 18. Distilling Normalizing Flow CelebA FID . . . . . . . . . . . . . . . 111 17 CHAPTER I INTRODUCTION I don’t believe in empirical science. I only believe in a priori truth. Kurt Gödel 1.1 Motivation This thesis focuses on the development of efficiently training machine learning algorithms, primarily applied to Computer Vision. Our focus is to develop methods which allow for a reduction in computational resources required to train and deploy models. Machine Learning is a subfield of Artificial Intelligence which aims to process data and automate the discovery of structures within the data. This process reduces the burden of needing to derive explicit formulations, instead allowing automation through optimization. This process allows algorithms to “learn” by “training” on the data. Computer Vision applies to a wide range of problems related to perception. Traditionally associated with image and video processing, the field extends to processing of other data, such as LIDAR, radio, depth estimation, and other forms of signal processing. The domain involves a broad range of tasks, including: regression, which models quantitative relationships between variables; discrimination, the processing distinguishing relevant objects or patterns; and generation, or data synthesis. The primary focus of this thesis revolves around discrimination and generation of images. Image processing presents unique challenges, often due to the high dimensionality of the embeddings. This high dimensionality causes difficulties in 18 formulating explicit descriptions of our data and the underlying structures within it. The goal of computer vision is to create the machinery necessary to automate this process for us, as efficiently as possible. While we may not be able to create fully formulate descriptions, the descriptions we provide our algorithms can both help and hinder them. For example, images usually have spatial relationships, with pixels that are local spatially having high probabilities of being related to one another. This has led to the use Convolutional Neural Networks, as their architecture is able to exploit this natural bias. But such relationships may not always hold. For example, a QR code contains sharp transitions, where neighboring pixels do not aid the prediction of one another. More flexible architectures, such as attention, can better process such imagery by reducing the importance of locality. Therefore, to efficiently process data we must consider the biases implicit to the neural architectures that we use. The modern success of these algorithms has presented additional challenges. It has been found that many of these methods can be improved through simple means: making them larger and providing them with more training data [142]. While this has led to dramatic improvements, it has similarly led to dramatic increases in the computational resources necessary to train and deploy these models. Once trained, these models may still be quite difficult to deploy, with their high computational demands, greatly limiting where they can be used. This has led many researchers to consider how these models can be more efficiently trained, requiring: less data, less time to train, and fewer computational resources. Similar challenges exist with respect to the deployment of these models. 1.2 Research Goals and Approaches The focus of this thesis revolves around two primary questions: 19 – How do we reduce the model’s data dependence? – How do we reduce the model’s computational demands? These questions are fundamentally intertwined, necessitating solutions which address the problems simultaneously. Naturally, by reducing the amount of data that a model must ingest reduces the amount of time that a model must be trained for. Conversely, by making a model more efficiently extract information from its data, the less data it will need to achieve a given performance level. This is because model parameters do not just determine its information capacity, but also play an integral role in the solution space during training [133]. Many works have found that once trained models are often significantly over-parametrized, meaning only a subset of their parameters are being used to model the data [33, 97]. These findings are further evidenced by the continued increasing performance of smaller models [66], and strongly suggest our models can be trained more efficiently. Our motivation to reduce a model’s data depends exists beyond our desire to be cost effective. Real world large datasets provides two primary challenges which require our models to be data efficient. First, many important structures within the data are subtle and difficult to recover. Second, data is often heavy- tailed, meaning we do not have many samples. Fundamentally, these require our models to generalize relationships with minimal examples. While we may focus on explicitly constrained data to aid the interpretation of our work, it provides benefits as our models and data expand in size. These feats are primarily accomplished the development the development of neural architectures and optimization methods. This thesis focuses on the former, specifically, studying the design of Computer Vision architectures which reduce: parameters, data dependence, and system resources. These goals must 20 be simultaneously optimized. Our objective is not to develop models with a small number of parameters if they also require substantially greater costs during training or deployment. Similarly, this would undermine our own goals if we reduce a model’s data dependence with significant cost to its performance. This thesis investigates three critical aspects of our neural architectures and structure it to follow a natural progression in complexity. The first work focuses on the understanding how our core neural architecture takes in data and how to efficiently extract the relationships it uncovers. Without efficiently providing and extracting data to/from a model, they become wasteful and this hinders the ability to develop more efficient core architectures. The second work focus on the core architecture, which perform the majority of the data processing. This section studies these two aspects as applied to vision transformers, directly building off one another. The third work revolves around knowledge distillation of Normalizing Flows. These models are structurally aware, explicitly designed to preserve the structures within the data. From these three lenses this thesis seeks to better understand how to build neural architectures that are smaller, faster, and cheaper. 1.3 Dissertation Outline This dissertation is organized as follows: Chapter 2 provides the necessary background and foundational information necessary to understand the research objectives. This background is necessary for understanding how the works are connected and the ways we seek to resolve underlying issues. Chapter 3 presents the work Escaping the Big Data Paradigm with Compact Transformers [51], and focuses on efficiently embedding and extracting data from Vision Transformers. 21 Chapter 4 presents the work Efficient Image Generation with Variadic Attention Heads [156], as well as the works it builds upon: Neighborhood Attention Transformer [52]. Chapter 5 presents the work Distilling Normalizing Flows, which provides a framework for knowledge distillation with Normalizing Flow architectures and studies the categorical distillation methods. Chapter 6 provides an overview of the findings and recommendations for future work. 1.4 Co-Authored Material The research presented herein involves previously published material. Below is a listing of the prior works in relation to the chapter material. Details of division of labor is provided in the preface to each chapter. – Chapter 2: This chapter includes material that was part of Steven Walton’s Area Exam [154]. – Chapter 3: This work was contains materials from Escaping the Big Data Paradigm with Compact Transformers [51]. This work was a collaboration with Ali Hassani, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, Humphrey Shi, and myself. – Chapter 4: This chapter contains materials from both Neighborhood Attention Transformer [52] and Efficient Image Generation with Variadic Attention Heads [156], with focus around the latter. The former is a collaboration between Ali Hassani, Jaichen Li, Shen Li, Humphrey Shi, and myself. The latter was a collaboration between Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi, and myself. 22 – Chapter 5: This chapter contains material from a collaboration between Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi, and myself. 23 CHAPTER II BACKGROUND Mathematicians do not deal in objects, but in the relationships among objects. Henri Poincaré Nota Bene: Some of the text and figures from this section were part of Steven Walton’s Area Exam [154], which has been publicly released by The University of Oregon. Steven was the sole author of this work. This section covers the background necessary for understanding the motivation and purpose of the work performed. There is includes some necessary discussion about how machine learning algorithms work, how data is processed, and the inherent biases of different learning architectures. The latter of which is the main focus of this thesis. While subsequent chapters will have lower mathematical notation and formulation, those herein provide important context and intuition for the work ahead. To reach our goal of making our machine learning models smaller, faster, and cheaper, we need to have some core understandings as to how these models work. It is not enough to treat them as black boxes; rather we have to look inside. Much of machine learning terminology has not been standardized, thus this section may be used to contextualize these terminologies and the usage within this thesis. 2.1 Learned Data Mappings The procedure can be understood through mapping between two sets, where our neural network is a learned mapping, f(x). Deep neural networks are Universal Approximators [18, 67, 112], where every multivariate continuous function can, 24 Name Relation Set Domain D {∀ x ∈ D} Codomain C {∀ y ∈ C} Range C̃ ⊆ C {y ∈ C | ∃ x ∈ D : f(x) = y} Image C̃ ⊆ C {y ∈ C | ∃ x ∈ D̃ : f(x) = y} Preimage D̃ ⊆ D {x ∈ D | ∃ y ∈ R̃ : f(x) = y} Table 1. Explanation of important set terms denoting their relationships and what elements are in their set. in principle, be approximated by the superposition of a sequence of continuous functions. With this in mind, it helps to revisit some of the basics of functions and set theory. We can view the Domain, D, as all valid inputs to the neural network. In the study of Computer Vision this is any valid image, regardless of whether this image is meaningful to humans or not. We can then define our Range, R, as all possible outputs that our neural net can produce. In the case of Image Classification this would be all labels that we are trying to learn. Our Codomain, C, is a super-set to our Range, R ⊆ C, and may include elements that our map cannot reach. In our example of Image Classification our Codomain would represent all possible labels. In practice, we are likely only interested in studying some subset of our domain, D̃ ⊆ D. This subset can be arbitrary and may be something like our training set, the set of images interesting to humans, or even some subset of our training data. Regardless of what this subset is, when they are passed through our mapping function then we call the outputs an Image, R̃ ⊆ R. A “reverse” of this function may then be defined, called the Preimage, f ∗[R̃]. The Preimage is defined as the set of elements in the domain that map to some image in the codomain. It 25 Domain Pr eim ag e {x ∈ D | f(x) ∈ R̃}︸ ︷︷ ︸ Subset C o d om ai n Ra ng e ︷ ︸︸ ︷ {f(x) | x ∈ D} Image {f(x) | x ∈ D̃}︸ ︷︷ ︸ Figure 1. The diagram illustrating concepts from Set Theory, explaining the Domain (D), Codomain, Range (R), Image (R̃), and Preimage. is important to note that the Preimage is not the inverse of the image. Many texts use the notation f−1, but we will use f ∗ to avoid confusion.1 Table 1 and Figure 1 are included to help explain these concepts. We will define a Target, T , as the set of data we intend to model. Unfortunately, this distribution may be unobtainable and is often intractable. That is, we are unable to provide a formal description of the distribution. An example, which we will use in Chapter 4 and Chapter 5, is “the set of all possible human faces.” We do not have a proper mathematical description this set, making it intractable, nor is it possible for us to completely sample from this set as it would require infinite time2. We instead collect a set of sample data Ω ⊆ T , which may be used to train the model (e.g. FFHQ [79] or CelebA [108]). It is important to 1This notation is used for a pullback, which is a nearly identical concept. 2This set would include all faces that were and all faces that will be. 26 note that we may not know how well Ω approximates T , especially when T is intractable. Our model processes data from the Ω to generate output, O. When performing Classification/Discrimination tasks, our output may be a (or a list of) label(s) but in generative tasks we instead seek to approximate the target distribution, T̃ . We should keep this model in mind when evaluating our work, so we can best understand what our models can and cannot do. Our data are discrete and sampled from the distributions we are trying to approximate, and great care must be taken to determine what is in our distribution or not. 2.2 Scale Is Not All You Need In March of 2019 Richard Sutton wrote a short article titled The Bitter Lesson [142]. This article had a large impact on the machine learning community. Sutton makes the argument that methods based predominantly on leveraging human knowledge are ill-founded and that our historical progress has shown us that focusing on search has resulted in success. Sutton acknowledges the benefits of leveraging human knowledge as well as how in practice this can often be constraining, preventing our machines from leveraging more general computation. Either through misinterpretation by Sutton or through readers, a popular belief rose through the community: “Scale Is All You Need”. This notion need be addressed, for if the belief is true to face then the only work need be done is that of scaling compute and data gathering. Some will interpret this in that scaling is sufficient, and that there may be more efficient methods, but we will show that scaling alone is insufficient. We do not disagree that scale is a necessary and essential component, but that it alone is insufficient to both explain recent progress as well as provide direction for further advancement. These claims let critical conditions remain implicit, assuming shared assumptions among readers. These 27 subtle details are consequential to generating efficient machine learning models, as understanding what data increases performance allows us to also better design algorithms to maximally incorporate information. Two aspects of scaling must be addressed: that of scaling data (Chapter 2.2.1) and that of scaling compute (Chapter 2.2.2). 2.2.1 Scaling Data. Undeniably one of the reasons for major advances has resulted with scaling of data. There is a simple argument that may suggest scaling data will be sufficient. We need to look at this to understand where it works and doesn’t. Our goal in machine learning is to learn some distribution, which we will call our Target Distribution, T . If we uniformly and randomly sample from our target distribution, one can conclude that with scale we will also increase our covering of the distribution. We may view this another way: if we select some arbitrary point in our target distribution, as we continue to sample then the distance between it and some data point, di, in our set of sampled points will decrease. ∃ ε ∈ R s.t. ||di − dj||pp < ε | ∀di, dj ∈ T . Where || · ||pp represents an arbitrary Lp distance. We can refine this more generally, which will better help us as we increase complexity. We can partition our distribution T into disjoint continuous partitions {P0, . . . , Pn}. That is: Pi ∩ Pj = {∅} | ∀i ̸= j and ⋃ Pi = T . We can reach a similar natural conclusion: as the number of samples increases, the probability that there does not exist a sample belonging to partition Pi goes to zero. limn→inf Pr(s ∈ Pi) = 0. This generalization helps us in two ways. Our partitions can be of arbitrary size and shape, allowing us to use them as abstractions, such as semantic 28 representations.3 Where a semantic representation may represent categories of our data. For example, if our model is generating human faces we may consider hair color as a semantic representation. This formulation can also be repeated for each partition, which allows us to extend the notion to a more realistic setting where data is discrete (i.e. discretization). While this logic may be natural, it relies on assumptions that are not true in practice. Notably, it assumes that both the data is independent and identically distributed (i.i.d) and that our sampling process is unbiased. These assumptions are not representative of the real world data, nor of the way in which we sample. In practice, as we increase the number of samples we increase the diversity of our data. This diversity, or variance, in data has a large impact on our models’ ability to generalize. We will see in Chapter 3 that introducing data augmentation to our models results in a significant improvement in their performance. These augmentations create additional variance in the data and help the model to not overfit. Scaling of data in the way we typically gather data can grow the variance to a greater degree than our typical data augmentation methods can. But this represents a fundamental limitation as well. We cannot scale infinitely, and as we gather more data inevitably we turn from increasing variance to contracting the variance. There are only so many unique things in the world. To understand this, we may think about randomly throwing a dart at a dartboard. As we start, every new dart likely lands with a high distance from one another. But as we continue we increase our coverage over the dartboard and our new darts land close to an 3We still need to maintain care to ensure our semantic representations are disjoint. This does not allow us to pick arbitrary semantic representations. 29 existing dart. This variance contraction means that we cannot rely on scaling data indefinitely. Additionally, an extra challenge comes from scaling data. Once the data is so large, we are unable to properly investigate it. This means we will not be able to properly verify that our model is not trained on the data it is being tested on. In this manner, we want to use the minimum amount of data required to train our models, to reduce our burden of verification. In practice, our data is heavy tailed, with many samples being underrepresented. Ultimately, despite high amounts of data, subsets exists in a low data regime. Our models may benefit from shared similarities, via a superposition of representations, but we are still motivated to develop models which work better when data is sparse. By better understanding how to make our models efficiently learn in limited data regimes we hope to build techniques that allow our larger models to efficiently model data that is within the long tail. 2.2.2 Model Size. We face similar complexities when it comes to the scaling of our models. Inherently our model parameters change our loss landscape [100], with larger models providing more ways for data to be disentangled [95, 45, 29]. It can be shown that different by using different loss functions that we may even trick ourselves into believing our models have found emergent capabilities [159] when they may have not [133]. With increased model parameters our models are more likely to overfit our data, making it difficult to generalize. With such sizes in terms of data and parameters it becomes difficult to distinguish between our models memorizing the data vs modeling the data. In practice, we benefit from physical limitations, 30 which also puts pressure on making our models as small as possible. The larger our models are, the more expensive they are to run. 2.3 The Foundations That Shape Us To cost effectively train our models we want them to both be parameter efficient and data efficient. With too much data, we are may spend disproportionate times loading from disk and simply ingesting the data. With too many parameters, we must split, or shard, our model across large supercomputing infrastructures. Key to Sutton’s Bitter Lesson was that models should be powerful and flexible. With our trend in scaling, we have also seen tremendous improvements in the algorithms that we use, such as the advent of the transformer [150]. Scale cannot be enough to explain our progress, as we have found that as research progresses, many smaller models end up significantly outperforming larger models [66], and this thesis is further demonstration of that. These algorithms may be referred to as our neural architectures, as we build them to work together. In the following sections we introduce some of the key architectures that will be used throughout this work. There exist far more frameworks methods [46, 112] and we focus only on what are used herein. 2.3.1 Transformers. The transformer model has become the backbone of modern machine learning models. This is due to its high flexibility, being able to form a relationship between all elements it attends to. Unlike many other architectures, the transformer is not limited by the locality of the data, with it being able to discover relationships between data regardless of its position in a sequence. This greater flexibility comes at an increased computational complexity, 31 but enables the model to form relationships that could not be efficiently formed through other previous architectures. These models are fairly simple in construction, having two main components: attention [43, 114, 150], and a feed-forward layer. Figure 2. The Transformer model architecture from Vaswani et. al. Diagram depicts dot-product self attention. In Figure 2 depicts part of the transformer model from Vaswani et al.’s work, showing the dot-product self attention (DPSA) variant, which is used throughout this work. The figure depicts a “post-norm” configuration, with the normalization layers appearing after the attnetion and feed-forward units, but modern configurations usually use “pre-norm” due to increased stability. The core of the transformer model is attention, defined as: Softmax ( QKT √ dk ) V (2.1) Where Q, K, and V represent queries, keys, and values, respectively. These are learnable parameters, most usually parameterized by a single layer feed-forward network. In the DPSA configuration, these networks share the same input. dk in 32 Generative Models Explicit Density Tractable Density Autoregressive Normalizing Flows Approximate Density Diffusion Models VAEs Implicit Density Markov Chain GSNs Direct GANs Figure 3. Taxonomy of Generative Models, based on Goodfellow’s Taxonomy [40] this equation is a softmax temperature scale, which is the inverse square root of the embedding dimension (a user defined hyperparameter). The queries and keys are multiplied together, learning a similarity matrix. The softmax of this is then referred to as the “score”, as its values are defined by a probability distribution. The value tensor is then weighted by the score, defining our attention function. Commonly, this configuration is done in a “multi-headed” manner. Instead of performing a single attention we may instead project our Q,K, and V tensors into an embedding so that we may process multiple attention calculations in parallel. The conclusion of the attention mechanism concatenates these tensors. This tends to make our models more efficient as each head is independent and can learn unique representations, as we will see in Chapter 4. The transformer model typically includes the usage of positional encoding, which adds extra data to the model to indicate the position of tokens, or data, in a sequence. 2.3.2 Adversarial Generation. Generative Adversarial Networks are a form of generative models introduced by Goodfellow et al. [41] which first enabled the generation of high quality synthetic imagery. Not necessarily restricted 33 to image synthesis, these models enable unsupervised learning by simultaneously training two models at once. If our goal is to train an image generator, we both a model to generate images and a model to discriminate real and fake images. The discriminator model requires labeled data, but only the binary distinction of real data or synthesized data. These models then competitively train, being able to play a minimax game, which often leads to high quality generation. min G max D Ex∼pdata [ logD(x) ] + Ez∼pz [ log ( 1−D ( G(z) ))] . (2.2) G learns a differentiable map z 7→ x that pushes forward a simple prior (usually spherical Gaussian) toward the data manifold, while D learns to spot discrepancies. While these models have shown great success and pushed the bounds of what is possible, they are not without problems. Training is notoriously unstable—mode collapse, vanishing gradients, and catastrophic forgetting are common. In addition, many generative models have greatly increased in size. These size increases have resulted in more impressive images but also become harder to train, costlier to train, and become slower in throughput. There then must be a trade-off of capabilities and performance, depending on the applications. In Chapter 4 we will use a GAN to demonstrate an improved variant of an attention mechanism, improving throughput and quality while decreasing the total number of parameters. 2.3.3 Normalizing Flows. Normalising flows provide exact log- likelihoods by composing a sequence of bijective, differentiable transforms f = f1 · · · · · fk: px(x) = pu(u) |det Jf (u)|−1 (2.3) 34 Here pu is a tractable base distribution and det Jf denotes the Jacobian determinant. The Jacobian determinant allows for a change of variable, allowing data from one distribution (u ∈ U) to be expressed in another coordinate system (x ∈ X). A simplified example that many readers may be more familiar with is the change of coordinates from a Cartesian space into Polar coordinates J = det ∂(x, y) ∂(r, θ) =   ∂x ∂r ∂x ∂θ ∂y ∂r ∂y ∂θ   =   cos θ −r sin θ sin θ r cos θ   = r cos2 θ + r sin2 θ = r (2.4) Given the Jacobian determinant it becomes trivial to convert from a Cartesian coordinate to Polar by the equation: ∫∫ f(x, y)dxdy = ∫∫ f(r, θ)rdrdθ This idea extends greatly, with far more complex formulations of coordinate transforms. The importance of these transforms is that they generate an isomorphic mapping from one space to another, where every element in one coordinate precisely maps to a unique element in the other. Through the composition of these transformations we can then define a nice tractable distribution, such as a Gaussian, and learn a coordinate transform that maps our data. This, in effect, allows us to turn our intractable distribution into a tractable one. We should remain careful, as there are still some pitfalls and our distribution is still only an approximation. 35 What makes this different from Approximate Density models, such as VAEs and Diffusion models, is that those models do not generate isomorphic functions. Like flows, they are able to generate a probability density function, making them “explicit” (Figure 3), but these models are by nature lossy. Where Flows are bijective, diffusion and VAEs are not. (a) Injection: one-to-one (b) Surjection: onto (c) Bijection: one-to-one and onto Figure 4. Visual representation of injections, surjections , and bijections. Source: Wolfram Mathworld The two most common forms of Normalizing Flows, which are also used within this thesis, are: Affine coupling flows. : Partition input x into two units, (x0, x1), such that f(x0, x1) = (x0, x1 ⊙ es(x0)+t(x1)), which make computationally inexpensive triangular Jacobians (e.g. RealNVP [26], Glow [86]). Autoregressive flows. : Parameterise each dimension conditioned on previous ones, yielding a composition of triangular Jacobians (MAF/IAF [121, 88]). Unlike transformer models, the architecture to Normalizing flows are highly restrictive. These restrictions come with the benefits of increased interpretability, but at the cost of additional computation and less flexibility. Where to make these trade-offs is difficult but it remains a challenge in determining the capability of these models. Unfortunately these models tend to be greatly under studied, with 36 only a handful of models having been trained with > 100M parameters, which is fairly small by modern standards. 2.4 The Tyranny of Measurements As a final note, we must be ever vigilant of the metrics that we use. Qualitative metrics are a critical part of the scientific method, evidencing our hypotheses and theories. Yet, metrics are only guides, proxying the things we wish to measure. We must stress the importance of this distinction as it is necessary to properly evaluate our models and interpret what they are doing. Within this thesis several of our works face the challenges of interpreting our metrics and the absence of them. In Chapters 4 and 5 perform image synthesis tasks, where our models create new data that is representative of what they trained on. There are no metrics that properly convey what is a good image or not. For example, a common metric is for measuring the capabilities of image models is the Fréchet Inception Distance (FID) [60]. This metric was shown to correlate with human judgement of image quality, but was developed when image quality was much worse. For comparison, the paper that introduced FID demonstrated models with an FID around 12.5 on the CelebA dataset, while the current state of the art is 3.15 [146]. These correlations are helped improve the state of art systems, but not being perfectly aligned with an actual measurement of realism the discrepancies grow as our models improve. The rapid success of machine learning is double edged sword. Our approximations that helped us make our progress may no longer be sufficient. With all metrics, we must constantly check their alignment, to ensure that we are progressing in the directions we intend. This is quite similar to the gradient decent process we use in machine learning, where early on we may make large 37 improvements with highly suboptimal steps towards the optima. Yet, as our model becomes better, we tend to make smaller steps to ensure we are progressing in the right direction. 38 CHAPTER III ESCAPING THE BIG DATA PARADIGM The first principle is that you must not fool yourself and you are the easiest person to fool. Richard Feynman Nota Bene: This chapter is based on the previously published co-authored work Escaping the Big Data Paradigm with Compact Transformers [51] and the associated blog post published through PyTorch’s Medium page [155]. – Ali Hassani and Steven Walton are joint primary authors of this work. Together they wrote the majority of the code, performed the majority of experiments and writing of the paper. The majority of code was written during pair-programming sessions between the two. – Steven Walton worked a bit more on designing the experiments and developing the theory, ensuring claims were thoroughly evidenced and finding relevant literature. – Ali Hassani worked a bit more on code and launching experiments, increasing code quality and ensuring experiments were launched effectively, maximizing machine utilization. – Nikhil Shah helped manage launching experiments and contributed to the paper writing. – Abulikemu Abuduweili provided code and feedback for the NLP experiments. 39 – Humphrey Shi was the advisor, contributing overall guidance on the research as well as funding for the work. Humphrey also contributed to the writing of the paper and ensuring research stayed on track. Critical to any data analysis is the preparation of that data. The ways in which we encode our data has significant impacts on the way that data is processed. It is not sufficient to simply apply the right modeling tools to the data, but one first needs to ensure that the data is properly processed. In machine learning systems, this processing is typically done by both man and machine. The ingress and egress of data is critical, and will influence what structures in the data can ultimately be recovered. In this chapter we introduce the work Escaping the Big Data Paradigm with Compact Transformers [51]. This work demonstrates that Vision Transformers do not need large amounts of data to be performant, instead being able to be trained from scratch and be effective in limited data regimes. Our results run counter to conventional wisdom around scaling, demonstrating that scale may decrease performance, rather than increase. On small datasets, like CIFAR-10, our small models are able to achieve comparable performance to much larger ViT models that also have large pretraining. On medium datasets, like ImageNet, we are able to outperform ViTs of comparable sizes, and achieve accuracies only slightly lower than large models with large pretraining. 3.1 Vision Transformers With Vaswani et al.’s[150] demonstration of a dot-product self-attention based transformer architectures in language, there were several attempts to integrate them into vision models [6, 129, 69, 68]. Cordonnier et al. [16] first showed that by downsampling and adding a positional encoding layer, that a 40 Convolution · · · Patching Reshape 210 3 4 (Optional) Positional Embedding Transformer Encoder ×N Sequence Pooling MLPHead Class Duck Transformer Convolutional Compact Transformer Vision Compact Figure 5. Architectural design of Compact Transformers Bert [24] style Transformer architecture could learn convolutional filters, given a sufficient number of attention heads. Unfortunately, these researchers were memory bound and were using 2 × 2 invertible down-sampling. Dosovitskiy et al. [28] improved upon this work, claiming “An Image is Worth 16×16 Words”, introducing the Vision Transformer. Instead of using a 2 × 2 down-sampling, they used larger 16 × 16 patches, giving the paper it’s name. Additionally, Dosovitskiy et al. significantly increased scaled both data and compute. While Cordonnier et al.’s network was ≈12M parameters, Dosovitskiy et al. used 3 networks, 86M, 307M, and 632M. While Cordonnier et al. exclusively trained on CIFAR-10 and CIFAR- 100 [147], Dosovitskiy et al. performed pretraining with the proprietary JFT- 300M dataset [141], ImageNet-21k, and ImageNet-1k [23]. Their work showed that with large-data pretraining that one could outperform ResNet [55] trained models, although later work showed that by training ResNets with modern training procedures that classification accuracy becomes similar [161]. Dosovitskiy et al. performed a wide variety of experiments, including using a CNN to generate their 41 patch embedding and fine-tuning at higher resolutions than pretraining [148, 89]. Their results suggested that only through large pretraining and large models could ResNets be beat. Dosovitskiy et al.’s work made an important claim: Transformers lack some of the inductive biases inherent to CNNs, such as translational equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive biases. If this problem could not be resolved then this would greatly limit research contributions by labs without large compute infrastructures 1. The community was quick to challenge Dosovitskiy et al.’s claim. Touvron et al.’s Training Data-Efficient Image Transformers & Distillation Through Attention [149], quickly followed in an attempt to address the claim, introducing the DeIT model. In particular, they criticized the large pretraining and sought to counter the claim that transformers do not generalize when trained on insufficient amounts of data. Their work similarly uses 3 models for training, but are a tiny (5M parameters), small (22M parameters), and base (86M parameters). The ViT was modified to introduce a knowledge transfer2 token, and the training scheme was modified to include distillation from a pretrained convolutional based network. For their convolutional network they selected a RegNetY-16GF [127] network (84M parameters) as the default teacher network. 1Often called “GPU Poor” 2We use the phrasing knowledge transfer instead of distillation for increased clarity; as the “teacher” network having fewer parameters than the “student” network 42 3.2 Data Efficient Vision Transformers While we recognize the importance of these works we believe alternative conclusions are possible. The ViT results could be explained by several alternative hypotheses, including the size of the network and through training techniques. DeIT’s results showed that part of the claim must be false, as even smaller models could achieve better performance, but this relied upon inheriting the local inductive biases transferred by a CNN rather than learning them themselves, which Cordonnier et al. had demonstrated is possible. The critical question remained: Can transformer models, be trained to outperform ResNets when model size and data were held equal? Both works suggested that the answer was no. On the other hand, Transformers are universal approximators and Cordonnier et al.’s work suggests there’s no reason one should believe this data threshold requirement. Additionally, we believed ViT and DeIT were rejecting valuable information by only passing a slice of the transformer’s outputs to the classification sub-network. In an effort to resolve this, we proposed three hypothesis: – Non-overlapping image patches bias the transformer networks due to information loss at the boundaries. – A learned transformation to map the transformer’s outputs to the classification sub-network will improve performance. – Transformer networks rely more on data variance than data quantity. 3.2.1 Convolutional Tokenizer. The first hypothesis was believed due to the discussion in the background section (Chapter 2.2.1), where these models were gaining more benefit from data variance than data quantity. While diversity is a common side-effect of scaling, it is a distinct phenomena. The second 43 was inspired by subword tokenization that is commonly used by many language models [35, 135, 24, 150] and experience with computational modeling. The belief here is that by using non-overlapping patches we weaken the network’s ability to incorporate information along the boundaries of the images. Such boundary conditions often plague computational models, requiring ghost cells and other forms of boundary communication techniques to de-bias calculations. Inputs ConvLayer Pooling Reshape Transformer Encoder Sequence Pooling Linear Layer Output Optional Positional Embedding Compact Convolutional Transformer (CCT) Convolutional Tokenization Transformer with Sequence Pooling Inputs Embed to Patches Linear Projection Reshape Transformer Encoder Sequence Pooling Linear Layer Positional Embedding Output Compact Vision Transformer (CVT) Patch-Based Tokenization Transformer with Sequence Pooling Inputs Embed to Patches Linear Projection Reshape Transformer Encoder Slice Linear Layer Class Token Positional Embedding Class Token Output Vision Transformer (ViT) Patch-Based Tokenization Transformer with Class Tokenization Figure 6. A comparison of the Vision Transformer variants used throughout this study. On the left is the batching and embedding process (tokenization). On the right is the main neural architecture. The Transformer Encoder blocks and Linear Layers (classification sub-network) are identical for all models. CVT follows ViT, removing the class token and introducing SeqPool. In CCT we modify the tokenization process, building from CVT. ViT uses a simple patch and embedding procedure, where the image is evenly divided into patches. This in illustrated on the right half of Figure 5, under Compact Vision Transformer (CVT). The process is to do a Group Normalization, ReLU, MaxPool, patch, and embed. Notably, Dosovitskiy et al. did the patching 44 and embedding simultaniously with a convolution, matching strides to the kernel size 3. This same strategy is used for our ViT-Lite and Compact Vision Transformer (CVT) models. This procedure can be seen in Figure 6. We propose removing the restriction of making the convolutional kernels and strides match, allowing these patches to overlap. This would have an additional beneficial side-effect, allowing for better generalizability, by not requiring images to be integer multiples of the kernel size. This extends the embedding process to allow for arbitrary image sizes and aspect ratios. Additionally, we remove the Group Normalization layer from the ViT model, finding it unnecessary. Given an image or feature map x ∈ RH×W×C we can process our image as follows: x0 = MaxPool (ReLU (Conv2d(x))) (3.1) Our convolution has a number of filters equal to the embedding dimension of the transformer backbone, and both our convolution and pooling operations allow for overlapping, which can introduce local inductive biases. 3.2.2 SeqPool. In order to map the sequential output of a transformer to the linear representation required by a feed-forward classification network ViT uses a singular class index, or token, similar to language models like BERT [24]. This class token is learnable and then allows for the output of the transformer to be sliced along the learned index. Unfortunately, this underutilizes the relationships learned by the transformer encoding layers. This method makes the assumption that the transformer encoder can, and will, decouple the relationships of the training data. This disentanglement is the main task of the classification subnetwork, thus forcing our Transformer to also perform this likely leads to underutilization and overly constrains the encoding layers. 3This can be seen at github.com/google-research/vit jax/models vit.py:264 45 https://github.com/google-research/vision_transformer/blob/main/vit_jax/models_vit.py#L263-L270 We propose SeqPool, an attention inspired pooling method. The method is based on the assumption that the transformer encoder’s output sequence contains information relevant to classification. While this method is more computationally complex than slicing, it can reduce overall computation due to removal of an additional token that must be processed by the entirety of the network. We use a network to generate a contraction S : Rb×n×d 7→ Rb×d, which then is an appropriate shape to be processed by the classification sub-network. Softmax ( g(xL) T ) xL (3.2) Unlike dot-product attention we are not using keys, queries, and values, but instead learning a weighting of our sequence. Our function g is a single feed-forward layer mapping g : Rb×n×d 7→ Rb×d. We score this contraction and weight our original input producing the flattened output. This process can be seen as a learnable submersion, incorporating across sequential data better, seemingly allowing us to take advantage of neuron polysemanticity [134, 62] and superpositionality [31]. 3.3 Experiments We perform a variety of experiments in order to test our research hypotheses. We name our models similar to those of ViT, using the more explicit format: [model]− [N layers] / [patch size]× [N convolutions]. (3.3) The original ViT-B/16 model has 12 transformer encoder layers and a patch size of 16, where we make the number of layers explicit: ViT-12/16. We use this convention for all ViT and CVT models, dropping the number of convolutions. For CCT we specify the number of convolutions, even if only one. This section is organized to first provide details of our experiments and resources. Chapter 3.3.4 46 contains our main results, demonstrating high performance Vision Transformer models on small datasets. Chapter 3.3.5 includes details of our ablations, detailing the effects of our changes to the architecture. Chapter 3.3.6 provides a scaling study, investigating the scaling of both data and parameters. Finally, Chapter 3.3.7 includes our NLP experiments, to demonstrate that these results generalize to language models. 3.3.1 Datasets. Our primary focus is on small datasets, where we train on CIFAR-10, CIFAR-100 [147], MNIST [94], and Fashion-MNIST [164]. We also test our models on Oxford Flowers-102 [120] 4 for generalizability due to its large similarity between classes and high variance for intra-class similarity. We also use ImageNet [23] to test the scailability of our approach, allowing for more direct comparisons to ViT and DeiT. We also test our approach in Natural Language Processing, using AG-News [172], TREC [101], SST [138], IMDb [116], and DBpedia [2]. 3.3.2 Computational Resources. For most experiments we use a machine with an Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz and 4 NVIDIA RTX 2080Tis (11GB). The exception was the CPU test which was performed with an AMD Ryzen 9 5900X, where we found you could reach 90% accuracy in under 30 minutes. Our ImageNet experiments were performed on a single machine with either 2 AMD EPYC) 7662s and 8 NVIDIA RTX A6000 (48GB) or 2 AMD EPYC 7713s and 8 NVIDIA A100s (80GB). 3.3.3 Hyperparameters. We used the Pytorch Image Models library (timm) [160] to train our models for all image experiments. Our augmentations include CutMix [167], Mixup [170], RandAugment [19], and Random Erasing [178]. 4We used the dataset from Kaggle, which has a different data split than torchvision. Further discussion is provided later. 47 Model # Layers # Heads Ratio Dim ViT-Lite-6 6 4 2 256 ViT-Lite-7 7 4 2 256 CVT-6 6 4 2 256 CVT-7 7 4 2 256 CCT-2 2 2 1 128 CCT-4 4 2 1 128 CCT-6 6 4 2 256 CCT-7 7 4 2 256 CCT-14 14 6 3 384 (a) Transformer Hyperparameters Model # Layers # Convs Kernel Stride ViT-Lite-7/8 7 1 8×8 8×8 ViT-Lite-7/4 7 1 4×4 4×4 CVT-7/8 7 1 8×8 8×8 CVT-7/4 7 1 4×4 4×4 CCT-2/3x2 2 2 3×3 1×1 CCT-7/3x1 7 1 3×3 1×1 CCT-7/7x2 7 2 7×7 2×2 (b) Tokenizer Hyperparameters Table 2. Hyperparmeters used in different model configurations. Table 2a (left) shows transformer hyperparameters while Table 2b (right) shows those for tokenizers. We performed hyperparameter sweeps for our differing methods and report the best results we achieved. All hyperparameter experiments were trained for 300 epochs, use a learning rate of 5 × 10−4, a cosine learning rate scheduler, and weighted Adam optimizer (β = [0.9, 0.999])[85, 177]. For CNN models we found that some performed best with AdamW while others were more performant with SGD with momentum 0.9. For reproducibility we release our checkpoints corresponding to the reported numbers and YAML files corresponding to our experimental settings. These can be found on our public GitHub repository 5. 3.3.4 Transformers On Small Datasets. The main results of this work are the success of training Vision Transformers on small datasets. We follow the aforementioned training procedure, except our best model we further train as it did not appear to be saturated. Our full results can be read in Table 3, where we show a comparison of various ResNet based models, ViTs, CVT, and CCT, testing our small vision datasets with comparisons of model size and required compute. Notably, on CIFAR-10, we are able to achieve a 10% improvement over similarly 5https://github.com/SHI-Labs/Compact-Transformers 48 https://github.com/SHI-Labs/Compact-Transformers Model CIFAR-10 CIFAR-100 FashionMNIST MNIST # Params FLOPs Convolutional Networks (Designed for ImageNet) ResNet18 90.27% 66.46% 94.78% 99.80% 11.18 M 0.04 G ResNet34 90.51% 66.84% 94.78% 99.77% 21.29 M 0.08 G ResNet50 91.63% 68.27% 94.99% 99.79% 23.53 M 0.08 G MobileNetV2/0.5 84.78% 56.32% 93.93% 99.70% 0.70 M < 0.01 G MobileNetV2/1.0 89.07% 63.69% 94.85% 99.75% 2.24 M 0.01 G MobileNetV2/1.25 90.60% 65.24% 95.05% 99.77% 3.47 M 0.01 G MobileNetV2/2.0 91.02% 67.44% 95.26% 99.75% 8.72 M 0.02 G Convolutional Networks (Designed for CIFAR) ResNet56[56] 94.63% 74.81% 95.25% 99.27% 0.85 M 0.13 G ResNet110[56] 95.08% 76.63% 95.32% 99.28% 1.73 M 0.26 G ResNet164-v1[57] 94.07% 74.84% − − 1.70 M 0.26 G ResNet164-v2[57] 94.54% 75.67% − − 1.70 M 0.26 G ResNet1k-v1[57] 92.39% 72.18% − − 10.33 M 1.55 G ResNet1k-v2[57] 95.08% 77.29% − − 10.33 M 1.55 G ResNet1k-v2⋆[57] 95.38% − − − 10.33 M 1.55 G Proxyless-G[12] 97.92% − − − 5.7 M − Vision Transformers ViT-12/16 83.04% 57.97% 93.61% 99.63% 85.63 M 0.43 G ViT-Lite-7/16 78.45% 52.87% 93.24% 99.68% 3.89 M 0.02 G ViT-Lite-6/16 78.12% 52.68% 93.09% 99.66% 3.36 M 0.02 G ViT-Lite-7/8 89.10% 67.27% 94.49% 99.69% 3.74 M 0.06 G ViT-Lite-6/8 88.29% 66.40% 94.36% 99.73% 3.22 M 0.06 G ViT-Lite-7/4 93.57% 73.94% 95.16% 99.77% 3.72 M 0.26 G ViT-Lite-6/4 93.08% 73.33% 95.14% 99.74% 3.19 M 0.22 G Compact Vision Transformers CVT-7/8 89.79% 70.11% 94.50% 99.70% 3.74 M 0.06 G CVT-6/8 89.50% 68.80% 94.53% 99.74% 3.21 M 0.05 G CVT-7/4 94.01% 76.49% 95.32% 99.76% 3.72 M 0.25 G CVT-6/4 93.60% 74.23% 95.00% 99.75% 3.19 M 0.22 G Compact Convolutional Transformers CCT-2/3×2 89.75% 66.93% 94.08% 99.70% 0.28 M 0.04 G CCT-4/3×2 91.97% 71.51% 94.74% 99.73% 0.48 M 0.05 G CCT-6/3×2 94.43% 77.14% 95.34% 99.75% 3.33 M 0.25 G CCT-7/3×2 95.04% 77.72% 95.16% 99.76% 3.85 M 0.29 G CCT-6/3×1 95.70% 79.40% 95.41% 99.79% 3.23 M 1.02 G CCT-7/3×1 96.53% 80.92% 95.56% 99.82% 3.76 M 1.19 G CCT-7/3×1⋆ 98.00% 82.72% − − 3.76 M 1.19 G Table 3. Comparisons of various models when trained on small datasets. ⋆ was trained for longer, see Table 4 for additional details. Our 3.76M parameter CCT model is about to outperform both ResNets and ViTs across all datasets, with longer training only being necessary to outperform the 5.7M Proxyless-G model on CIFAR-10. 49 sized ViT-Lite models (ViT-Lite-7/8) and an 18% improvement over the ViT- 12/16 (ViT-B/16) model while our model has a 95.6% reduction in the number of parameters. Our best model only contains a single convolutional layer within the embedding process, meaning that the transformer architecture is performing the main computation, achieving an accuracy of 98% while using only 3.76M parameters. This result is only slightly less than Vaswani et al.’s much larger models that include JFT-300M or ImageNet-21k pretraining and outperforms VIT- 12/32, ViT-24/16, and ViT-24/32 when using ImageNet-1k pretraining and fine- tuning at 384 resolution (Table 5 of Vaswani et al.). We found that an increase in convolutions tended to harm model performance. # Epochs Pos. Emb. CIFAR-10 CIFAR-100 300 Learnable 96.53% 80.92% 1500 Sinusoidal 97.48% 82.72% 5000 Sinusoidal 98.00% 82.87% Table 4. Training of CCT-7/3×1 with an increased number of epochs. These results show that our CCT based model is able to outperform both standard Vision Transformers as well as ResNet models. We demonstrate that neither large scale pretraining nor knowledge distillation are needed to overcome the biases found in smaller scale data. Furthermore, we strongly suspect that the underlying issue is due to the tokenization process of overlapping patches. We include a comparison of Salient Maps [32, 137] in Figure 7, comparing visualizations on ImageNet. Saliency maps operate by looking visualizing the gradient accumulations across the network. We should take care as to fully interpret the semantic meaning of these maps, but the visualizations do clearly indicate how the original patching may be recovered in the standard ViT model 50 while we have a much smoother representation in CCT, evidencing the first research hypothesis. Model CLS # Conv Conv Size Aug Tuning C-10 C-100 # Params FLOPS “Large” Models (≈ 85M Parameters) ViT-12/16 CT ✗ ✗ ✗ ✗ 69.82% 40.57% 85.63 M 0.43 G ViT-12/16 CT ✗ ✗ ✓ ✓ 80.72% 56.73% 85.63 M 0.43 G CVT-12/16 SP ✗ ✗ ✓ ✓ 80.84% 58.05% 85.63 M 0.34 G ViT-12/8 CT ✗ ✗ ✓ ✓ 90.24% 69.81% 85.20 M 1.45 G ViT-12/4 CT ✗ ✗ ✓ ✓ 94.07% 76.08% 85.12 M 5.61 G CCT-12/7×1 SP 1 7× 7 ✓ ✓ 93.72% 76.21% 85.20 M 5.55 G CCT-12/3×2 SP 2 3× 3 ✓ ✓ 94.50% 77.05% 85.53 M 5.63 G Small Models (≈ 4M Parameters) ViT-Lite-7/16 CT ✗ ✗ ✗ ✗ 71.78% 41.59% 3.89 M 0.02 G ViT-Lite-7/8 CT ✗ ✗ ✗ ✗ 83.38% 55.69% 3.74 M 0.06 G ViT-Lite-7/4 CT ✗ ✗ ✗ ✗ 83.59% 58.43% 3.72 M 0.26 G CVT-7/16 SP ✗ ✗ ✗ ✗ 72.26% 42.37% 3.89 M 0.02 G CVT-7/8 SP ✗ ✗ ✗ ✗ 84.24% 55.49% 3.74 M 0.06 G CVT-7/8 SP ✗ ✗ ✓ ✗ 87.15% 63.14% 3.74 M 0.06 G CVT-7/4 SP ✗ ✗ ✗ ✗ 88.06% 62.06% 3.72 M 0.25 G CVT-7/4 SP ✗ ✗ ✓ ✗ 91.72% 69.59% 3.72 M 0.25 G CVT-7/4 SP ✗ ✗ ✓ ✓ 92.43% 73.01% 3.72 M 0.25 G CVT-7/2 SP ✗ ✗ ✗ ✗ 84.80% 57.98% 3.76 M 1.18 G CCT-7/7×1 SP 1 7× 7 ✗ ✗ 87.81% 62.83% 3.74 M 0.26 G CCT-7/7×1 SP 1 7× 7 ✓ ✗ 91.85% 69.43% 3.74 M 0.26 G CCT-7/7×1 CT 1 7× 7 ✓ ✓ 91.67% 72.07% 3.74 M 0.26 G CCT-7/7×1 SP 1 7× 7 ✓ ✓ 92.29% 72.46% 3.74 M 0.26 G CCT-7/3×2 CT 2 3× 3 ✓ ✓ 93.36% 74.77% 3.85 M 0.29 G CCT-7/3×2 SP 2 3× 3 ✓ ✓ 93.65% 74.77% 3.85 M 0.29 G CCT-7/3×1 SP 1 3× 3 ✓ ✓ 94.47% 75.59% 3.76 M 1.19 G Table 5. Ablation study, transforming ViT into CCT. We measure CIFAR validation accuracy across each modification as well as the number of model parameters and computation (MACs). All ViT models use a class token (CT), while CVT and CCT use SeqPool (SP). We report the number of convolutions used during embedding (# Conv), its kernel size, if we utilized image augmentation (Aug), and tuning. 3.3.5 Ablations. We include ablations of our parameters to better understand the impact of our changes to the ViT model. In Table 5 we step through the process of converting our ViT model into CCT.In our table we denote if we used a class token (CT) or SeqPool (SP), the number of convolutions 51 ImageNet ViT CCT NAT Figure 7. Salient maps of ViT, CCT, and NAT based on ImageNet-1k. It can be seen that CCT removes the blocking artifacts from ViT. CCT sometimes creates displacement, but this is resolved by NAT (presented in Chapter 4). 52 user (overlapping patches), the kernel size, if image augmentations were used, and additional tuning. Our tuning includes dropout, attention dropout, and stochastic depth. We separate our models into two sections, with “Large” models, with approximately 85M parameters and small models, with approximately 4M parameters. By directly comparing similar ViT-Lite models to our CVT models we can see the effect of our SeqPool method. In all cases we see that there is a minor performance improvement due to this, with a much lower effect with the large 85M parameter models. When comparing on CIFAR-10, models with 7 transformer encoders, a patch size of 16 we observe a 0.7% increase, 1.0% for a patch size of 8, a 5.3% increase with a patch size of 4. For the larger 12 transformer layer models with a patch size of 16 we only notice a 0.1% increase, but these models included tuning and augmentation, likely reducing the impact. In the smaller models we see that the larger contribution to performance increase is due to decreased patch size. For ViT models, decreasing from a patch size of 16 to 8 increased model performance by 16.2%, but reducing to a patch size of 4 only accounted for an additional 0.3% increase. For CVT the decrease to a patch size of 8 showed a similar 16.6% improvement, but further reduction to a patch of 4 gave another 4.5% increase. Larger impacts can be observed when looking at CIFAR-100, except in the case of a patch size of 8 where SeqPool appears to have a slight negative (< 0.5%) impact. We see a +1.8%, −0.4%, and +6.2% difference for SeqPool, for our 3 patches. On ViT the patch reduction accounts for a 33.9% and 4.9% improvements while CVT shows 31.0% and 11.8% improvements. With decreased patch sizing the transformer appears to be able to overcome the primary issues presented by smaller training sets. Our SeqPool 53 method still demonstrates greater performance, especially as patch size decreases, showing greater network utilization. The largest gains come from moving to CCT, which can also better take advantage of data augmentations, showing better capacity for generalization. For example, ViT-Lite-7/8, CVT-7/8, and CCT-7/3×1 all have 3.74M parameters, but their CIFAR-10 scores are 83.38%, 84.24%, and 87.81% respectively. Where CVT shows a 1% improvement, CCT shows 5.3%. We can see that CVT-7/8 improves to 87.15% (2.91%), while CCT-7/3×1 improves to 91.85% (4.04%) when introducing augmentation. We can also see in our CCT experiments that by removing SeqPool and reintroducing the class tokens that we drop performance by 0.62%, demonstrating that SeqPool does not account for these differences. A similar pattern can be found with larger models, though our comparisons are not as thorough. These results show that the overlapping patches and better extraction of data from the transformer architecture result in significant improvements, evidencing our first two hypotheses. We also include a short study on Positional Embedding, in Table 6. Because our overlapping tokenization allowed us to debias some of the positional relationships within the data we test to find the importance of positional embedding. While ViT and CVT benefit strongly from positional embedding, CCT only gets minor benefits. This further demonstrates the bias introduced by patching in ViT. Some additional positional embedding comparisons can be found in Figure 9. 3.3.6 Scaling Study. While the previous results demonstrate that pretraining is unnecessary for Vision Transformers to be effective on small datasets, 54 Model PE CIFAR-10 CIFAR-100 Conventional Vision Transformers are more dependent on Positional Embedding ViT-12/16 Learnable 69.82% (+3.11%) 40.57% (+1.01%) Sinusoidal 69.03% (+2.32%) 39.48% (−0.08%) None 66.71% ( baseline) 39.56% ( baseline) ViT-Lite-7/8 Learnable 83.38% (+7.25%) 55.69% (+7.15%) Sinusoidal 80.86% (+4.73%) 53.50% (+4.96%) None 76.13% ( baseline) 48.54% ( baseline) CVT-7/8 Learnable 84.24% (+6.52%) 55.49% (+7.23%) Sinusoidal 80.84% (+3.12%) 50.82% (+2.56%) None 77.72% ( baseline) 48.26% ( baseline) Compact Convolutional Transformers are less dependent on Positional Embedding CCT-7/7 Learnable 82.03% (+0.21%) 63.01% (+3.24%) Sinusoidal 81.15% (−0.67%) 60.40% (+0.63%) None 81.82% ( baseline) 59.77% ( baseline) CCT-7/3×2 Learnable 90.69% (+1.67%) 65.88% (+2.82%) Sinusoidal 89.93% (+0.91%) 64.12% (+1.06%) None 89.02% ( baseline) 63.06% ( baseline) CCT-7/3×2† Learnable 95.04% (+0.64%) 77.72% (+0.20%) Sinusoidal 94.80% (+0.40%) 77.82% (+0.30%) None 94.40% ( baseline) 77.52% ( baseline) CCT-7/3×1† Learnable 96.53% (+0.29%) 80.92% (+0.65%) Sinusoidal 96.27% (+0.03%) 80.12% (−0.15%) None 96.24% ( baseline) 80.27% ( baseline) CCT-7/7×1-noSeqPool Learnable 82.41% (+0.12%) 62.61% (+3.31%) Sinusoidal 81.94% (−0.35%) 61.04% (+1.74%) None 82.29% ( baseline) 59.30% ( baseline) CCT-7/3×2-noSeqPool Learnable 90.41% (+1.49%) 66.57% (+1.40%) Sinusoidal 89.84% (+0.92%) 64.71% (−0.46%) None 88.92% ( baseline) 65.17% ( baseline) Table 6. Validation accuracy comparison comparing Positional Embedding method. Augmentations and training techniques such as Mixup and CutMix were turned off for these experiments to better highlight differences. The numbers reported are best out of 4 runs with random initializations. † denotes model trained with extra augmentation and hyperparameter tuning. 55 we need to understand the relationship of model size, data quantity, and data quality. In order to address the Scale is All You Need arguments, we begin with the study of model size. Our main study of model size can be seen in our ablations (Table 5), where we observe that our larger 85.53M parameter model outperforms out 3.76M parameter model on both CIFAR-10 and CIFAR-100, showing very minor improvement on CIFAR-10 and a 2% increase on CIFAR-100. This result runs counter to ViT, where the larger model has a performance decrease of up to 16.5% and 30.5%, respectively. When given additional augmentation, the larger ViT model is only able to outperform our largest ViT- Lite-7/16 model, which did not use tuning or augmented training. The two slightly smaller ViT-Lite models are still able to outperform this large model without the inclusion of additional augmentation or training, demonstrating that the smaller patch sizes play a more significant role, as discussed in Chapter 3.3.5. We believe that the smaller window sizes allow the transformer architecture to better integrate data across patches, learning convolutions similar to what Cordonnier et al. had shown, but further study is required to confirm or deny. The increase relationship between patch size and performance applies to both large and small ViTs, with the large ViT approaching the performance of CCT (surpassing ViT-Lite models) once the patch size is reduced to 4, yet still do not surpass the performance of small CCT models on CIFAR-10. Under most configurations, CVT also shows a decrease in performance, again with improved performance primarily being attributed to the path size. Performance decreases at a patch size of 2, similar to Cordonnier et al.’s configuration, showing that the patches can be too small. In a way, this 56 demonstrates that scale plays an important role, but these trends run counter to the conventional wisdom. These results demonstrate the importance of the embedding process and that näıvely scaling architectures may instead hinder performance. Careful design of the neural architecture trumps scaling. Model Top-1 # Params FLOPS Epochs ResNet50 77.15% 25.55 M 4.15 G 120 ResNet50 (2021) 79.80% 25.55 M 4.15 G 300 ViT-S 79.85% 22.05 M 4.61 G 300 CCT-14/7×2 80.67% 22.36 M 5.53 G 300 DeiT-S ⚗ 81.16% 22.44M 4.63 G 300 CCT-14/7×2 ⚗ 81.34% 22.36 M 5.53 G 300 Table 7. ImageNet Top-1 validation accuracy comparison (no extra data or pretraining). Models with ⚗ denotes distillation and follow the knowledge distillation process as described in Touvron et al [149]. ResNet50 (2021) is reported from [161] which has the same training recipe as ours. To study relation of data to model performance we perform multiple scaling studies. In order to complete our parameter scaling study, we test our model’s performance on larger amounts of data, with ImageNet, but leave further large model and large data scaling studies to labs with resources similar to Vaswani et al. In Table 7 we train a 14 layer (22M param) model, and compare it to ViT-S and DeiT-S models from Touvron et al. [149]. It is difficult to get these models to be exactly the same parameter size, but our model is able to still outperform ViT on ImageNet-1k without any pretraining. We also compare to DeiT-S ⚗, where our model is slightly smaller, following the same knowledge distillation process. Our model again shows improvements, demonstrating that our procedure does not produce negative effects with increased data scale. 57 Model Resolution Pretraining Top-1 # Params FLOPs CCT-14/7×2 224 - 97.19% 22.17 M 18.63 G DeiT-B 384 ImageNet-1k 98.80% 86.25 M 55.68 G ViT-L/16 384 JFT-300M 99.74% 304.71 M 191.30 G ViT-H/14 384 JFT-300M 99.68% 661.00 M 504.00 G CCT-14/7×2 384 ImageNet-1k 99.76% 22.17 M 18.63 G Table 8. Flowers-102 Top-1 validation accuracy comparison. CCT outperforms other competitive models, having significantly fewer parameters and GFLOPs. This demonstrates the compactness on small datasets even with large images. We also test our 22M parameter model on the Flowers-102 dataset, which is designed for high data variance and to test model generalizability. For this we are able to achieve an accuracy of over 97% without the use of any pretraining data or higher resolution tuning. These results can be found in Table 8. When using ImageNet-1k pretraining and including higher resolution tuning, following the procedure of DeiT, we are able to achieve state of the art results, outperforming models that included more than a magnitude more parameters and a more than a magnitude amount of pretraining data. It should be noted that we used the Flowers-102 dataset provided from Kaggle and that this uses a different data split than that which is included in the torchvision version6. This was brought to our attention through a GitHub issue,7 where a user was unable to replicate our results. We retrained our CCT-7/7×2 (4M params) and CCT-14/7×2 models at 224 resolution and obtained 68.26% and 68.85% accuracy, respectively. When applying the same procedure to ViT-S/16 we obtained a result of 48.63%, only showing our model having better performance applied to this dataset. 6The torchvision dataset collection did not include Flowers-102 when initially trained. 7A wandb report showing training results can be found alongside the issue here: https: //github.com/SHI-Labs/Compact-Transformers/issues/65 58 https://github.com/SHI-Labs/Compact-Transformers/issues/65 https://github.com/SHI-Labs/Compact-Transformers/issues/65 10 20 30 40 50 60 70 80 90 100 65 70 75 80 85 90 95 Percent Per Class (%) A cc u ra cy (% ) CCT-7/3x2 ViT-Lite-7/4 ResNet18 MobileNet Figure 8. Comparison of models with restricted number of samples per class. At 10% models are trained on only 5000 images. Transformer based models demonstrate better scalability than ResNet based models. Moving on to further test the scalability of our model with respect to data, we study the performance with respect the number of samples as well as the size of our images. In Figure 8 we restrict the number of samples in each class within the CIFAR-10 dataset. We compare the performance of CCT, ViT, ResNet18, and MobileNet when using only 10% of CIFAR-10 up to the full dataset. With only 10% of CIFAR-10, CCT is still able to achieve 77.7% accuracy, compared to ViT’s 67.9%. CCT is able to outperform the other models regardless of the data reduction. ViT shows worse performance with data scaling, only beating ResNet18 when including 70% or more of the data. Additionally, we include a short study where we modify the image sizes of CIFAR-10 to understand the dependence on resolution, found in Figure 9. With smaller resolution images models will likely be less able to rely upon local structures within the data, as they will be merged. When upscaling, we use a standard bicubic interpolation. In the first row of the graphs we train our models 59 16 24 32 48 64 75 80 85 90 95 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width A cc u ra cy (% ) 16 24 32 48 64 80 85 90 95 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width 16 24 32 48 64 80 85 90 95 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width 16 24 32 48 64 50 60 70 80 90 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width A cc u ra cy (% ) (a) No P.E. 16 24 32 48 64 20 40 60 80 100 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width (b) Sinusoidal P.E. 16 24 32 48 64 40 60 80 100 ViT-Lite-7/4 CCT-7/3x2 Image Height & Width (c) Learnable P.E. Figure 9. Comparison of ViT-Lite and CCT accuracy on CIFAR-10 with differing image resolutions. In first row, models are trained from scratch. In second row, models are inference and trained on 32 × 32 images. Fig. 9a is without positional embedding, Fig. 9c with sinusoidal positional embedding, and Fig. 9b with a learnable positional embedding. Inference with learnable positional embedding cannot be extended to larger images without modifying model parameters. 60 from scratch, allowing them to discover these associations. In the second row, we only run inference, testing our models’ capacity to generalize to novel resolutions. We also show comparisons without positional embedding, with Sinusoidal Positional Embedding, and with Learnable Positional Embedding. In our inference results Learnable Positional Embedding models are unable to process larger resolution images than they were trained on, creating a significant limitation to this method. In all cases, except inference with Sinusoidal Positional Embedding, CCT is able to out perform ViT, further demonstrating data generalizability. 3.3.7 Natural Language Processing. Finally, we test our method on small natural language processing datasets. This network needs slight modification, incorporating GloVe [125] to provide word embeddings for our model. We do not train these embedding parameters and we do not include GloVe in our model parameter sizes, which is about 20M. To process the data we treat the text as single channel data, use an embedding dimension of 300, and a convolution kernel of size 1. We also perform masking in the typical manner. By using CCT on these datasets we are able to achieve up to a 3% improvement when comparing to vanilla transformers. Additionally, our CCT model is able to do this while using fewer parameters. Our CCT models that are able to perform best have less than 1M parameters, making GloVe a significantly larger part of the network. We report a comparison of vanilla transformers, ViT, CVT, and CCT in Table 9 3.4 Conclusion In this work we saw the importance of properly embedding information into our machine learning models. We need to ensure that this is done properly or we may severely limit our model’s capabilities. Even small seemingly trivial differences 61 Model AGNews TREC SST IMDb DBpedia # Params Vanilla Transformer Encoders Transformer-2 93.28% 90.40% 67.15% 86.01% 98.63% 1.086 M Transformer-4 93.25% 92.54% 65.20% 85.98% 96.91% 2.171 M Transformer-6 93.55% 92.78% 65.03% 85.87% 98.24% 4.337 M Vision Transformers (ViT) ViT-Lite-2/1 93.02% 90.32% 67.66% 87.69% 98.99% 0.238 M ViT-Lite-2/2 92.20% 90.12% 64.44% 87.39% 98.88% 0.276 M ViT-Lite-2/4 90.53% 90.00% 62.37% 86.17% 98.72% 0.353 M ViT-Lite-4/1 93.48% 91.50% 66.81% 87.38% 99.04% 0.436 M ViT-Lite-4/2 92.06% 90.42% 63.75% 87.00% 98.92% 0.474 M ViT-Lite-4/4 90.93% 89.30% 60.83% 86.71% 98.81% 0.551 M ViT-Lite-6/1 93.07% 91.92% 64.95% 87.58% 99.02% 3.237 M ViT-Lite-6/2 92.56% 89.38% 62.78% 86.96% 98.89% 3.313 M ViT-Lite-6/4 91.12% 90.36% 60.97% 86.42% 98.72% 3.467 M Compact Vision Transformers (CVT) CVT-2/1 93.24% 90.44% 67.88% 87.68% 98.98% 0.238 M CVT-2/2 92.29% 89.96% 64.26% 86.99% 98.93% 0.276 M CVT-2/4 91.10% 89.84% 62.22% 86.39% 98.75% 0.353 M CVT-4/1 93.53% 92.58% 66.64% 87.27% 99.04% 0.436 M CVT-4/2 92.35% 90.36% 63.90% 86.96% 98.93% 0.474 M CVT-4/4 90.71% 90.14% 61.98% 86.77% 98.80% 0.551 M CVT-6/1 93.38% 92.06% 65.94% 86.78% 99.02% 3.237 M CVT-6/2 92.57% 91.14% 64.57% 86.61% 98.86% 3.313 M CVT-6/4 91.35% 91.66% 61.63% 86.13% 98.76% 3.467 M Compact Convolutional Transformers (CCT) CCT-2/1x1 93.40% 90.86% 68.76% 88.95% 99.01% 0.238 M CCT-2/2x1 93.38% 91.86% 67.19% 89.13% 99.04% 0.276 M CCT-2/4x1 93.80% 91.42% 64.47% 88.92% 99.04% 0.353 M CCT-4/1x1 93.49% 91.84% 68.21% 88.71% 99.03% 0.436 M CCT-4/2x1 93.30% 93.54% 66.42% 88.94% 99.05% 0.474 M CCT-4/4x1 93.09% 93.20% 66.57% 88.86% 99.02% 0.551 M CCT-6/1x1 93.73% 91.22% 66.59% 88.81% 98.99% 3.237 M CCT-6/2x1 93.29% 92.10% 65.02% 88.74% 99.02% 3.313 M CCT-6/4x1 92.86% 92.96% 65.84% 88.68% 99.02% 3.467 M Table 9. Top-1 validation accuracy on text classification datasets. The number of parameters does not include the word embedding layer, because we use pretrained word-embeddings and freeze those layers while training. 62 can have tremendous effects on these models, making it important to care when designing our neural architectures. If great care is not taken we will make the wrong conclusions and hinder our own progress. While pretraining can help with model performance, when working with very large datasets it becomes difficult to deduplicate data, and works have shown that despite attempts to deduplicate these datasets may still be reduced upwards of 50% [1]. These duplications reduce model performance and generalizability, as they push the models to over attend to certain semantics. While reducing the requisite dataset size doesn’t solve this problem, it certainly makes it a much more tractable problem. Given such results it makes it difficult to distinguish if large pretrained models are generalizing or simply memorizing data. An important result of this work was the ability to achieve comparable performance while using orders of magnitude fewer parameters. While there are still a large number of parameters, having fewer decreases a model’s ability to overfit. Smaller models also enable them to be used by more people, with fewer computational resources, and in more domains. Despite the rapid advancement of computational power, such small models are still critical tools for many areas of science, which may not have access to multiple GPUs or the ability to obtain large datasets. While datasets like CIFAR-10 are considered to be small by machine learning standards, they are often orders of magnitude larger than datasets available within other research domains. This work makes transformer models available to these researchers. 63 CHAPTER IV VARIADIC NEIGHBORHOOD ATTENTION Random numbers should not be generated with a method chosen at random. Donald Knuth Nota Bene: This chapter is based on the previous published co-authored work Efficient Image Generation with Variadic Attention Heads [156], formerly released as StyleNAT: Giving Each Head a New Perspective. Additionally, this chapter involves content from Neighborhood Attention Transformer [52] (NAT) in order to facilitate the discussion of StyleNAT, but is not the focus of this chapter. – Steven Walton programmed the majority of the source code for StyleNAT and ran the majority of experiments. This includes creating all the research questions and designing all the necessary experiments to evidence them. His contributions also include all the visual analysis as well as the development of the attention maps to visualize restricted attention mechanisms. He was also the main writer of the paper. Steven also made significant contributions to the work of NAT, helping develop the theory (primarily around generalization), made contributions to the source code, provided advice, and help write the paper. – Ali Hassani developed the NATTEN CUDA kernel that was used in both StyleNAT and NAT. He provided important insights, especially with the rapidly changing NATTEN code, made contributions to the source code, helped perform experiments, and provided key insights for the development of 64 the restricted attention visualization. Ali Hassani was also the primary author of the NAT paper, writing the majority of code, performing the majority of experiments, and was the largest contributor to the paper’s text. – Xingqian Xu contributed advice and insights around the underlying StyleGAN architecture. – Zhangyang Wang provided guidance during the research and feedback for the project. – Jaichen Li provided feedback for the NAT design and contributed to the writing of the paper. – Shen Li provided general design feedback for the NATTEN CUDA kernel and support for running the large scale experiments. – Humphrey Shi was the advisor for both StyleNAT and NAT, contributing overall guidance on the research as well as funding for both works. Humphrey also contributed to the writing of the paper and ensuring research stayed on track. While Chapter 3’s success with CCT demonstrated that ViTs could be significantly improved in terms of data and computational efficiency, it left the core neural architecture untouched. These impacts come from preparing the data for processing, but further improvements can be made by also improving the processing. Our ViT models still struggle with their O(n2) complexities, in both time and space, so making improvements to these layers can have significant impacts. Still, the work showed that transformers did not need big data nor 65 big models to be successful. This motivates further work into improving these architectures themselves. Transformers were born with language in mind, but had been adapted for vision. The computational challenges are particularly challenging in Computer Vision due to the multi-dimensional data that must be processed, c × w × h which frequently leads to out-of-memory (OOM) issues [174, 96, 171]. The de- facto solution to this problem had been to use Convolutional Neural Networks (CNNs)[94, 93, 41]. This is because CNNs provide memory efficiency by operating only on a localized context window as well as naturally incorporating multi- dimensional spatial relationships. On the other hand, transformer networks attend over the entire data, allowing for arbitrary connections to be made. As previously discussed (Chapter 3.1), transformers are capable of learning convolution filters, so it should be possible for them to be just as powerful. These benefits come at a cost of O(n2) both in computational complexity as well as memory complexity, but our previous work demonstrated that smaller ViTs could outperform CNNs. This then begs the question if ViTs can be better adapted to vision tasks. Are we able to achieve O(n) performance while also being able to incorporate both local and global structures within our data? Figure 10. Samples form FFHQ-256 (left) with FID: 2.05, FFHQ-1024 (center) with FID: 4.17, and Church (right) with FID: 3.40 generated by our StyleNAT network, using Hydra Neighborhood Attention. 66 This chapter studies the core architecture of the network, by introducing Efficient Image Generation with Variadic Attention Heads [156], which allows the vision transformer to do more with less. The primary modification for this work is simple, yet powerful: allow attention heads to attend to independent receptive fields. Our results demonstrate that some simple modifications to our attention heads can allow our Vision Transformers to better integrate local and global relationships during image generation. The result of this is the ability to train a StyleGAN [79] based model, using a modified version of Neighborhood Attention [52, 50], which pushes the Pareto Frontier for image generation on FFHQ-256. Our model makes significant improvements in terms of visual fidelity while being smaller and has a higher throughput than other comparative models. 4.1 Localized Attention In an effort to address the computational challenges of transformers, researchers looked to a number of different solutions. One such solution is to only perform attention on some localized region instead of the whole input. This formulation is natural as analysis of attention maps shows that there is strong correlation between neighboring tokens [90, 3, 152], or having Attention Sinks [163]. Works like Image Transformer [123] and Stand Alone Self-Attention (SASA) [129] use localized context windows for their transformer algorithms, similar to the ideas proposed in Longformer [7]. These methods reduced the computational burden of attention mechanisms, approximating O(n) complexity, but had issues generalizing as the window size increased. Other works like HaloNet [151] and the Window Self-Attention (WSA) from Swin Transformer [109, 110] partitioned the query and context sets, independently performing self-attention. These blocks become highly parallelizeable but does not account for cross-block interactions. 67 Swin tried to address this issue by introducing shifted windows (SWSA), where subsequent attentions would shift their windows. With a hierarchical structure the network can is able to attend to every pixel in an image to attend to one another, but incorporates biases around boundaries, similar to the issues faced in the non- overlapping blocks of ViT (Chapter 3). 4.2 Neighborhood Attention 2.5 5.0 7.5 10.0 12.5 15.0 17.5 81.0 81.5 82.0 82.5 83.0 83.5 84.0 84.5 0.0 80.5 ConvNeXt-T ConvNeXt-S ConvNeXt-B Swin-T Swin-S Swin-B NAT-M NAT-T NAT-S NAT-B Model parameters Mini Tiny Small Base ∼ 20M ∼ 30M ∼ 50M ∼ 90M GFLOPs Accuracy Neighborhood Attention Transformer ConvNeXt (CVPR 2022) Swin Transformer (ICCV 2021) Figure 11. Comparison of Neighborhood Attention, Swin, and ConvNeXt on ImageNet classification. To resolve these issues, Hassani et al. developed the Neighborhood Attention Transformer (NAT) [52]. The architecture is similar to SASA but resolved the generalization issue, ensuring that when the window size was equal to the image size that Neighborhood Attention (NA) would be identical to the traditional dot- product self-attention mechanism. Like a convolution, NA considers a context window around each individual input queries, Q. The keys, K, then evaluate over the surrounding neighborhood (a square). If a (relative) positional bias [68, 128], B, is used then this must also be modified to account for the key location. Similarly, 68 the value, V , must be updated to correspond wi