Smaller, Faster, Cheaper:
Architectural Designs for Efficient Machine Learning

by

Steven Walton

A dissertation accepted and approved in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

in Computer Science

Dissertation Committee:

Hank Childs, Chair

Humphrey Shi, Core Member

Daniel Lowd, Core Member

Thien Nguyen, Core Member

Edward Rubin, Institutional Representative

University of Oregon

Summer 2025


© 2025 Steven Walton
This work, including text and images of this document but not including

supplemental files (for example, not including software code and data), is licensed
under a Creative Commons

Attribution 4.0 International License.

2

http://creativecommons.org/licenses/by/4.0/


DISSERTATION ABSTRACT

Steven Walton

Doctor of Philosophy in Computer Science

Title: Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning

Major advancements in the capabilities of computer vision models have

been primarily fueled by rapid expansion of datasets, model parameters, and

computational budgets, leading to ever-increasing demands on computational

infrastructure. However, as these models are deployed in increasingly diverse and

resource-constrained environments, there is a pressing need for architectures that

can deliver high performance while requiring fewer computational resources.

This dissertation focuses on architectural principles through which models

can achieve increased performance while reducing their computational demands.

We discuss strides towards this goal through three directions. First, we focus on

data ingress and egress, investigating how information may be passed into and

retrieved from our core neural processing units. This ensures that our models

make the most of available data, allowing smaller architectures to become more

performant. Second, we investigate modifications to the core neural architecture,

applied to restricted attention in vision transformers. This section explores how

removing uniform context windows in restricted attention increases the expressivity

of the underlying neural architecture. Third, we explore the natural structures of

Normalizing Flows and how we can leverage these properties to better distill model

knowledge.

3


These contributions demonstrate that careful design of neural architectures

can increase the efficiency of machine learning algorithms, allowing them to become

smaller, faster, and cheaper.

This dissertation includes previously published and unpublished co-authored

material.

4


CURRICULUM VITAE

NAME OF AUTHOR: Steven Walton

GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED:

University of Oregon, Eugene, OR, USA
Embry-Riddle Aeronautical University, Prescott, AZ, USA

DEGREES AWARDED:

Doctor of Philosophy in Computer Science, 2025, University of Oregon
Master of Science in Computer Science, 2023, University of Oregon
Bachelor of Science in Space Physics, 2014, Embry-Riddle Aeronautical

University

AREAS OF SPECIAL INTEREST:

Computer Vision
Machine Learning
Artificial Intelligence
Generative Modeling

PROFESSIONAL EXPERIENCE:

Graduate Researcher, University of Oregon, Eugene, OR, Aug. 2018 - Jun.
2025

Metropolis Intern, Nvidia, Sep. 2023 - Mar. 2024
Ph.D. Research Intern, Picsart AI Research, Eugene, OR, Jun. 2021 - Nov.

2022
Computer Science Intern, Lawrence Livermore National Labratory,

Livermore, CA, Jun. - Sept. 2020
Computer Science Intern, Lawrence Livermore National Labratory,

Livermore, CA, Jun. - Sept. 2019
ASTRO Intern, Oak Ridge National Labratory, Oak Ridge, TN, Jun. - Aug.

2018

5


GRANTS, AWARDS AND HONORS:

Outstanding Reviewer, CVPR 2025

PUBLICATIONS:

Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, and Humphrey
Shi. Efficient image generation with variadic attention heads. In
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, 2025

Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita
Orlov, and Humphrey Shi. Distilling normalizing flows. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, 2025.

Ali Hassani, Fengzhe Zhou, Aditya Kane, Jiannan Huang, Chieh-Yun
Chen, Min Shi, Steven Walton, Markus Hoehnerbach, Vijay Thakkar,
Michael Isaev, Qinsheng Zhang, Bing Xu, Haicheng Wu, Wen mei
Hwu, Ming-Yu Liu, and Humphrey Shi. Generalized neighborhood
attention: Multi-dimensional sparse attention at the speed of light, 2025.
arXiv:2504.16922

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta,
Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang,
Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal
Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye
Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson,
Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio,
Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven
Walton, Charig Yang, Kai Han, and Samuel Albanie. Zerobench: An
impossible visual benchmark for contemporary large multimodal models,
2025. arXiv:2502.09696

Noble Kennamer, Steven Walton, and Alexander Ihler. Design amortization
for bayesian optimal experimental design. Proceedings of the AAAI
Conference on Artificial Intelligence, 37(7):8220–8227, 2023.

Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi.
Neighborhood attention transformer. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages
6185–6194, 2023.

6


Steven Walton. Isomorphism, normalizing flows, and density
estimation: Preserving relationships between data, 2022.
https://www.cs.uoregon.edu/Reports/AREA-202307-Walton.pdf

Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li,
Steven Walton, and Humphrey Shi. Semask: Semantically masked
transformers for semantic segmentation. In Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV)
Workshops, pages 752–761, 2023.

Jiachen Li, Ali Hassani, Steven Walton, and Humphrey Shi. Convmlp:
Hierarchical convolutional mlps for vision. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, pages 6307–6316, 2023.

David Pugmire, James Kress, Jieyang Chen, Hank Childs, Jong Choi,
Dmitry Ganyushin, Berk Geveci, Mark Kim, Scott Klasky, Xin Liang,
Jeremy Logan, Nicole Marsaglia, Kshitij Mehta, Norbert Podhorszki,
Caitlin Ross, Eric Suchyta, Nick Thompson, Steven Walton, Lipeng
Wan, Matthew Wolf, Jeffrey Nichols, Becky Verastegui, Arthur ‘Barney’
Maccabe, Oscar Hernandez, Suzanne Parete-Koon, and Theresa Ahearn.
“Visualization as a Service for Scientific Data”. In “Driving Scientific
and Engineering Discoveries Through the Convergence of HPC, Big Data
and AI”, pages “157–174”, “Cham”, “2020”. “Springer International
Publishing”.

Steven Walton, Ali Hassani, Abulikemu Abuduweili, and Humphrey Shi.
Training compact transformers from scratch in 30 minutes with pytorch.
medium.com/pytorch, 2021. arXiv:2104.05704

Ali Hassani, Steven Walton, Nikhil Shah, Abulikemu Abuduweili, Jiachen
Li, and Humphrey Shi. Escaping the big data paradigm with compact
transformers, 2022.

Steven Walton. Datum: Dotted attention temporal upscaling method. 2020.
https://www.cs.uoregon.edu/Reports/DRP-202006-Walton.pdf

7


ACKNOWLEDGEMENTS

I’d like to thank my mentors and professors from my Universities for helping

get me to where I am today. Thank you Jeff Spear, for being the first to show me

how to be creative with math. To Karla Westphal, for helping me find passion and

dedication to the subject. To my undergraduate professors: Timothy Callahan,

Andri Gretarsson, Edward Poon, Hisaya Tsutsui, and Darrel Smith who taught me

my passion for math, physics, and providing me the tools to understand the world

around me. To my graduate professors and advisors, who helped get me through

these difficult times. I especially want to thank Hank Childs for encouraging me to

pursue Machine Learning and to be my acting advisor after Humphrey moved to

Georgia Tech. I want to thank Humphrey Shi for being my advisor and helping me

make all the connections and pushing me to become a better researcher.

I’d like to thank my friends and family for helping get through this. It was a

journey that I could not have made alone. Noble, you’ve been a close friend for so

many years and your insights helped shape my research and encouraged me to go

to graduate school. You constantly challenge my ideas, often frustratingly so, but

they always end up better and more refined for it. Never change. Ali, I couldn’t

ask for a better co-author nor friend. Your intelligence and work ethic have always

pushed me to better myself, and I look forward to calling you “doctor”. I want

to thank my cat Hypatia, who has been my best friend for the last decade. She’s

had to listen to many explinations and I’m sorry you have not received formal

recognition for your contributions despite frequent appearances in my works

(including this one). Lastly, I want to thank my wonderful girlfriend: Jaichung Lee.

We have been through so much and I could not have crossed the finish line without

8


you. I know it was as much of a challenge for you as it was for me, and this PhD

would not have been possible without your many efforts. Thank you.

9


DEDICATED TO

My mom, and the many years of watching Star Trek together.

My dad, and the many years of reading Asimov together.

Jaichung, and the many years to build the future together.

10


TABLE OF CONTENTS

Chapter Page

I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.2. Research Goals and Approaches . . . . . . . . . . . . . . . . . 19

1.3. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . 21

1.4. Co-Authored Material . . . . . . . . . . . . . . . . . . . . . 22

II. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.1. Learned Data Mappings . . . . . . . . . . . . . . . . . . . . 24

2.2. Scale Is Not All You Need . . . . . . . . . . . . . . . . . . . 27

2.2.1. Scaling Data . . . . . . . . . . . . . . . . . . . . . . 28

2.2.2. Model Size . . . . . . . . . . . . . . . . . . . . . . . 30

2.3. The Foundations That Shape Us . . . . . . . . . . . . . . . . 31

2.3.1. Transformers . . . . . . . . . . . . . . . . . . . . . . 31

2.3.2. Adversarial Generation . . . . . . . . . . . . . . . . . 33

2.3.3. Normalizing Flows . . . . . . . . . . . . . . . . . . . 34

2.4. The Tyranny of Measurements . . . . . . . . . . . . . . . . . 37

III. ESCAPING THE BIG DATA PARADIGM . . . . . . . . . . . . . 39

3.1. Vision Transformers . . . . . . . . . . . . . . . . . . . . . . 40

3.2. Data Efficient Vision Transformers . . . . . . . . . . . . . . . 43

3.2.1. Convolutional Tokenizer . . . . . . . . . . . . . . . . . 43

3.2.2. SeqPool . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 47

11


Chapter Page

3.3.2. Computational Resources . . . . . . . . . . . . . . . . 47

3.3.3. Hyperparameters . . . . . . . . . . . . . . . . . . . . 47

3.3.4. Transformers On Small Datasets . . . . . . . . . . . . . 48

3.3.5. Ablations . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.6. Scaling Study . . . . . . . . . . . . . . . . . . . . . 54

3.3.7. Natural Language Processing . . . . . . . . . . . . . . . 61

3.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 61

IV. VARIADIC NEIGHBORHOOD ATTENTION . . . . . . . . . . . . 64

4.1. Localized Attention . . . . . . . . . . . . . . . . . . . . . . 67

4.2. Neighborhood Attention . . . . . . . . . . . . . . . . . . . . 68

4.3. Variadic Attention Heads . . . . . . . . . . . . . . . . . . . . 70

4.4. Generating The Right Experiment . . . . . . . . . . . . . . . . 72

4.4.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4.2. Hyperparameters . . . . . . . . . . . . . . . . . . . . 78

4.5. When Faced With Sparse Attention . . . . . . . . . . . . . . . 80

4.6. A Bump While Headed To Church . . . . . . . . . . . . . . . . 83

4.7. Metrics Are Not Enough . . . . . . . . . . . . . . . . . . . . 85

4.7.1. The Face Says It All . . . . . . . . . . . . . . . . . . . 86

4.7.2. Quick Training on Deep Fake Detection . . . . . . . . . . 87

4.7.3. Fingerprints . . . . . . . . . . . . . . . . . . . . . . 89

4.7.3.1. StyleGAN . . . . . . . . . . . . . . . . . . . 89

4.7.3.2. StyleSwin . . . . . . . . . . . . . . . . . . . 91

4.7.3.3. StyleNAT . . . . . . . . . . . . . . . . . . . 93

4.7.4. Attention To Details . . . . . . . . . . . . . . . . . . 95

12


Chapter Page

V. DISTILLATION OF INVERTIBLE NETWORKS . . . . . . . . . . 98

5.1. Model Distillation . . . . . . . . . . . . . . . . . . . . . . . 99

5.2. Distilling Normalizing Flows . . . . . . . . . . . . . . . . . . 100

5.2.1. Categories of Flow Distillations . . . . . . . . . . . . . . 101

5.2.1.1. Latent Knowledge Distillation . . . . . . . . . . 101

5.2.1.2. Intermediate Latent Knowledge Distillation . . . . 102

5.2.1.3. Synthesized Knowledge Distillation . . . . . . . . 102

5.2.1.4. All Together . . . . . . . . . . . . . . . . . . 104

5.3. Distillation Experiments . . . . . . . . . . . . . . . . . . . . 105

5.3.1. Density Estimation . . . . . . . . . . . . . . . . . . . 106

5.3.2. Image Generation . . . . . . . . . . . . . . . . . . . . 108

5.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 111

VI. CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . 113

6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2. Future Directions . . . . . . . . . . . . . . . . . . . . . . . 114

6.2.1. Core Challenges . . . . . . . . . . . . . . . . . . . . 115

6.2.2. Scaling . . . . . . . . . . . . . . . . . . . . . . . . 116

6.2.3. Ingress and Egress of Data . . . . . . . . . . . . . . . . 117

6.2.3.1. Parameterization . . . . . . . . . . . . . . . . 117

6.2.3.2. Automated Preprocessing . . . . . . . . . . . . 118

6.2.3.3. Making The Most of it . . . . . . . . . . . . . 118

6.2.4. Core Processing Architectures . . . . . . . . . . . . . . 119

6.2.4.1. Flexible Learning . . . . . . . . . . . . . . . . 120

6.2.4.2. Is Beauty in the Eye of the Beholder? . . . . . . . 120

6.2.5. Structurally Aware Architectures . . . . . . . . . . . . . 121

13


Chapter Page

6.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 122

14


LIST OF FIGURES

Figure Page

1. Domain, Range, Image, Preimage diagram . . . . . . . . . . . . . . 26

2. Vision Transformer Architecture . . . . . . . . . . . . . . . . . . 32

3. Taxonomy of Generative Models . . . . . . . . . . . . . . . . . . 33

4. Diagram of Injection, Surjection, and Bijection . . . . . . . . . . . . 36

5. Architectural design of Compact Transformers . . . . . . . . . . . . 41

6. Variations of ViT Architectures . . . . . . . . . . . . . . . . . . 44

7. Vision Transformer Salient Maps . . . . . . . . . . . . . . . . . . 52

8. Accuracy of ViTs on Restricted Samples per Class . . . . . . . . . . 59

9. Vision Transformer Resolution Based Performance . . . . . . . . . . 60

10. StyleNAT Samples (FFHQ-256, FFHQ-1024, LSUN Church) . . . . . . 66

11. NAT vs Swin Vs ConvNext ImageNet Performance . . . . . . . . . . 68

12. Neighborhood Attention Transformer (NAT) . . . . . . . . . . . . . 70

13. StyleNAT Architecture . . . . . . . . . . . . . . . . . . . . . . 72

14. StyleNAT: FID vs. Throughput vs. Parameters . . . . . . . . . . . 74

15. StyleNAT Samples: FFHQ & LSUN Church . . . . . . . . . . . . . 81

16. StyleNAT FID vs Iteration . . . . . . . . . . . . . . . . . . . . 82

17. StyleGAN3 Visual Artifacts . . . . . . . . . . . . . . . . . . . . 90

18. StyleSwin Visual Artifacts . . . . . . . . . . . . . . . . . . . . . 92

19. StyleNAT Visual Artifacts . . . . . . . . . . . . . . . . . . . . . 93

20. StyleNAT and StyleSwin Attention Maps . . . . . . . . . . . . . . 96

21. StyleNAT and StyleSwin Attention Maps . . . . . . . . . . . . . . 97

22. Illustration of Knowledge Transfer for Normalizing Flows . . . . . . . 103

15


Figure Page

23. Distilling Normalizing Flow CIFAR-10 Samples . . . . . . . . . . . 110

24. Distilling Normalizing Flow CelebA Samples . . . . . . . . . . . . . 111

16


LIST OF TABLES

Table Page

1. Terminology of Mathematical Sets . . . . . . . . . . . . . . . . . 25

2. CCT Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 48

3. CCT Main Results . . . . . . . . . . . . . . . . . . . . . . . . 49

4. Extended CCT Training . . . . . . . . . . . . . . . . . . . . . . 50

5. CCT Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 51

6. CCT Positional Embedding Comparison . . . . . . . . . . . . . . . 55

7. CCT ImageNet Accuracy . . . . . . . . . . . . . . . . . . . . . 57

8. CCT Flowers-102 Accuracy . . . . . . . . . . . . . . . . . . . . 58

9. CCT Text Classification . . . . . . . . . . . . . . . . . . . . . . 62

10. StyleNAT Configurations . . . . . . . . . . . . . . . . . . . . . 78

11. Comparison of Generative Models . . . . . . . . . . . . . . . . . . 80

12. StyleNAT Ablations . . . . . . . . . . . . . . . . . . . . . . . . 83

13. GLOW and MAF Model Configurations . . . . . . . . . . . . . . . 106

14. Distilling Normalizing Flow Density Estimation Metrics . . . . . . . . 107

15. Distilling Normalizing Flow Performance Metrics . . . . . . . . . . . 108

16. Distilling Normalizing Flow Model Configurations for Image Generation . 108

17. Distilling Normalizing Flow Image Generation Metrics . . . . . . . . . 109

18. Distilling Normalizing Flow CelebA FID . . . . . . . . . . . . . . . 111

17


CHAPTER I

INTRODUCTION

I don’t believe in empirical science.

I only believe in a priori truth.

Kurt Gödel

1.1 Motivation

This thesis focuses on the development of efficiently training machine

learning algorithms, primarily applied to Computer Vision. Our focus is to develop

methods which allow for a reduction in computational resources required to train

and deploy models.

Machine Learning is a subfield of Artificial Intelligence which aims to

process data and automate the discovery of structures within the data. This

process reduces the burden of needing to derive explicit formulations, instead

allowing automation through optimization. This process allows algorithms to

“learn” by “training” on the data.

Computer Vision applies to a wide range of problems related to perception.

Traditionally associated with image and video processing, the field extends

to processing of other data, such as LIDAR, radio, depth estimation, and

other forms of signal processing. The domain involves a broad range of tasks,

including: regression, which models quantitative relationships between variables;

discrimination, the processing distinguishing relevant objects or patterns; and

generation, or data synthesis. The primary focus of this thesis revolves around

discrimination and generation of images.

Image processing presents unique challenges, often due to the high

dimensionality of the embeddings. This high dimensionality causes difficulties in

18


formulating explicit descriptions of our data and the underlying structures within

it. The goal of computer vision is to create the machinery necessary to automate

this process for us, as efficiently as possible. While we may not be able to create

fully formulate descriptions, the descriptions we provide our algorithms can both

help and hinder them. For example, images usually have spatial relationships,

with pixels that are local spatially having high probabilities of being related to

one another. This has led to the use Convolutional Neural Networks, as their

architecture is able to exploit this natural bias. But such relationships may not

always hold. For example, a QR code contains sharp transitions, where neighboring

pixels do not aid the prediction of one another. More flexible architectures, such as

attention, can better process such imagery by reducing the importance of locality.

Therefore, to efficiently process data we must consider the biases implicit to the

neural architectures that we use.

The modern success of these algorithms has presented additional challenges.

It has been found that many of these methods can be improved through simple

means: making them larger and providing them with more training data [142].

While this has led to dramatic improvements, it has similarly led to dramatic

increases in the computational resources necessary to train and deploy these

models. Once trained, these models may still be quite difficult to deploy, with their

high computational demands, greatly limiting where they can be used. This has

led many researchers to consider how these models can be more efficiently trained,

requiring: less data, less time to train, and fewer computational resources. Similar

challenges exist with respect to the deployment of these models.

1.2 Research Goals and Approaches

The focus of this thesis revolves around two primary questions:

19


– How do we reduce the model’s data dependence?

– How do we reduce the model’s computational demands?

These questions are fundamentally intertwined, necessitating solutions which

address the problems simultaneously. Naturally, by reducing the amount of data

that a model must ingest reduces the amount of time that a model must be trained

for. Conversely, by making a model more efficiently extract information from its

data, the less data it will need to achieve a given performance level. This is because

model parameters do not just determine its information capacity, but also play an

integral role in the solution space during training [133]. Many works have found

that once trained models are often significantly over-parametrized, meaning only

a subset of their parameters are being used to model the data [33, 97]. These

findings are further evidenced by the continued increasing performance of smaller

models [66], and strongly suggest our models can be trained more efficiently.

Our motivation to reduce a model’s data depends exists beyond our desire

to be cost effective. Real world large datasets provides two primary challenges

which require our models to be data efficient. First, many important structures

within the data are subtle and difficult to recover. Second, data is often heavy-

tailed, meaning we do not have many samples. Fundamentally, these require our

models to generalize relationships with minimal examples. While we may focus on

explicitly constrained data to aid the interpretation of our work, it provides benefits

as our models and data expand in size.

These feats are primarily accomplished the development the development

of neural architectures and optimization methods. This thesis focuses on the

former, specifically, studying the design of Computer Vision architectures which

reduce: parameters, data dependence, and system resources. These goals must

20


be simultaneously optimized. Our objective is not to develop models with a small

number of parameters if they also require substantially greater costs during training

or deployment. Similarly, this would undermine our own goals if we reduce a

model’s data dependence with significant cost to its performance.

This thesis investigates three critical aspects of our neural architectures and

structure it to follow a natural progression in complexity. The first work focuses

on the understanding how our core neural architecture takes in data and how

to efficiently extract the relationships it uncovers. Without efficiently providing

and extracting data to/from a model, they become wasteful and this hinders the

ability to develop more efficient core architectures. The second work focus on the

core architecture, which perform the majority of the data processing. This section

studies these two aspects as applied to vision transformers, directly building off

one another. The third work revolves around knowledge distillation of Normalizing

Flows. These models are structurally aware, explicitly designed to preserve the

structures within the data. From these three lenses this thesis seeks to better

understand how to build neural architectures that are smaller, faster, and cheaper.

1.3 Dissertation Outline

This dissertation is organized as follows:

Chapter 2 provides the necessary background and foundational information

necessary to understand the research objectives. This background is necessary

for understanding how the works are connected and the ways we seek to resolve

underlying issues.

Chapter 3 presents the work Escaping the Big Data Paradigm with Compact

Transformers [51], and focuses on efficiently embedding and extracting data from

Vision Transformers.

21


Chapter 4 presents the work Efficient Image Generation with Variadic

Attention Heads [156], as well as the works it builds upon: Neighborhood Attention

Transformer [52].

Chapter 5 presents the work Distilling Normalizing Flows, which provides

a framework for knowledge distillation with Normalizing Flow architectures and

studies the categorical distillation methods.

Chapter 6 provides an overview of the findings and recommendations for

future work.

1.4 Co-Authored Material

The research presented herein involves previously published material. Below

is a listing of the prior works in relation to the chapter material. Details of division

of labor is provided in the preface to each chapter.

– Chapter 2: This chapter includes material that was part of Steven Walton’s

Area Exam [154].

– Chapter 3: This work was contains materials from Escaping the Big Data

Paradigm with Compact Transformers [51]. This work was a collaboration

with Ali Hassani, Nikhil Shah, Abulikemu Abuduweili, Jiachen Li, Humphrey

Shi, and myself.

– Chapter 4: This chapter contains materials from both Neighborhood Attention

Transformer [52] and Efficient Image Generation with Variadic Attention

Heads [156], with focus around the latter. The former is a collaboration

between Ali Hassani, Jaichen Li, Shen Li, Humphrey Shi, and myself. The

latter was a collaboration between Ali Hassani, Xingqian Xu, Zhangyang

Wang, Humphrey Shi, and myself.

22


– Chapter 5: This chapter contains material from a collaboration between

Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey

Shi, and myself.

23


CHAPTER II

BACKGROUND

Mathematicians do not deal in

objects, but in the relationships

among objects.

Henri Poincaré

Nota Bene: Some of the text and figures from this section were part

of Steven Walton’s Area Exam [154], which has been publicly released by The

University of Oregon. Steven was the sole author of this work.

This section covers the background necessary for understanding the

motivation and purpose of the work performed. There is includes some necessary

discussion about how machine learning algorithms work, how data is processed, and

the inherent biases of different learning architectures. The latter of which is the

main focus of this thesis. While subsequent chapters will have lower mathematical

notation and formulation, those herein provide important context and intuition for

the work ahead. To reach our goal of making our machine learning models smaller,

faster, and cheaper, we need to have some core understandings as to how these

models work. It is not enough to treat them as black boxes; rather we have to look

inside. Much of machine learning terminology has not been standardized, thus this

section may be used to contextualize these terminologies and the usage within this

thesis.

2.1 Learned Data Mappings

The procedure can be understood through mapping between two sets, where

our neural network is a learned mapping, f(x). Deep neural networks are Universal

Approximators [18, 67, 112], where every multivariate continuous function can,

24


Name Relation Set
Domain D {∀ x ∈ D}
Codomain C {∀ y ∈ C}
Range C̃ ⊆ C {y ∈ C | ∃ x ∈ D : f(x) = y}
Image C̃ ⊆ C {y ∈ C | ∃ x ∈ D̃ : f(x) = y}
Preimage D̃ ⊆ D {x ∈ D | ∃ y ∈ R̃ : f(x) = y}

Table 1. Explanation of important set terms denoting their relationships and what
elements are in their set.

in principle, be approximated by the superposition of a sequence of continuous

functions.

With this in mind, it helps to revisit some of the basics of functions and

set theory. We can view the Domain, D, as all valid inputs to the neural network.

In the study of Computer Vision this is any valid image, regardless of whether

this image is meaningful to humans or not. We can then define our Range, R,

as all possible outputs that our neural net can produce. In the case of Image

Classification this would be all labels that we are trying to learn. Our Codomain,

C, is a super-set to our Range, R ⊆ C, and may include elements that our

map cannot reach. In our example of Image Classification our Codomain would

represent all possible labels.

In practice, we are likely only interested in studying some subset of our

domain, D̃ ⊆ D. This subset can be arbitrary and may be something like our

training set, the set of images interesting to humans, or even some subset of our

training data. Regardless of what this subset is, when they are passed through our

mapping function then we call the outputs an Image, R̃ ⊆ R. A “reverse” of this

function may then be defined, called the Preimage, f ∗[R̃]. The Preimage is defined

as the set of elements in the domain that map to some image in the codomain. It

25


Domain

Pr
eim

ag
e

{x ∈ D | f(x) ∈ R̃}︸ ︷︷ ︸

Subset

C
o
d
om

ai
n

Ra
ng
e

︷ ︸︸ ︷
{f(x) | x ∈ D}

Image

{f(x) | x ∈ D̃}︸ ︷︷ ︸

Figure 1. The diagram illustrating concepts from Set Theory, explaining the
Domain (D), Codomain, Range (R), Image (R̃), and Preimage.

is important to note that the Preimage is not the inverse of the image. Many texts

use the notation f−1, but we will use f ∗ to avoid confusion.1 Table 1 and Figure 1

are included to help explain these concepts.

We will define a Target, T , as the set of data we intend to model.

Unfortunately, this distribution may be unobtainable and is often intractable. That

is, we are unable to provide a formal description of the distribution. An example,

which we will use in Chapter 4 and Chapter 5, is “the set of all possible human

faces.” We do not have a proper mathematical description this set, making it

intractable, nor is it possible for us to completely sample from this set as it would

require infinite time2. We instead collect a set of sample data Ω ⊆ T , which may

be used to train the model (e.g. FFHQ [79] or CelebA [108]). It is important to

1This notation is used for a pullback, which is a nearly identical concept.

2This set would include all faces that were and all faces that will be.

26


note that we may not know how well Ω approximates T , especially when T

is intractable. Our model processes data from the Ω to generate output, O.

When performing Classification/Discrimination tasks, our output may be a (or a

list of) label(s) but in generative tasks we instead seek to approximate the target

distribution, T̃ . We should keep this model in mind when evaluating our work, so

we can best understand what our models can and cannot do. Our data are discrete

and sampled from the distributions we are trying to approximate, and great care

must be taken to determine what is in our distribution or not.

2.2 Scale Is Not All You Need

In March of 2019 Richard Sutton wrote a short article titled The Bitter

Lesson [142]. This article had a large impact on the machine learning community.

Sutton makes the argument that methods based predominantly on leveraging

human knowledge are ill-founded and that our historical progress has shown us

that focusing on search has resulted in success. Sutton acknowledges the benefits

of leveraging human knowledge as well as how in practice this can often be

constraining, preventing our machines from leveraging more general computation.

Either through misinterpretation by Sutton or through readers, a popular belief

rose through the community: “Scale Is All You Need”. This notion need be

addressed, for if the belief is true to face then the only work need be done is that

of scaling compute and data gathering. Some will interpret this in that scaling

is sufficient, and that there may be more efficient methods, but we will show

that scaling alone is insufficient. We do not disagree that scale is a necessary and

essential component, but that it alone is insufficient to both explain recent progress

as well as provide direction for further advancement. These claims let critical

conditions remain implicit, assuming shared assumptions among readers. These

27


subtle details are consequential to generating efficient machine learning models,

as understanding what data increases performance allows us to also better design

algorithms to maximally incorporate information.

Two aspects of scaling must be addressed: that of scaling data

(Chapter 2.2.1) and that of scaling compute (Chapter 2.2.2).

2.2.1 Scaling Data. Undeniably one of the reasons for major

advances has resulted with scaling of data. There is a simple argument that may

suggest scaling data will be sufficient. We need to look at this to understand where

it works and doesn’t.

Our goal in machine learning is to learn some distribution, which we will call

our Target Distribution, T . If we uniformly and randomly sample from our target

distribution, one can conclude that with scale we will also increase our covering

of the distribution. We may view this another way: if we select some arbitrary

point in our target distribution, as we continue to sample then the distance

between it and some data point, di, in our set of sampled points will decrease.

∃ ε ∈ R s.t. ||di − dj||pp < ε | ∀di, dj ∈ T . Where || · ||pp represents an arbitrary

Lp distance.

We can refine this more generally, which will better help us as we increase

complexity. We can partition our distribution T into disjoint continuous partitions

{P0, . . . , Pn}. That is: Pi ∩ Pj = {∅} | ∀i ̸= j and
⋃
Pi = T . We can reach a similar

natural conclusion: as the number of samples increases, the probability that there

does not exist a sample belonging to partition Pi goes to zero. limn→inf Pr(s ∈

Pi) = 0.

This generalization helps us in two ways. Our partitions can be of

arbitrary size and shape, allowing us to use them as abstractions, such as semantic

28


representations.3 Where a semantic representation may represent categories of our

data. For example, if our model is generating human faces we may consider hair

color as a semantic representation. This formulation can also be repeated for each

partition, which allows us to extend the notion to a more realistic setting where

data is discrete (i.e. discretization).

While this logic may be natural, it relies on assumptions that are not true

in practice. Notably, it assumes that both the data is independent and identically

distributed (i.i.d) and that our sampling process is unbiased. These assumptions

are not representative of the real world data, nor of the way in which we sample.

In practice, as we increase the number of samples we increase the diversity of

our data. This diversity, or variance, in data has a large impact on our models’

ability to generalize. We will see in Chapter 3 that introducing data augmentation

to our models results in a significant improvement in their performance. These

augmentations create additional variance in the data and help the model to not

overfit.

Scaling of data in the way we typically gather data can grow the variance

to a greater degree than our typical data augmentation methods can. But this

represents a fundamental limitation as well. We cannot scale infinitely, and as we

gather more data inevitably we turn from increasing variance to contracting the

variance. There are only so many unique things in the world. To understand this,

we may think about randomly throwing a dart at a dartboard. As we start, every

new dart likely lands with a high distance from one another. But as we continue

we increase our coverage over the dartboard and our new darts land close to an

3We still need to maintain care to ensure our semantic representations are disjoint. This does
not allow us to pick arbitrary semantic representations.

29


existing dart. This variance contraction means that we cannot rely on scaling data

indefinitely.

Additionally, an extra challenge comes from scaling data. Once the data is

so large, we are unable to properly investigate it. This means we will not be able

to properly verify that our model is not trained on the data it is being tested on.

In this manner, we want to use the minimum amount of data required to train our

models, to reduce our burden of verification.

In practice, our data is heavy tailed, with many samples being

underrepresented. Ultimately, despite high amounts of data, subsets exists in a low

data regime. Our models may benefit from shared similarities, via a superposition

of representations, but we are still motivated to develop models which work better

when data is sparse. By better understanding how to make our models efficiently

learn in limited data regimes we hope to build techniques that allow our larger

models to efficiently model data that is within the long tail.

2.2.2 Model Size. We face similar complexities when it comes

to the scaling of our models. Inherently our model parameters change our

loss landscape [100], with larger models providing more ways for data to be

disentangled [95, 45, 29]. It can be shown that different by using different loss

functions that we may even trick ourselves into believing our models have found

emergent capabilities [159] when they may have not [133].

With increased model parameters our models are more likely to overfit

our data, making it difficult to generalize. With such sizes in terms of data and

parameters it becomes difficult to distinguish between our models memorizing

the data vs modeling the data. In practice, we benefit from physical limitations,

30


which also puts pressure on making our models as small as possible. The larger our

models are, the more expensive they are to run.

2.3 The Foundations That Shape Us

To cost effectively train our models we want them to both be

parameter efficient and data efficient. With too much data, we are may spend

disproportionate times loading from disk and simply ingesting the data. With too

many parameters, we must split, or shard, our model across large supercomputing

infrastructures.

Key to Sutton’s Bitter Lesson was that models should be powerful and

flexible. With our trend in scaling, we have also seen tremendous improvements

in the algorithms that we use, such as the advent of the transformer [150]. Scale

cannot be enough to explain our progress, as we have found that as research

progresses, many smaller models end up significantly outperforming larger

models [66], and this thesis is further demonstration of that.

These algorithms may be referred to as our neural architectures, as we

build them to work together. In the following sections we introduce some of the

key architectures that will be used throughout this work. There exist far more

frameworks methods [46, 112] and we focus only on what are used herein.

2.3.1 Transformers. The transformer model has become the

backbone of modern machine learning models. This is due to its high flexibility,

being able to form a relationship between all elements it attends to. Unlike many

other architectures, the transformer is not limited by the locality of the data, with

it being able to discover relationships between data regardless of its position in a

sequence. This greater flexibility comes at an increased computational complexity,

31


but enables the model to form relationships that could not be efficiently formed

through other previous architectures.

These models are fairly simple in construction, having two main

components: attention [43, 114, 150], and a feed-forward layer.

Figure 2. The Transformer model architecture from Vaswani et. al. Diagram
depicts dot-product self attention.

In Figure 2 depicts part of the transformer model from Vaswani et al.’s

work, showing the dot-product self attention (DPSA) variant, which is used

throughout this work. The figure depicts a “post-norm” configuration, with the

normalization layers appearing after the attnetion and feed-forward units, but

modern configurations usually use “pre-norm” due to increased stability. The core

of the transformer model is attention, defined as:

Softmax

(
QKT

√
dk

)
V (2.1)

Where Q, K, and V represent queries, keys, and values, respectively. These

are learnable parameters, most usually parameterized by a single layer feed-forward

network. In the DPSA configuration, these networks share the same input. dk in

32


Generative Models Explicit Density Tractable Density Autoregressive

Normalizing Flows

Approximate Density Diffusion Models

VAEs

Implicit Density Markov Chain GSNs

Direct GANs

Figure 3. Taxonomy of Generative Models, based on Goodfellow’s Taxonomy [40]

this equation is a softmax temperature scale, which is the inverse square root of

the embedding dimension (a user defined hyperparameter). The queries and keys

are multiplied together, learning a similarity matrix. The softmax of this is then

referred to as the “score”, as its values are defined by a probability distribution.

The value tensor is then weighted by the score, defining our attention function.

Commonly, this configuration is done in a “multi-headed” manner. Instead

of performing a single attention we may instead project our Q,K, and V tensors

into an embedding so that we may process multiple attention calculations in

parallel. The conclusion of the attention mechanism concatenates these tensors.

This tends to make our models more efficient as each head is independent and can

learn unique representations, as we will see in Chapter 4.

The transformer model typically includes the usage of positional encoding,

which adds extra data to the model to indicate the position of tokens, or data, in a

sequence.

2.3.2 Adversarial Generation. Generative Adversarial Networks

are a form of generative models introduced by Goodfellow et al. [41] which first

enabled the generation of high quality synthetic imagery. Not necessarily restricted

33


to image synthesis, these models enable unsupervised learning by simultaneously

training two models at once. If our goal is to train an image generator, we both a

model to generate images and a model to discriminate real and fake images. The

discriminator model requires labeled data, but only the binary distinction of real

data or synthesized data. These models then competitively train, being able to play

a minimax game, which often leads to high quality generation.

min
G

max
D

Ex∼pdata

[
logD(x)

]
+ Ez∼pz

[
log
(
1−D

(
G(z)

))]
. (2.2)

G learns a differentiable map z 7→ x that pushes forward a simple prior

(usually spherical Gaussian) toward the data manifold, while D learns to spot

discrepancies. While these models have shown great success and pushed the

bounds of what is possible, they are not without problems. Training is notoriously

unstable—mode collapse, vanishing gradients, and catastrophic forgetting are

common.

In addition, many generative models have greatly increased in size. These

size increases have resulted in more impressive images but also become harder

to train, costlier to train, and become slower in throughput. There then must

be a trade-off of capabilities and performance, depending on the applications. In

Chapter 4 we will use a GAN to demonstrate an improved variant of an attention

mechanism, improving throughput and quality while decreasing the total number of

parameters.

2.3.3 Normalizing Flows. Normalising flows provide exact log-

likelihoods by composing a sequence of bijective, differentiable transforms f =

f1 · · · · · fk:

px(x) = pu(u) |det Jf (u)|−1 (2.3)

34


Here pu is a tractable base distribution and det Jf denotes the Jacobian

determinant. The Jacobian determinant allows for a change of variable, allowing

data from one distribution (u ∈ U) to be expressed in another coordinate system

(x ∈ X). A simplified example that many readers may be more familiar with is the

change of coordinates from a Cartesian space into Polar coordinates

J = det
∂(x, y)

∂(r, θ)

=




∂x
∂r

∂x
∂θ

∂y
∂r

∂y
∂θ




=



cos θ −r sin θ

sin θ r cos θ




= r cos2 θ + r sin2 θ

= r

(2.4)

Given the Jacobian determinant it becomes trivial to convert from a Cartesian

coordinate to Polar by the equation:
∫∫

f(x, y)dxdy =
∫∫

f(r, θ)rdrdθ

This idea extends greatly, with far more complex formulations of coordinate

transforms. The importance of these transforms is that they generate an isomorphic

mapping from one space to another, where every element in one coordinate

precisely maps to a unique element in the other. Through the composition of

these transformations we can then define a nice tractable distribution, such as a

Gaussian, and learn a coordinate transform that maps our data. This, in effect,

allows us to turn our intractable distribution into a tractable one. We should

remain careful, as there are still some pitfalls and our distribution is still only an

approximation.

35


What makes this different from Approximate Density models, such as VAEs

and Diffusion models, is that those models do not generate isomorphic functions.

Like flows, they are able to generate a probability density function, making them

“explicit” (Figure 3), but these models are by nature lossy. Where Flows are

bijective, diffusion and VAEs are not.

(a) Injection: one-to-one (b) Surjection: onto (c) Bijection: one-to-one
and onto

Figure 4. Visual representation of injections, surjections , and bijections. Source:
Wolfram Mathworld

The two most common forms of Normalizing Flows, which are also used

within this thesis, are:

Affine coupling flows. : Partition input x into two units, (x0, x1), such that

f(x0, x1) = (x0, x1 ⊙ es(x0)+t(x1)), which make computationally inexpensive

triangular Jacobians (e.g. RealNVP [26], Glow [86]).

Autoregressive flows. : Parameterise each dimension conditioned on previous ones,

yielding a composition of triangular Jacobians (MAF/IAF [121, 88]).

Unlike transformer models, the architecture to Normalizing flows are highly

restrictive. These restrictions come with the benefits of increased interpretability,

but at the cost of additional computation and less flexibility. Where to make these

trade-offs is difficult but it remains a challenge in determining the capability of

these models. Unfortunately these models tend to be greatly under studied, with

36


only a handful of models having been trained with > 100M parameters, which is

fairly small by modern standards.

2.4 The Tyranny of Measurements

As a final note, we must be ever vigilant of the metrics that we use.

Qualitative metrics are a critical part of the scientific method, evidencing our

hypotheses and theories. Yet, metrics are only guides, proxying the things we wish

to measure. We must stress the importance of this distinction as it is necessary

to properly evaluate our models and interpret what they are doing. Within this

thesis several of our works face the challenges of interpreting our metrics and the

absence of them. In Chapters 4 and 5 perform image synthesis tasks, where our

models create new data that is representative of what they trained on. There are

no metrics that properly convey what is a good image or not.

For example, a common metric is for measuring the capabilities of image

models is the Fréchet Inception Distance (FID) [60]. This metric was shown

to correlate with human judgement of image quality, but was developed when

image quality was much worse. For comparison, the paper that introduced FID

demonstrated models with an FID around 12.5 on the CelebA dataset, while the

current state of the art is 3.15 [146]. These correlations are helped improve the

state of art systems, but not being perfectly aligned with an actual measurement of

realism the discrepancies grow as our models improve.

The rapid success of machine learning is double edged sword. Our

approximations that helped us make our progress may no longer be sufficient.

With all metrics, we must constantly check their alignment, to ensure that we

are progressing in the directions we intend. This is quite similar to the gradient

decent process we use in machine learning, where early on we may make large

37


improvements with highly suboptimal steps towards the optima. Yet, as our model

becomes better, we tend to make smaller steps to ensure we are progressing in the

right direction.

38


CHAPTER III

ESCAPING THE BIG DATA PARADIGM

The first principle is that you must

not fool yourself and you are the

easiest person to fool.

Richard Feynman

Nota Bene: This chapter is based on the previously published co-authored

work Escaping the Big Data Paradigm with Compact Transformers [51] and the

associated blog post published through PyTorch’s Medium page [155].

– Ali Hassani and Steven Walton are joint primary authors of this work.

Together they wrote the majority of the code, performed the majority of

experiments and writing of the paper. The majority of code was written

during pair-programming sessions between the two.

– Steven Walton worked a bit more on designing the experiments and

developing the theory, ensuring claims were thoroughly evidenced and finding

relevant literature.

– Ali Hassani worked a bit more on code and launching experiments, increasing

code quality and ensuring experiments were launched effectively, maximizing

machine utilization.

– Nikhil Shah helped manage launching experiments and contributed to the

paper writing.

– Abulikemu Abuduweili provided code and feedback for the NLP experiments.

39


– Humphrey Shi was the advisor, contributing overall guidance on the research

as well as funding for the work. Humphrey also contributed to the writing of

the paper and ensuring research stayed on track.

Critical to any data analysis is the preparation of that data. The ways

in which we encode our data has significant impacts on the way that data is

processed. It is not sufficient to simply apply the right modeling tools to the

data, but one first needs to ensure that the data is properly processed. In machine

learning systems, this processing is typically done by both man and machine. The

ingress and egress of data is critical, and will influence what structures in the data

can ultimately be recovered.

In this chapter we introduce the work Escaping the Big Data Paradigm with

Compact Transformers [51]. This work demonstrates that Vision Transformers do

not need large amounts of data to be performant, instead being able to be trained

from scratch and be effective in limited data regimes. Our results run counter

to conventional wisdom around scaling, demonstrating that scale may decrease

performance, rather than increase. On small datasets, like CIFAR-10, our small

models are able to achieve comparable performance to much larger ViT models

that also have large pretraining. On medium datasets, like ImageNet, we are able

to outperform ViTs of comparable sizes, and achieve accuracies only slightly lower

than large models with large pretraining.

3.1 Vision Transformers

With Vaswani et al.’s[150] demonstration of a dot-product self-attention

based transformer architectures in language, there were several attempts to

integrate them into vision models [6, 129, 69, 68]. Cordonnier et al. [16] first

showed that by downsampling and adding a positional encoding layer, that a

40


Convolution · · · Patching

Reshape

210 3 4
(Optional)
Positional
Embedding

Transformer Encoder ×N

Sequence Pooling

MLPHead

Class
Duck

Transformer

Convolutional

Compact

Transformer

Vision

Compact

Figure 5. Architectural design of Compact Transformers

Bert [24] style Transformer architecture could learn convolutional filters, given a

sufficient number of attention heads. Unfortunately, these researchers were memory

bound and were using 2 × 2 invertible down-sampling. Dosovitskiy et al. [28]

improved upon this work, claiming “An Image is Worth 16×16 Words”, introducing

the Vision Transformer. Instead of using a 2 × 2 down-sampling, they used larger

16 × 16 patches, giving the paper it’s name. Additionally, Dosovitskiy et al.

significantly increased scaled both data and compute. While Cordonnier et al.’s

network was ≈12M parameters, Dosovitskiy et al. used 3 networks, 86M, 307M,

and 632M. While Cordonnier et al. exclusively trained on CIFAR-10 and CIFAR-

100 [147], Dosovitskiy et al. performed pretraining with the proprietary JFT-

300M dataset [141], ImageNet-21k, and ImageNet-1k [23]. Their work showed

that with large-data pretraining that one could outperform ResNet [55] trained

models, although later work showed that by training ResNets with modern training

procedures that classification accuracy becomes similar [161]. Dosovitskiy et al.

performed a wide variety of experiments, including using a CNN to generate their

41


patch embedding and fine-tuning at higher resolutions than pretraining [148, 89].

Their results suggested that only through large pretraining and large models could

ResNets be beat.

Dosovitskiy et al.’s work made an important claim: Transformers lack some

of the inductive biases inherent to CNNs, such as translational equivariance and

locality, and therefore do not generalize well when trained on insufficient amounts

of data. However, the picture changes if the models are trained on larger datasets

(14M-300M images). We find that large scale training trumps inductive biases.

If this problem could not be resolved then this would greatly limit research

contributions by labs without large compute infrastructures 1. The community was

quick to challenge Dosovitskiy et al.’s claim.

Touvron et al.’s Training Data-Efficient Image Transformers & Distillation

Through Attention [149], quickly followed in an attempt to address the claim,

introducing the DeIT model. In particular, they criticized the large pretraining and

sought to counter the claim that transformers do not generalize when trained on

insufficient amounts of data. Their work similarly uses 3 models for training, but

are a tiny (5M parameters), small (22M parameters), and base (86M parameters).

The ViT was modified to introduce a knowledge transfer2 token, and the training

scheme was modified to include distillation from a pretrained convolutional based

network. For their convolutional network they selected a RegNetY-16GF [127]

network (84M parameters) as the default teacher network.

1Often called “GPU Poor”

2We use the phrasing knowledge transfer instead of distillation for increased clarity; as the
“teacher” network having fewer parameters than the “student” network

42


3.2 Data Efficient Vision Transformers

While we recognize the importance of these works we believe alternative

conclusions are possible. The ViT results could be explained by several alternative

hypotheses, including the size of the network and through training techniques.

DeIT’s results showed that part of the claim must be false, as even smaller

models could achieve better performance, but this relied upon inheriting the local

inductive biases transferred by a CNN rather than learning them themselves, which

Cordonnier et al. had demonstrated is possible. The critical question remained:

Can transformer models, be trained to outperform ResNets when model size and

data were held equal? Both works suggested that the answer was no. On the other

hand, Transformers are universal approximators and Cordonnier et al.’s work

suggests there’s no reason one should believe this data threshold requirement.

Additionally, we believed ViT and DeIT were rejecting valuable information by

only passing a slice of the transformer’s outputs to the classification sub-network.

In an effort to resolve this, we proposed three hypothesis:

– Non-overlapping image patches bias the transformer networks due to

information loss at the boundaries.

– A learned transformation to map the transformer’s outputs to the

classification sub-network will improve performance.

– Transformer networks rely more on data variance than data quantity.

3.2.1 Convolutional Tokenizer. The first hypothesis was believed

due to the discussion in the background section (Chapter 2.2.1), where these

models were gaining more benefit from data variance than data quantity. While

diversity is a common side-effect of scaling, it is a distinct phenomena. The second

43


was inspired by subword tokenization that is commonly used by many language

models [35, 135, 24, 150] and experience with computational modeling. The belief

here is that by using non-overlapping patches we weaken the network’s ability

to incorporate information along the boundaries of the images. Such boundary

conditions often plague computational models, requiring ghost cells and other forms

of boundary communication techniques to de-bias calculations.

Inputs ConvLayer Pooling Reshape
Transformer

Encoder

Sequence
Pooling

Linear
Layer

Output

Optional
Positional
Embedding

Compact Convolutional Transformer (CCT)

Convolutional Tokenization Transformer with Sequence Pooling

Inputs Embed to
Patches

Linear
Projection

Reshape Transformer
Encoder

Sequence
Pooling

Linear
Layer

Positional
Embedding

Output

Compact Vision Transformer (CVT)

Patch-Based Tokenization Transformer with Sequence Pooling

Inputs Embed to
Patches

Linear
Projection

Reshape Transformer
Encoder

Slice
Linear
Layer

Class
Token

Positional
Embedding

Class
Token

Output

Vision Transformer (ViT)

Patch-Based Tokenization Transformer with Class Tokenization

Figure 6. A comparison of the Vision Transformer variants used throughout this
study. On the left is the batching and embedding process (tokenization). On
the right is the main neural architecture. The Transformer Encoder blocks and
Linear Layers (classification sub-network) are identical for all models. CVT follows
ViT, removing the class token and introducing SeqPool. In CCT we modify the
tokenization process, building from CVT.

ViT uses a simple patch and embedding procedure, where the image is

evenly divided into patches. This in illustrated on the right half of Figure 5, under

Compact Vision Transformer (CVT). The process is to do a Group Normalization,

ReLU, MaxPool, patch, and embed. Notably, Dosovitskiy et al. did the patching

44


and embedding simultaniously with a convolution, matching strides to the

kernel size 3. This same strategy is used for our ViT-Lite and Compact Vision

Transformer (CVT) models. This procedure can be seen in Figure 6.

We propose removing the restriction of making the convolutional kernels and

strides match, allowing these patches to overlap. This would have an additional

beneficial side-effect, allowing for better generalizability, by not requiring images to

be integer multiples of the kernel size. This extends the embedding process to allow

for arbitrary image sizes and aspect ratios. Additionally, we remove the Group

Normalization layer from the ViT model, finding it unnecessary. Given an image or

feature map x ∈ RH×W×C we can process our image as follows:

x0 = MaxPool (ReLU (Conv2d(x))) (3.1)

Our convolution has a number of filters equal to the embedding dimension of the

transformer backbone, and both our convolution and pooling operations allow for

overlapping, which can introduce local inductive biases.

3.2.2 SeqPool. In order to map the sequential output of a

transformer to the linear representation required by a feed-forward classification

network ViT uses a singular class index, or token, similar to language models like

BERT [24]. This class token is learnable and then allows for the output of the

transformer to be sliced along the learned index. Unfortunately, this underutilizes

the relationships learned by the transformer encoding layers. This method

makes the assumption that the transformer encoder can, and will, decouple the

relationships of the training data. This disentanglement is the main task of the

classification subnetwork, thus forcing our Transformer to also perform this likely

leads to underutilization and overly constrains the encoding layers.

3This can be seen at github.com/google-research/vit jax/models vit.py:264

45

https://github.com/google-research/vision_transformer/blob/main/vit_jax/models_vit.py#L263-L270


We propose SeqPool, an attention inspired pooling method. The method is

based on the assumption that the transformer encoder’s output sequence contains

information relevant to classification. While this method is more computationally

complex than slicing, it can reduce overall computation due to removal of an

additional token that must be processed by the entirety of the network. We use a

network to generate a contraction S : Rb×n×d 7→ Rb×d, which then is an appropriate

shape to be processed by the classification sub-network.

Softmax
(
g(xL)

T
)
xL (3.2)

Unlike dot-product attention we are not using keys, queries, and values, but instead

learning a weighting of our sequence. Our function g is a single feed-forward layer

mapping g : Rb×n×d 7→ Rb×d. We score this contraction and weight our original

input producing the flattened output. This process can be seen as a learnable

submersion, incorporating across sequential data better, seemingly allowing us to

take advantage of neuron polysemanticity [134, 62] and superpositionality [31].

3.3 Experiments

We perform a variety of experiments in order to test our research

hypotheses. We name our models similar to those of ViT, using the more explicit

format:

[model]− [N layers] / [patch size]× [N convolutions]. (3.3)

The original ViT-B/16 model has 12 transformer encoder layers and a patch

size of 16, where we make the number of layers explicit: ViT-12/16. We use this

convention for all ViT and CVT models, dropping the number of convolutions.

For CCT we specify the number of convolutions, even if only one. This section is

organized to first provide details of our experiments and resources. Chapter 3.3.4

46


contains our main results, demonstrating high performance Vision Transformer

models on small datasets. Chapter 3.3.5 includes details of our ablations, detailing

the effects of our changes to the architecture. Chapter 3.3.6 provides a scaling

study, investigating the scaling of both data and parameters. Finally, Chapter 3.3.7

includes our NLP experiments, to demonstrate that these results generalize to

language models.

3.3.1 Datasets. Our primary focus is on small datasets, where we

train on CIFAR-10, CIFAR-100 [147], MNIST [94], and Fashion-MNIST [164].

We also test our models on Oxford Flowers-102 [120] 4 for generalizability due

to its large similarity between classes and high variance for intra-class similarity.

We also use ImageNet [23] to test the scailability of our approach, allowing for

more direct comparisons to ViT and DeiT. We also test our approach in Natural

Language Processing, using AG-News [172], TREC [101], SST [138], IMDb [116],

and DBpedia [2].

3.3.2 Computational Resources. For most experiments we use a

machine with an Intel(R) Core(TM) i9-9960X CPU @ 3.10GHz and 4 NVIDIA

RTX 2080Tis (11GB). The exception was the CPU test which was performed with

an AMD Ryzen 9 5900X, where we found you could reach 90% accuracy in under

30 minutes. Our ImageNet experiments were performed on a single machine with

either 2 AMD EPYC) 7662s and 8 NVIDIA RTX A6000 (48GB) or 2 AMD EPYC

7713s and 8 NVIDIA A100s (80GB).

3.3.3 Hyperparameters. We used the Pytorch Image Models library

(timm) [160] to train our models for all image experiments. Our augmentations

include CutMix [167], Mixup [170], RandAugment [19], and Random Erasing [178].

4We used the dataset from Kaggle, which has a different data split than torchvision. Further
discussion is provided later.

47


Model # Layers # Heads Ratio Dim

ViT-Lite-6 6 4 2 256
ViT-Lite-7 7 4 2 256

CVT-6 6 4 2 256
CVT-7 7 4 2 256

CCT-2 2 2 1 128
CCT-4 4 2 1 128
CCT-6 6 4 2 256
CCT-7 7 4 2 256
CCT-14 14 6 3 384

(a) Transformer Hyperparameters

Model # Layers # Convs Kernel Stride

ViT-Lite-7/8 7 1 8×8 8×8
ViT-Lite-7/4 7 1 4×4 4×4

CVT-7/8 7 1 8×8 8×8
CVT-7/4 7 1 4×4 4×4

CCT-2/3x2 2 2 3×3 1×1
CCT-7/3x1 7 1 3×3 1×1
CCT-7/7x2 7 2 7×7 2×2

(b) Tokenizer Hyperparameters

Table 2. Hyperparmeters used in different model configurations. Table 2a (left)
shows transformer hyperparameters while Table 2b (right) shows those for
tokenizers.

We performed hyperparameter sweeps for our differing methods and report the best

results we achieved. All hyperparameter experiments were trained for 300 epochs,

use a learning rate of 5 × 10−4, a cosine learning rate scheduler, and weighted

Adam optimizer (β = [0.9, 0.999])[85, 177]. For CNN models we found that some

performed best with AdamW while others were more performant with SGD with

momentum 0.9. For reproducibility we release our checkpoints corresponding to

the reported numbers and YAML files corresponding to our experimental settings.

These can be found on our public GitHub repository 5.

3.3.4 Transformers On Small Datasets. The main results of this

work are the success of training Vision Transformers on small datasets. We follow

the aforementioned training procedure, except our best model we further train as it

did not appear to be saturated. Our full results can be read in Table 3, where we

show a comparison of various ResNet based models, ViTs, CVT, and CCT, testing

our small vision datasets with comparisons of model size and required compute.

Notably, on CIFAR-10, we are able to achieve a 10% improvement over similarly

5https://github.com/SHI-Labs/Compact-Transformers

48

https://github.com/SHI-Labs/Compact-Transformers


Model CIFAR-10 CIFAR-100 FashionMNIST MNIST # Params FLOPs

Convolutional Networks (Designed for ImageNet)

ResNet18 90.27% 66.46% 94.78% 99.80% 11.18 M 0.04 G
ResNet34 90.51% 66.84% 94.78% 99.77% 21.29 M 0.08 G
ResNet50 91.63% 68.27% 94.99% 99.79% 23.53 M 0.08 G

MobileNetV2/0.5 84.78% 56.32% 93.93% 99.70% 0.70 M < 0.01 G
MobileNetV2/1.0 89.07% 63.69% 94.85% 99.75% 2.24 M 0.01 G
MobileNetV2/1.25 90.60% 65.24% 95.05% 99.77% 3.47 M 0.01 G
MobileNetV2/2.0 91.02% 67.44% 95.26% 99.75% 8.72 M 0.02 G

Convolutional Networks (Designed for CIFAR)

ResNet56[56] 94.63% 74.81% 95.25% 99.27% 0.85 M 0.13 G
ResNet110[56] 95.08% 76.63% 95.32% 99.28% 1.73 M 0.26 G
ResNet164-v1[57] 94.07% 74.84% − − 1.70 M 0.26 G
ResNet164-v2[57] 94.54% 75.67% − − 1.70 M 0.26 G
ResNet1k-v1[57] 92.39% 72.18% − − 10.33 M 1.55 G
ResNet1k-v2[57] 95.08% 77.29% − − 10.33 M 1.55 G
ResNet1k-v2⋆[57] 95.38% − − − 10.33 M 1.55 G
Proxyless-G[12] 97.92% − − − 5.7 M −

Vision Transformers

ViT-12/16 83.04% 57.97% 93.61% 99.63% 85.63 M 0.43 G

ViT-Lite-7/16 78.45% 52.87% 93.24% 99.68% 3.89 M 0.02 G
ViT-Lite-6/16 78.12% 52.68% 93.09% 99.66% 3.36 M 0.02 G

ViT-Lite-7/8 89.10% 67.27% 94.49% 99.69% 3.74 M 0.06 G
ViT-Lite-6/8 88.29% 66.40% 94.36% 99.73% 3.22 M 0.06 G

ViT-Lite-7/4 93.57% 73.94% 95.16% 99.77% 3.72 M 0.26 G
ViT-Lite-6/4 93.08% 73.33% 95.14% 99.74% 3.19 M 0.22 G

Compact Vision Transformers

CVT-7/8 89.79% 70.11% 94.50% 99.70% 3.74 M 0.06 G
CVT-6/8 89.50% 68.80% 94.53% 99.74% 3.21 M 0.05 G

CVT-7/4 94.01% 76.49% 95.32% 99.76% 3.72 M 0.25 G
CVT-6/4 93.60% 74.23% 95.00% 99.75% 3.19 M 0.22 G

Compact Convolutional Transformers

CCT-2/3×2 89.75% 66.93% 94.08% 99.70% 0.28 M 0.04 G
CCT-4/3×2 91.97% 71.51% 94.74% 99.73% 0.48 M 0.05 G
CCT-6/3×2 94.43% 77.14% 95.34% 99.75% 3.33 M 0.25 G
CCT-7/3×2 95.04% 77.72% 95.16% 99.76% 3.85 M 0.29 G

CCT-6/3×1 95.70% 79.40% 95.41% 99.79% 3.23 M 1.02 G
CCT-7/3×1 96.53% 80.92% 95.56% 99.82% 3.76 M 1.19 G
CCT-7/3×1⋆ 98.00% 82.72% − − 3.76 M 1.19 G

Table 3. Comparisons of various models when trained on small datasets. ⋆ was
trained for longer, see Table 4 for additional details. Our 3.76M parameter CCT
model is about to outperform both ResNets and ViTs across all datasets, with
longer training only being necessary to outperform the 5.7M Proxyless-G model on
CIFAR-10.

49


sized ViT-Lite models (ViT-Lite-7/8) and an 18% improvement over the ViT-

12/16 (ViT-B/16) model while our model has a 95.6% reduction in the number

of parameters. Our best model only contains a single convolutional layer within

the embedding process, meaning that the transformer architecture is performing

the main computation, achieving an accuracy of 98% while using only 3.76M

parameters. This result is only slightly less than Vaswani et al.’s much larger

models that include JFT-300M or ImageNet-21k pretraining and outperforms VIT-

12/32, ViT-24/16, and ViT-24/32 when using ImageNet-1k pretraining and fine-

tuning at 384 resolution (Table 5 of Vaswani et al.). We found that an increase in

convolutions tended to harm model performance.

# Epochs Pos. Emb. CIFAR-10 CIFAR-100

300 Learnable 96.53% 80.92%
1500 Sinusoidal 97.48% 82.72%
5000 Sinusoidal 98.00% 82.87%

Table 4. Training of CCT-7/3×1 with an increased number of epochs.

These results show that our CCT based model is able to outperform both

standard Vision Transformers as well as ResNet models. We demonstrate that

neither large scale pretraining nor knowledge distillation are needed to overcome

the biases found in smaller scale data. Furthermore, we strongly suspect that

the underlying issue is due to the tokenization process of overlapping patches.

We include a comparison of Salient Maps [32, 137] in Figure 7, comparing

visualizations on ImageNet. Saliency maps operate by looking visualizing the

gradient accumulations across the network. We should take care as to fully

interpret the semantic meaning of these maps, but the visualizations do clearly

indicate how the original patching may be recovered in the standard ViT model

50


while we have a much smoother representation in CCT, evidencing the first

research hypothesis.

Model CLS # Conv Conv Size Aug Tuning C-10 C-100 # Params FLOPS

“Large” Models (≈ 85M Parameters)

ViT-12/16 CT ✗ ✗ ✗ ✗ 69.82% 40.57% 85.63 M 0.43 G

ViT-12/16 CT ✗ ✗ ✓ ✓ 80.72% 56.73% 85.63 M 0.43 G
CVT-12/16 SP ✗ ✗ ✓ ✓ 80.84% 58.05% 85.63 M 0.34 G

ViT-12/8 CT ✗ ✗ ✓ ✓ 90.24% 69.81% 85.20 M 1.45 G
ViT-12/4 CT ✗ ✗ ✓ ✓ 94.07% 76.08% 85.12 M 5.61 G

CCT-12/7×1 SP 1 7× 7 ✓ ✓ 93.72% 76.21% 85.20 M 5.55 G
CCT-12/3×2 SP 2 3× 3 ✓ ✓ 94.50% 77.05% 85.53 M 5.63 G

Small Models (≈ 4M Parameters)

ViT-Lite-7/16 CT ✗ ✗ ✗ ✗ 71.78% 41.59% 3.89 M 0.02 G
ViT-Lite-7/8 CT ✗ ✗ ✗ ✗ 83.38% 55.69% 3.74 M 0.06 G
ViT-Lite-7/4 CT ✗ ✗ ✗ ✗ 83.59% 58.43% 3.72 M 0.26 G

CVT-7/16 SP ✗ ✗ ✗ ✗ 72.26% 42.37% 3.89 M 0.02 G
CVT-7/8 SP ✗ ✗ ✗ ✗ 84.24% 55.49% 3.74 M 0.06 G
CVT-7/8 SP ✗ ✗ ✓ ✗ 87.15% 63.14% 3.74 M 0.06 G
CVT-7/4 SP ✗ ✗ ✗ ✗ 88.06% 62.06% 3.72 M 0.25 G
CVT-7/4 SP ✗ ✗ ✓ ✗ 91.72% 69.59% 3.72 M 0.25 G
CVT-7/4 SP ✗ ✗ ✓ ✓ 92.43% 73.01% 3.72 M 0.25 G
CVT-7/2 SP ✗ ✗ ✗ ✗ 84.80% 57.98% 3.76 M 1.18 G

CCT-7/7×1 SP 1 7× 7 ✗ ✗ 87.81% 62.83% 3.74 M 0.26 G
CCT-7/7×1 SP 1 7× 7 ✓ ✗ 91.85% 69.43% 3.74 M 0.26 G

CCT-7/7×1 CT 1 7× 7 ✓ ✓ 91.67% 72.07% 3.74 M 0.26 G
CCT-7/7×1 SP 1 7× 7 ✓ ✓ 92.29% 72.46% 3.74 M 0.26 G

CCT-7/3×2 CT 2 3× 3 ✓ ✓ 93.36% 74.77% 3.85 M 0.29 G
CCT-7/3×2 SP 2 3× 3 ✓ ✓ 93.65% 74.77% 3.85 M 0.29 G

CCT-7/3×1 SP 1 3× 3 ✓ ✓ 94.47% 75.59% 3.76 M 1.19 G

Table 5. Ablation study, transforming ViT into CCT. We measure CIFAR
validation accuracy across each modification as well as the number of model
parameters and computation (MACs). All ViT models use a class token (CT),
while CVT and CCT use SeqPool (SP). We report the number of convolutions
used during embedding (# Conv), its kernel size, if we utilized image augmentation
(Aug), and tuning.

3.3.5 Ablations. We include ablations of our parameters to better

understand the impact of our changes to the ViT model. In Table 5 we step

through the process of converting our ViT model into CCT.In our table we

denote if we used a class token (CT) or SeqPool (SP), the number of convolutions

51


ImageNet ViT CCT NAT

Figure 7. Salient maps of ViT, CCT, and NAT based on ImageNet-1k. It can be
seen that CCT removes the blocking artifacts from ViT. CCT sometimes creates
displacement, but this is resolved by NAT (presented in Chapter 4).

52


user (overlapping patches), the kernel size, if image augmentations were used,

and additional tuning. Our tuning includes dropout, attention dropout, and

stochastic depth. We separate our models into two sections, with “Large” models,

with approximately 85M parameters and small models, with approximately 4M

parameters.

By directly comparing similar ViT-Lite models to our CVT models we can

see the effect of our SeqPool method. In all cases we see that there is a minor

performance improvement due to this, with a much lower effect with the large

85M parameter models. When comparing on CIFAR-10, models with 7 transformer

encoders, a patch size of 16 we observe a 0.7% increase, 1.0% for a patch size of 8,

a 5.3% increase with a patch size of 4. For the larger 12 transformer layer models

with a patch size of 16 we only notice a 0.1% increase, but these models included

tuning and augmentation, likely reducing the impact.

In the smaller models we see that the larger contribution to performance

increase is due to decreased patch size. For ViT models, decreasing from a patch

size of 16 to 8 increased model performance by 16.2%, but reducing to a patch

size of 4 only accounted for an additional 0.3% increase. For CVT the decrease

to a patch size of 8 showed a similar 16.6% improvement, but further reduction to

a patch of 4 gave another 4.5% increase. Larger impacts can be observed when

looking at CIFAR-100, except in the case of a patch size of 8 where SeqPool

appears to have a slight negative (< 0.5%) impact. We see a +1.8%, −0.4%,

and +6.2% difference for SeqPool, for our 3 patches. On ViT the patch reduction

accounts for a 33.9% and 4.9% improvements while CVT shows 31.0% and 11.8%

improvements. With decreased patch sizing the transformer appears to be able

to overcome the primary issues presented by smaller training sets. Our SeqPool

53


method still demonstrates greater performance, especially as patch size decreases,

showing greater network utilization.

The largest gains come from moving to CCT, which can also better take

advantage of data augmentations, showing better capacity for generalization. For

example, ViT-Lite-7/8, CVT-7/8, and CCT-7/3×1 all have 3.74M parameters,

but their CIFAR-10 scores are 83.38%, 84.24%, and 87.81% respectively. Where

CVT shows a 1% improvement, CCT shows 5.3%. We can see that CVT-7/8

improves to 87.15% (2.91%), while CCT-7/3×1 improves to 91.85% (4.04%)

when introducing augmentation. We can also see in our CCT experiments that

by removing SeqPool and reintroducing the class tokens that we drop performance

by 0.62%, demonstrating that SeqPool does not account for these differences. A

similar pattern can be found with larger models, though our comparisons are not as

thorough. These results show that the overlapping patches and better extraction

of data from the transformer architecture result in significant improvements,

evidencing our first two hypotheses.

We also include a short study on Positional Embedding, in Table 6.

Because our overlapping tokenization allowed us to debias some of the positional

relationships within the data we test to find the importance of positional

embedding. While ViT and CVT benefit strongly from positional embedding,

CCT only gets minor benefits. This further demonstrates the bias introduced by

patching in ViT. Some additional positional embedding comparisons can be found

in Figure 9.

3.3.6 Scaling Study. While the previous results demonstrate that

pretraining is unnecessary for Vision Transformers to be effective on small datasets,

54


Model PE CIFAR-10 CIFAR-100

Conventional Vision Transformers are more dependent on Positional Embedding

ViT-12/16
Learnable 69.82% (+3.11%) 40.57% (+1.01%)

Sinusoidal 69.03% (+2.32%) 39.48% (−0.08%)

None 66.71% ( baseline) 39.56% ( baseline)

ViT-Lite-7/8
Learnable 83.38% (+7.25%) 55.69% (+7.15%)

Sinusoidal 80.86% (+4.73%) 53.50% (+4.96%)

None 76.13% ( baseline) 48.54% ( baseline)

CVT-7/8
Learnable 84.24% (+6.52%) 55.49% (+7.23%)

Sinusoidal 80.84% (+3.12%) 50.82% (+2.56%)

None 77.72% ( baseline) 48.26% ( baseline)

Compact Convolutional Transformers are less dependent on Positional Embedding

CCT-7/7
Learnable 82.03% (+0.21%) 63.01% (+3.24%)

Sinusoidal 81.15% (−0.67%) 60.40% (+0.63%)

None 81.82% ( baseline) 59.77% ( baseline)

CCT-7/3×2
Learnable 90.69% (+1.67%) 65.88% (+2.82%)

Sinusoidal 89.93% (+0.91%) 64.12% (+1.06%)

None 89.02% ( baseline) 63.06% ( baseline)

CCT-7/3×2†
Learnable 95.04% (+0.64%) 77.72% (+0.20%)

Sinusoidal 94.80% (+0.40%) 77.82% (+0.30%)

None 94.40% ( baseline) 77.52% ( baseline)

CCT-7/3×1†
Learnable 96.53% (+0.29%) 80.92% (+0.65%)

Sinusoidal 96.27% (+0.03%) 80.12% (−0.15%)

None 96.24% ( baseline) 80.27% ( baseline)

CCT-7/7×1-noSeqPool
Learnable 82.41% (+0.12%) 62.61% (+3.31%)

Sinusoidal 81.94% (−0.35%) 61.04% (+1.74%)

None 82.29% ( baseline) 59.30% ( baseline)

CCT-7/3×2-noSeqPool
Learnable 90.41% (+1.49%) 66.57% (+1.40%)

Sinusoidal 89.84% (+0.92%) 64.71% (−0.46%)

None 88.92% ( baseline) 65.17% ( baseline)

Table 6. Validation accuracy comparison comparing Positional Embedding method.
Augmentations and training techniques such as Mixup and CutMix were turned
off for these experiments to better highlight differences. The numbers reported are
best out of 4 runs with random initializations. † denotes model trained with extra
augmentation and hyperparameter tuning.

55


we need to understand the relationship of model size, data quantity, and data

quality.

In order to address the Scale is All You Need arguments, we begin with the

study of model size. Our main study of model size can be seen in our ablations

(Table 5), where we observe that our larger 85.53M parameter model outperforms

out 3.76M parameter model on both CIFAR-10 and CIFAR-100, showing very

minor improvement on CIFAR-10 and a 2% increase on CIFAR-100.

This result runs counter to ViT, where the larger model has a performance

decrease of up to 16.5% and 30.5%, respectively. When given additional

augmentation, the larger ViT model is only able to outperform our largest ViT-

Lite-7/16 model, which did not use tuning or augmented training. The two slightly

smaller ViT-Lite models are still able to outperform this large model without the

inclusion of additional augmentation or training, demonstrating that the smaller

patch sizes play a more significant role, as discussed in Chapter 3.3.5. We believe

that the smaller window sizes allow the transformer architecture to better integrate

data across patches, learning convolutions similar to what Cordonnier et al. had

shown, but further study is required to confirm or deny. The increase relationship

between patch size and performance applies to both large and small ViTs, with

the large ViT approaching the performance of CCT (surpassing ViT-Lite models)

once the patch size is reduced to 4, yet still do not surpass the performance of small

CCT models on CIFAR-10.

Under most configurations, CVT also shows a decrease in performance,

again with improved performance primarily being attributed to the path

size. Performance decreases at a patch size of 2, similar to Cordonnier et al.’s

configuration, showing that the patches can be too small. In a way, this

56


demonstrates that scale plays an important role, but these trends run counter

to the conventional wisdom. These results demonstrate the importance of the

embedding process and that näıvely scaling architectures may instead hinder

performance. Careful design of the neural architecture trumps scaling.

Model Top-1 # Params FLOPS Epochs

ResNet50 77.15% 25.55 M 4.15 G 120
ResNet50 (2021) 79.80% 25.55 M 4.15 G 300
ViT-S 79.85% 22.05 M 4.61 G 300
CCT-14/7×2 80.67% 22.36 M 5.53 G 300

DeiT-S ⚗ 81.16% 22.44M 4.63 G 300
CCT-14/7×2 ⚗ 81.34% 22.36 M 5.53 G 300

Table 7. ImageNet Top-1 validation accuracy comparison (no extra data or
pretraining). Models with ⚗ denotes distillation and follow the knowledge
distillation process as described in Touvron et al [149]. ResNet50 (2021) is reported
from [161] which has the same training recipe as ours.

To study relation of data to model performance we perform multiple scaling

studies. In order to complete our parameter scaling study, we test our model’s

performance on larger amounts of data, with ImageNet, but leave further large

model and large data scaling studies to labs with resources similar to Vaswani et al.

In Table 7 we train a 14 layer (22M param) model, and compare it to ViT-S and

DeiT-S models from Touvron et al. [149]. It is difficult to get these models to be

exactly the same parameter size, but our model is able to still outperform ViT

on ImageNet-1k without any pretraining. We also compare to DeiT-S ⚗, where
our model is slightly smaller, following the same knowledge distillation process.

Our model again shows improvements, demonstrating that our procedure does not

produce negative effects with increased data scale.

57


Model Resolution Pretraining Top-1 # Params FLOPs

CCT-14/7×2 224 - 97.19% 22.17 M 18.63 G

DeiT-B 384 ImageNet-1k 98.80% 86.25 M 55.68 G
ViT-L/16 384 JFT-300M 99.74% 304.71 M 191.30 G
ViT-H/14 384 JFT-300M 99.68% 661.00 M 504.00 G
CCT-14/7×2 384 ImageNet-1k 99.76% 22.17 M 18.63 G

Table 8. Flowers-102 Top-1 validation accuracy comparison. CCT outperforms
other competitive models, having significantly fewer parameters and GFLOPs. This
demonstrates the compactness on small datasets even with large images.

We also test our 22M parameter model on the Flowers-102 dataset, which

is designed for high data variance and to test model generalizability. For this we

are able to achieve an accuracy of over 97% without the use of any pretraining

data or higher resolution tuning. These results can be found in Table 8. When

using ImageNet-1k pretraining and including higher resolution tuning, following the

procedure of DeiT, we are able to achieve state of the art results, outperforming

models that included more than a magnitude more parameters and a more than

a magnitude amount of pretraining data. It should be noted that we used the

Flowers-102 dataset provided from Kaggle and that this uses a different data

split than that which is included in the torchvision version6. This was brought

to our attention through a GitHub issue,7 where a user was unable to replicate

our results. We retrained our CCT-7/7×2 (4M params) and CCT-14/7×2 models

at 224 resolution and obtained 68.26% and 68.85% accuracy, respectively. When

applying the same procedure to ViT-S/16 we obtained a result of 48.63%, only

showing our model having better performance applied to this dataset.

6The torchvision dataset collection did not include Flowers-102 when initially trained.

7A wandb report showing training results can be found alongside the issue here: https:
//github.com/SHI-Labs/Compact-Transformers/issues/65

58

https://github.com/SHI-Labs/Compact-Transformers/issues/65
https://github.com/SHI-Labs/Compact-Transformers/issues/65


10 20 30 40 50 60 70 80 90 100
65

70

75

80

85

90

95

Percent Per Class (%)

A
cc

u
ra

cy
(%

)

CCT-7/3x2

ViT-Lite-7/4
ResNet18
MobileNet

Figure 8. Comparison of models with restricted number of samples per class.
At 10% models are trained on only 5000 images. Transformer based models
demonstrate better scalability than ResNet based models.

Moving on to further test the scalability of our model with respect to data,

we study the performance with respect the number of samples as well as the size

of our images. In Figure 8 we restrict the number of samples in each class within

the CIFAR-10 dataset. We compare the performance of CCT, ViT, ResNet18,

and MobileNet when using only 10% of CIFAR-10 up to the full dataset. With

only 10% of CIFAR-10, CCT is still able to achieve 77.7% accuracy, compared to

ViT’s 67.9%. CCT is able to outperform the other models regardless of the data

reduction. ViT shows worse performance with data scaling, only beating ResNet18

when including 70% or more of the data.

Additionally, we include a short study where we modify the image sizes

of CIFAR-10 to understand the dependence on resolution, found in Figure 9.

With smaller resolution images models will likely be less able to rely upon local

structures within the data, as they will be merged. When upscaling, we use a

standard bicubic interpolation. In the first row of the graphs we train our models

59


16 24 32 48 64
75

80

85

90

95

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

A
cc

u
ra

cy
(%

)

16 24 32 48 64
80

85

90

95

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

16 24 32 48 64
80

85

90

95

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

16 24 32 48 64

50

60

70

80

90

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

A
cc

u
ra

cy
(%

)

(a) No P.E.

16 24 32 48 64

20

40

60

80

100

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

(b) Sinusoidal P.E.

16 24 32 48 64

40

60

80

100

ViT-Lite-7/4

CCT-7/3x2

Image Height & Width

(c) Learnable P.E.

Figure 9. Comparison of ViT-Lite and CCT accuracy on CIFAR-10 with differing
image resolutions. In first row, models are trained from scratch. In second row,
models are inference and trained on 32 × 32 images. Fig. 9a is without positional
embedding, Fig. 9c with sinusoidal positional embedding, and Fig. 9b with a
learnable positional embedding. Inference with learnable positional embedding
cannot be extended to larger images without modifying model parameters.

60


from scratch, allowing them to discover these associations. In the second row, we

only run inference, testing our models’ capacity to generalize to novel resolutions.

We also show comparisons without positional embedding, with Sinusoidal Positional

Embedding, and with Learnable Positional Embedding. In our inference results

Learnable Positional Embedding models are unable to process larger resolution

images than they were trained on, creating a significant limitation to this method.

In all cases, except inference with Sinusoidal Positional Embedding, CCT is able to

out perform ViT, further demonstrating data generalizability.

3.3.7 Natural Language Processing. Finally, we test our

method on small natural language processing datasets. This network needs slight

modification, incorporating GloVe [125] to provide word embeddings for our model.

We do not train these embedding parameters and we do not include GloVe in

our model parameter sizes, which is about 20M. To process the data we treat the

text as single channel data, use an embedding dimension of 300, and a convolution

kernel of size 1. We also perform masking in the typical manner.

By using CCT on these datasets we are able to achieve up to a 3%

improvement when comparing to vanilla transformers. Additionally, our CCT

model is able to do this while using fewer parameters. Our CCT models that are

able to perform best have less than 1M parameters, making GloVe a significantly

larger part of the network. We report a comparison of vanilla transformers, ViT,

CVT, and CCT in Table 9

3.4 Conclusion

In this work we saw the importance of properly embedding information into

our machine learning models. We need to ensure that this is done properly or we

may severely limit our model’s capabilities. Even small seemingly trivial differences

61


Model AGNews TREC SST IMDb DBpedia # Params

Vanilla Transformer Encoders

Transformer-2 93.28% 90.40% 67.15% 86.01% 98.63% 1.086 M
Transformer-4 93.25% 92.54% 65.20% 85.98% 96.91% 2.171 M
Transformer-6 93.55% 92.78% 65.03% 85.87% 98.24% 4.337 M

Vision Transformers (ViT)

ViT-Lite-2/1 93.02% 90.32% 67.66% 87.69% 98.99% 0.238 M
ViT-Lite-2/2 92.20% 90.12% 64.44% 87.39% 98.88% 0.276 M
ViT-Lite-2/4 90.53% 90.00% 62.37% 86.17% 98.72% 0.353 M
ViT-Lite-4/1 93.48% 91.50% 66.81% 87.38% 99.04% 0.436 M
ViT-Lite-4/2 92.06% 90.42% 63.75% 87.00% 98.92% 0.474 M
ViT-Lite-4/4 90.93% 89.30% 60.83% 86.71% 98.81% 0.551 M
ViT-Lite-6/1 93.07% 91.92% 64.95% 87.58% 99.02% 3.237 M
ViT-Lite-6/2 92.56% 89.38% 62.78% 86.96% 98.89% 3.313 M
ViT-Lite-6/4 91.12% 90.36% 60.97% 86.42% 98.72% 3.467 M

Compact Vision Transformers (CVT)

CVT-2/1 93.24% 90.44% 67.88% 87.68% 98.98% 0.238 M
CVT-2/2 92.29% 89.96% 64.26% 86.99% 98.93% 0.276 M
CVT-2/4 91.10% 89.84% 62.22% 86.39% 98.75% 0.353 M
CVT-4/1 93.53% 92.58% 66.64% 87.27% 99.04% 0.436 M
CVT-4/2 92.35% 90.36% 63.90% 86.96% 98.93% 0.474 M
CVT-4/4 90.71% 90.14% 61.98% 86.77% 98.80% 0.551 M
CVT-6/1 93.38% 92.06% 65.94% 86.78% 99.02% 3.237 M
CVT-6/2 92.57% 91.14% 64.57% 86.61% 98.86% 3.313 M
CVT-6/4 91.35% 91.66% 61.63% 86.13% 98.76% 3.467 M

Compact Convolutional Transformers (CCT)

CCT-2/1x1 93.40% 90.86% 68.76% 88.95% 99.01% 0.238 M
CCT-2/2x1 93.38% 91.86% 67.19% 89.13% 99.04% 0.276 M
CCT-2/4x1 93.80% 91.42% 64.47% 88.92% 99.04% 0.353 M
CCT-4/1x1 93.49% 91.84% 68.21% 88.71% 99.03% 0.436 M
CCT-4/2x1 93.30% 93.54% 66.42% 88.94% 99.05% 0.474 M
CCT-4/4x1 93.09% 93.20% 66.57% 88.86% 99.02% 0.551 M
CCT-6/1x1 93.73% 91.22% 66.59% 88.81% 98.99% 3.237 M
CCT-6/2x1 93.29% 92.10% 65.02% 88.74% 99.02% 3.313 M
CCT-6/4x1 92.86% 92.96% 65.84% 88.68% 99.02% 3.467 M

Table 9. Top-1 validation accuracy on text classification datasets. The number of
parameters does not include the word embedding layer, because we use pretrained
word-embeddings and freeze those layers while training.

62


can have tremendous effects on these models, making it important to care when

designing our neural architectures. If great care is not taken we will make the

wrong conclusions and hinder our own progress.

While pretraining can help with model performance, when working with

very large datasets it becomes difficult to deduplicate data, and works have shown

that despite attempts to deduplicate these datasets may still be reduced upwards of

50% [1]. These duplications reduce model performance and generalizability, as they

push the models to over attend to certain semantics. While reducing the requisite

dataset size doesn’t solve this problem, it certainly makes it a much more tractable

problem. Given such results it makes it difficult to distinguish if large pretrained

models are generalizing or simply memorizing data.

An important result of this work was the ability to achieve comparable

performance while using orders of magnitude fewer parameters. While there are

still a large number of parameters, having fewer decreases a model’s ability to

overfit. Smaller models also enable them to be used by more people, with fewer

computational resources, and in more domains. Despite the rapid advancement

of computational power, such small models are still critical tools for many areas

of science, which may not have access to multiple GPUs or the ability to obtain

large datasets. While datasets like CIFAR-10 are considered to be small by machine

learning standards, they are often orders of magnitude larger than datasets

available within other research domains. This work makes transformer models

available to these researchers.

63


CHAPTER IV

VARIADIC NEIGHBORHOOD ATTENTION

Random numbers should not be

generated with a method chosen at

random.

Donald Knuth

Nota Bene: This chapter is based on the previous published co-authored

work Efficient Image Generation with Variadic Attention Heads [156], formerly

released as StyleNAT: Giving Each Head a New Perspective. Additionally, this

chapter involves content from Neighborhood Attention Transformer [52] (NAT) in

order to facilitate the discussion of StyleNAT, but is not the focus of this chapter.

– Steven Walton programmed the majority of the source code for StyleNAT

and ran the majority of experiments. This includes creating all the research

questions and designing all the necessary experiments to evidence them. His

contributions also include all the visual analysis as well as the development

of the attention maps to visualize restricted attention mechanisms. He

was also the main writer of the paper. Steven also made significant

contributions to the work of NAT, helping develop the theory (primarily

around generalization), made contributions to the source code, provided

advice, and help write the paper.

– Ali Hassani developed the NATTEN CUDA kernel that was used in both

StyleNAT and NAT. He provided important insights, especially with the

rapidly changing NATTEN code, made contributions to the source code,

helped perform experiments, and provided key insights for the development of

64


the restricted attention visualization. Ali Hassani was also the primary author

of the NAT paper, writing the majority of code, performing the majority of

experiments, and was the largest contributor to the paper’s text.

– Xingqian Xu contributed advice and insights around the underlying

StyleGAN architecture.

– Zhangyang Wang provided guidance during the research and feedback for the

project.

– Jaichen Li provided feedback for the NAT design and contributed to the

writing of the paper.

– Shen Li provided general design feedback for the NATTEN CUDA kernel

and support for running the large scale experiments.

– Humphrey Shi was the advisor for both StyleNAT and NAT, contributing

overall guidance on the research as well as funding for both works. Humphrey

also contributed to the writing of the paper and ensuring research stayed on

track.

While Chapter 3’s success with CCT demonstrated that ViTs could be

significantly improved in terms of data and computational efficiency, it left the

core neural architecture untouched. These impacts come from preparing the

data for processing, but further improvements can be made by also improving

the processing. Our ViT models still struggle with their O(n2) complexities, in

both time and space, so making improvements to these layers can have significant

impacts. Still, the work showed that transformers did not need big data nor

65


big models to be successful. This motivates further work into improving these

architectures themselves.

Transformers were born with language in mind, but had been adapted for

vision. The computational challenges are particularly challenging in Computer

Vision due to the multi-dimensional data that must be processed, c × w × h

which frequently leads to out-of-memory (OOM) issues [174, 96, 171]. The de-

facto solution to this problem had been to use Convolutional Neural Networks

(CNNs)[94, 93, 41]. This is because CNNs provide memory efficiency by operating

only on a localized context window as well as naturally incorporating multi-

dimensional spatial relationships.

On the other hand, transformer networks attend over the entire data,

allowing for arbitrary connections to be made. As previously discussed

(Chapter 3.1), transformers are capable of learning convolution filters, so it should

be possible for them to be just as powerful. These benefits come at a cost of O(n2)

both in computational complexity as well as memory complexity, but our previous

work demonstrated that smaller ViTs could outperform CNNs. This then begs the

question if ViTs can be better adapted to vision tasks. Are we able to achieve O(n)

performance while also being able to incorporate both local and global structures

within our data?

Figure 10. Samples form FFHQ-256 (left) with FID: 2.05, FFHQ-1024 (center)
with FID: 4.17, and Church (right) with FID: 3.40 generated by our StyleNAT
network, using Hydra Neighborhood Attention.

66


This chapter studies the core architecture of the network, by introducing

Efficient Image Generation with Variadic Attention Heads [156], which allows

the vision transformer to do more with less. The primary modification for this

work is simple, yet powerful: allow attention heads to attend to independent

receptive fields. Our results demonstrate that some simple modifications to our

attention heads can allow our Vision Transformers to better integrate local and

global relationships during image generation. The result of this is the ability to

train a StyleGAN [79] based model, using a modified version of Neighborhood

Attention [52, 50], which pushes the Pareto Frontier for image generation on

FFHQ-256. Our model makes significant improvements in terms of visual fidelity

while being smaller and has a higher throughput than other comparative models.

4.1 Localized Attention

In an effort to address the computational challenges of transformers,

researchers looked to a number of different solutions. One such solution is to

only perform attention on some localized region instead of the whole input. This

formulation is natural as analysis of attention maps shows that there is strong

correlation between neighboring tokens [90, 3, 152], or having Attention Sinks [163].

Works like Image Transformer [123] and Stand Alone Self-Attention (SASA) [129]

use localized context windows for their transformer algorithms, similar to the

ideas proposed in Longformer [7]. These methods reduced the computational

burden of attention mechanisms, approximating O(n) complexity, but had issues

generalizing as the window size increased. Other works like HaloNet [151] and

the Window Self-Attention (WSA) from Swin Transformer [109, 110] partitioned

the query and context sets, independently performing self-attention. These blocks

become highly parallelizeable but does not account for cross-block interactions.

67


Swin tried to address this issue by introducing shifted windows (SWSA), where

subsequent attentions would shift their windows. With a hierarchical structure the

network can is able to attend to every pixel in an image to attend to one another,

but incorporates biases around boundaries, similar to the issues faced in the non-

overlapping blocks of ViT (Chapter 3).

4.2 Neighborhood Attention

2.5 5.0 7.5 10.0 12.5 15.0 17.5

81.0

81.5

82.0

82.5

83.0

83.5

84.0

84.5

0.0

80.5

ConvNeXt-T

ConvNeXt-S

ConvNeXt-B

Swin-T

Swin-S

Swin-B

NAT-M

NAT-T

NAT-S

NAT-B

Model parameters

Mini

Tiny

Small

Base

∼ 20M

∼ 30M

∼ 50M

∼ 90M

GFLOPs

Accuracy
Neighborhood Attention Transformer

ConvNeXt (CVPR 2022)

Swin Transformer (ICCV 2021)

Figure 11. Comparison of Neighborhood Attention, Swin, and ConvNeXt on
ImageNet classification.

To resolve these issues, Hassani et al. developed the Neighborhood Attention

Transformer (NAT) [52]. The architecture is similar to SASA but resolved the

generalization issue, ensuring that when the window size was equal to the image

size that Neighborhood Attention (NA) would be identical to the traditional dot-

product self-attention mechanism. Like a convolution, NA considers a context

window around each individual input queries, Q. The keys, K, then evaluate over

the surrounding neighborhood (a square). If a (relative) positional bias [68, 128], B,

is used then this must also be modified to account for the key location. Similarly,

68


the value, V , must be updated to correspond wi