MEASURING LONG-TERM MEMORIES AT THE FEATURE LEVEL REVEALS 
MECHANISMS OF INTERFERENCE RESOLUTION 
 
 
 
 
 
 
 
 
 
 
 
by 
 
 
MAXWELL L. DRASCHER 
 
 
 
 
 
 
 
 
 
 
 
A DISSERTATION 
 
Presented to the Department of Psychology 
and the Division of Graduate Studies of the University of Oregon 
in partial fulfillment of the requirements 
for the degree of 
Doctor of Philosophy  
 
March 2023 
 
  
DISSERTATION APPROVAL PAGE 
 
Student: Maxwell L. Drascher 
 
Title: Measuring Long-term Memories at the Feature Level Reveals Mechanisms of Interference 
Resolution 
 
This dissertation has been accepted and approved in partial fulfillment of the requirements for 
the Doctor of Philosophy degree in the Department of Psychology by: 
 
Brice Kuhl Chairperson 
Ulrich Mayr Core Member 
Margaret Sereno Core Member 
James Murray Institutional Representative 
 
and 
 
Krista Chronister Vice Provost for Graduate Studies  
 
Original approval signatures are on file with the University of Oregon Division of Graduate 
Studies.  
 
Degree awarded March 2023 
  
2 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
© 2023 Maxwell L. Drascher  
This work is licensed under a Creative Commons  
Attribution-NonCommercial-ShareAlike 4.0 International License. 
 
 
 
3 
  
DISSERTATION ABSTRACT  
 
Maxwell L. Drascher 
 
Doctor of Philosophy 
 
Department of Psychology 
 
March 2023 
 
Title: Measuring Long-term Memories at the Feature Level Reveals Mechanisms of Interference 
Resolution 
 
 
When memories share similar features, this can lead to interference, and ultimately 
forgetting. At the same time, many highly similar memories are remembered vividly for years to 
come. Understanding what causes interference and how it is overcome is key to understanding 
the vast human memory capacity. One unresolved challenge is that interference has primarily 
been studied with dichotomous measures of memory (“remembered”, “forgotten”). This limits 
our understanding because memories are not all-or-none, they are comprised of multiple 
features, each of which can be recalled with different levels of detail or bias. In order to 
investigate this issue, this dissertation focuses on the use of face stimuli. Faces are a unique 
class of stimuli for studying memory interference in that they are readily parameterizable and 
humans are experts at perceiving them. This means that they can be manipulated to be similar 
enough to cause interference, but subtle differences can also be stored and later probed from 
long-term memory. This dissertation develops a methodology to create synthetic faces that can 
be manipulated and probed along a set of perceptually-important feature dimensions. This 
development process included documenting face landmark positions, sorting faces based on 
perceived similarity, and collecting subjective ratings on a corpus of 1,148 face images. In a 
series of three experiments, I then applied this novel methodology to understand how memories 
change at the feature level when there is interference between highly similar memories. I found 
two memory changes that specifically occurred when there was interference between highly 
4 
  
similar stimuli: (1) during recollection there was a bias to exaggerate the subtle differences and 
(2) distinguishing features were recalled with greater consistency. Critically, these memory 
changes were adaptive in that they were associated with less interference-related errors. 
Finally, in a separate fMRI experiment, I used the same corpus of faces and feature dimensions 
to reconstruct faces based on patterns of fMRI activity evoked while viewing them. I argue that 
this approach can be utilized in the future to measure neural representational changes during 
interference resolution. Together our findings provide important insights into how the memory 
system resolves interference between highly similar memories.  
 
 
This dissertation includes previously published and unpublished co-authored material. 
5 
  
CURRICULUM VITAE 
 
NAME OF AUTHOR:  Maxwell L. Drascher 
 
 
GRADUATE AND UNDERGRADUATE SCHOOLS ATTENDED: 
 
 University of Oregon, Eugene 
 Skidmore College, Saratoga Springs 
  
 
DEGREES AWARDED: 
 
 Doctor of Philosophy, Psychology, 2023, University of Oregon 
 Master of Science, Psychology, 2017, University of Oregon 
 Bachelor of Arts, Psychology, 2012, Skidmore College 
  
 
AREAS OF SPECIAL INTEREST: 
 
 Cognitive Neuroscience 
 
 
PROFESSIONAL EXPERIENCE: 
 
 Graduate Research & Teaching Assistant, University of Oregon,  
  September 2016 – March 2023 
 
 Project Manager, Claremont Graduate University,  
  August 2015 – August 2016   
 
 Research Specialist, Princeton University,  
  June 2014 – July 2015 
 
 Research Assistant, UMASS Boston,  
  August 2012 – January 2014 
 
 
GRANTS, AWARDS, AND HONORS: 
 
 NSF GRFP Honorable Mention, University of Oregon, 2018 
 
 First Year Fellowship, University of Oregon, 2016 
 
 Psi Chi, Skidmore College, 2011 
 
 Periclean Honors Forum, Skidmore College, 2008 
 
 
6 
  
PUBLICATIONS: 
 
Drascher, M. L., & Kuhl, B. A. (2022). Long-term memory interference is resolved via 
repulsion and precision along diagnostic memory dimensions. Psychonomic Bulletin 
& Review, 1-15. 
 
Chanales, A. J., Tremblay-McGaw, A. G., Drascher, M. L., & Kuhl, B. A. (2021). Adaptive 
repulsion of long-term memory representations is triggered by event 
similarity. Psychological science, 32(5), 705-720. 
 
Siperstein, G. N., Parker, R. C., & Drascher, M.L. (2013). National snapshot of adults with 
intellectual disabilities in the labor force. Journal of Vocational Rehabilitation, 39(3), 
157-165. 
7 
  
ACKNOWLEDGMENTS  
 
 
This dissertation would not be here without the many people in my life. I am very grateful 
for the support I have received. I will highlight certain people, but there are too many to name. 
I would first like to thank my advisor, Dr. Brice Kuhl. Dr. Kuhl has provided me with 
countless hours of advice and wisdom throughout my time at the University of Oregon. Dr. 
Kuhl’s scientific perspective is all over this document. I am very grateful for all the support he 
has given me. 
I would also like to thank the other members of my committee: Dr. Ulrich Mayr, Dr. 
Margaret Sereno, and Dr. James Murray. Thank you for your time, attention, and valuable 
insight. Thanking my committee would not be complete without a special mention of a member 
who sadly passed away, Dr. Sarah Dubrow. Dr. Dubrow made my experience as a graduate 
student richer and more joyful. I hope that hints of her thinking can be seen in this dissertation. 
I would also like to thank the members of the Kuhl lab, past and present. Several people 
directly played a role in this research. I would like to specifically thank Dr. Hongmi Lee, Dr. 
Nicole Long, and Sarah Sweigart for their roles early in this research. I would also like to thank 
Alex Tremblay-McGaw for her overall support and for her role in landmarking face stimuli and 
collecting much of the data. Thank you to Paul Keene for his role in data analysis. Many other 
members of the lab have helped me, guided my thinking, and provided support. Thank you to 
everyone else in the lab who I neglected to highlight here. 
I would also like to thank the staff both in the Psychology Department and at the Lewis 
Center for Neuroimaging for providing facilities, equipment, training, and support integral to 
collection of the data included in this project. I greatly appreciate all of the assistance they have 
provided to me. 
It is also important to thank one of the most influential people on my journey here, my 
8 
  
undergraduate thesis advisor, Dr. Hugh Foley. Dr. Foley gave me confidence and helped instill 
a love of research in me. He has always been very supportive of me. It is remarkable how much 
of this work parallels what I worked on with him over ten years ago. 
Importantly, I would never have made it to this point without the support of my loved 
ones. Thank you for not giving up on me. Thank you to my parents. And thank you to my 
amazing partner, Kathy Padgett. 
This research was supported in part by NSF CAREER Award (BCS-1752921) and NIH-
NINDS R01 (NS107727) awarded to Dr. Kuhl. 
 
  
9 
  
 
 
 
 
 
 
For Mom. 
Thank you for believing in me.  
 
10 
  
TABLE OF CONTENTS 
Chapter Page 
 
 
I. INTRODUCTION  ...................................................................................................  19 
      Introduction: how do we remember highly similar information? ..........................  19 
 Memory interference background ........................................................................  21 
 Cognitive perspectives on memory interference ............................................  21 
 The role of item similarity in interference .......................................................  22 
 Neural origins of interference .........................................................................  23 
 Adaptive memory changes as a route to interference resolution ........................  24 
 Adaptive memory errors and distortions ........................................................  24 
 Interference resolution ...................................................................................  26 
 Measuring the feature space of memory .............................................................  30 
 Behavioral measures of memory content ......................................................  30 
 Neural measures of memory content .............................................................  33 
      Goal and structure of the dissertation .................................................................  35 
II. STANDARDIZED SET OF 1,148 FACE STIMULI WITH  
    LANDMARKS, SORTING, AND RATING DATA ...................................................  37 
      Introduction ..........................................................................................................  37 
 Methods ...............................................................................................................  41 
 Face image corpus ........................................................................................  41 
 Face image landmarking ...............................................................................  41 
 Active appearance model application ............................................................  44 
 Subjective similarity sorting ...........................................................................  44 
  Procedure ..........................................................................................  44  
11 
  
Chapter Page 
  
  Reliability analysis ..............................................................................  46 
 Subjective ratings ..........................................................................................  46 
  Participants ........................................................................................  46 
  Procedure ..........................................................................................  47 
  Reliability analysis ..............................................................................  47 
 Results .................................................................................................................  48 
 Landmark validation .......................................................................................  48 
 Active appearance model application ............................................................  49 
 Subjective similarity sorting ...........................................................................  53 
 Subjective ratings ..........................................................................................  56 
 Discussion ...........................................................................................................  59 
III. LONG-TERM MEMORY INTERFERENCE IS RESOLVED VIA REPULSION  
     AND PRECISION ALONG DIAGNOSTIC MEMORY DIMENSIONS ...................  63 
      Introduction ..........................................................................................................  63 
 Methods ...............................................................................................................  66 
 Participants ....................................................................................................  66 
 Materials ........................................................................................................  67 
  Cue words ..........................................................................................  67 
  Faces .................................................................................................  67 
 Procedure ......................................................................................................  70 
  Learning phase ..................................................................................  71 
  Reconstruction phase ........................................................................  73 
  Reconstruction search space .............................................................  74 
12 
  
Chapter Page 
 
 Analysis methods ...........................................................................................  74  
  Performance-based exclusion criteria ................................................  74 
  Measuring associative memory .........................................................  75 
  Measuring bias ...................................................................................  75 
  Measuring precision ...........................................................................  77 
  Measuring the relationship between reconstruction bias  
  and associative interference ..............................................................  77 
 Results .................................................................................................................  80 
 Associative memory test ................................................................................  80 
 Face reconstruction accuracy ........................................................................  81 
 Face reconstruction bias ................................................................................  82 
 Face reconstruction precision ........................................................................  84 
 Relationship between reconstruction bias and associative interference .......  85 
 Discussion ...........................................................................................................  88 
IV. RECONSTRUCTING FACE IMAGES FROM DISTRIBUTED PATTERNS OF  
      FMRI ACTIVITY USING THE ACTIVE APPEARANCE MODEL  ........................  91 
 Introduction ..........................................................................................................  91 
 Methods ...............................................................................................................  93 
 Participants ....................................................................................................  93 
 Procedure ......................................................................................................  94 
 Stimuli  ...........................................................................................................  95 
 fMRI imaging acquisition ................................................................................  96 
 fMRI data preprocessing ................................................................................  96 
13 
  
Chapter Page 
 
 Regions of interest .........................................................................................  98 
 Face reconstruction analysis .........................................................................  99 
 Statistical tests ...............................................................................................  101 
 Results .................................................................................................................  101 
 Behavioral Performance ................................................................................  101 
 Reconstruction of faces using AAM ...............................................................  101 
 Reconstruction performance compared to eigenface model .........................  106 
 Reconstruction performance for appearance vs shape components ............  109 
 Reconstruction performance for individual AAM components .......................  112 
 The effect of repetition on reconstruction performance .................................  116 
 Predicting subjective ratings ..........................................................................  119 
 Discussion ...........................................................................................................  120 
 Differences in content representation ............................................................  121 
 Effect of repetition ..........................................................................................  123 
 Comparison to eigenfaces .............................................................................  124 
 Future directions ............................................................................................  125 
V. GENERAL DISCUSSION  ....................................................................................  127 
 Integrated summary of results .............................................................................  127 
 The role of selective attention in interference resolution .....................................  129 
 Relationship between behavioral findings and  
 neural accounts of interference resolution ...........................................................  131 
 Future directions ..................................................................................................  132 
      Broader implications ............................................................................................  135 
      Conclusion: how do we remember highly similar information? ...........................  136 
14 
  
Chapter Page 
 
APPENDICES ...........................................................................................................  138 
 A. CHAPTER II SUPPLEMENTARY MATERIAL  ...............................................  138 
 B. CHAPTER III SUPPLEMENTARY MATERIAL  ..............................................  140 
 C. CHAPTER IV SUPPLEMENTARY MATERIAL  ..............................................  145 
REFERENCES CITED ..............................................................................................  151 
15 
  
LIST OF FIGURES 
 
Figure Page 
 
 
1.1. Simplified summary of the Hulbert and Norman (2015) model ......................  28 
2.1. Example of the 62 landmark positions ...........................................................  42 
2.2. Example trial from sorting task ......................................................................  45 
2.3. Illustration of the top ten shape components .................................................  50 
2.4. Illustration of the top ten appearance components ........................................  51 
2.5. Example of five stimuli reconstructed with AAM components. ......................  52 
2.6. Top three multidimensional scaling (MDS) components ...............................  54 
2.7. Boxplot of the correlation between dissimilarity matrices ..............................  54 
2.8. Scree plot of a hierarchical clustering analysis based on sorted stimuli ........  55 
2.9. Scree plot of a hierarchical clustering analysis based on sorted stimuli ........  56 
2.10. Example mean images of the face groupings ...............................................  56 
2.11. Relationship between the ratings for all stimuli .............................................  58 
3.1. Experimental paradigm and design ...............................................................  69 
3.2. Associative memory test accuracy across learning rounds ...........................  81 
3.3. Face reconstruction accuracy ........................................................................  82 
3.4. Feature memory along the diagnostic and non-diagnostic dimensions .........  83 
3.5. Relationship between reconstruction bias and associative memory .............  87 
4.1. Experimental design ......................................................................................  95 
4.2. Visualization of the ROIs ...............................................................................  99 
4.3. Schematic of the face reconstruction analysis ..............................................  100 
4.4. Reconstruction examples from one participant from OCC ............................  102 
4.5. AFC accuracy for AAM components ..............................................................  103 
16 
  
Figure  Page 
 
4.6. AFC accuracy for AAM components within PPC ...........................................  104 
4.7. AFC accuracy for AAM components within temporal ROI .............................  105 
4.8. AFC accuracy for AAM compared to eigenface components ........................  106 
4.9. AFC accuracy by the number of components included .................................  108 
4.10. AFC accuracy for shape compared to appearance components ..................  110 
4.11. AFC accuracy for shape compared to appearance within PPC ROIs ...........  111 
4.12. AFC accuracy for shape compared to appearance within temporal ROIs .....  112 
4.13. Correlation between predicted and true on individual AAM Components .....  114 
4.14. Best (and worst) predicted AAM components ...............................................  115 
4.15. AFC accuracy by repetition number ..............................................................  117 
4.16. AFC accuracy by repetition number for PPC ROIs. .......................................  118 
4.17. AFC accuracy by repetition number for temporal ROIs. ................................  119 
4.18. Average correlation between predicted and averaged ratings ......................  120 
S2.1.    Density plot of masculinity/femininity ratings ................................................  138 
S2.2.    Example of 8 stimulus pairs matched on affect, but differing on gender ......  139 
S2.3.    Example of 8 stimulus pairs matched on gender, but differing on affect ......  139 
S3.1.    Differential memory effects for affect vs. gender ..........................................  141 
S3.2.    Relationship between reconstruction bias and precision ..............................  142 
S3.3.    Bias and precision in the competitive vs. non-competitive conditions. .........  143 
S3.4.    Histogram of reconstruction responses in the competitive condition ............  144 
S4.1.    AFC accuracy for correct trials, separately for 1st and 2nd appearance ......  145 
S4.2.    AFC accuracy for correct trials, separately for 1st and 2nd appearance,  
        and for PPC ROIs .........................................................................................  146  
17 
  
Figure  Page 
 
S4.3.    AFC accuracy for correct trials, separately for 1st and 2nd appearance,  
        and for temporal ROIs ..................................................................................  147 
S4.4.    Appearance components predicted by only OCC and only PPC ..................  148 
S4.5.    Shape components predicted by only OCC and both OCC and TEMP ........  149 
S4.6.    AAM components with significant negative correlations ...............................  150 
 
 
 
  
18 
  
Chapter I 
INTRODUCTION 
Introduction: how do we remember highly similar information? 
The capacity of the human memory system is seemingly limitless (Brady et al., 2008). 
Yet, we also quite often forget. One of the central problems in memory research is 
understanding the reasons and circumstances that distinguish a memory that will be 
successfully recalled and one that will be forgotten (Anderson, 2003; Anderson et al., 1994; 
Anderson & Spellman, 1995; Crowder, 2014; Fawcett & Hulbert, 2020; Smith & Hunt, 2000). 
There is a lot we already understand about challenges to memory, in particular how similarity 
between memories can lead to interference, which increases the chance of forgetting (Anderson 
& Neely, 1996). However, in part due to methodological barriers, most of the progress in this 
research has come from studies that focus only on whether a memory is remembered or 
forgotten (Cooper & Ritchey, 2019). In contrast, relatively little attention has been paid to subtler 
changes to memory—at the individual feature level. This is critical, because as I will argue, 
measuring changes to a memory’s features may be key to understanding the impact of 
interference, how interference is resolved, and ultimately the vast human memory capacity.  
This dissertation focuses on memories as a multi-dimensional constellation of individual 
features (Cooper & Ritchey, 2019; Horner & Burgess, 2013; Horner & Burgess, 2014; Xue, 
2018). For example, when you meet someone new, you form a memory of that person that 
consists of specific pieces of information or features (e.g. eye color or the shirt they were 
wearing). A week later, given a cue (e.g. their name), you may be able to retrieve that memory. 
19 
  
Critically, each of the features you perceived when first meeting them is unlikely to be retrieved 
with the same accuracy or level of detail (with some forgotten). 
The central challenge that this dissertation addresses is applying this perspective of 
memory to the study of episodic memory interference. In order to study interference, we need a 
class of stimuli that can be manipulated to be sufficiently similar to cause interface, while distinct 
enough to be retrieved from memory. The features of these stimuli also need to be measurable 
along continuous dimensions that allow for the ability to track subtle memory changes. This 
dissertation focuses on first developing a methodology that meets these criteria (Chapter 2). In 
human research, faces are uniquely suited for this purpose. I proceed to show two applications 
of this methodology—both in a behavioral paradigm (Chapter 3) and in a neuroimaging 
paradigm (Chapter 4). I conclude with a discussion of what has been learned as well as 
potential future applications (Chapter 5). To preview, I found two distinct changes at the level of 
individual features of memory, that each may play a role in resolving interference, ultimately 
making it possible for humans to remember so many highly similar pieces of information.   
The rest of this introductory chapter summarizes important background information. In 
the next section, I highlight some of what is already known about memory interference (Memory 
interference background). I then proceed to discuss the idea of adaptive memory distortions, the 
notion that deviations from perfectly accurate memories can be advantageous for navigating the 
world (Adaptive memory changes as a route to interference resolution). That perspective helps 
inform why it is important to measure memories at the feature-level. Here, I also introduce a 
computational model (Hulbert & Norman, 2015) that motivated much of this work. This model 
proposes a theory of how memory features may change as an adaptive response to 
interference. The next section provides background on how the features of memory have 
previously been measured, both through behavioral probes and fMRI analysis (Measuring the 
feature space of memory). This dissertation represents an innovation in the behavioral measure 
20 
  
of feature memory and presents a path towards innovation in the measure of feature memory 
based on fMRI activity. I conclude with a brief preview of the rest of the dissertation (Goal and 
structure of the dissertation). 
Memory interference background 
Cognitive perspectives on memory interference 
Every day we encounter moments, items, or events that are quite similar to ones we’ve 
seen before. For example, it is a cliché to note that you can’t remember what you ate for 
breakfast. This is because the memory of your breakfast from this morning is probably quite 
similar to other mornings. In contrast, you may have no trouble recalling what you ate at the 
restaurant you visited for the first time last week as there are no other memories of meals at the 
same restaurant. The disruption in our memory system caused by another similar, competing 
memory, such as yesterday’s breakfast, is known as memory interference (Anderson & Neely, 
1996; Anderson & Spellman, 1995; Smith & Hunt, 2000).  
There are several theories about the mechanisms underlying memory interference. That 
is, how is the architecture of our memory system organized such that similar items would create 
disruptions. Some theoretical perspectives focus on retrieval, assuming that the competing 
items are stored in long-term memory. Under this perspective, memory interference reflects a 
disruption in the ability retrieve those memories (Rajsic et al., 2017; Tulving, 1974). For 
example, in cases where memories share a common cue, the memory with the strongest 
association with the cue will tend to win out; further, it may actually suppress the association 
between the cue and the competing item (Anderson & Neely, 1996; Anderson et al., 1994; 
Gillund & Shiffrin, 1984; Melton & Irwin, 1940; Rundus, 1973). Alternatively, the memories may 
remain retrievable, but competition can create binding errors where a cue from one item is 
errantly associated with a competitor. In this case, we would expect to not only see memory 
errors, but specifically “swap” errors where competitors are preferentially recalled (Bays et al., 
21 
  
2009). Relatedly, depending on how similar a competitor is, it may simply be confused with the 
target during retrieval (Diana et al., 2004; Schurgin et al., 2020). 
Other interference perspectives focus on changes to the memory of the item itself. This 
could involve weakening of the memory representation of the item overall, or specific changes 
to features (see Interference resolution, below). Because the means of measuring memories at 
the feature level have been limited until recently, mechanistic accounts of changes to the 
memory themselves remain somewhat speculative in nature. Ultimately many of these potential 
interference mechanisms may occur depending on the circumstances, but determining when 
and to what degree they are occurring will require more research that fully maps the 
circumstances of interference and the effects thereof. 
The role of item similarity in interference 
Many factors go into the degree of memory interference experienced. One key factor is 
the degree of similarity between competing items. Since interference is triggered by similarity, it 
is intuitive to suspect that greater similarity between competing items leads to greater memory 
interference. In fact, there is quite a lot of research that supports this view from a broad array of 
paradigms (Anderson et al., 1994; Anderson & Spellman, 1995; Baddely, 1964; Baddeley & 
Dale, 1966; Chanales et al., 2017; Smith & Hunt, 2000; Watson & Lee, 2013; Yeung et al., 
2013). While similarity does tend to cause interference, there are contexts where high similarity 
can be beneficial to memory in comparison to moderate similarity (Anderson, 2003; Bauml & 
Hartinger, 2002; Kahana et al., 2007; Lin & Luck, 2009; Mate & Baques, 2009; Sanocki & 
Sulman, 2011). This suggests that depending on the context, there may be a specific level of 
similarity where interference peaks and both lower and higher levels of similarity would tend to 
cause less interference.  
This question has primarily been studied with dichotomous measures of memory, where 
forgetting an item is used as an indicator of greater interference. However, recently working 
22 
  
memory studies have used continuous measures of memory features to address this question. 
This research suggests that highly similar items may damage the precision of remembered 
memory features, whereas less similar items may disrupt the ability to retrieve the memory at all 
(Li et al., 2020; Sun et al., 2017). These studies provide clues as to what the role of similarity in 
interference is, however it remains unknown to what extend these findings apply to long-term 
memory and what the neural mechanisms underlying this relationship is.   
Neural origins of interference 
One of the most influential frameworks for understanding the human memory system is 
the complimentary learning systems (CLS) framework (Battaglia et al., 2011; McClelland et al., 
1995; Norman, 2010; Norman & O’Reilly, 2003; O'Reilly & McClelland, 1994; O'Reilly & 
Norman, 2002; O'Reilly & Rudy, 2001; Schapiro et al., 2017; Schlichting et al., 2015). Under this 
perspective, there are two basic types of learning: (1) a ‘fast’ system designed to remember 
specific events and (2) a ‘slow’ system for extracting generalities over time. Although all 
memories can still be impacted by the neural architecture of the slow system, this dissertation 
focuses on the fast system—which supports the formation of distinct memories with a large 
amount of detail and where interference is an obstacle to be overcome.  
 The formation of distinct memories is largely driven by the architecture of the 
hippocampus. When a novel stimulus or event is experienced, the hippocampus forms distinct 
representations in CA3. This automatic formation of a distinct representation is known as 
pattern separation (Bakker et al., 2008; Yassa & Stark, 2011). Computational models suggest 
that the sparse coding in the dentate gyrus drives the formation of these distinct representations 
independently of the content of the memory (Norman & O’Reilly, 2003; O’Reilly & Norman, 
2002; Schapiro et al., 2017). These unique, orthogonalized neural representations decrease the 
likelihood of interference.  
 When attempting to retrieve a memory, a partial retrieval cue can lead to the reactivation 
23 
  
of the full memory representation in CA3 (O’Reilly & Norman, 2002). Computational models 
suggest that the recurrent connections in CA3 lead to a process known as pattern completion 
where the activation pattern in this region will converge towards the full memory representation. 
CA3 first outputs to CA1, where memory features are likely represented. The reinstatement of 
the memory then continues to cascade out to the entorhinal cortex and to connected and 
distributed cortical regions (Xue, 2018). 
Memory interference likely arises from two competing needs in the memory system: (1) 
to represent the features of memories that may have little distinction, while simultaneously (2) 
forming distinct pathways for correct retrieval (Colgin et al., 2008). This tradeoff appears to be 
addressed by the outlined hippocampal architecture where pattern separation automatically 
creates distinct representations in dentate gyrus and CA3, then during retrieval, memory 
features are represented in broad patterns across cortical regions. Yet, despite a neural 
architecture that seems, in part, designed to diminish memory interference, it still quite often 
occurs. Further, since pattern separation occurs automatically when a new item is experience, it 
can’t explain interference resolution. This means that instances of overcoming interference must 
be explained by another, experience-dependent neural process.  
Adaptive memory changes as a route to interference resolution 
Adaptive memory errors and distortions 
Our memory system is not akin to taking a photograph where a perfect representation is 
stored, instead there are systematic errors and distortions (Schacter et al., 2011). Memory is 
better understood as a reconstructive process. Evidence for this comes from systematic 
patterns in memory errors, and from the discovery of neural overlap between areas involved in 
memory retrieval and with imagining the future (Benoit & Schacter, 2015; Schacter & Madore, 
2016). Memories are shaped by our already formed preconceptions about the world (Tompary & 
Thompson-Schill, 2021), changes in the current environment (Brunec et al., 2018; Zheng et al., 
24 
  
2022), attentional focus (Hutchinson et al., 2016; Swan et al., 2016), behavioral goals during 
both encoding (Long & Kuhl, 2018) and retrieval (Favilla et al., 2018; Mack et al., 2016), and 
other memories (Scotti et al., 2021).  
There are a great number of well-established memory errors and distortions, many of 
which reflect adaptive attributes of our memory system. For example, when participants are 
presented with a list of semantically related words (e.g. “shell”, “omelet”, “yolk”, “frittata”, 
“scramble”), they will tend to develop a false memory for a related word not on the list (e.g. 
“egg”; Deese, 1959; Roediger & McDermott, 1995). Although “egg” was never seen it may be 
adaptive to falsely remember that word because it reflects the gist of what was experienced, 
thereby better allowing you to apply that experience in the future (Schacter et al., 2011). 
Another example is when information later learned about an already experienced event is 
incorporated into the memory for that event. This reflects an adaptive memory change where 
memories are flexible enough to incorporate new information, however it can show up as an 
error when false information is incorporated into the memory (Schacter et al., 2011).  By 
identifying these patterns, we can begin to see how certain memory “errors” are only errors in 
respect to veridical memory, not flaws in how the memory system is working. What is made 
clear from this perspective is that both the form and precision of memory measurement, and an 
analysis method that minimizes assumptions about what a memory error is, are critical. 
 The view that memory errors can be adaptive is key to understanding how memories 
may change in response to interference. For example, the act of retrieval can lead to forgetting 
in the context of interference. In one of the main paradigms used to study this effect, 
participants study the association between an item and a semantic category (Anderson et al., 
1994; Hulbert & Norman, 2015). Then in a retrieval practice phase, half of the items from half of 
the semantic categories are cued for retrieval. After several rounds of practice, all items are 
tested on retrieval based on the cue. Unsurprisingly, participants have a better memory for the 
25 
  
items that are practiced (RP+) compared to control items where that category was not practiced 
(Nrp). The key, consistent finding from this paradigm is that unpracticed items from the 
practiced categories (RP-) are recalled less well compared to the control items (Nrp). This effect 
is known as retrieval induced forgetting (RIF;  Anderson et al., 2000; Bauml, 2002). 
 This memory error is quite often adaptive (Bjork, 1989). For example, consider learning 
two different techniques for cooking scrambled eggs. There are several differences, but one key 
difference is that in J. Kenji Lopez-Alt’s version you add salt at the beginning, but in Gordon 
Ramsay’s version you add salt at the end. Later when you practice Lopez-Alt’s version, it might 
be bad to accidentally recall Ramsay’s admonishment to not add salt at the beginning. That is 
an example of a non-on/off memory error that could be caused by memory interference. 
Critically, in this case, if you are practicing one egg technique, forgetting the other technique is 
adaptive and increasingly likely with more experience practicing the recipe.  
Interference resolution 
Forgetting is not always adaptive though; sometimes you want to remember both of the 
competitive items. Fortunately, this sort of memory interference can be overcome. In fact, 
overcoming this sort of interference can actually strengthen the once forgotten items (Storm et 
al., 2008). This was demonstrated by interleaving a relearning phase with the retrieval practice 
phase in a RIF paradigm. They found the typical RIF effect for RP- items that were not 
relearned. However, this forgetting actually enhanced learning for those items. In particular, for 
RP- items that were relearned, memory was better than for control items which were studied the 
same number of times. This suggests that the memory change involved in RIF is more than 
simply strengthening or weakening memory signals or retrieval routes; there is likely something 
more complex going on that works adaptively to prioritize certain memories when circumstances 
suggest that is beneficial, but also allow for correct retrieval of two or more competing memories 
when that is beneficial. 
26 
  
 A neural network model that would explain this effect comes from Hulbert and Norman 
(2015). They model a memory as a combination of its constituent features, with each feature as 
a node in the network (Fig. 1.1). This means that when items are competitive with one another 
due to similar features, many of the nodes may be overlapping. Under this model, it is the 
representational overlap which causes interference. In a RIF paradigm, when one item is 
retrieved, the features of that memory are activated and the connections between those nodes 
become strengthened. This enhanced memory strength increases the likelihood that the RP+ 
item will be correctly remembered latter. At the same time, because there are shared features 
with the completive RP- item, that item also becomes weakly reactivated. This weak activation 
works to weaken the connection between the unique features of the RP- item and the features 
shared with the RP+ item. This process decreases the likelihood that the RP- item will be 
correctly recalled later. This is especially true if the cue is related to the shared features—
another retrieval route may be needed. This explains the typical RIF effect. If the RP- item is 
then relearned, the unique features are strongly activated and other features that were not a 
strong part of the original representation may become emphasized more. This activation 
strengthens the connection between this new constellation of nodes. Over time, through 
interleaved practice, both memories can become strong, but with more distinct neural 
representations. This process of separating the memory representations over the course of 
learning is known as pattern differentiation or repulsion.  
27 
  
 
Figure 1.1. Simplified summary of the Hulbert and Norman (2015) model. Left. The memory of an event 
can be thought of as constellation of individual features (circles). When two events are similar, they likely 
have many shared features (purple) and may interfere with one another. Right. Over the course of 
interleaved learning, the model predicts that the unique features of each event (red: event 1, blue: event 2) 
become strengthened and the shared features become weakened. 
 
 Although the Hulbert and Norman (2015) model makes sense from a theoretical or 
computational perspective, there is limited experimental evidence for all aspects of their 
account. They predicted that repulsion would be detected in the form of lower similarity in 
hippocampal representational patterns. In their study, they found that greater repulsion in left 
hippocampus was associated with greater learning on the RP- items relative to control items. 
Thus they established initial evidence for a relationship between repulsion and learning. 
Importantly, they had no measurement of memory features, thus no way of evaluating that 
aspect of their model.   
Others have expanded on this initial evidence. One study that used an associative 
learning task found that the hippocampal representations of similar scene images were driven 
apart (repulsion) through learning (Favila et al., 2016). Critically, these neural changes were 
associated with interference reduction. Another study tracked hippocampal representations as 
participants learned routes (Chanales et al., 2017). The key manipulation was that each route 
had an overlapping portion of the route with one other route, making those two memories 
28 
  
competitive with one another. They found that prior to learning, the neural representation in the 
hippocampus of overlapping routes were as similar to one another as non-overlapping routes. 
This makes sense according to a pattern separation account (see Neural origins of interference, 
above). Critically, the hippocampal representation of overlapping routes became more dissimilar 
over the course of learning, consistent with a repulsion account. This effect was specific to more 
difficult trials, i.e. where interference needed to be overcome. Both of these studies could be 
explained by the Hulbert and Norman (2015) model, but without reference to specific features, 
other explanations cannot be ruled out.  
More recently, one study looked at pattern similarity changes between competitive scene 
images over the course of learning (Wanjia et al., 2021). They found that the repulsion effect 
was specific to CA3/dentate gyrus subfield (as opposed to CA1), consistent with theories of 
hippocampal function (see Neural origins of interference, above). As evidence of the adaptive 
impact of these changes, the effect occurred specifically when interference was overcome and 
was strongest for items with the greatest initial interference. As an initial way to link repulsion to 
memory features, the repulsed activity patterns carried relatively more information about the 
correct compared to the incorrect learned association. Another recent study has also shown 
evidence for a shift in information after interference resolution (Zhao et al., 2021). Here they 
focused on changes in cortical regions and identified regions where greater information on the 
unique feature was represented for competitive items. They did not find evidence for overall 
repulsion, but these results are consistent with the idea of interference resolution involving shifts 
in the representation of specific memory features. 
Overall there is strong and growing evidence that interference resolution, at least in part, 
involves a shift in neural representations such that competitive items are further apart in 
representational space. The extent to which this shift is consistent with ideas of Hulbert and 
Norman (2015) remain unresolved. One alternative that could potentially explain these results is 
29 
  
the idea of hippocampal remapping. It has long been documented in animal studies that there 
are place cells in the hippocampus that preferentially fire in certain locations. When these 
animals are put in a new context, or believe they are in a new context, there is a rapid 
“remapping” where cells fire for new preferred locations (Bostock et al., 1991; Colgin, et al., 
2008; Muller & Kubie, 1987; Wills et al., 2005). Although the evidence for this phenomenon is 
strongest in navigation, there is growing evidence that the same logic could apply to anything 
that the hippocampus is interested in (Colgin, et al., 2008; Wanjia et al., 2021). Thus, another 
way interference could be resolved is through the association of distinct internal contexts with 
competitive stimuli. That is, these neural changes in hippocampus could be unrelated to feature 
changes. 
Thus, there are multiple potential models that are consistent with the idea of neural 
repulsion—and that is focusing specifically on the hippocampus, when other regions are likely 
playing a role as well (Zhao et al., 2021). In order to provide evidence that these neural changes 
are related to memory features, we must establish whether these neural changes are having an 
impact on how the item is remembered (not just if). Further, we need a way to decode these 
pattern similarity shifts into a meaningful feature space that corresponds to how the memory 
themselves may change. 
Measuring the feature space of memory 
Behavioral measures of memory content 
Most long term memory studies have focused on whether an entire event is remembered 
or forgotten (Cooper & Ritchey, 2019). However, memory is not an all-or-none event—
memories can be measured in much more informative ways. For example, a recalled item can 
be measured along a self-report rating scale in terms of how confident they are in their retrieval 
or how vividly it was retrieved (Kuhl & Chun, 2014; St-Laurent et al, 2015; Bonnici et al., 2016; 
Ford & Kensinger, 2016). Measuring perceived vividness captures the idea that memories can 
30 
  
be recalled at a more gist-level, or with much greater detail (Brady et al., 2008; Schacter et al., 
2011). What perceived vividness fails to capture is that the individual features of a memory can 
vary in vividness and may be distorted. Memories are best understood as a multi-dimensional 
constellation of features (Cooper & Ritchey, 2019; Horner & Burgess, 2013; Horner & Burgess, 
2014; Xue, 2018). Therefore, researchers have increasingly begun to measure long-term 
memories along continuous feature dimensions, probing features such as, location on the 
screen (Berens et al., 2020; Harlow & Yonelinas, 2016; Nilakantan et al., 2018), orientation 
(Richter et al., 2016), and color (Brady et al., 2013; Chanales et al., 2020).  
Continuous measures of feature memory are tremendously useful because they can be 
utilized to estimate both the precision and accuracy of individual memory features. I define 
precision as a measure of how detailed the memory for a particular feature is. For example, you 
could remember that a person’s eyes are blue, or you could have a more precise memory for a 
specific shade of blue. Importantly, although precision is often conflated with accuracy, I view 
the two as independent. That is, you could have a very detailed memory for the eye color and 
be wrong. I define accuracy as how close to the true value a feature memory is.  
There are multiple ways to measure precision and bias, one of the most straightforward 
ways is to bin the data. You might create some range around the true value that allows some 
small degree of error to be considered accurate, then you can create bins further away to 
indicate some degree of inaccuracy (e.g. Nilakantan et al., 2017). Another similar approach 
would be to take the absolute value, as a measure of the average distance from the true value. 
An alternative approach, mixture modeling, views responses as mixture of multiple underlying 
distributions (Zhang & Luck, 2008). For example, the overall distribution can be driven by some 
responses that reflect a successfully retrieved memory with some amount of precision and 
some responses that reflect random guessing. This analysis is helpful for determining not only 
the precision and accuracy of memory features, but also how often they are retrieved at all.   
31 
  
Regardless of which analysis approach is taken, in the context of interference, we might 
expect for there to be distortions in feature memory. Therefore, I view not only a measure of 
accuracy as important, but a measure of whether there is any directional bias in errors made. 
Again bias, can be independent of precision. Traditionally in studies of memory interference, we 
might expect the confusion caused by two competing memories to lead to integration, where 
there are no longer two distinct memories and only a more gist-level recollection of both. This 
would cause memories to be recalled with a bias towards each other. The Hulbert and Norman 
(2015) model, however, suggests that over the course of overcoming interference, the 
differences between competing items becomes highlighted. Under these circumstances we 
might see a bias away from a competing item. 
Measures of precision, accuracy, and bias, have been widely used to study the impact of 
interference in working memory (e.g. Sun et al., 2017). However, they have seldom been used 
to study long-term memory interference. There are important properties the stimuli need in order 
to apply this approach. (1) In order to test the predictions of the Hulbert and Norman (2015) 
model, the stimuli need at least two independently measurable dimensions that can each be 
manipulated to act as a shared or unique feature. (2) These dimensions need to be 
perceptually-important. For example, two tree images can cause interference with one another 
and an experimenter could create an underlying dimension that defines that perceptual 
difference. However, that dimension would not be meaningful to participants based on viewing 
those two images alone. Therefore, you could not expect to detect changes to memory that 
align with this latent dimension. In contrast, for perceptually-important dimensions (e.g. color or 
location) we would expect to be able to detect changes in memory. (3) Interference needs to be 
restricted to where the experimenter intends it to occur. In working memory studies of 
interference, trials can be treated independently because information does not need to be 
retained beyond that trial. In contrast, in long-term memory studies the information needs to be 
32 
  
retained throughout the experiment. Thus, for example, utilizing the angle of a gradient as a 
continuous memory measure works when each trial can be treated independently. However, in 
a long-term memory study, the gradient stimuli would all interfere with one another.  
In Chapter 2 I develop a method that allows for the creation of synthetic face stimuli that 
meet these criteria. In Chapter 3 I demonstrate an approach to analyzing memory feature data 
in a way that decouples accuracy and precision. With the view that memory distortions often 
serve an adaptive function, I link both the accuracy-independent measure of precision and bias 
for a specific feature with improved performance on a separate measure of memory.  
Neural measures of memory content 
Episodic memories are supported by a broad pattern across many cortical regions. 
Successful retrieval of those memories is associated with reactivating similar neural patterns as 
were originally elicited by the event (Kuhl et al., 2011; Xue et al., 2010; Zeithamova et al., 2012). 
Recent advances in fMRI data analysis have improved the ability to characterize neural 
representations in terms of the specific information they are representing (Cohen et al., 2017; 
Davis & Poldrack, 2013; Norman, Polyn, et al., 2006; Rugg et al., 2002). This can be helpful to 
determine differences between brain regions, and of particular interest here, how information is 
transformed over the course of processing and perhaps over the course of time through 
learning.  
Multiple fMRI analysis approaches can to some degree measure the content of 
memories, including univariate activation and adaptation (Davis & Poldrack, 2013). Of most 
interest here are multi-voxel pattern analysis (MVPA) techniques that take into account not only 
activation levels of individual voxels or regions, but are instead driven by distributed 
representational patterns (Kriegeskorte et al., 2008). These approaches tend to have the 
greatest sensitivity, particularly when attempting to delineate representations in multi-
dimensional feature spaces (Davis & Poldrack, 2013). 
33 
  
One approach that has proved particularly effective in this domain is representational 
similarity analysis (RSA). RSA involves creating a dissimilarity matrix between pairs of stimuli or 
conditions based on neural activity within a brain region (Kriegeskorte, et al., 2008). This 
approach utilizes all informational content available and maps it to a common representational 
space that can be compared across region or time, or compared to stimulus feature spaces. In 
condition-rich designs where there are many unique stimuli, this approach can be very effective 
at mapping a complex representational space (Drucker & Aguirre, 2009; Kriegeskorte et al., 
2008; Nestor et al., 2016). In experimental contexts where a small number of competitive stimuli 
need to be learned, however, it may be more difficult to make that type of mapping. 
A particular form of this approach has recently been used to focus on the similarity of 
competitive pairs (rather than a full feature space), tracked over the course of learning 
(Chanales et al., 2017; Wanjia et al., 2021). RSA or other approaches (e.g. Chadwick et al., 
2011) that distinguish items without reference to specific features are adept at distinguishing 
similar items in the hippocampus. Techniques such as multi-dimensional scaling (MDS) can 
further help to visualize and interpret representational changes in the form of a feature space. 
However, these MDS features do not necessarily correspond to interpretable feature memory 
changes. Further, the ability to distinguish competitive items in other cortical regions—where 
these representations are shifted to in the long-term and where the feature information is 
reactivated—may be more limited with this approach. 
The MVPA approach that this dissertation focuses on is decoding. This approach puts 
the output into a meaningful dimension that can correspond to hypotheses about how memory 
features change in response to interference. Decoding approaches have the potential to be 
more powerful because they do not weigh all voxels equally (as RSA does). Although decoding 
may have first appeared to have limited power (Carlson et al., 2003; Cox & Savoy, 2003; Haxby 
et al., 2001), the upper limits in power and specificity have continuously been pushed (Huth et 
34 
  
al., 2016; Mozafari et al., 2020; VanRullen & Reddy, 2019). This has included increasingly 
complex output from complex stimulus classes (Dado et al., 2022; Lin & Hsieh, 2022). Thus, this 
approach has the tantalizing potential to bring meaning to shifts in overall neural pattern 
similarity.  
Decoding could be a powerful approach to bring meaning to the shifts detected through 
similarity based approaches that have been used to find repulsion in the context of memory 
interference (see Interference resolution, above). If these similarity shifts are driven by changes 
in feature information, then these shifts could be decodable. Similarity based approaches have 
been demonstrated to detect small but meaningful shifts in neural representations over time; it is 
unclear whether decoding will be able to reliably measure small shifts, but if it could, the 
implications would be quite powerful. In Chapter 2 I develop a set of dimensions that describe a 
large set of stimuli that are good candidates to be used in the study of memory interference 
(faces). In Chapter 4 I describe an approach to decode those dimensions from fMRI activity.  
Goal and structure of the dissertation 
The primary goal of the dissertation is to develop and validate an approach to studying 
how the features of memories change in response to interference. In Chapter 2 I will focus on 
the development and validation of the methodology. In Chapter 3 and 4 I will focus on specific 
applications of this methodology. I will conclude in Chapter 5 with a discussion of how this 
approach can be leveraged going forward in the context of long-term memory interference and 
in cognitive neuroscience more generally. 
 In Chapter 2, I document the development and initial validation of a set of face stimuli 
standardized for the use in psychology experiments, with a number of useful metrics and the 
ability to control and manipulate. Face stimuli are a perfect stimulus class for use in long-term 
memory studies with high interference because humans are experts at processing faces and 
can later remember fine-grained differences as having distinct identities. Further, as established 
35 
  
in this chapter, multiple dimensions at multiple levels of neural processing can be experimentally 
controlled and probed from memory. 
I follow that in Chapter 3 with a demonstration of this methodology in a high interference 
behavioral setup. The face stimuli are used to create competitive pairs of stimuli that are 
matched in all features except for one, where they are only slightly different. The experiment 
tracks associative memory over the course of experiencing and then overcoming memory 
interference. The face methodology is then utilized again to probe feature memory along the 
same dimensions the stimuli were manipulated along. I found that feature memories that are 
diagnostic of the difference between competitive items are both biased and recalled with greater 
precision in response to interference. Further, I found these memory changes to be adaptive for 
learning. 
 In order to eventually apply this methodology to decoding neural representations in the 
context of interference, we first need to establish the ability to decode and to identify the most 
decodable features. Thus, in Chapter 4, I investigate the ability to decode face features from 
perceptual data. I discuss differences in the ability to decode different data-driven face 
components and subjective ratings, and how those differ between brain regions. 
 I conclude in Chapter 5 with a summary of our results and the broader implications of 
those findings. I also discuss how the approaches applied in Chapters 3 and 4 can be utilized in 
concert moving forward. The goal is to help bridge the gap between neural and behavioral 
perspectives on resolving memory interference. A convergent approach is key to understanding 
how memory change in response to interference supports the vast human memory capacity.  
36 
  
Chapter II 
STANDARDIZED SET OF 1,148 FACE STIMULI WITH  
LANDMARKS, SORTING, AND RATING DATA 
This chapter contains unpublished co-authored material. Maxwell L. Drascher is the primary 
author of this chapter with input from his advisor Brice A. Kuhl. Drascher and Kuhl designed the 
study together. Drascher conducted all data collection, and wrote the scripts for experiment 
presentation, data analysis, and figure creation. Drascher wrote the manuscript with editorial 
assistance from Kuhl. 
Introduction 
Face images represent a unique category of visual stimuli given the fact that they are 
relatively uniform and contain common features, but humans can still perceive and later 
remember subtle differences. Human expertise in faces also makes those subtle differences 
measurable and amenable to parameterization. These properties make faces an appealing 
class of stimuli that can be leveraged to study a broad range of cognitive domains. Developing 
the ability to measure and manipulate faces along distinct dimensions is key to unlocking the full 
potential of face stimuli in experimental settings. 
Faces are comprised of many features, including measurable physical dimensions such 
as eye color and skin tone. Faces are also comprised of a variety of high-level, socially-relevant 
dimensions such as gender, trustworthiness, and dominance. Even these seemingly more 
abstract dimensions are perceived similarly across different participants, which makes them 
measurable (Oosterhof & Todorov, 2008).  However, the ability to tightly control and reliably 
measure higher-level face dimensions is not as straightforward as many of the most commonly 
used feature spaces in cognitive research, such as color (e.g. Bays et al., 2009; Chanales et al., 
2021; Zhao et al., 2021; Zhang & Luck, 2008), orientation (e.g. Haynes & Rees, 2005; Kamitani 
37 
  
& Tong, 2005; Korkki et al., 2020; Pertzov et al., 2017), or location on the screen (e.g. Berens et 
al., 2020; Harlow & Yonelinas, 2016; Nilakantan et al., 2017).  
One approach to manipulating faces is to use actors showing different facial expressions 
(e.g. Benda & Scherf, 2020; Chung et al., 2019; Conley et al., 2018; Ebner et al., 2010; Engell & 
Haxby, 2007; Furl et al., 2013; Said et al., 2010; Thomas et al., 2001; Tottenham et al., 2009). 
This approach does not directly create a continuous feature space though. One method of 
generating a continuous face space is morphing, which utilizes two or more specific face images 
to generate a continuous space between them (Steyvers, 1999; Leopold et al., 2001). This 
technique can generate a continuous dimension between two different emotional expressions of 
the same face (e.g. Arsalidou et al., 2011; LaBar et al., 2003; Sato et al., 2004; Won et al., 
2020), but this space may not align well with the true space between those emotions (Hays et 
al., 2020). Morphing can also generate a dimension between any specific face images. This 
type of dimension is a great tool in experimental designs where it can be implicitly learned 
during an experiment, for example in a category learning paradigm (e.g. Ashby et al., 2020; 
Goldstone et al., 2001). However, it is less useful at generating perceptually-important 
dimensions (e.g. affect). 
Alternative approaches to parameterizing faces include approaches that generate 
dimensions based on the variance in base image properties across a large pool of faces. One 
early approach to this was eigenfaces, which are generated from a principal component 
analysis (PCA) on the pixel intensity values across three color channels (Turk & Pentland, 
1991). This approach is powerful given the high degree of information it can capture with a 
limited number of components. However, because the components are completely data-driven, 
the individual components are highly influenced by the stimuli used, are difficult to interpret, and 
are often poorly aligned with features important for face perception. This approach was 
subsequently improved with the active appearance model (AAM; Chang & Tsao, 2017; Cootes 
38 
  
et al., 2001; Edwards et al., 1998). The AAM is more labor-intensive, with the requirement of 
landmarking the face stimuli, but yields greater reproduction of the face images with fewer 
components. This approach also yields components that do not necessarily align with human 
perception of faces, however they tend to be more interpretable both individually and with the 
inclusion of two broad component groupings (shape and appearance). Both eigenfaces and 
AAM create measurable and manipulable face dimensions, however they lack a clear, innate 
connection to features important to face perception. 
An alternative approach is to use artificially generated and manipulable face images 
(e.g. Roesch et al., 2011). For a long time, the capabilities of these types of stimuli were limited 
and were often unavailable to researchers, however the technology available is becoming 
increasingly realistic (e.g. Hays et al., 2020; Peterson et al., 2022). For the tightest experimental 
control, one of these options may be optimal. However, there is evidence that synthetic faces 
may be processed differently than real faces (Balas & Pacella, 2015; Schindler et al., 207; 
Wheatley et al., 2011). Thus, although synthetic images are extremely valuable, they should be 
used with caution, especially if you want to study face processing specifically. Until it is 
demonstrated that there are no differences in behavioral and neural responses, real face stimuli 
will remain a valuable resource for cognitive scientists. 
Researchers looking to use real face stimuli in their experiments have many options to 
choose from (e.g. Bainbridge et al., 2013; Benda & Scherf, 2020; DeBruine et al., 2017; Ebner 
et al., 2010; Ma et al., 2015; Minear & Park, 2004; Walker et al., 2018).  Although many of these 
databases may have lacked diversity before, they are becoming increasingly diverse (e.g. Chen 
et al., 2021; Chung et al., 2019; Conley et al., 2018; Lakshmi et al., 2020; Ma & Wittenbrink, 
2020). All of these databases contain information on certain stimulus properties, however, 
depending on the needs of a particular experiment, the list of compatible databases may be 
39 
  
narrowed significantly or not exist at all. This is particularly true when there are multiple, 
complementary purposes for the stimuli.  
One important metric that is not often available for face stimuli is an overall measure of 
similarity. Similarity is a key measure in many experimental designs, however it is difficult to 
apply to more complex stimulus classes because similarity on any one dimension or 
combination of dimensions does not necessarily correspond to overall perceptions of similarity 
(Jiang et al., 2021). Therefore, data specifically meant to capture overall similarity, such as 
sorting faces into groupings, is required. Another important metric that is often not available is 
facial landmarking. When, looking to apply the AAM, the list of face database options is either 
limited substantially (e.g. Koestinger et al, 2011; Milborrow et al., 2010) or requires the labor-
intensive process of manually landmarking a new stimulus set. Although properties like these 
are available in some circumstances, when selecting face stimuli there are cases where it would 
be beneficial to use the same stimuli for many purposes. Thus, a large database that contains 
information on not only subjective ratings, but also less commonly available information such as 
sorting data and landmark positions, may be the optimal option. 
The current manuscript describes the data collection and validation process for a broad 
set of data that describes face stimulus properties and offers advice for future applications 
based both on previously published uses (Drascher & Kuhl, 2022) as well as potential future 
uses. The ultimate goal is to create a freely accessible resource of face stimuli with data 
available that facilitates their use across a broad array of purposes.  The database contains a 
total of 1,148 faces, all forward facing, and cropped to a uniform size and position in the frame. 
The faces were selected to be diverse in terms of gender, age, ethnicity, and facial expression. 
The key distinguishing features of this corpus are the breadth and uniqueness of data available 
on this size of a face corpus. All images have been independently rated on several important 
social dimensions, have been sorted based on appearance, and have been hand landmarked. 
40 
  
This diversity of information allows for the use of face stimuli with a high degree of experimental 
control on reliable, perceptually-important dimensions (Drascher & Kuhl, 2022), while being 
large enough to be used as a training set in neuroimaging-based, image reconstruction designs 
(Lee & Kuhl, 2016). This dramatically increases the utility of the face stimuli by allowing the 
same stimuli to be used with multiple potential applications and facilitating the bridging of 
behavioral and neuroimaging findings. 
Methods 
Face image corpus  
A total of 1,148 faces were selected from a variety of online sources (see Lee & Kuhl, 
2016; a small number of faces [8] were removed from the 1,156 in this set for having attributes 
highly distinct from the rest of the set and were thus unlikely to be successfully reconstructed). 
All faces were forward-facing and cropped and resized to 179 x 251 pixels. The faces were 
selected to be diverse in terms of gender, age, ethnicity, and facial expression. The full corpus 
is available at: https://osf.io/4uydh. 
Face image landmarking 
All of the face stimuli were hand landmarked in each of 62 locations (see Fig. 2.1 for an 
example; see Chang and Tsao (2017) for a similar landmarking scheme).  Landmark locations 
were chosen to represent the overall shape of the face as well as the shape and relative 
position of internal features. The landmarks locations share a lot of overlap between what was 
used previously (Chang & Tsao, 2017), however there are many differences that account for 
differences in the stimulus set. For example, in our scheme, no landmarks were included to 
track the top of the head because the face images were cropped there, however landmarks 
were included to mark the hair line. In our scheme, we also included a high number of 
landmarks to track eyebrow and mouth shape, in order to capture the variance in expression in 
this set. In total, 9 landmarks tracked the cheek and jaw line, 10 tracked eye shape, 12 tracked 
41 
  
eyebrow shape and position, 16 tracked mouth/lip shape, 11 tracked the nose shape, and 3 
tracked the hairline. Most critically, after initially creating the landmarks, the positions were 
adjusted through piloting in order to capture the variance within the stimulus set, while also 
being able to be consistently applied across stimuli and different raters.  
 
Figure 2.1. Example of the 62 landmark positions on one of the face images. The positions were designed 
to capture the variance in the shape of faces in the corpus (see https://osf.io/4uydh/). This landmarking was 
completed on 1,148 unique images. 
The consistency of landmark positions was measured with a series of two-way mixed-
effects, agreement intraclass correlation coefficients (ICCs) on the vertical and horizontal 
position of each landmark (124 total). The two-way mixed-effects model treats the items as 
random, but the effect of rater as fixed. We opted for that because we were not interested in 
generalizing our findings to other potential raters. Agreement ICCs were used because the 
consistency of the absolute position of the landmarks is critical. After piloting, the majority of 
landmarks had high reliability within and between raters (majority of landmarks above 0.75). A 
small number of landmarks maintained low ICC (below 0.4), however we attribute this to the 
landmark locations having low variability between images (due to properties that were 
42 
  
standardized in the set such as face position), which led to poor reliability as assessed by ICC, 
but small absolute differences between raters. 
Eight research assistants were then trained on the landmarks of one of the original two 
raters. During training, performance was assessed with a series of ICCs compared to the 
original rater and/or other raters that had already completed training. This helped pinpoint 
landmark locations that needed to be fine-tuned for each rater. Training lasted until the ICC was 
high (consistently above .75 for most landmarks, excluding low-reliability landmarks explained 
above).  
Six research assistants (out of the eight) completed training with high reliability and 
collectively landmarked the remaining images. In order to continually evaluate performance and 
consistency between raters, images were periodically repeated across different raters. In total, 
510 images were landmarked by two or more raters and 36 were landmarked by 3 or more. This 
allowed for the continued evaluation of reliability. In cases where large differences were 
identified between raters (greater than 5 SDs on one landmark), visual inspection revealed that 
in the vast majority of cases the large differences reflected an ambiguous property of the 
stimulus (e.g. whether dark pixels represent a shadow or the continuation of an eyebrow), rather 
than an error.  
After landmarking was completed, we made small automatic adjustments for certain 
landmark positions. Specifically, landmarks that fell near the edge (e.g. along the jawline) were 
sometimes placed just outside the image range, those landmarks were automatically shifted 
back within the image. Additionally, in order to handle landmarks for the bottom lip accidentally 
being place slightly above the top lip (in cases where the image had a closed lip), those 
landmarks were shifted 2 pixels apart vertically.  
In instances where stimuli were landmarked by multiple raters, we used the average 
landmark position after this initial preprocessing. A small number of stimuli (54) were 
43 
  
landmarked multiple times by one rater. These repeated landmarks were averaged together 
prior to averaging with landmark positions from other raters.  
Active appearance model application 
Application of the active appearance model (AAM) was similar to prior approaches 
(Chang & Tsao, 2017; Cootes et al., 2001; Edwards et al.,1998; Van Ginneken et al., 2002), 
however the approach needed to be modified for application to color images. First a principal 
component analysis (PCA) was run on the vertical and horizontal positions of the 62 landmarks 
across all 1,148 images. This generated shape components and the mean face shape. Many of 
these components likely reflect noise, thus we filtered out the bottom 1% of components in 
terms of variance explained, leaving 61 shape components. In order to generate appearance 
components, each stimulus was smoothly warped using inverse weighted interpolation (with a 
radius of 10 and power of 5) to match the mean face shape (Bookstein, 1989). Warping of the 
images was applied in MATLAB (adapted from: Archibald, 2009). The process of warping the 
images to the mean shape created a set of shape-free face stimuli. A PCA was then performed 
on red/green/blue intensities across all pixels (179 x 251) of the shape-free faces. This 
generated a set 1,148 appearance components, however we retained the components that 
explained 99% of the variance, leaving 753 appearance components. The AAM was applied in 
MATLAB using a modified version of the Active Shape Model (ASM) and Active Appearance 
Model (AAM) package (Kroon, 2012).  
Subjective similarity sorting 
Procedure. This data was collected on a slightly larger sample of face stimuli, prior to the 
removal of 6 images (see Face image corpus). A group of 6 research assistants completed this 
task. On each trial, a random sample of 65 face stimuli were presented on the screen 
simultaneously (Fig. 2.2). Sorters were instructed to group the stimuli based on “which faces 
looked more closely genetically related to one another (i.e. the faces grouped together are more 
44 
  
likely to have common ancestors).” We used this language so that similarities or differences due 
to gender, age, and hairstyle would play a minimal role in the sorting. Faces were place into 
groups by clicking the mouse to put a square of a certain color around that face. Each color was 
associated with a number on the keyboard (0-9), which allowed the sorters to change the 
color/group. There were no restrictions on how many stimuli needed to be in a group, however 
every face needed to be put into a group, even if it was by itself. With each new set of stimuli, 
sorters could group the faces into up to ten groups, but were not required to use all ten. The 
group numbers were treated as independent on each trial, so if a stimulus was in group 1 on 
one trial, that could be ignored on subsequent trials. The only relevant information was which 
stimuli were grouped together. There was no time limit or restriction on the ability to switch 
groupings. Pressing the return key sorted the faces into the groups visually, providing the 
sorters the chance to make any changes. When they were finished, they pressed space to 
proceed to the next trial. 
 
Figure 2.2. Example trial from sorting task. A total of 65 images were presented on the screen at a time. 
Participants used their mouse to click on an image to put a colored box around an image, with the color 
indicating one of ten possible groups (top). The color was switched by pressing the corresponding number 
on the keyboard. Participants were required to sort every face into a group, with no restrictions on how 
many groups were used or the size of the group. 
45 
  
Faces were presented in pseudo-random sets, where across all sorters, every face was 
presented with every other face at least once. This design allowed for the creation of a 
dissimilarity matrix (collapsed across sorters) that had information from at least one trial for 
every pairwise combination in the set (1,335,180 combinations). The overall order of the trials 
was random, as well as the position of the images on the screen. The images were displayed at 
full size (251 x 179 pixels) in 13 columns and 5 rows on a 27-inch iMac screen. 
Reliability analysis. Using the sorting data collapsed across all sorters, we generated a 
dissimilarity matrix across all stimuli with each calculated as 1 minus the percentage of times 
each pair of stimuli were grouped together when they appeared in the same trial. In order to 
measure the reliability across all raters, we generated dissimilarity matrices based on every 
combination of 3 sorters (a total of 20 dissimilarity matrices). For each dissimilarity matrix in this 
set, there was one corresponding matrix with a set of 3 different sorters. We then calculated the 
correlation between all 10 non-overlapping pairs of matrices. 
Subjective ratings 
Participants. Face ratings were collected online via Amazon Mechanical Turk (MTurk). A total 
of 111 MTurk participants completed the rating task. All participants were located in the United 
Sates, were at least 18 years old, and had a MTurk job approval rate of 0.9 or above. Informed 
consent was obtained in accordance with procedures approved by the University of Oregon 
Institutional Review Board. Participants were paid $2.50 for completion of their ratings. We set a 
goal of obtaining 5 unique raters for each stimulus and rating type, a total of 100 participants 
(each participant rated 25% of the stimuli). Participants were removed based on a set of 
exclusion criteria that ensured compliance and effort in the task (see below). Participants that 
were excluded from further analysis were replaced until we reached this recruitment goal.  
A total of 11 participants were removed based on our exclusion criteria. We excluded 
participants who responded too quickly (less than 500 ms) on a high percentage (greater than 
46 
  
15%) of trials. This was intended to exclude participants who were not engaging with the task 
and were just clicking quickly to get to the next trial; two participants were removed based on 
that criteria. Additionally, we excluded participants who were inconsistent in their ratings of 
repeated stimuli (r < 0.4); nine participants were removed based on that standard. The 
variability in responses was individually examined for each participant, to ensure that there was 
variability in the responses made and that there were no systematic patterns in the responses 
(e.g. repeatedly clicking the same number in consecutive trials). No additional participants were 
removed based on systematic response patterns. 
Procedure. On each trial, participants were presented with one face stimulus in the center of 
the screen. Below each stimulus were nine buttons representing the range of the rating scale. 
Participants were instructed to rate each face based on their personal opinion by using their 
mouse to click the corresponding number (1-9) on the screen. Each participant made ratings on 
one of five dimensions: dominance, trustworthiness, attractiveness, happiness, or 
masculinity/femininity. In the first four cases the prompt was, “how 
dominant/trustworthy/attractive/happy is this this person?”, with 1 labeled as “not at all” and 9 
labeled as “extremely”. For masculinity/femininity, 1 was labeled as “extremely feminine” and 9 
was labeled as “extremely masculine”.  
This data was collected on the same slightly larger set of faces as the sorting task. Each 
participant was randomly assigned to one of four possible stimulus pools, containing 25% of the 
stimuli. Participants were presented with a total of 289 unique stimuli, with 10% (29) randomly 
repeated for a total of 318 trials.  
Reliability analysis. Responses made quicker than 500 ms were presumed to be errant and 
were removed from all analyses, this occurred for 23/100 participants (excluding participants 
already removed), but very few times (M = 5.30 ± 10.86). We assessed reliability both within 
and across participants. Reliability within participants was measured with a correlation between 
47 
  
repeated images and with an average of the absolute differences. We then averaged these 
results for each rating type. In order to assess the inter-rater reliability of these ratings, we 
calculated a two-way random-effects ICC on the consistency amongst raters. Here, we did not 
use agreement ICCs as we did with the landmark positions, because we intended to 
standardize the ratings, so the absolute rating number was not important, only the relative 
positions of face stimuli. We ran this analysis separately for each rating type and stimulus pool, 
and then averaged the findings for each rating type. Missing data was omitted in a listwise way. 
Results 
Landmark validation  
As an initial test of reliability, one rater landmarked 54 images twice. As a measure of 
test-retest reliability, we ran a series of two-way mixed-effect, agreement ICCs on the vertical 
and horizontal position of each landmark (124 total) for the repeated images. The ICC was high 
across all landmark positions (M = 0.93 ± 0.091, Median = 0.95, range: 0.18-0.99; 99.2% above 
.4), indicating high reliability for this rater. 
As a measure of the reliability of the landmarks across raters, we used the same ICC 
analysis but with different raters rather than repeated images from the same rater. The stimuli 
which overlapped between raters differed, so we calculated the ICC separately for every 
pairwise combination of raters who had overlap in stimuli landmarked. Out of 21 possible 
pairwise combinations between raters, there were 15 combinations with overlapping images, 
with each rater included at least twice. Among the 15 combinations, there was a lot of variability 
in the number of overlapping stimuli (M = 102.33 ± 92.68, Median = 79, range: 3-305). For all 
pairwise set of raters we calculated the mean across all 124 ICCs. The ICC was consistently 
high (M = 0.78 ± 0.06, range: 0.68-0.87). We also calculated the percentage of ICCs within each 
pair that was above 0.4, with the percentage consistently high (M = 92.0% ± 6.2%, range: 80.7-
48 
  
99.2%), suggesting that for the vast majority of landmarks, there was no concern about 
reliability. In fact, we calculated the same statistic for the percentage of ICCs above 0.75, and 
found that the majority of positions were reliably high (M = 71.1% ± 10.4%, range: 51.6-87.1%). 
Although there were a small number of landmark positions that had low ICCs, these were the 
landmarks identified during development that had low variance across the stimuli in the set (see 
Methods and Discussion). 
Active appearance model application  
Previous uses of the AAM have utilized the top 25 shape and top 25 appearance 
components, because those components alone capture the majority of the visual variance 
(Chang & Tsao, 2017). In this instance, the top 25 shape components capture 95% of the 
variance in landmark position, and the top 25 appearance components collectively capture 79% 
of the shape-free visual variance (see Fig. 2.3,4 for a visual representation of the top 
components). Thus, in this analysis, 50 components explain most of the variance and act as a 
potentially good cut-point for maximizing efficiency in representing face images. However, 
depending on the purposes of utilizing the components, there may be other cut-points that make 
sense. We found that visual reconstructions of face images were strong prior to 50 components, 
but also continued to improve up to and beyond that point (Fig. 2.5).  
49 
  
 
Figure 2.3. Illustration of the top ten shape components manipulated individually. The center column (black 
rectangle) is the mean face. Each row shows the mean face manipulated up (right) or down (left) on 
individual shape components. The components are ordered from highest (1) to lowest (10) variance 
explained. Collectively these ten components explain 86% of variance in landmark positions. 
50 
  
 
Figure 2.4. Illustration of the top ten appearance components manipulated individually. The center column 
(black rectangle) is the mean face. Each row shows the mean face manipulated up (right) or down (left) on 
individual appearance components. The components are ordered from highest (1) to lowest (10) variance 
explained. Collectively these ten components explain 69% of shape-free visual variance. 
51 
  
Figure 2.5. Example of five stimuli reconstructed with differing amounts of AAM components. The left 
column shows the original image. To the right of that shows the image reconstructed with differing amounts 
of AAM components included. First, “full AAM” was reconstructed with all 61 shape components and all 753 
appearance components. The next four columns show reconstructions with 100 to 10 components included, 
with half coming from each type. For example, the 100 components column includes 50 shape components 
and 50 appearance components. 
52 
  
Subjective similarity sorting 
The AAM allows for the creation of artificially generated faces based on changing or 
manipulating the value of the components. These components can be utilized to create stimuli 
with a controlled amount of similarity on one or more components. Furthermore, it allows for the 
active manipulation of stimuli during an experiment, either as controlled by the experimenter or 
interactively with the participant. However, due to the data-driven nature of the AAM, the 
components do not necessarily directly map onto perceptually-important dimensions. In order to 
make that mapping, we collected data on the perception of the faces. First, we analyzed data on 
the clustering of faces based on subjective similarity sorting.  
A group of 6 research assistants acted as sorters of the face stimuli. The number of trials 
each sorter completed ranged from 87-520 (M = 218.67 ± 158.08, Median = 171.5). Each 
stimulus was presented at least once by 5 of the 6 sorters (the 6th sorter saw 1148/1156 stimuli). 
On average each stimulus was presented 73.8 times (± 2.91) to each sorter. Across all sorters, 
the average number of groupings created on each trial was 6.9 (± 1.77) out of a maximum of 10.   
Using the sorting data collapsed across all sorters, we generated a dissimilarity matrix 
across all stimuli with each calculated as 1 minus the percentage of times each pair of stimuli 
were grouped together when they appeared in the same trial (see Fig. 2.6 for a visualization). In 
order to assess the reliability across all sorters, we first generated dissimilarity matrices based 
on every combination of 3 sorters (a total of 20 dissimilarity matrices). For each dissimilarity 
matrix in this set, there was one corresponding matrix with a set of 3 different sorters. As a 
measure of reliability, we calculated the correlation between all 10 non-overlapping pairs of 
matrices (Fig. 2.7). Any missing cells in either matrix pair were removed from the analysis. Of 
the 667,590 unique stimulus combinations in the full dissimilarity matrix, most were kept in this 
analysis (M = 481,317 ± 25,820, range: 438,311-512,307). On average the correlation between 
53 
  
the dissimilarity matrices was 0.46 ± 0.05 (range: 0.37-0.52). This pattern indicates a consistent 
pattern of inter-rater reliability no matter the combination of sorters included. The unexplained 
variance could be reflective of differences between sorters, alternatively it could reflect how 
sorting is influenced by the unique combination of faces appearing on each trial. 
 
Figure 2.6. Top three multidimensional scaling (MDS) components across all face stimuli. The MDS 
analysis was run on the dissimilarity matrix generated from sorting the face stimuli. 
 
Figure 2.7. Boxplot of the correlation between dissimilarity matrices generated by every unique split of 
sorters. Individual correlations are indicated with an “x”. The correlation between the matrices were all near 
the mean of 0.46 ± 0.05.  
54 
  
As a demonstration of one application of this data, we proceeded to identify clusters of 
face stimuli that resembled one another most closely. We generated a distance matrix based on 
the Euclidean distance between the rows of the full dissimilarity matrix. We then performed a 
hierarchical clustering analysis on this distance matrix, using Ward’s minimum variance method 
as implemented by the “hclust” function in R (Murtagh & Legendre, 2014). Based on a scree plot 
of the height when creating a different number of clusters, there were multiple logical cut-points 
depending on the intended purpose (Fig. 2.8,9). One way of visually inspecting the groupings is 
to look at the average face image across all stimuli included in a cluster. As one example, using 
9 clusters (approach used in Chapter 3), the number of stimuli included in each cluster ranged 
from 58-293 (M = 128.44). Based on visual inspection, this cutoff point successfully generated 
distinct locations in face space (Fig. 2.10).  
 
Figure 2.8. Scree plot of a hierarchical clustering analysis based on the distance between sorted face 
stimuli. 1-20 groups (x-axis) are included here, with the height of the groups plotted on the y-axis. Based 
on this plot, 4 is the most logical cut-point. 
55 
  
 
Figure 2.9. Scree plot of a hierarchical clustering analysis based on the distance between sorted face 
stimuli. 4-20 groups (x-axis) are included here, with the height of the groups plotted on the y-axis. By 
zooming in on a higher number of groups, additional logical cut-points emerge. 
 
 
Figure 2.10. Example mean images of the face groupings, based on a hierarchical clustering analysis of 
the sorting data. In this example, the number of groups was set to 9. 
Subjective ratings 
Participants consistently utilized most or all of the range of the scale to make ratings (M 
= 8.31 ± 0.98; out of 9 maximum). Interestingly, there was some variation between dimensions 
with participants using slightly less of the range for attractiveness (M = 7.80 ± 1.01), dominance 
(M = 8.05 ± 1.23), and trustworthiness (M = 8.10 ± 1.02), but the full range for affect (M = 8.80 ± 
0.523) and gender (M = 8.80 ± 0.523). This difference could be attributable to something about 
the stimulus set, or could be related to the more bivalent nature of affect and gender that could 
56 
  
push responses away from the center of the scale. Participants not only responded at the 
extremes though, they tended to use every response within that range, with the number of 
distinct responses given closely corresponding to the range (M = 8.24 ± 1.07). The average 
standard deviation in responses was 1.87 ± 0.57. 
Participants were consistent in the ratings they made, with high correlations between the 
ratings for repeated stimuli (dominance: M = 0.72 ± 0.13, range: 0.46-0.94; trustworthiness: M = 
0.69 ± 0.15, range: 0.41-0.96; attractiveness: M = 0.78 ± 0.13, range: 0.52-0.95; affect: M = 0.88 
± 0.12, range: 0.67-0.98; gender: M = 0.88 ± 0.12, range: 0.51-0.97) and low average absolute 
differences (dominance: M = 0.78 ± 0.35, range: 0.10-1.45; trustworthiness: M = 0.85 ± 0.29, 
range: 0.41-1.52; attractiveness: M = 0.58 ± 0.20, range: 0.32-1.00; affect: M = 0.57 ± 0.25, 
range: 0.24-1.03; gender: M = 0.69 ± 0.22, range: 0.38-1.10).  
In order to assess the inter-rater reliability of these ratings, we calculated the ICC on the 
consistency amongst raters for each rating and stimulus pool. The ICC was consistently high for 
affect (M = 0.74 ± 0.034, range: 0.72-0.79) and gender (M = 0.78 ± 0.039, range: 0.74-0.83). 
The other ratings were less consistently scored, as indicated by lower ICCs for dominance (M = 
0.28 ±0.087, range: 0.18-0.36), trustworthiness (M = 0.38 ± 0.010, range: 0.28-0.50), and 
attractiveness (M = 0.40 ± 0.052, range: 0.35-0.46). 
In order to prepare the ratings for future applications, we combined the ratings from each 
participant. First, ratings of stimuli repeated within participants were averaged together. The 
ratings were then z-scored within participants to help account for any differences in how 
participants utilized the rating scale. With the ratings now on a standardized scale, we averaged 
across participants.  
The distributions and pairwise relationships between ratings is an important way to 
evaluate the validity of the data (Fig. 2.11). One initial validation is the bimodal distribution for 
57 
  
gender (see Fig. S2.1). In fact, if classification were to be performed with the 0 point of the scale 
as the dividing line, classification accuracy was nearly perfect (female: 98.8%; male: 99.3%). 
Another important validation, is the high correlation (r = .72) between affect and trustworthiness. 
The low correlation between gender and affect (r = -.12) is a good validation that the face 
expressions in the set did not systematically vary by gender. One surprisingly strong 
relationship was between trustworthiness and dominance (r = -.56). Previous research has 
suggested that these are independent face dimensions (Oosterhof & Todorov, 2008). The 
relationship we found here could be driven by differences between the stimulus sets.  
 
Figure 2.11. Relationship between the ratings for all stimuli. The ratings were z-scored within each 
participant and then average across participants. Top right: the pairwise correlations between each of the 
5 ratings. Bottom left: the pairwise scatterplots between each of the ratings. Each dot represents one image. 
Diagonal: Density plot of each rating.  
58 
  
Discussion 
 Face images are a tremendously valuable stimulus class in psychological research. 
Here we make available a large corpus of face stimuli, all forward-facing, aligned, and uniform in 
size, but with a high degree of diversity on perceptually-important features. The stimuli have all 
been hand landmarked for use in AAM, have been sorted on similarity, and have been rated on 
perceptually-important feature dimensions. Combined, these attributes allow these stimuli to be 
utilized with a high degree of experimental control. 
One of the main applications is the creation of synthetic face stimuli through the 
manipulation of the AAM components. With 61 shape and 753 appearance components, there 
is a large potential search space, with many locations in the space generating a combination of 
features that don’t exist naturally. One way of choosing locations to generate realistic faces from 
is through mapping the grouping data to the AAM components (e.g. with a linear regression 
model). This allows for the generation of the “average” face from each group. In Chapter 3 we 
validated that faces generated from different groups were less like likely to cause interference 
with one another. This is useful for creating a stimulus design structure where interference only 
occurs where you intend it to. 
 The AAM components also allow for the experimental manipulation of specific 
components. When these components are combined with the subjective ratings, we can learn 
and utilize the relationship between the two. In one approach (see Chapter 3), we fit two 
regularized regression models with the AAM shape or appearance components as the outcome 
measures and all subjective measures collected as input variables. The weights from these 
models allow for the shifting of AAM components in relation to a specified shift in a subjective 
dimension (e.g. affect). This approach allowed for the creation of stimuli manipulated to be 
exactly the same, except slightly different on one perceptually-important dimension (see Fig. 
S2.2,3 for examples). Further, it allowed for the ability to probe memory on those same 
59 
  
manipulated dimensions. For many experimental designs, this is an improvement over 
approaches such as warping, which rely on dimensions that need to be learned during the 
experiment. Other approaches, such as actors with different expressions, don’t allow for a multi-
dimensional and continuous search space. 
There are times, however, when it may be preferable to employ natural face images. It 
remains unclear the extent to which the perception of synthetic face stimuli matches the 
perception of true face images. With recent advances in synthetic faces, there is evidence that 
they may be indistinguishable behaviorally (Shen et al., 2021), but that doesn’t necessarily 
mean that they are the same in neural representational space (Dado et al., 2022). The present 
corpus has a diverse array of faces that is large enough to fill out the full range of the key 
perceptual scales included. Indeed, participants tended to use the full scale when making 
ratings. These ratings allow a quantification of differences between face stimuli on specific 
dimensions, allowing for experimental control without artificial manipulation. Further, the sorting 
data allows for the use of an overall similarity metric. Separately or combined, these metrics can 
be used in experiments that want to manipulate the similarity between stimuli or could be used 
in a category learning design based on the latent clustering of the face images.  
 The capability to have natural and synthetic images mapped onto the same latent 
dimensions further provides opportunities for integrated research approaches. For example, 
being able to map the synthetic and natural images into the same space allows for the 
possibility of testing whether there are differences in neural responses between them. Similarly, 
it allows for a more integrated research program where different experiments can use face 
stimuli from the same pool that can all be mapped into a shared feature space. 
The stimuli and data we are currently making available are well equipped for research 
purposes, but there are areas to target for improvement going forward. Overall we found high 
reliability in our metrics, but there were some potential gaps. In particular, we found high 
60 
  
reliability in most landmark positions both within and across raters. However, there were a small 
number of landmarks that had considerably lower reliability scores. We attribute that to the lack 
of variability in those particular landmarks, because features that the face stimuli were matched 
on (e.g. eye position) were the most likely to have low reliability. This lack of variation in those 
landmark positions, could lead to any errant deviation lowering the measured reliability. If that 
interpretation is true, we may either need a better metric to assess reliability or those landmarks 
are not adding any value to the AAM and should be left out. 
Sorting reliability indicated high correspondence between different sorters, however 
there was some unaccounted for variance. The correlation between dissimilarity matrices 
derived from different combinations of sorters were all near the mean of 0.46. The unexplained 
variance could reflect differences between sorters, alternatively it could reflect how sorting is 
influenced by the unique combination of faces appearing on each trial. Evidence for either 
explanation could be found through the collection of additional sorting data. One barrier to the 
use of this data is that it was collected by research assistants rather than experimental 
participants. We plan to validate the current sorting results with independent experimental data.  
Our subjective ratings demonstrated high inter-rater reliability for affect and gender, 
however the reliability was much more modest for trustworthiness, dominance, and 
attractiveness. It may be the case that these dimensions are more subjective and could vary by 
participant. The collection of additional data could help clarify whether this reflects a real 
difference between the ability to consistently rate these dimensions or is driven by the present 
participant pool. Collecting more data for these potentially less reliable dimensions could 
potentially make up for the variability in response patterns and reflect the average perception of 
that dimension. 
There are a number of potential applications for this corpus of face stimuli. One 
important plan to further increase the utility of this face corpus is to make the fMRI data 
61 
  
collected in this dissertation (see Chapter 4) publicly available. One example use case would be 
to generate a dissimilarity matrix based on the neural data and compare that to the behaviorally-
derived one; this would provide a way to compare perceived face similarity between behavioral 
and neural-derived measures. Further, the neural dissimilarity matrix could be used as an 
overall similarity index for experimental design purposes as described above for the behavioral 
data. Each similarity metric could be useful in different experimental contexts, particularly since 
these similarity metrics are distinct both in terms of the measurement tool and the task. The 
neural data may ultimately act as a better pure reflection of overall similarity. 
 We make all stimuli, landmark positions, AAM components, sorting data, and subjective 
rating data available to other researchers. We hope that this provides a valuable set of face 
stimuli for a variety of experimental designs and purposes. We view them as particularly 
valuable in cases where multiple continuous face dimensions need to be quantified or 
manipulated. These attributes will help researchers utilize face stimuli to their full potential.   
62 
  
Chapter III 
LONG-TERM MEMORY INTERFERENCE IS RESOLVED VIA REPULSION AND PRECISION 
ALONG DIAGNOSTIC MEMORY DIMENSIONS 
From Drascher, M. L., & Kuhl, B. A. (2022). Long-term memory interference is resolved via 
repulsion and precision along diagnostic memory dimensions. Psychonomic Bulletin & 
Review, 1-15. 
Introduction 
When episodic memories are similar, this can lead to interference and forgetting. A 
critical point of emphasis in theories of episodic memory has been to not only characterize the 
contexts and situations in which interference occurs, but to consider the mechanisms that 
resolve interference (Anderson, et al., 1994; Anderson & Spellman, 1995; Anderson, 2003; 
Crowder, 2014; Fawcett & Hulbert, 2020; Smith & Hunt, 2000). To the extent that similarity is a 
root cause of interference, one potentially powerful way to reduce interference is to accentuate 
subtle differences between memories (Hulbert & Norman, 2015; Smith & Hunt, 2000). However, 
there is surprisingly little evidence characterizing whether or how the contents of episodic 
memories change as an adaptive response to interference.  
One way to accentuate differences between similar memories is by increasing memory 
precision. For example, if two students look similar, more precise memories for the features of 
those students’ faces (e.g., their specific eye colors) should render those memories more 
distinct. This concept is similar to the idea from perceptual learning that stimulus dimensions are 
‘stretched’ to allow more fine-grained perceptual discriminations (Goldstone, 1998; Nosofsky, 
1986). Analogously, increasing memory precision should expand the space between similar 
memories, thereby reducing interference.  
63 
  
An alternative, though not mutually exclusive, possibility is that differences between 
similar events are accentuated by misremembering event features as being more different that 
they actually were. For example, a pair of recent studies demonstrated that when otherwise 
identical objects were associated with slightly different colors, the color difference between 
those objects was systematically exaggerated in memory (Chanales et al., 2021; Zhao et al., 
2021). Critically, this memory repulsion only emerged with extensive practice and coincided with 
reductions in interference-related memory errors. In fact, during early stages of learning, there 
was an ‘attraction’ in color memory (Chanales et al., 2021). Notably, repulsion-like biases have 
also been observed in working memory (Bae & Luck, 2017; Chunharas et al., 2018; Chunharas 
et al., 2019; Golomb, 2015) and visual attention (Chen et al., 2019; Won et al., 2020; Yu & 
Geng, 2019). 
To the extent that episodic memory interference triggers changes in precision or bias, 
these changes should be most likely to occur (or most beneficial) along feature dimensions that 
are diagnostic of differences between similar memories. For example, if two students have 
identical hair color but slightly different eye color, then eye color would represent a diagnostic 
feature dimension. Targeted changes in discrimination accuracy along diagnostic feature 
dimensions have been observed during category learning (Goldstone & Steyvers, 2001; 
Kruschke, 1996; Theves et al., 2020) and in working memory (Chunharas et al., 2018). 
Computational models of episodic memory interference have proposed that episodic memory 
representations also undergo targeted changes that specifically exaggerate differences between 
similar memories (Hulbert & Norman, 2015), but empirical support for this proposal remains 
limited. 
While precision and bias may both contribute to the resolution of memory interference, 
they are orthogonal constructs. Whereas precision refers to a reduction in memory variability, 
bias refers to a shift in a memory distribution. However, both measures require that memory be 
64 
  
expressed using continuous values. Additionally, calculating precision requires that individual 
memories be sampled multiple times (to observe variability in the response). Despite recent 
progress towards utilizing continuous feature measures in episodic memory research (e.g. 
Berens et al., 2020; Brady et al., 2013; Cooper et al., 2019; Cooper & Ritchey, 2019; Harlow & 
Donaldson, 2013; Harlow & Yonelinas, 2016; Nilakantan et al., 2017; Nilakantan et al., 2018; 
Rhodes et al., 2020; Richter et al., 2016), prior studies have not specifically compared the 
relative contributions of precision and bias to the resolution of episodic memory interference. 
Here, using multi-dimensional stimuli (faces), we tested whether similarity between 
stimuli induces adaptive changes in episodic feature memory (precision and/or bias) along 
diagnostic versus non-diagnostic feature dimensions. We developed a set of synthetic face 
stimuli that were manipulated on perceptually-important dimensions (Oosterhof & Todorov, 
2008) as well as a behavioral face reconstruction task that allowed participants to express face 
memory by actively adjusting the synthetic faces. We used this innovative methodology across 
three experiments (including a preregistered third experiment) that each included a simple 
learning paradigm in which participants studied associations between faces and cue words 
(professions). Critically, most of the faces had a competitive pairmate that differed only on a 
counterbalanced diagnostic dimension (affect or gender). After extensive study and retrieval 
practice, we probed participants’ memories for both feature dimensions simultaneously. Our 
central hypothesis was that competition would yield adaptive changes along the diagnostic 
feature dimension. Specifically, we predicted that memory for diagnostic features would be 
biased to exaggerate differences between similar memories (repulsion) and that repulsion would 
be associated with lower memory interference. We also predicted greater precision for 
diagnostic features and, importantly, tested whether repulsion and precision were independently 
predictive of memory interference. 
  
65 
  
Methods 
We conducted three experiments with the same core experimental design and 
procedure. The only differences across the experiments were (1) the similarity of competitive 
pairmates increased very slightly from experiments 1 to 2 to 3, and (2) the minimum number of 
learning rounds was increased from experiment 1 to experiments 2 and 3 to account for the 
greater similarity/difficulty. Analyses and predictions for experiment 3 were preregistered 
(https://osf.io/s2gnq) after analyzing data from experiments 1 and 2. Thus, analyses are first 
reported for experiments 1 and 2, and then, separately, for experiment 3 (to test for replication). 
Exploratory analyses that combined data across experiments are also reported. 
Participants 
Participants were undergraduate students from the University of Oregon who received 
course credit for participation. A total of 40 participants were recruited for experiment 1. Four 
participants were excluded from analyses due to technical/procedural errors (see preregistration 
for full exclusion criteria: https://osf.io/s2gnq), resulting in a sample of 36 participants (Mage = 
19.11 ±1.65, 18-25 years, 25 females). We sought a similar sample size in experiment 2 and 
recruited 41 participants (Mage = 20.49 ±2.47, 18-28 years, 28 females); no participants were 
excluded for technical/procedural errors. Based on the effect sizes in experiments 1 and 2 and 
corresponding power analyses, we recruited a sample 60 participants for a preregistered 
experiment 3 (see https://osf.io/s2gnq). Three participants were excluded for 
technical/procedural errors, resulting in a sample of 57 participants (Mage = 19.00 ±2.41, 18-22 
years, 40 females). Each experiment involved a single session for each participant that lasted 
90-120 minutes. Informed consent was obtained in accordance with procedures approved by 
the University of Oregon Institutional Review Board. All participants who were not excluded due 
to technical/procedural errors were included in our analyses of the associative memory test 
66 
  
performance (see Procedure). Inclusion in all subsequent analyses was based on a set of 
performance-based exclusion criteria (see Performance-based exclusion criteria).  
Materials 
Cue words. For each participant and each experiment, the same set of 12 cue words was used 
(farmer, dentist, lawyer, teacher, chef, tailor, plumber, actor, artist, surgeon, judge, barber). 
Each cue word was assigned to a unique face, with the assignment randomized for each 
participant. All of the cue words referred to professions, consisted of one or two syllables, and 
were displayed in white with all capital letters. 
Faces. Face images appeared in color with a uniform ellipse shape with a horizontal radius of 
81 pixels and a vertical radius of 120 pixels. For all experiments, face images were generated 
from a set of eight base faces. The base faces were derived from a separate experimental 
procedure in which participants sorted a corpus of 1,008 faces into ‘families’ based on 
subjective assessment of the likelihood that faces were genetically related. Clustering 
algorithms were applied to the sorting responses to identify distinct clusters (families). Each of 
the eight base faces represents the mean face from a cluster, normalized for features not 
relevant to the grouping (see https://osf.io/6cew9/ for full details of stimulus generation 
methods). Critically, because of the way in which the eight base faces were generated, the base 
faces were distinct from each other according to characteristics that were orthogonal to the 
dimensions of affect and gender (which were the dimensions manipulated in the current 
experiments).  
For each participant in each experiment, half of the base faces (four) were assigned to a 
competitive condition and half (four) were assigned to a non-competitive condition. The 
assignment of base faces to conditions was randomized for each participant. Base faces were 
manipulated along two dimensions—affect and gender—in order to generate the specific faces 
that participants studied (studied faces). For the four base faces assigned to the competitive 
67 
  
condition, we created pairmates by generating two studied faces from each base face, with the 
common base being the source of competition. For the four faces assigned to the non-
competitive condition, each base face was manipulated to generate a single studied face. Thus, 
a total of 12 studied faces were generated and used for each experiment.  
For each experiment, each studied face was manipulated to fall into one of four locations 
in a two x two (affect x gender) space. That is, within each experiment, each studied face had 
one of two affect values and one of two gender values. To manipulate these dimensions, we 
collected subjective affect and gender ratings for all of the 1,008 faces in the corpus (see 
https://osf.io/znc58/) and then used regression analyses to learn the mapping between the 
gender and affect ratings and face image parameters (739 parameters in total) derived from an 
Active Appearance Model (AAM) (Chang & Tsao, 2017; Cootes et al., 2001; Edwards et al., 
1998). Thus, the regression weights allowed for different affect and gender values to be 
translated to the 739-parameter feature space to manipulate the base faces. In order to 
maximize the independence of the affect and gender dimensions, for each of the AAM 
parameters, the dimension (affect or gender) with the highest magnitude regression weight was 
retained and the regression weight for the other dimension was set to 0. Thus, each face 
dimension (affect, gender) was associated with a distinct set of AAM parameters. 
For the non-competitive condition, the four studied faces corresponded to the four 
locations in affect-gender space (one face per location), with the assignment of base faces to 
locations randomly determined for each participant. For the competitive condition, the eight 
studied faces again corresponded to the four locations in affect-gender space (two faces per 
location), with the assignment of base faces to locations randomly determined for each 
participant. Critically, the eight faces in the competitive condition included four sets of pairmates. 
For two of those sets, the pairmates within each set differed on affect and were matched on 
gender (i.e., diagnostic dimension = affect, non-diagnostic dimension = gender). For the other 
68 
  
two sets, the pairmates differed on gender and were matched on affect (i.e., diagnostic 
dimension = gender, non-diagnostic dimension = affect) (see Fig. 3.1A). For the sets of 
pairmates that shared the same diagnostic dimension, each set corresponded to a different 
value on the non-diagnostic dimension, but the pairmates within each set had the same value 
on the non-diagnostic dimension. For example, for the two sets of pairmates for which gender 
was the diagnostic dimension, each set of pairmates would have a different value on the affect 
dimension, but the pairmates within each set would have the same value on the affect 
dimension. 
 
Figure 3.1. Experimental paradigm and design. A. Examples of competitive pairmates from experiment 1, 
with the location of the faces in affect-gender space shown below. Top: example of pairmates matched on 
affect (non-diagnostic dimension) but differing slightly on gender (diagnostic dimension). Bottom: example 
of pairmates matched on gender (non-diagnostic dimension) but differing slightly on affect (diagnostic 
dimension). B. Learning phase. Each round of the learning phase (up to 12 rounds total) consisted of 
three tasks. During study, participants viewed and studied associations between cue words and faces. 
During recall, participants viewed a cue word and were instructed to recall the corresponding face as 
vividly as possible; the correct face image then appeared. During the associative memory test, 
participants attempted to match each face image with its corresponding cue word, selected from a set of 
6 options: target, competitor (the cue word of the pairmate face), and 4 lures (cues from other faces). C. 
Face reconstruction task. Left: participants were first shown a cue and instructed to visualize the 
corresponding face. Then, an altered version of that face appeared (shifted a random amount on the 
affect and gender dimensions). Center: participants used mouse clicks in a two-dimensional box to 
search the affect-gender space until the reconstructed face matched their memory for the target. Right: 
schematic of the search space showing the true location of the target (green dot) and competitor (red 
dot). Example reconstruction responses (open green dots) demonstrate our predictions: a bias away from 
the competitor (repulsion) on the diagnostic dimension and lower variability (greater precision) along the 
diagnostic compared to the non-diagnostic dimension.  
69 
  
For experiment 1, the difference between competitive faces along the diagnostic 
dimension was determined based on subjective assessment of the authors and initial pilot data. 
The goal was for the differences to be very subtle, yet learnable (see Fig. 3.1A for examples). 
Note: the units for these differences were not meaningful and are therefore not reported. For 
experiment 2, the difference between competitive pairs was reduced by 25% relative to 
experiment 1 in order to slightly increase the difficulty/interference. This was motivated by 
evidence that repulsion is more likely to occur when discrimination is relatively more difficult 
(Chanales et al., 2021). For experiment 3, the difference between competitive pairs on the 
gender dimension was the same as in experiment 2, but the difference on the affect dimension 
was reduced by 50% relative to experiment 1. This was motivated by evidence, from experiment 
2, that interference was somewhat lower along the affect dimension compared to the gender 
dimension. Note that since the differences between competitive pairs in experiment 1 were quite 
small to begin with, the changes across experiments were subtle. For additional consideration of 
differences between affect versus gender across experiments, see Fig. S3.1. 
Within each experiment, the difference between competing faces (pairmates) on the 
diagnostic dimension is described in relative terms (scaled units), with each face being 1 unit 
from the center of face space and, therefore, 2 units from each other. All faces were also exactly 
1 unit away from the affect and gender borders in the response window (see Reconstruction 
phase, below). Analyses of face memory from the reconstruction phase were performed based 
on the distance, in units, between participants’ responses and the actual location of the studied 
phases. 
Procedure 
Each experiment consisted of two main phases: a learning phase and a reconstruction 
phase. The purpose of the learning phase was for participants to extensively study and practice 
remembering the cue-face associations. The reconstruction phase served as the critical 
70 
  
memory test for measuring bias and precision in face memory. All experiments were run in 
Matlab, using the Psychophysics Toolbox extensions (Brainard, 1997; Kleiner et al, 2007; Pelli, 
1997). All phases of the experiment had a gray background. 
Learning phase. The learning phase consisted of up to 12 rounds, with each round split into 
two sub-rounds. Each sub-round included three blocks corresponding to the following 
experimental tasks, in the following order: study, recall, and associative memory test (Fig. 3.1B), 
with the exception that rounds one and two did not include the recall task. For each participant 
and each round of the learning phase, the 12 associations were randomly split into two groups 
of six associations each (four competitive, two non-competitive), with each group of six 
associations assigned to a separate sub-round. In other words, in each round of the learning 
phase, half of the associations went through study/recall/associative memory test and then the 
other half of the associations went through study/recall/associative memory test (with the 
exception, as noted above, that rounds one and two did not include the recall task). The 
rationale for splitting the associations into two sub-rounds was to facilitate learning by reducing 
the amount of information per block. 
In the study task, participants viewed and studied the cue-face pairings. On each trial 
(2000 ms), a cue appeared directly above a face image. In between trials, there was a fixation 
cross for 200 ms. Participants were instructed to study the cue-face pairings; no response was 
made. In the recall task, participants attempted to recall the face associated with each cue. On 
each trial, a cue was presented above a blank ellipse (representing the to-be-recalled face) for 
2500 ms. Participants were instructed to recall the associated face image as vividly as possible. 
Although no response was made, the correct face would then appear below the cue for 1000 ms 
as a way of providing feedback. In between trials, there was a 200 ms fixation cross. In the 
associative memory test, participants attempted to match face images with corresponding cue 
words. On each trial, a face image was presented for 2000 ms and was then replaced by a set 
71 
  
of six different cue words displayed in the bottom half of the screen (three cues in each of two 
rows with the position randomly determined for each trial). The cue words included all of the 
cues from the current sub-round. For faces in the competitive condition, the set of cues included 
the correct answer (target), the cue that had been paired with the current face’s pairmate 
(interference error), and four cues that had been paired with the other, unrelated faces (lures). 
For faces in the non-competitive condition, the set of cues included the correct answer (target) 
and five cues that had been paired with unrelated faces (lures). Participants made responses by 
clicking on the cue word with the mouse. After each response was registered, feedback 
indicated whether the response was correct (“Correct!”; 500 ms) or incorrect with the correct cue 
indicated (e.g. “Incorrect. This is the BARBER.”; 2000 ms). 
During the first two rounds of the learning phase, each study block presented each cue-
face association three times. In subsequent rounds, each association was studied once per 
block. As noted above, there was no recall task in the first two rounds of the learning phase. In 
subsequent rounds, each association was recalled twice per recall block. Across all rounds of 
the learning phase, each association was tested three times per associative test block. For each 
task block (study/recall/associative test), the order in which each association was 
presented/tested was pseudo-randomly determined, with the following constraints: (1) all of the 
associations in each block were studied/presented once before any were repeated, (2) a given 
association was never presented/tested consecutively, (3) competing associations (face 
pairmates) were never presented/tested in consecutive trials. These constraints helped ensure 
that any comparisons between stimuli/associations were memory-based.  
In experiment 1, participants repeated the learning phase for at least nine rounds and 
until they reached 100% accuracy on the associative memory test, up to a maximum of 12 
rounds. Most participants had reached perfect accuracy after nine rounds (24/36), and nearly all 
did so after 10 rounds (31/36). Only two participants went through all 12 rounds, with one 
72 
  
achieving perfect performance and the other being removed for continued poor performance 
(see below for performance-based exclusion criteria). In experiments 2 and 3, all participants 
completed 12 rounds of the learning phase regardless of associative memory test performance. 
For each experiment, participants were given the opportunity to take a break after every two 
rounds, with the length of the break determined by the participant. Participants were instructed 
to press the space bar when ready to proceed. 
Reconstruction phase. After the learning phase, participants’ memories for the features of the 
faces were probed with a surprise reconstruction task (Fig. 1C). On each trial in the 
reconstruction task, participants were first shown a cue (e.g. “What does the BARBER look 
like?”) above a blank ellipse for 2500 ms and were instructed to bring the target face to mind. 
Next, an altered version of the target face appeared in the ellipse with a response box beneath 
the face representing the search space (see Reconstruction search space, below, for details). 
Participants used a mouse to click through the box; the face image above the box changed 
according to the location of each mouse click in the box. Although participants were not 
explicitly made aware of this, the box represented a two-dimensional affect-gender space. 
Participants were instructed to continue searching (clicking through the box) until the face 
matched their memory for the target face. Participants finalized their response by pressing the 
space bar. There was no limit on the response time. A fixation cross appeared for 200 ms 
between trials. Each of the 12 studied faces was probed (reconstructed) a total of four times in 
the reconstruction phase (48 trials total). The rationale for probing faces multiple times was so 
that the precision (variability) of reconstructions for each face could be measured. Faces were 
reconstructed in a pseudo-random block order. In each of four consecutive blocks (with no 
break or demarcation between blocks), each of the 12 faces was reconstructed once. As in the 
learning phase, the same face was never tested consecutively and pairmate faces were never 
tested in consecutive trials. After the reconstruction phase, there was a short phase where 
73 
  
participants were prompted to provide a rating on a 9-point scale for both affect and gender for 
each stimulus. Results from this task (which was only included for validation) are not described 
here. 
Reconstruction search space. In the reconstruction task, the altered face presented on each 
trial was derived from the same base face as the target face, but the affect and gender values 
were randomly selected from a range of possible values. This range of possible values 
corresponded to the size of the two-dimensional search space (i.e., the size of the response 
box). Importantly, the range of the search space and the center of the search space were 
identical across all trials, but the mapping of the dimensions to the x and y axes (e.g., x axis = 
affect, y axis = gender) and the direction/orientation of the axes (e.g., left = low, right = high) 
were randomly varied for each trial so that participants would not learn to associate a given face 
with a fixed spatial position in the response box. For each experiment, the size of the search 
space relative to the distance between pairmate faces was identical. That is, for each 
experiment the height and width of the search space was exactly twice the distance between 
pairmate faces on the diagnostic dimension. Thus, with pairmate faces 2 units apart (in our 
standardized units), the height and width of the search space was 4 units. For each trial, the 
location of the correct answer (target face) and the location of the pairmate face (for faces in the 
competitive condition) always corresponded to one of four possible locations (the center of each 
quadrant) with all four of those locations contained in the search space (see Fig. 1A). 
Analysis methods 
Performance-based exclusion criteria. For analyses that involved the reconstruction task 
data, we excluded a small number of participants based on performance during rounds 9-12 of 
the associative memory test. Participants were excluded if (a) their error rate for non-
competitive trials was greater than 20% for any of these rounds or (b) they selected the lure 
faces on greater than 20% of the competitive trials for any of these rounds. Based on these 
74 
  
criteria, one participant was excluded from analysis of the reconstruction task data in 
experiment 1 (yielding N = 35), four were excluded from experiment 2 (yielding N = 37), and 
eight were excluded from experiment 3 (yielding N = 49) (see https://osf.io/dj6q2/ for other 
exclusion criteria that were established but did not apply). The rationale for having a high 
threshold for inclusion of participants in the reconstruction task analysis was to minimize cases 
where participants reconstructed an entirely wrong face and to instead focus on bias/precision 
in otherwise correctly remembered faces.  
Measuring associative memory. As noted above, the associative memory test was used to 
confirm that participants achieved high accuracy in associating cues with faces. The associative 
memory test also allowed for a manipulation check of whether the competitive condition induced 
interference (lower associative memory accuracy) compared to the non-competitive condition. 
Data from the associative memory test was first analyzed in terms of accuracy on competitive 
compared to non-competitive trials. We ran a separate repeated measures ANOVA for each 
experiment with factors of condition (competitive, non-competitive) and learning round (1-9 for 
experiment 1, 1-12 for experiments 2 and 3). For competitive trials, we also separated errors by 
whether they were attributable to competition (interference error) or not (lures). If errors were 
random, interference errors would occur on 1/5th (20%) of the error trials. To test whether 
interference errors occurred at above chance levels, we therefore ran one sample t-tests, for 
each experiment, comparing the mean percentage of interference errors (across all learning 
rounds) to 20%. 
Measuring bias. As described above, on each trial in the reconstruction task the target face 
was located in one of four locations (the center of the four quadrants). Thus, for both the x and y 
axes of the search space, the target was half-way between the center and the border of the 
search space (Fig. 1A). To measure for potential bias, for each experiment all responses were 
aligned onto a common axis and rescaled onto a common scale, separately for each feature 
75 
  
dimension (affect, gender). For the rescaled data, the range of possible responses for each 
dimension was -2 to 2, with 0 being the center of the face space (i.e., the center of the search 
space). For the competitive condition, the location of the target face on the diagnostic dimension 
= 1 and the location of the pairmate face = -1 (Fig. 3.1C). Thus, a bias away from the pairmate 
face would be represented by values greater than 1, whereas a bias toward the pairmate face 
(or toward the center of face space) would be represented by values lower than 1. For the non-
diagnostic dimension, the location of the target face and the pairmate face = 1. Although faces 
from the non-competitive condition were included in the reconstruction task, bias was not 
measured for these faces because the distinction between diagnostic versus non-diagnostic 
dimensions did not exist. Rather, non-competitive faces were of critical importance in the 
associative memory test, where they served to establish an overall memory interference effect. 
It is important to note that, for the reconstruction task, the response range on each trial 
was asymmetrically distributed around the target. If the response range had been symmetrically 
distributed around the target, then the correct response on each trial would have, by definition, 
been the center of the search space—which likely would have led participants to learn to simply 
respond in the center. However, the drawback of the approach we used is that, for the 
diagnostic dimension in the competitive condition, there was more opportunity to respond 
toward the pairmate face (values between -2 and 1) than away from the pairmate face (values 
of 1 to 2). Of course, this asymmetry works against our predicted effect of repulsion (values 
greater than 1). Nonetheless, in order to account for the asymmetrically restricted response 
range, we estimated the true mean by fitting truncated normal distributions to the data. For each 
participant, separate models were run for the diagnostic and non-diagnostic dimensions, with 
each model pooling data across faces and feature dimensions (affect, gender) in order to 
include a sufficient number of data points. Thus, each model included 32 data points (eight 
faces in the competitive condition x four reconstruction trials per face). Maximum-likelihood 
76 
  
estimation was used to find the mean and standard deviation of a truncated normal distribution 
that best fit the data. The distributions were modelled using the truncnorm and MASS packages 
in R. We constrained the search space of the mean to a range of plausible values evenly 
balanced on either side of the target (+/- 1 unit) and constrained the standard deviation to be a 
maximum of 1 and a minimum of .1. Although we view the modelled means as a better estimate 
of the true means, there are some sources of variance that the models do not account for. For 
example, the models do not account for potentially unique distributions for each feature 
dimension and/or stimulus. Furthermore, there is evidence that there may be inherent, global 
biases in how face features are later recalled (Won et al., 2020; Bülthoff & Zhao, 2019). 
Critically, however, any global biases would equally influence the diagnostic and non-diagnostic 
dimensions. Therefore, our analysis primarily focused on differences in modeled means for the 
diagnostic versus non-diagnostic dimensions. 
Measuring precision. In order to measure the precision with which diagnostic and non-
diagnostic features were remembered for each face, we calculated the standard deviation of 
responses across the four reconstruction trials for each face, separately for the diagnostic and 
non-diagnostic feature dimensions. We then computed the mean of these standard deviation 
values for each participant, separately for the diagnostic and non-diagnostic dimensions.  
Measuring the relationship between reconstruction bias and associative interference. In 
order to determine whether bias on the diagnostic feature dimension plays an adaptive role in 
reducing memory interference, we ran a series of mixed-effects models that focused on the 
relationship between bias measured during the reconstruction task and accuracy on the 
associative memory test (averaged across the last four rounds in order to capture the end state 
of learning). Although this analysis was performed at the level of individual items (faces), the 
accuracy value for each face was defined as the average accuracy for that face and its 
pairmate. As such, both pairmates with each set had the same accuracy value. The rationale for 
77 
  
averaging accuracy across pairmates was that if, for example, participants associate two 
competing faces (pairmates) with the same cue word (profession), rather than treating one of 
these associations as ‘correct’ and the other as ‘incorrect,’ it is more appropriate for the error to 
be shared across the two faces. 
For the analyses relating reconstruction bias to associative memory accuracy we 
excluded participants who had perfect accuracy, across all trials, on the final four rounds of the 
associative memory test. The rationale for this exclusion was that, for these participants, there 
was no variance in associative memory for the model to explain. Additionally, we did not run this 
analysis for experiment 1 given the near-ceiling performance on the associative memory test 
over the last four rounds (11 participants [31%] had 100% accuracy; and the remaining 
participants had mean accuracy of 95.96 ± 3.01% with an average SD within a participant of 
3.62 ± 1.70). For experiments 2 and 3—which used more similar pairmates—associative 
memory accuracy was lower and, therefore, fewer participants were excluded due to ceiling 
performance (seven participants [19%] in e2 and six participants [12%] in e3; mean accuracy for 
the remaining participants, e2: M = 92.47 ± 7.58%, e3: M = 93.56 ± 6.26%).   
For these models, it was critical to compute reconstruction bias at the level of individual 
faces. However, the method described above of estimating the average bias for each participant 
by pooling across trials/faces was not feasible for this analysis given the small number of 
observations (four trials per face). Thus, for this analysis we simply used the mean of the 
reconstruction response (across the four trials per face). In order to address the concern that 
any observed relationship between reconstruction bias and associative memory accuracy might 
be driven by potential ‘swap errors,’ our preregistered approach was to exclude any individual 
responses (trials) for which the scaled response was between -2 and 0 and to only retain 
responses for which the scaled response was between 0 and 2. For the diagnostic dimension, 
78 
  
any responses that were closer to the competing pairmate than to the target were therefore 
excluded. All remaining responses were included in the mean response for each face. While 
rare, if a face was associated with an excluded response on all four reconstruction trials, that 
face was entirely excluded from analysis. For experiment 2, this occurred for a total of four faces 
distributed across four participants; for experiment 3, this occurred for a total of six faces 
distributed across six participants. While this preregistered approach for exclusion of potential 
swap errors was intended as a conservative approach for eliminating the influence of extreme 
errors, all of our main results remained significant when no responses were excluded. 
Additionally, in exploratory analyses that combined data across experiments 2 and 3, instead of 
excluding extreme responses altogether, responses between -2 and 0 were capped at a value 
of 0 which allowed for all trials to be retained in the model, but reduced the influence of extreme 
responses. 
Mixed effects models were implemented in R using the lme4 package. Likelihood ratio 
tests were used to compare models with relevant variables to null models that excluded those 
variables. In order to account for potential differences related to whether the diagnostic 
dimension was affect versus gender, all models included this categorical variable as a fixed 
effect. In order to allow the relationship between reconstruction bias and associative memory 
accuracy to vary for each participant, we modeled the relationship between bias and associative 
memory accuracy with random intercepts and random slopes for each participant, where 
possible. Our preregistered approach to dealing with models that failed to converge or that 
reached a singular fit was to rerun the same model with the random slope for bias removed (see 
Barr et al., 2013). While all of our preregistered models did converge, an exploratory model 
which used the difference in bias on the diagnostic versus non-diagnostic dimension as a 
predictor failed to converge when a random slope was included; thus, we removed the random 
slope. Exploratory models that included only unsigned error or precision as predictors (without 
79 
  
bias) failed to converge when random slopes were included for these variables; thus, we 
removed random slopes for these variables. Finally, exploratory models that included bias along 
with precision and unsigned error as predictors also failed to converge when random slopes 
were included for all variables; when removing random slopes, we prioritized retaining a random 
slope for bias, which led to the exclusion of random slopes for precision and unsigned error. 
Results 
Associative memory test 
To test whether associative memory accuracy differed between the competitive and non-
competitive conditions we conducted repeated measures ANOVAs for each experiment with 
factors of condition (competitive, non-competitive) and round (e1: the first nine rounds; e2 and 
e3: 12 rounds). For each experiment, there was a significant main effect of condition (e1: 
F(1,35) = 26.14, p < 0.001, 𝜂#" = 0.034; e2: F(1,40) = 67.43, p < 0.001, 𝜂#" = 0.10; e3: F(1,56) = 
88.21, p < 0.001, 𝜂#" = 0.16), with lower accuracy in the competitive condition (Fig. 3.2A). To 
confirm that this difference specifically reflected interference, we considered the types of errors 
made. For the competitive condition, errors could correspond to selecting the competitor face or 
one of the four non-competitive lures (Fig. 3.2B). If errors were random, the competitor would be 
selected on 1/5th of the error trials. However, combining error trials across rounds, the 
competitor was selected at above-chance levels (e1: M = 60.18 ± 19.68%, t(35) = 12.25, p < 
0.001, d = 2.04; e2: M = 71.29 ± 15.78%, t(40) = 20.82, p < 0.001, d = 3.25; e3: M = 78.63 ± 
11.58%, t(56) = 38.21, p < 0.001, d = 5.06), confirming that increased errors in the competitive 
condition reflected interference from the competitor face.   
80 
  
 
Figure 3.2. Associative memory test accuracy across learning rounds. A. Percent correct responses on the 
associative memory test during each round of the learning phase, separated by the non-competitive (blue) 
and competitive (orange) conditions and by experiment number. Performance was significantly higher for 
the non-competitive compared to the competitive condition in each of the three experiments. For accuracy 
in the competitive condition separated according to whether the diagnostic dimension was affect versus 
gender, see Fig. S3.1A. B. Error rates for the competitive condition on the associative memory test during 
each round of the learning phase. Data are separated by error type (competitor: red; lure average: grey) 
and experiment number. Competitors (the cues associated with the pairmate faces) were selected at a rate 
that exceeded the average rate of selecting one of four lures. Error bars represent SEM. 
Face reconstruction accuracy 
To test whether face reconstruction accuracy was above chance, we measured the 
Euclidean distance between each response and the target face location (in the two-dimensional 
response space; Fig. 3.1C). For each participant, the mean Euclidean distance between 
responses and target locations was compared against a permuted distribution (calculated by 
shuffling responses within participant 10,000 times). Above-chance accuracy (better than 97.5% 
of the permuted means) was observed for every participant (Fig. 3.3).  
81 
  
Figure 3.3. Face reconstruction accuracy. A. The mean Euclidean distance between the reconstructed 
location and the target location was significantly lower than chance for every participant as determined by 
comparing responses to a distribution of shuffled responses (10,000 shuffles per participant). The plot is 
arranged from participants with the lowest to highest mean Euclidean distance (left to right), with each 
participant represented by an individual dot (e1: blue; e2: orange; e3: pink). The distribution of shuffled 
responses for each participant is represented by a boxplot. B. Histogram of z scores reflecting each 
participant’s mean Euclidean distance relative to the distribution of shuffled data (M = -6.87 ± 1.68, range: 
-9.97 - -2.59]). Lower z scores reflect better performance (lower Euclidean distance).  
Face reconstruction bias 
To test our critical prediction of repulsion along the diagnostic face dimension, we 
compared feature bias (see Methods) for the diagnostic vs. non-diagnostic dimensions in the 
competitive condition (Fig. 3.4A). We first tested predictions in experiments 1 and 2, and then 
tested for replication in experiment 3. A repeated measures ANOVA with factors of dimension 
(diagnostic, non-diagnostic) and experiment (e1, e2) revealed significantly greater bias toward 
repulsion on the diagnostic dimension (F(1,70) = 22.25, p < 0.001, 𝜂#" = 0.061). There was a 
trend toward a significant interaction between dimension and experiment (F(1,70) = 3.96, p = 
0.0506, 𝜂#" = 0.011), with a relatively weaker effect size in experiment 1 (d = 0.27) than 
82 
  
experiment 2 (d = 0.73). As predicted, experiment 3 replicated, with a large effect size and 
preregistered hypothesis, the greater bias toward repulsion on the diagnostic dimension (t(48) = 
5.87, p < 0.001, d = 0.83).   
 
Figure 3.4. Feature memory from the reconstruction task along the diagnostic and non-diagnostic 
dimensions. A. There was greater bias towards repulsion (higher modeled mean response) on the 
diagnostic (orange) compared to the non-diagnostic (blue) dimension. B. There was greater precision 
(lower standard deviation of responses across the four reconstruction trials for each face) on the diagnostic 
compared to the non-diagnostic dimension. For analyses separated according to whether the diagnostic 
dimension was affect versus gender, see Fig. S3.1B,C. For analyses comparing the diagnostic and non-
diagnostic dimensions with the non-competitive condition, see Fig. S3.3. Note: error bars represent SEM. 
Although our preregistered analyses focused on the comparison between diagnostic and 
non-diagnostic dimensions, we also tested whether reconstructions on the diagnostic dimension 
significantly differed from the veridical location of target faces. Indeed, combining data across all 
three experiments, the modeled means for the diagnostic dimension were significantly greater 
than the true value of 1 (t(120) = 4.39, p < 0.001, d = 0.40), reflecting a bias away from the 
competing face. This effect did not significantly differ across experiments (F(2,118) = 2.15, p = 
83 
  
0.12, 𝜂#" = 0.035). In contrast, on the non-diagnostic dimension there was a small, but significant 
bias toward the center of face space (modeled means < 1; t(120) = -2.33, p = 0.021, d = 0.21). 
This effect significantly differed across experiments (F(2,118) = 9.56, p < 0.001, 𝜂#" = 0.14).  In 
fact, in experiment 1 responses were significantly above 1 (t(34) = 2.15, p = 0.039, d = 0.36), 
and in experiments 2 and 3 they were significantly below 1 (e2: t(36) = -2.45, p = 0.019, d = 
0.40; e3: t(48) = -3.98, p < 0.001, d = 0.57). While the absolute values of reconstructed 
responses should be interpreted with some caution (due to potential global biases), the 
consistent bias toward repulsion on the diagnostic dimension supports our prediction that 
competition triggers targeted repulsion on the diagnostic dimension. 
Face reconstruction precision 
We next tested whether reconstruction precision differed across diagnostic vs. non-
diagnostic dimensions (Fig. 3.4B). We defined precision as the standard deviation across 
repeated reconstructions of the same face (see Methods). For the competitive condition, a 
repeated measures ANOVA with factors of dimension (diagnostic, non-diagnostic) and 
experiment (e1, e2) revealed significantly greater precision—i.e., lower reconstruction 
variability—on the diagnostic dimension (F(1,70) = 16.81, p < 0.001, 𝜂#"	= 0.044). This effect did 
not interact with experiment (F(1,70) = 0.34, p = 0.56, 𝜂#" = 0.001). The effect of greater 
precision on the diagnostic dimension was replicated (consistent with our preregistered 
prediction) in experiment 3 (t(48) = 5.45, p < 0.001, d = 0.74).  
Although our measure of precision was mathematically independent from our measure 
of bias, it is notable that these measures were correlated such that faces reconstructed with 
greater precision also tended to be associated with greater bias (see Fig. S3.2A). Importantly, 
however, the effect of greater precision on the diagnostic versus non-diagnostic dimension 
remained significant even when high-bias items were excluded from analysis (see Fig. S3.2B). 
84 
  
Relationship between reconstruction bias and associative interference 
Finally, we tested our prediction that greater reconstruction bias (repulsion) on the 
diagnostic dimension is associated with better associative memory test performance (less 
interference). Due to near-ceiling associative memory performance in experiment 1 (Fig. 3.2), 
we focused on experiment 2 data. We ran a mixed-effects model that predicted item-level 
associative memory accuracy with fixed effects of (a) bias on the diagnostic dimension 
(continuous variable) and (b) whether the diagnostic dimension was affect or gender 
(categorical variable). Bias was modelled with random intercepts and slopes for each 
participant. Using a likelihood ratio test, we compared this model to a model without bias. 
Critically, model fit was significantly better when bias was included (𝜒#(1) = 4.67, p = 0.031), 
with bias positively predicting associative memory accuracy (𝛽'()* = 3.58, SE = 1.62). As a 
control, we repeated the same analysis, but with bias on the non-diagnostic dimension; here, 
bias failed to improve model fit (𝜒#(1) = 0.021, p = 0.89, 𝛽'()*	= -0.31, SE = 2.14). For 
experiment 3, we predicted (using a preregistered analysis) a replication of the relationship 
between diagnostic dimension bias and associative memory accuracy. We observed a small 
effect in the predicted direction, but it was not significant (𝜒#(1) = 0.24, p = 0.63, 𝛽'()* = 0.69, 
SE = 1.41). 
In our preregistered analysis, we excluded reconstruction responses (trials) that were 
more similar to the competitor than the target. The rationale for this was to ensure that extreme 
responses (potential swap errors) did not have an outsized influence on the model (see 
Methods). However, this approach fully eliminated these trials rather than minimizing their 
influence. Therefore, as an exploratory analysis, we replaced these extreme reconstruction 
scores with a value of 0 (equal distance between the target and competitor, see Methods). This 
allowed all trials to be included, but reduced the influence of extreme responses (see Fig. S3.4 
for further analysis of what these extreme responses may represent). For this exploratory 
85 
  
analysis, we combined data from experiments 2 and 3, with experiment (e2, e3) added as a 
fixed effect. Compared to a null model, adding bias on the diagnostic dimension significantly 
improved model fit (𝜒#(1) = 15.88, p < 0.001), with positive bias (repulsion) predicting higher 
associative memory accuracy (𝛽'()* = 4.45, SE = 1.04). Adding an interaction between 
experiment and bias, did not improve model fit (𝜒#(1) = 1.39, p = 0.24, 𝛽+,-.'()*	= -2.47, SE = 
2.08), indicating that the relationship between bias and associative memory did not differ across 
experiments. Moreover, bias significantly improved model fit when applied to experiment 3 data 
alone (𝜒#(1) = 3.98, p = 0.046, 𝛽'()*	= 2.45, SE = 1.19), confirming that the relationship between 
bias and associative memory was not driven only by experiment 2 data. As a control, we ran the 
same model comparison but with bias on the non-diagnostic dimension as a predictor; there 
was no significant difference between models (𝜒#(1) = 0.14, p = 0.71, 𝛽'()*	= -0.40, SE = 1.08). 
Further, the degree of bias on the diagnostic dimension relative to the non-diagnostic dimension 
(i.e., the bias difference score) also significantly improved model fit compared to a null model 
without bias, 𝜒#(1) = 19.87, p < 0.001, 𝛽'()*.1(22	= 2.71, SE = 0.60 (random slopes were 
excluded due to reaching singularity). 
In an additional set of exploratory analyses that again combined data from experiments 
2 and 3 we tested whether reconstruction bias on the diagnostic dimension predicted 
associative memory accuracy beyond what was predicted by unsigned error (absolute distance 
from the target on the diagnostic dimension) and precision (on the diagnostic dimension). Note: 
the following analyses did not include random slopes for unsigned error or precision (see 
Methods for rationale). Using hierarchical linear regressions with fixed effects of experiment (e2, 
e3) and feature dimension (whether the diagnostic dimension was affect or gender), model fit 
was significantly improved, compared to a null model, when unsigned error or precision were 
added (unsigned error: 𝜒#(1) = 16.42, p < 0.001, 𝛽34*(54+1.+6676 = -5.72, SE = 1.40; precision: 
86 
  
𝜒#(1) = 30.27, p < 0.001, 𝛽-6+8(*(74 = -5.91, SE = 1.06). In other words, lower unsigned error and 
greater precision were associated with better associative memory. Critically, however, model fit 
significantly improved when bias was added to a model that already included unsigned error 
and precision (𝜒#(1) = 4.39, p = 0.036,	𝛽'()* = 2.38, SE = 1.11). Thus, bias predicted associative 
memory accuracy beyond what was explained by precision and unsigned error. Notably, model 
fit also significantly improved when precision was added to a model that already included 
unsigned error and bias (𝜒#(1) = 26.51, p < 0.001,	𝛽-6+8 = -5.64, SE = 1.08). Taken together, 
these exploratory analyses indicate that bias (repulsion) and precision—despite being 
correlated measures (Fig. S3.2A)—were independently predictive of associative memory 
performance (Fig. 3.5).  
 
Figure 3.5. Relationship between reconstruction bias on the diagnostic dimension and associative 
memory accuracy. For the purpose of visualization, a mixed-effects model was run with mean associative 
memory accuracy (from the final four rounds of the learning phase) as the dependent variable and with 
experiment number, unsigned error, and bias included as predictors (gender/affect and precision were 
excluded). Stronger bias towards repulsion (reconstruction bias values > 1 reflect repulsion) was 
associated with higher associative memory accuracy (i.e., lower interference). Each dot represents a 
specific face image, with each participant plotted with a unique color. Each line represents the modelled, 
participant-specific relationship between reconstruction bias and associative memory accuracy. Note: 
bends in the lines reflect effects of absolute error.  
87 
  
Discussion 
Across three experiments we found that similarity between long-term memories induced 
adaptive and feature-specific changes to the contents of those memories. We measured these 
changes using a two-dimensional face space (affect, gender), allowing us to separately 
measure memory along a dimension that was diagnostic of differences between similar faces 
and a dimension that was non-diagnostic of differences. We found that memory along 
diagnostic feature dimensions exhibited two key properties: (1) a systematic bias (repulsion) 
that exaggerated the difference between similar memories, and (2) greater precision (lower 
variability). Finally, we found that repulsion and precision were independently predictive of 
interference-related memory errors. 
Although our paradigm was modeled after classic memory interference studies 
(Anderson, 2003; Anderson et al., 1994), the repulsion effect we observed is distinct from 
classic interference effects. If anything, interference predicts an attraction in remembered 
features. However, an important feature of our design is that face memory was only tested after 
extensive study and practice (Chanales et al., 2021, Zhao et al., 2021). Indeed, we found that 
greater repulsion in feature memory was associated with lower interference in the associative 
memory test. While it is important to note that this relationship failed to replicate using our 
preregistered analysis method in experiment 3, we view the updated method as a better 
approach for handling extreme responses, and the relationship we observed generalized across 
experiments and was independently significant in experiment 3. The relationship between 
repulsion and associative memory accuracy is notable when considering that repulsion 
fundamentally reflects a form of memory error. However, the error we observed was not 
randomly distributed; instead, it was systematically biased away from competing memories, 
thereby increasing the representational distance between memories. These findings 
complement evidence of conceptually-similar biases in working memory (Bae & Luck, 2017; 
88 
  
Chen et al., 2019; Chunharas et al., 2018; Chunharas et al., 2019; Golomb, 2015) and visual 
attention (Won et al., 2020; Yu & Geng, 2019). The ubiquity of these biases across domains 
suggests that repulsion is a fundamental, adaptive mechanism for resolving interference. 
A central and novel focus of the present study was to compare repulsion along 
diagnostic versus non-diagnostic feature dimensions. The fact that repulsion was stronger for 
the diagnostic dimension provides important evidence that memories were not globally 
exaggerated (relative to the center of face space) in response to competition. Critically, in 
studies where only one featured dimension is probed (Chanales et al., 2021, Zhao et al., 2021), 
this interpretation cannot be ruled out. It is also noteworthy that because the mapping between 
affect and gender and the diagnostic and non-diagnostic dimensions was counterbalanced 
within participants, our results cannot be explained in terms of a bias along one feature 
dimension that generalized across all faces, as might occur in category learning (Goldstone, 
1998; Goldstone & Steyvers, 2001). Finally, the relationship between repulsion and memory 
interference was selective to the diagnostic feature dimension, confirming that global biases 
were not adaptive. Thus, competition triggered targeted and adaptive distortions that 
preferentially occurred along the dimension that was essential for discrimination. These findings 
provide novel support for computational models of memory interference which propose targeted, 
feature-specific changes in memory representations (Hulbert and Norman, 2015; Norman et al., 
2007; Norman, Newman, et al., 2006). 
As with the repulsion effects, the precision effects we observed are in sharp contrast to 
typical interference effects. Specifically, whereas prior studies have shown that interference 
reduces precision in feature memory (Berens et al., 2020; Pertzov et al., 2017; Sun et al., 2017), 
our findings reveal that memory interference was associated with a relative gain in memory 
precision when comparing the diagnostic versus non-diagnostic dimensions. Importantly, 
however, we defined precision as the standard deviation across repeated tests of the same 
89 
  
memory. This measure of precision was orthogonal to repulsion (or accuracy) as it reflected the 
consistency with which faces were remembered, regardless of the distance between 
remembered and actual values (absolute error). Put another way, if each face feature is 
represented by a distribution of potentially-remembered values, repulsion would reflect a shift in 
this distribution whereas precision would reflect reduced variance in this distribution (Yu & 
Geng, 2019). This is a key point because prior measures of memory precision have often 
assumed a distribution centered around the actual (veridical) memory value (e.g. Brady et al., 
2013; Cooper & Ritchey, 2019; Harlow & Donaldson, 2013; Harlow & Yonelinas, 2016; 
Nilakantan et al., 2017; Nilakantan et al., 2018; Rhodes et al., 2020; Richter et al., 2016). While 
this is a reasonable assumption in many contexts, the current findings provide clear evidence, in 
the context of memory interference, that this assumption is violated.  
An interesting avenue for future research will be to characterize the relationship between 
repulsion and precision. Here, these measures were mathematically distinct and were 
independently predictive of associative memory interference. Yet, repulsion and precision both 
have the consequence of increasing representational distance between competing memories 
and may therefore serve a common purpose. In fact, there was a robust correlation between 
these measures, with greater repulsion predicting greater precision (Fig. S3.2A). Thus, it is 
possible that repulsion and precision are distinct facets of a common underlying mechanism. 
In summary, we demonstrate that episodic memories are modified and distorted in 
targeted and adaptive ways in response to interference. Whereas it is intuitive to conceptualize 
interference resolution as a reduction in memory errors, our findings support a distinct view in 
which systematic memory errors enhance discriminability between similar memories.  
90 
  
Chapter IV 
RECONSTRUCTING FACE IMAGES  
FROM DISTRIBUTED PATTERNS OF FMRI ACTIVITY  
USING THE ACTIVE APPEARANCE MODEL 
This chapter contains unpublished co-authored material. Maxwell L. Drascher is the primary 
author of this chapter with input from his advisor Brice A. Kuhl. Drascher and Kuhl designed the 
study together. Drascher supervised or conducted all data collection, and wrote the scripts for 
experiment presentation, data analysis, and figure creation. Data analysis was conducted with 
assistance from Paul Keene. Drascher wrote the manuscript with editorial assistance from Kuhl. 
Introduction 
Neural decoding has become an increasingly popular and important tool for 
neuroscientists (Norman, Polyn, et al., 2006). This approach to studying neural activity patterns 
began with broad, dichotomous decisions between perceptual categories (Carlson et al., 2003; 
Cox & Savoy, 2003; Haxby et al., 2001), but expanded its utility with the ability to make 
continuous feature predictions. This was first pursued with a focus on single, simple features 
such as the orientation of visual gratings (Ester et al., 2015; Kamitani & Tong, 2005; Serences 
et al., 2009), motion direction (Kamitani & Tong, 2006), and color (Brouwer & Heeger, 2009). 
More recently, researchers have utilized a neural reconstruction approach where complex, 
multi-dimensional stimulus classes are decoded across a set of dimensions that can collectively 
represent the image. Based on these decoded feature values, you can generate a neural 
reconstruction of the stimuli. This type of visualization can provide a window into what 
information is prioritized internally given a particular prompt and how that may differ depending 
on both experimental conditions and individual differences (Nestor et al., 2020). 
91 
  
Neural reconstruction approaches have often been applied to natural images (Beliy et 
al., 2019; Kay et al., 2008; Miyawaki et al., 2008; Mozafari et al., 2020; Naselaris et al., 2009; 
Seeliger et al., 2018) and movies (Nishimoto et al., 2011; Wen et al., 2018), but one stimulus 
class that has been of particular interest has been faces (Cowen et al., 2014; Dado et al., 2022; 
Güçlütürk et al., 2017; Lee & Kuhl, 2016; Nemrodov et al., 2019; Nestor et al., 2016; VanRullen 
& Reddy, 2019). Faces are of particular interest because humans are experts at perceiving and 
remembering faces (Kanwisher, 2000). This makes faces an important stimulus class for 
cognitive scientists, both in terms of studying how faces are processed as well as a stimulus 
class where small differences can be perceived and remembered (Nestor et al., 2020). 
Improving the ability to accurately reconstruct faces from neural activity would both offer 
important insights into face processing and would open up many experimental design 
possibilities.  
Here, we approached face reconstruction from fMRI data in a similar way to some 
previous attempts (Cowen et al., 2014; Lee & Kuhl, 2016). In those attempts, a principal 
component analysis (PCA) was run on a set of face stimuli in order to generate a set of 
dimensions that describe each face in the set (eigenfaces). Here, we compare that approach to 
an updated approach to parameterizing the face images, the active appearance model (AAM). 
This approach previously performed better than eigenfaces at reconstructing face images based 
on electrophysiological recordings from macaque monkeys (Chang & Tsao, 2017). This updated 
approach also allows for face images to be represented with greater fidelity, with fewer 
components (50 vs 300 in the cited usages). The higher fidelity raises the ceiling on the ability to 
accurately and vividly reconstruct faces. The efficiency in representation makes subsequent 
analyses more efficient and interpretable; each component is more likely to represent features 
important for face perception and less likely to be representing other visual properties in the 
images. This increases the chance of successfully mapping brain activity patterns to these 
92 
  
components, and may increase the likelihood of results generalizing across participants. 
Further, the AAM also introduces broad groupings of components (shape, appearance), that 
could help parse differences in face representation between brain regions (Chang & Tsao, 
2017).  
In the present study, across 35 participants, over 1,000 distinct face stimuli were viewed 
while in a scanner. These face stimuli were diverse in terms of gender, age, facial expression, 
race, and ethnicity, and have a wide range of information describing each, including AAM 
components and subjective ratings (see Chapter 2). Our approach here focuses on using a 
regularized regression algorithm to map patterns of fMRI activity while participants were viewing 
a face stimulus to individual face components. We focus on using independent models based 
on activity from different brain regions. In particular, we included regions that are important for 
early perception, face-processing, and also higher-level processing and memory. We compared 
the model’s performance at predicting individual face components and component types across 
these brain regions. This approach also allows for the generation of reconstructed face images 
based on different regions, which provides a “view” into the participants’ internal representation. 
Reconstruction accuracy was measured in terms of the distance between the predicted and true 
component values. 
Methods 
Participants 
A total of 40 (Mage = 22.68 ± 4.17, 18-34 years, 24 females) right-handed, native English 
speakers, with normal or corrected-to-normal vision, from the University of Oregon community 
participated in the experiment. Five participants were excluded from analysis, four for having a 
high degree of head movement (greater than 10 instances of framewise displacement above 0.5 
mm on multiple functional runs), and one for having a high non-response rate (greater than 20% 
93 
  
on 6/9 runs). This resulted in a final sample of 35 participants (Mage = 22.60 ± 3.95, 18-31 years, 
20 females). Informed consent was obtained in accordance with procedures approved by the 
University of Oregon Institutional Review Board. 
Procedure 
Participants completed nine fMRI runs of a repetition detection task, each a total of 56 
trials and for 7 m and 38 s. On each trial participants viewed an image centrally presented on 
the screen for 2 s, followed by a 6 s fixation cross (Fig. 4.1). Participants were instructed to pay 
attention to the images, because although most images were presented only once, a small 
percentage would be repeated. Using a button box, participants indicated whether each image 
was “new” or “old”. Responses were included if they were made within 7 s of stimulus onset. 
The button used (index or middle finger) to indicate new/old was randomly assigned to each 
participant. Each scanning run included 4 scene images and 48 unique face images, with 4 of 
those face images being repeated. The repeated images were used as the test images for 
image reconstruction, in order to have a better estimate of neural response to those images. No 
images were repeated across runs. The trial order was pseudorandomized within each run, with 
the constraint that test faces did not appear consecutively and there were at least 3 trials 
between repetitions of the same test face. Two of the runs were the same for all participants 
(fixed runs), in terms of stimulus inclusion and order, however the position of the run in the 
series of nine runs was randomized. All seven other runs were randomized within participant. In 
total, each participant viewed 432 unique images, with 396 used for training and 36 held out for 
testing. The test images were the same for all participants. The training images not included in 
the fixed runs were assigned to each participant in a pseudorandomized manner, with half being 
male/female.  
94 
  
 
Figure 4.1. Experimental design. In the scanner, participants viewed images of faces or scenes one at a 
time and judged whether each image was “old” (repeated within the run) or “new” (novel). 
An additional 10th functional run with the same task, but with a dynamic video of faces 
was included for many participants (31/40 total, 27/35 after exclusions), but that is not the focus 
of the present manuscript. Although the structure laid out above was the design, an error in the 
stimulus code led to some participants (14 total, 12 after exclusions) to accidentally be assigned 
stimuli from the fixed runs as training stimuli in other runs as well, which led to many faces being 
mistakenly repeated across runs for those participants (M = 65.58 ± 7.38, 54-82 stimuli). 
Stimuli 
A total of 1,148 faces were selected from a variety of online sources (see Lee & Kuhl, 
2016). All faces were forward-facing and cropped and resized to 179 x 251 pixels. The faces 
were selected to be diverse in terms of age, race, ethnicity, and facial expression, with half 
being male/female. The full corpus is available at: https://osf.io/4uydh. A total of 36 images were 
pseudorandomly selected to be the test images based on including half male/female and 
including diversity in terms of race and facial expression. All other images were 
pseudorandomly selected to be included as training images for each participant, with half 
male/female and images not already used being prioritized. Of the 1,030 images in the training 
95 
  
pool (outside of the fixed runs and testing), each stimulus was used at least once across all 
included participants (M = 10.0 ± 2.61, 1-18 times). 
A small percentage of scene trials (four per run) were included to select face-preferring 
voxels. A total of 112 scene images were collected from freely available sources and 
cropped/resized to match the size of the face images. A diverse selection of indoor and outdoor 
scenes were included in the corpus. Scene images not included in the fixed runs were randomly 
assigned to each participant in a random order. 
fMRI imaging acquisition 
Imaging data were collected on a Siemens 3 T Skyra scanner at the Robert and Beverly 
Lewis Center for Neuroimaging at the University of Oregon. Whole-brain functional images were 
collected using a T2*-weighted multiband 229 EPI sequence (TR 2 s; TE 25 ms; flip angle 90°; 
grid size 104 x 104; voxel size 2 x 2 x 2 mm) and a 32-channel head coil. All participants had at 
least 9 functional scan runs, with most having 10 functional scan runs (see Procedure). After the 
functional runs, a whole-brain T1-weighted MPRAGE 3D anatomical volume (grid size 176 x 
256 x 256; voxel size 1 x 1 x 1 mm) was also collected. 
fMRI data preprocessing 
fMRI data preprocessing was performed using fMRIPrep 21.00 (Esteban, et al., 2018), 
based on Nipype (Gorgolewski et al., 2011). A field map was estimated from two consecutive 
gradient-recalled echo acquisitions. The corresponding phase-map(s) were phase-unwrapped 
with prelude (FSL 6.0.5.1:57b01774). The T1-weighted (T1w) images were corrected for 
intensity non-uniformity with N4BiasFieldCorrection (Tustison et al., 2010) and skull-stripped 
using antsBrainExtraction.sh with OASIS30ANTs as the target template (ANTs 2.3.3; Avants et 
al., 2008). Brain tissue segmentation of cerebrospinal fluid (CSF), white-matter (WM) and gray-
matter (GM) was performed on the brain-extracted T1w using fast (FSL; Zhang, Brady, & Smith, 
2001). A T1w-reference map was computed after registration of 2 T1w images (after INU-
96 
  
correction) using mri_robust_template (FreeSurfer 6.0.1; Reuter et al., 2010). Brain surfaces 
were reconstructed using recon-all (FreeSurfer 6.0.1; Dale et al., 1999), and the brain mask 
estimated previously was refined with a custom variation of the method to reconcile ANTs-
derived and FreeSurfer-derived segmentations of the cortical gray-matter of Mindboggle (Klein 
et al., 2017). Volume-based spatial normalization to one standard space 
(MNI152NLin2009cAsym) was performed through nonlinear registration with antsRegistration 
(ANTs 2.3.3), using brain-extracted versions of both T1w reference and the template (Fonov et 
al., 2009). 
The estimated field map was then aligned with rigid-registration to the target EPI (echo-
planar imaging) reference run. The field coefficients were mapped on to the reference EPI using 
the transform. Functional runs were slice-time corrected to half of slice acquisition range using 
3dTshift from AFNI (Cox and Hyde, 1997). The BOLD reference was then co-registered to the 
T1w reference using bbregister (FreeSurfer; Greve & Fischl, 2009), using boundary-based 
registration with six degrees of freedom. Masks generated for each functional run were used to 
generate one mask based on the intersection of all masks. Functional data were smooth with a 
2.0 mm FWHM Gaussian kernel using 3dBlurToFWHM from AFNI (Cox & Hyde, 1997). Each 
voxel was then standardized to a mean of 100 across time (within run), with values representing 
percentage signal change (in reference to the mean), and with a minimum of 0 and 200. 
The data was modeled with a generalized least squares regression using AFNI’s 
3dREMLfit (Cox and Hyde, 1997). This analysis was performed by first generating a design 
matrix using AFNI’s 3dDeconvolve function (Cox and Hyde, 1997). The design matrix included 6 
motion parameters (x/y/z movement/rotation) as nuisance regressors, calculated using mcflirt 
(FSL 6.0.5.1:57b01774: Jenkinson et al., 2002). Framewise displacement (FD), was including 
as an additional nuisance variable, as calculated in Nipype (Power et al., 2014). Linear trends 
and low-frequency drifts were regressed out by including Legendre polynomials (4). Timepoints 
97 
  
with a FD of above 0.5 were censored from this analysis. All nuisance regressors were 
regressed out at the run level. The hemodynamic response was modelled with a gamma 
function.  
Regions of interest 
All regions of interest (ROIs) were generated in a participant-specific manner based on 
FreeSurfer’s Destrieux atlas (Destrieux et al., 2010; Fig. 4.2). These anatomical ROIs were co-
registered to the functional images. We focused our analysis using three broad cortical ROIs, 
occipital (OCC), posterior parietal (PPC), and temporal (TEMP). We had no a-priori reason to 
expect meaningful or interpretable hemispheric differences, therefore all analyses used bilateral 
ROIs. OCC was defined as a combination of several regions across the occipital lobe that had 
little potential overlap with temporal or parietal regions (inferior occipital gyrus and sulcus, 
cuneus, middle occipital gyrus, superior occipital gyrus, occipital pole, calcarine sulcus, anterior 
and posterior transverse collateral sulcus, middle occipital sulcus and lunatus sulcus, superior 
occipital sulcus and transverse occipital sulcus, anterior occipital sulcus and preoccipital notch). 
The total number of voxels varied by participant (M = 6,467 ± 768, range: 5,204-7,979). PPC 
was defined as a combination of bilateral angular gyrus (ANG), intraparietal sulcus (IPS), 
supramarginal gyrus (SMG; combination of SMG and Jensen sulcus), and superior parietal 
cortex (SPC). The total number of voxels varied by participant (ANG: M = 1,830 ± 232, range: 
1,463-2,357; IPS: M = 1,192 ± 145, range: 913-1,526; SMG: M = 1,832 ± 308, range: 1,375-
2,640; ANG: M = 1,420 ± 221, range: 861-1,889). TEMP was defined as a combination of 
inferior temporal (Inf), superior temporal (Sup), middle temporal (Mid), temporal pole (Pole), and 
transverse temporal (Trans). Included with these other regions in some analyses, but not in the 
overall ROI, was the fusiform gyrus (FUS). The total number of voxels varied by participant (Inf: 
M = 2,103 ± 274, range: 1,400-2,525; Sup: M = 5,512 ± 586, range: 4,513-6,787; Mid: M = 
98 
  
2,149 ± 272, range: 1,523-2,742; Pole: M = 1,287 ± 183, range: 877-1,658; Trans: M = 126 ± 
29, range: 45-190; FUS: M = 1,270 ± 173, range: 838-1,671). 
 
Figure 4.2. Visualization of the ROIs on the inflated surface of an averaged template brain supplied by 
FreeSurfer (magenta represents OCC; blue represents SPC; light blue represents IPS; green represents 
SMG; blue/green represents ANG; brown represents Sup; orange/brown represents Mid; orange 
represents Inf; yellow represents FUS). Left, ventral view. Right, lateral view.  
Face reconstruction analysis 
AAM components were generated based on the full corpus of face stimuli. Each face 
stimulus had 25 unique shape and 25 unique appearance component values. See Chapter 2 for 
full explanation of how these components were generated.  
As a comparison face parameterization method, eigenface components were generated 
on the full stimulus set. Similar to AAM, eigenfaces are generated through a PCA analysis. 
Here, however, the PCA was performed on the raw image information (179 x 251 x 3 
red/green/blue values), rather than performing a separate PCA on shape information first. 
Consistent with prior work, we focused on the top 300 components (Cowen et al., 2014; Lee & 
Kuhl, 2016). 
For each participant and individual ROI, we performed a multinomial ridge regression 
using the beta values for all voxels within the ROI as predictors and the AAM components as 
99 
  
the outcome values (Fig. 4.3). The regressions were performed using the “glmnet” package in R 
(Friedman et al., 2010). The weights relating the voxel activity and face components were then 
applied to the voxel activity evoked by the held out test images. Model predictions, unless 
otherwise noted, were made based on the beta values averaged across each trial including the 
same test image. For one set of analyses, we separated out the first compared to the second 
appearance of the test images.  
 
Figure 4.3. Schematic of the face reconstruction analysis. Top. The model was trained using a ridge 
regression that used the evoked fMRI pattern to predict the 50 AAM components for each face. Bottom. 
This model was then used to predict the AAM components based on the evoked fMRI activity patterns on 
a set of held-out test trials. 
  Face reconstructions were generated based on these predicted values (see Chapter 2 
for reconstruction procedure). The accuracy of the reconstruction for each stimulus was 
determined by a series of two-alternative forced choice (AFC) tests. For each reconstruction, we 
measured whether it was more similar to the original face, or to a lure image. Similarity was 
measured by the Euclidian distance between the predicted and true component values. If the 
predicted component values were more similar to the true values than the lure values, the test 
was considered correct. The AFC accuracy for each image was based on the average of each 
100 
  
AFC test across using all test images as the lure image (35 total comparisons). The two face 
comparison makes chance performance 50%.  
In order to assess performance at the individual component level, we used the same 
predictions as described above. For each participant, we then calculated the correlation 
between the predicted and true score across all 36 test images. We then converted correlation 
values into Fisher’s z. Performance was assessed in reference to a correlation of 0. 
For the subjective ratings (affect, gender, trustworthiness, dominance, and 
attractiveness), we ran a separate model, but with the same structure. Here, rather than 
predicting the individual AAM components, the five ratings were predicted. The subjective 
ratings are based on the average rating from MTurk participants (see Chapter 2). Performance 
on these predictions were evaluated in the same way as the individual AAM components above. 
Statistical tests 
Performance compared to chance was assessed using one sample t-tests for each 
separate model. Statistical tests were assessed at the 0.05 alpha threshold. No corrections 
have been made for multiple tests. 
Results 
Behavioral performance 
Overall, participants performed well at the repetition detection task and were engaged. 
Participants began the task with high accuracy in block 1 (M = 89.7% ± 16.6%) and steadily 
improved through block 9 (M = 93%, ± 7.2%; overall: M = 91.5% ± 6.3%). The mean sensitivity 
(d’) was 2.60 ± 0.73. The average response time was 1765 ms ± 502 ms. 
 Reconstruction of faces using AAM 
Reconstructions were generated based on models run separately for each ROI and 
participant (see Fig. 4.4 for example reconstructions). We focused our initial analyses on the 
three main overall ROIs (Fig. 4.5). In order to quantify the quality of the reconstruction, we 
101 
  
computed the similarity between the predicted AAM components and the true components and 
compared this against the similarity between the predicted components and another test 
image’s true components in a series of 2-alternative forced choice (AFC) tests. If the predicted 
components were more similar to the true components than the comparison image components, 
the test was considered correct. To assess performance compared to chance (50% accuracy), 
we ran a series of one sample t-tests for each ROI. All three ROIs were significantly above 
chance (OCC: t(34) = 6.85, p < 0.001, d = 1.16, M = 53.4%; PPC: t(34) = 3.59, p = 0.001, d = 
0.61, M = 51.1%; TEMP: t(34) = 3.48, p = 0.001, d = 0.59, M = 51.2%). The three regions, 
however, significantly differed in their level of accuracy (F(2, 68) = 19.03, p < 0.001, 𝜂#"  = 0.17), 
with OCC performing significantly better than PPC (t(34) = 5.2, p < 0.001, d = 0.92) and TEMP 
(t(34) = 4.67, p < 0.001, d = 0.82). Performance in PPC and TEMP did not significantly differ 
(t(34) = -0.56, p = 0.58, d = 0.10). 
 
Figure 4.4. Reconstruction examples from one participant from OCC. The reconstructions were based on 
both AAM (middle) and eigenface (bottom) components compared to the true image(top). The AFC 
accuracy for each reconstruction is in the bottom right corner. The first three columns (left to right) are 
examples where both face models performed well, the next two are examples where the models diverged 
in their performance, the final column shows poor performance for both models. 
102 
  
 
Figure 4.5. Alternative forced choice (AFC) accuracy for AAM components, modeled separately for occipital 
(OCC), posterior parietal (PPC), and temporal (TEMP) cortical ROIs. All three ROIs reconstructed face 
images at above chance levels, with OCC reconstructions performing significantly better than PPC and 
TEMP. Error bars represent SEM 
The PPC region used included several different ROIs, we next sought to explore 
whether individual regions had predictive power (Fig. 4.6). We performed the same analysis, but 
focused on angular gyrus (ANG), intraparietal sulcus (IPS), supramarginal gyrus (SMG), and 
superior parietal cortex (SPC) separately (all four combined was our PPC ROI). Average 
performance in all four ROIs was similar to the PPC overall, indicating that there was little or no 
model gain from including all of these ROIs together. However, there was additional variance in 
model performance across participants (PPC overall: SD = 1.75; ANG: SD = 2.70; IPS: SD = 
2.33; SMG: SD = 2.97; SPC: SD = 3.47). Thus when examined individually, only SPC 
performed significantly above chance (t(34) = 2.37, p = 0.02, d = 0.4, M = 51.4%). All other 
103 
  
regions failed to reach significance at the 0.05 threshold (ANG: t(34) = 0.95, p = 0.35, d = 
0.16, M = 50.4%; IPS: t(34) = 1.94, p = 0.06, d = 0.33, M = 50.8%; SMG: t(34) = 1.94, p = 
0.06, d = 0.33, M = 51.0%). An overall ANOVA, however, indicated no significant difference 
between the 4 ROIs (F(3, 102) = 0.71, p = 0.55, 𝜂#"  = 0.01). 
 
Figure 4.6. Alternative forced choice (AFC) accuracy for AAM components, modeled separately for ROIs 
within the overall PPC region: angular gyrus (ANG), intraparietal sulcus (IPS), supramarginal gyrus (SMG), 
and superior parietal cortex (SPC).  Only predictions based on SPC activity performed significantly above 
chance. Error bars represent SEM 
The TEMP region used included several different ROIs, we next sought to explore 
whether individual regions had predictive power (Fig. 4.7). We performed the same analysis, but 
focused on inferior temporal (Inf), superior temporal (Sup), middle temporal (Mid), temporal pole 
(Pole), and transverse temporal (Trans). We also included the fusiform gyrus (FUS), which was 
not included in the overall temporal ROI. Average performance was significantly above chance 
104 
  
for Inf (t(34) = 3.35, p = 0.002, d = 0.57, M = 51.6%) and Sup (t(34) = 3.24, p = 0.003, d = 
0.55, M = 51.3%). Mean performance was similar in FUS and Mid, but failed to reach 
significance (FUS: t(34) = 1.97, p = 0.056, d = 0.33, M = 51.0%; Mid: t(34) = 1.39, p = 0.17, d = 
0.24, M = 50.8%). The mean performance was at chance levels for Pole and Trans (Pole: t(34) 
= 0.30, p = 0.77, d = 0.05, M = 50.1%; Trans: t(34) = 0.72, p = 0.47, d = 0.12, M = 50.2%). An 
overall ANOVA, however, indicated no significant difference between the 6 ROIs (F(5, 170) = 
2.14, p = 0.063, 𝜂#"  = 0.04). 
 
 
Figure 4.7. Alternative forced choice (AFC) accuracy for AAM components, modeled separately for ROIs 
within the broader temporal ROI: inferior temporal (Inf), superior temporal (Sup), middle temporal (Mid), 
temporal pole (Pole), and transverse temporal (Trans). We also included the fusiform gyrus (FUS), which 
was not included in the overall temporal ROI. Average performance was significantly above chance for Inf 
and Sup. Error bars represent SEM  
105 
  
Reconstruction performance compared to eigenface model 
There was no evidence that 50 AAM components were more accurately reconstructed 
than the 300 eigenface components (Fig. 4.8; see Fig. 4.4 for a visualization). A model (AAM, 
eigenface) x ROI (OCC, PPC, TEMP) repeated measures ANOVA found no significant 
difference in model performance (F(1, 34) = 0.49, p = 0.49, 𝜂#"  = 0.001) and no interaction with 
ROI (F(2, 68) = 0.94, p = 0.39, 𝜂#"  = 0.001).  
 
Figure 4.8. AFC accuracy for AAM compared to eigenface components. We found no difference in AFC 
accuracy comparing 50 AAM components (orange) to 300 eigenface (blue; EF) in occipital (OCC), posterior 
parietal (PPC), and temporal (TEMP) cortical ROIs. Error bars represent SEM 
 We followed this up with the same approach, but with our four individual PPC ROIS. We 
again found no significant difference in model performance (F(1, 34) = 0.64, p = 0.43, 𝜂#"  = 
0.001) and no interaction (F(3, 102) = 0.48, p = 0.70, 𝜂#"  = 0.002). We repeated this approach 
106 
  
with the six TEMP ROIs and again found no significant difference in model performance (F(1, 34) 
= 0.09, p = 0.77, 𝜂#"  < 0.001) and no interaction (F(5, 170) = 1.13, p = 0.34, 𝜂#"  = 0.004).  
Although there was no overall difference found, this may have been an unfair test due to 
differences in the number of components included. Therefore, we repeated our model (AAM, 
eigenface) x ROI (OCC, PPC, TEMP) repeated measures ANOVA, but with only the top 50 
eigenface components. We found no significant difference in model performance (F(1, 34) = 
0.33, p = 0.57, 𝜂#"  < 0.001) and no interaction (F(2, 68) = 1.27, p = 0.29, 𝜂#"  = 0.002). We further 
examined whether there were any accuracy differences as different numbers of components 
were included in the analysis (Fig. 4.9). For the purposes of statistical analysis, we looked at 2, 
4, 6, 8, and 10 components included. We proceeded in steps of 2 so that both 1 additional 
shape and 1 additional appearance component could be added at each step (ordered in terms 
of variance added).  We limited ourselves to the first few components in order to avoid 
overfitting the model. We ran a series of model (AAM, eigenface) x component number (2, 4 ,6, 
8 ,10) repeated measures ANOVAs for each ROI (OCC, PPC, TEMP). In OCC, we found a 
significant main effect of component number, indicating an improvement in performance with 
more components included (F(4, 136) = 23.20, p < 0.001, 𝜂#"  = 0.03). Consistent with our 
previous analyses, there was no significant effect of model (F(1, 34) = 1.39, p = 0.25, 𝜂#"  = 
0.004). There was however, a significant interaction between component number and model 
(F(4, 136) = 2.54, p = 0.04, 𝜂#"  = 0.003). This seems to indicate some differences in the rate of 
AFC improvement for each model as more components are added, but these differences even 
out over time. We found no significant effects for PPC or TEMP. 
107 
  
 
Figure 4.9. AFC accuracy by the number of components included for AAM (orange) and EF (blue) 
components. In OCC (top), performance improved at differential rates over the first 10 components, but 
plateaued at the same level for both. In PPC (middle) and TEMP (bottom), both plateaued immediately for 
both model types. The components included are ordered in terms of explained visual variance for each 
model type. Only even numbers were included, so that for the AAM components, each step included the 
next top shape and the next top appearance component. Shaded region indicates zoomed in region of 
graph that shows transitions of 2 components, rather than 10 on the rest of the x-axis. Error bars represent 
SEM 
108 
  
Reconstruction performance for appearance vs shape components 
One of the advantages of using AAM, is that the components can be divided into two 
broad categories. This allows us to look at differences between regions in terms their ability to 
reconstruct different types of components (Fig. 4.10). Model performance was assessed in the 
same way here, but with the components divided into 25 shape and 25 appearance 
components.  A model (shape, appearance) x ROI (OCC, PPC, TEMP) repeated measures 
ANOVA found no significant difference in model performance (F(1, 34) = 3.12, p = 0.09, 𝜂#"  = 
0.02). However, there was a significant interaction (F(2, 68) = 10.88, p < 0.01, 𝜂#"  = 0.03). Follow 
up t-tests found no difference comparing appearance and shape for OCC (t(34) = 1.06, p = 
0.30, d = 0.17). In TEMP, however, the model performed significantly better at predicting shape 
than appearance components (t(34) = 3.76, p < 0.001, d = 0.80). PPC trended in that same 
direction, but did not quite reach significance (t(34) = 1.96, p = 0.058, d = 0.45). Follow-up one 
sample t-tests found that AFC performance for shape was significantly above chance for PPC 
(t(34) = 4.42, p < 0.001, d = 0.75) and TEMP (t(34) = 5.82, p < 0.001, d = 0.98). However, both 
failed to reach significance for appearance (PPC: t(34) = 1.22, p = 0.23, d = 0.21; TEMP: t(34) = 
1.30, p = 0.20, d = 0.22).  
109 
  
 
Figure 4.10. AFC accuracy for shape compared to appearance components. AFC accuracy was 
significantly higher for the 25 shape (blue) than the 25 appearance (orange) AAM components for TEMP. 
There was a trend in the same direction for PPC, but no evidence for a difference in OCC. Error bars 
represent SEM 
We found a trend towards an overall advantage in PPC for shape compared to 
appearance components. We followed-up by looking at ROIs within PPC (Fig. 4.11). A model 
(appearance, shape) x ROI (ANG, IPS, SMG, SPC) repeated measures ANOVA found a 
significant difference in model performance, with shape performing significantly better than 
appearance across these ROIs (F(1, 34) = 7.83, p = 0.01, 𝜂#"  = 0.03) and no significant 
interaction (F(3, 102) = 2.14, p = 0.10, 𝜂#"  = 0.02). Although there was no significant interaction, 
follow-up t-tests revealed significantly better performance for shape components in ANG (t(34) = 
2.85, p = 0.01, d = 0.63) and SMG (t(34) = 2.51, p = 0.02, d =-0.58), but not for IPS (t(34) = 
0.76, p = 0.45, d = 0.20) or SPC (t(34) = 0.26, p = 0.79, d = 0.05. In fact, AFC performance for 
110 
  
shape was significantly above chance for ANG (t(34) = 2.67, p = 0.01, d = 0.29) and SMG (t(34) 
= 3.26, p = 0.003, d = 0.55), but both failed to reach significance for appearance (PPC: t(34) = 
0.92, p = 0.36, d = 0.16; SMG: t(34) = 0.06, p = 0.95, d = 0.01). 
 
 
Figure 4.11. AFC accuracy for shape compared to appearance components within PPC ROIs. AFC 
accuracy was significantly higher for the 25 shape (blue) than the 25 appearance (orange) AAM 
components for two PPC ROIs: ANG and SMG, but no different for IPS and SPC. Error bars represent SEM 
We found an overall advantage in temporal cortex for shape compared to appearance 
components. We followed-up by considering TEMP subregions (Fig. 4.12). A model 
(appearance, shape) x ROI repeated measures ANOVA found a significant difference in model 
performance, with again, shape performing significantly better than appearance across these 
ROIs (F(1, 34) = 7.75, p = 0.01, 𝜂#"  = 0.03) and no significant interaction (F(5, 170) = 1.30, p = 
0.27, 𝜂#"  = 0.01) or effect of ROI (F(5, 170) = 1.89, p = 0.1, 𝜂#"  = 0.02). Although there was no 
111 
  
significant interaction, follow-up t-tests revealed significantly better performance for shape 
components in Sup (t(34) = 3.73, p < 0.001, d = 0.72) and Mid (t(34) = 2.88, p = 0.01, d = 0.60), 
but not for the other included regions. In fact, performance for shape was significantly above 
chance for Sup (t(34) = 5.36, p < 0.001, d = 0.91) and Mid (t(34) = 3.61, p < 0.001, d = 0.61), 
but both failed to reach significance for appearance (Sup: t(34) = 0.92, p = 0.36, d = 0.16; 
Mid: t(34) = 0.07, p = 0.94, d = 0.01). 
 
Figure 4.12. AFC accuracy for shape compared to appearance components within temporal ROIs. AFC 
accuracy was significantly higher for the 25 shape (blue) than the 25 appearance (orange) AAM 
components for two TEMP ROIs: Sup and Mid. There were no other significant differences in the included 
ROIs. Error bars represent SEM 
Reconstruction performance for individual AAM components 
Although accuracy for shape and appearance components provides an overall summary 
of what is best reconstructed from fMRI activity, the limited number of components (50) gives 
112 
  
the additional opportunity to examine reconstruction performance for individual components. For 
each participant, we calculated the correlation between the predicted and true score across all 
36 test images (Fig. 4.13). For the purpose of statistical testing, we converted all correlations 
into Fisher’s Z and ran a series of one sample t-tests compared to a correlation of 0. 
Performance in OCC was significantly better than chance for 10 out of 25 shape components 
and 12 out of 25 appearance components (22/50 total). The pattern of significance was biased 
towards early components, but not exclusively so. For PPC, 5 out of 25 shape components 
were significantly predicted and 4 out of 25 appearance components. Interestingly, although the 
significant appearance components were biased towards early ones (3, 4, 5, 12), there did not 
appear to be the same pattern in shape components (1, 10, 17, 18, 20). For TEMP, 6 out of 25 
shape components were significantly predicted and 3 out of 25 appearance components. 
Interestingly, there was a high degree of correspondence between PPC and TEMP, with all 5 
significant shape components for PPC included in the 6 total for TEMP, and all 3 significant 
appearance components for TEMP included in the 4 significant for PPC. Although many more 
components were significant for OCC, the pattern aligned with TEMP and PPC, with only 
appearance component 12 being significantly predicted by only PPC. There was a total of 3 
appearance components (A3, A4, A5) and 5 shape components (S1, S10, S17, S18, S20) 
significant for all three ROIs. Of note, although the most consistently predicted components 
were high in terms of appearance variance explained, that same pattern does not appear to 
hold for shape, where many later components were the best predicted across these ROIs (see 
Fig. 4.14 for a visualization of these components). 
113 
  
 
Figure 4.13. Average correlation between predicted and true score on individual AAM components, sorted 
by 25 appearance (top) and 25 shape (bottom) components, and by OCC (red), PPC (green), and TEMP 
(blue). The components are ordered in terms of visual variance explained. Shaded areas represent SEM 
114 
  
 
Figure 4.14. Best (and worst) predicted AAM components. The mean face (center) is depicted, shifted 
uniform amounts for each component (rows). The selected components were significantly predicted by all 
3 ROIs (OCC, PPC, TEMP). There were 3 appearance components (A3, A4, A5) that met this criteria and 
5 shape components (S1, S10, S17, S18, S20). One additional shape component (S13) is also pictured as 
the predictions were significantly negatively correlated with the true values.  
115 
  
Interestingly, there were a few components with a significantly negative correlation (7 
total: 1 OCC shape, 3 PPC shape, 1 TEMP shape, 2 TEMP appearance; see Fig. 4.14, S4.6). 
Of particular note, shape component 13 had a negative correlation for all 3 ROIS. A subsequent 
investigation of this unexpected effect, found that it was driven by a small number of stimuli (5) 
which were outliers in terms of model predictions on this component, all in the negative 
direction. When those stimuli are removed from the analysis, predictions for this component are 
no longer negatively correlated with the true scores (OCC: M = 0.05 ± 0.18; PPC: M = 0.05 ± 
0.17; TEMP: M = 0.07 ± 0.17).  
The effect of repetition on reconstruction performance  
For our initial set of analyses, we assumed that averaging across repetitions of test 
faces would yield the strongest reconstruction performance. However, we proceeded to 
examine the test trials separately, in order to both test that assumption and to examine whether 
there were any meaningful differences between the first and second time a stimulus is seen 
(Fig. 4.15). For ease of interpretation, for this analysis, we focused on participants (N=23) who 
only saw the test items twice (see Methods). A repetition (1, 2) x ROI (OCC, PPC, TEMP) 
repeated measures ANOVA found no significant difference in repetition (F(1, 22) = 3.78, p = 
0.06, 𝜂#"  = 0.04) and no interaction (F(2, 44) = 2.45, p = 0.10, 𝜂#"  = 0.01). However, follow up t-
tests revealed a significant drop in performance between repetitions 1 and 2 in OCC (t(22) = 
3.12, p = 0.005, d = 0.65), but not in PPC (t(22) = 1.33, p = 0.20, d = 0.39) or TEMP (t(22) = 
0.61, p = 0.55, d = 0.18).  
116 
  
 
Figure 4.15. AFC accuracy by repetition number. There was a significant drop in AFC accuracy for 
predictions made by OCC on the 2nd test trial compared to the 1st test trial of each test image. There was 
no effect of repetition in PPC or TEMP. For ease of interpretation, the current plot and results focus on 
participants (N=23) who saw all test items only twice. Error bars represent SEM 
As a follow-up, separating the predictions allowed for looking at whether there was any 
gain in performance on correct compared to incorrect trials. Participants were consistently 
accurate at the task, so there were too few incorrect trials to examine separately. Instead, we 
reran the same analyses on correct trials only and found the same pattern of statistical results 
(see Fig. S4.1). 
We were further interested in whether there would be any pattern within PPC (Fig. 4.16).  
However, a repetition (1, 2) x ROI repeated measures ANOVA found no significant difference in 
repetition (F(1, 22) = 1.54, p = 0.23, 𝜂#"  = 0.01) and no interaction (F(3, 66) = 1.37, p = 0.26, 𝜂#"  
= 0.02). Follow up t-tests revealed no significant differences in any ROI between repetition 1 
117 
  
and 2. Follow-up analyses looking at correct trials only confirmed this same pattern of results 
(see Fig. S4.2). 
 
Figure 4.16. AFC accuracy by repetition number for PPC ROIs. There was no significant effect of repetition 
on AFC accuracy for any individual PPC ROI (ANG, IPS, SMG, SPC). For ease of interpretation, the current 
plot and results focus on participants (N=23) who saw all test items only twice. Error bars represent SEM 
We were also interested in whether there would be any pattern within TEMP (Fig. 4.17).  
However, a repetition (1, 2) x ROI repeated measures ANOVA found no significant difference in 
repetition (F(1, 22) = 2.77, p = 0.11, 𝜂#" = 0.03) and no interaction (F(5, 110) = 0.67, p = 0.65, 𝜂#" 
= 0.01). Follow up t-tests revealed no significant differences in any ROI between repetition 1 
and 2. Follow-up analyses looking at correct trials only broadly were in line with this pattern of 
results. (see Fig. S4.3). In this case, however, FUS which had trended towards significance 
118 
  
(t(22) = 2.05, p = 0.053, d = 0.59), did significantly differ, with performance falling on the 
repeated trial (t(22) = 2.52, p = 0.020, d = 0.59). 
 
Figure 4.17. AFC accuracy by repetition number for temporal ROIs. There was no significant effect of 
repetition on AFC accuracy for any individual TEMP ROI (FUS, Inf, Sup, Mid, Pole, Trans). For ease of 
interpretation, the current plot and results focus on participants (N=23) who saw all test items only twice. 
Error bars represent SEM 
Predicting subjective ratings 
In order to see how well perceptually-important information can be predicted, we used 
the same modelling procedure, but replaced the AAM components with five subjectively-rated 
dimensions (affect, gender, trustworthiness, dominance, and attractiveness). We used the same 
procedure as with individual AAM components to calculate the relationship between predicted 
and true subjective rating values (Fig. 4.18).  We found that model performance was 
significantly above chance for trustworthiness in all three ROIs (OCC: t(34) = 5.47, p < 
119 
  
0.001, d = 0.93, M = 0.13; PPC: t(34) = 4.15, p < 0.001, d = 0.7, M = 0.10; TEMP: t(34) = 
4.55, p < 0.001, d = 0.77, M = 0.11). Performance was above chance for affect (t(34) = 7.04, p < 
0.001, d = 1.19, M = 0.19) and gender (t(34) = 3.60, p = 0.001, d = 0.61, M = 0.12) only for 
OCC. Dominance and attractiveness predictions were not significantly correlated with the true 
values in any ROI.
 
Figure 4.18. Average correlation (r) between predicted and averaged subjective ratings (affect, gender, 
trustworthiness, dominance, and attractiveness) for OCC (red), PPC (green), and TEMP (blue). For plotting 
purposes, we used r, but for statistical testing, we converted the correlations to Fisher’s z. Model 
performance was significantly above chance for trustworthiness in all three ROIs, and above chance for 
gender and affect only in OCC. Error bars represent SEM 
Discussion 
The current study evaluated the ability to reconstruct face images based on distributed 
patterns of fMRI activity evoked during the perception of the face stimuli. We focused on a data-
driven approach to parameterizing face images, the active appearance model (AAM). We then 
used ridge regression to predict the top 50 AAM components based on fMRI activity within 
different ROIs. We evaluated the performance of these models with alternate force choice 
(AFC) accuracy in relation to the true component values. This analysis established the model’s 
120 
  
ability to reconstruct these face components at above chance levels across several different 
cortical ROIs. Further, we established differences in the ability to predict certain types of 
components within particular ROIs. 
Differences in content representation 
 One advantage of the AAM is that it allows for the comparison of two broad classes of 
features (see Chapter 2). The shape components are derived first and come from the top 25 
principal components of the locations of several face landmarks (e.g. landmarks along the chin, 
around the eyes). These components represent a mix of holistic (e.g. facial expression), 
configural (e.g. relative position of eyes), and local information (e.g. nose or mouth shape). They 
also capture less perceptually-important information, such as head-tilt. The appearance 
components are the top 25 principal components once all shape variance is removed. These 
components represent all information related to color and texture, including perceptually-
important (e.g. facial hair) and unimportant (e.g. lighting) information. Broadly, the shape 
components may tend to represent more high-level information (e.g. affect), whereas 
appearance components capture more low-level visual information (e.g. skin tone). 
We found no difference in the ability to reconstruct either component type in occipital 
cortex. Further investigation could look at regions within our broad occipital ROI to identify 
whether there were any areas that performed better at predicting one component type. In 
contrast to occipital, we found that the temporal cortex better reconstructed shape than 
appearance information. This is unsurprising, given that it is later in the visual processing 
stream and tends to represent higher-level information (Cichy et al., 2014; Martin et al., 2018). 
When we looked at regions within posterior parietal and temporal cortex, we found that when 
there was a difference in shape and appearance, that shape was better predicted. In particular, 
within PPC, we found this pattern in ANG and SMG, and within temporal, we found this pattern 
in Sup and Mid. The results in ANG are of particular interest here because of its role in 
121 
  
representing the content of memory (Kuhl & Chun, 2014). This region also previously performed 
best at reconstructing face images from working memory (Lee & Kuhl, 2016). The current 
analysis suggests this was likely driven by higher-level visual features. 
We further examined how well the model performed at predicting individual components. 
We found that the components that are best predicted by fMRI patterns do not necessarily 
correspond to visual variance explained. This was especially true for the shape components, 
where the components most consistently predicted across the three ROIs tended towards lower 
visual importance (components 10, 17, 18, 20). For the appearance components, there does 
appear to be a relationship between better predictions and higher visually important 
components, however it is not a completely linear pattern, with some later components (10, 11, 
16 ,17) being significantly predicted by occipital cortex. 
 Examining model performance in terms of individual components helps to explain 
similarities and differences between the ROIs. Our analyses identified three appearance 
components that were consistently predicted by all three main ROIs. All three of those 
components appear to partially capture variance associated with gender, and to a lesser extent 
skin tone. They also appear to capture less perceptually-important information (e.g. lighting). 
Our analyses also identified five shape components that were consistently predicted by all three 
ROIs. Many of these components reflect complex features relevant to the specific identity of 
faces, such as nose shape and size. Many of these components also capture eyebrow position, 
which is likely related to higher-level information, such as affect. Outside of these components, 
there were several components that were predicted only by one or two of the ROIs. These 
components have a wide range of potential interpretations and span the range of components 
(see Fig. S4.4,5 for a visualization of the components).  
As one way to provide additional insight to what the model may be decoding, we utilized 
the same modelling approach to decode perceptually-important dimensions across the same 
122 
  
brain regions (affect, gender, dominance, trustworthiness, and attractiveness). Here we found 
that OCC significantly predicted affect, gender, and trustworthiness. PPC and TEMP both only 
significantly predicted trustworthiness. Given the importance of trustworthiness in immediate 
perceptual evaluations (Oosterhof & Todorov, 2008) and previous evidence of its representation 
in the brain (Cao et al., 2020), it is not surprising that it was one of the best predicted subjective 
dimensions. We, however, did expect to be able to successfully predict more of these 
dimensions. In particular, the failure of PPC and TEMP to predict gender is surprising given the 
discussion above about how some of the best predicted individual AAM components appear to 
load strongly onto gender.  
Although we have evidence that multiple AAM components and subjective ratings can 
be decoded from neural activity, this analysis does not establish what is driving the success of 
the model. For example, certain neurons could be tuned to a particular AAM component, to 
another dimension related to a component (e.g. gender), to a face exemplar high or low on that 
component, or to one particular feature of a component (e.g. nose shape). Furthermore, regions 
that successfully predict the same component, may not be driven by the same underlying neural 
representation. 
Effect of repetition 
Although we expected that utilizing the average voxel activity pattern across two test 
trials to yield the best results, we were also interested in whether there was any difference in the 
ability to reconstruct images based on the first or second time a test image was presented. We 
found evidence for a drop in performance between repetitions 1 and 2 in occipital cortex and 
fusiform gyrus, though the latter was only significant when incorrect trials were removed. 
Although we were specifically interested in parietal regions that are involved in memory, we 
failed to find any regions that demonstrated significant evidence of “memory amplification”, 
where the second appearance of the item was better represented (Favila et al., 2018).  
123 
  
There are several potential explanations for why we may have seen the drop in 
performance for the repeated trial. One explanation is that the first time a face was seen 
demanded more attention, because participants were encoding the features of the faces. On the 
second appearance, however, participants may have gotten an immediate familiarity signal and 
responded with “old” and proceeded to lapse in attention for the remainder of the trial. One 
alternative account is that the model was trained on only the first appearance of face images, 
therefore any differences in repetition trials, whether due to an attention or memory effect, were 
not trained on. There are too few repetition trials to explore this possibility with the present data. 
Comparison to eigenfaces  
Previous efforts to reconstruct faces have successfully employed eigenfaces to 
parameterizing face images (Cowen et al., 2014; Lee & Kuhl, 2016). Here we used an improved 
parameterization technique that represents face images more realistically and efficiently, and 
has previously demonstrated greater reconstruction success (Chang & Tsao, 2017). Despite our 
expectations, we found no advantage to using the AAM compared to eigenfaces. 
One potential explanation was differences in the number of components used (300 for 
eigenface, 50 for AAM). However, we found no advantage for 50 AAM components when 
compared to 50 eigenface components. Furthermore, although our choice of 50 AAM 
components was based on previous usages (Chang & Tsao, 2017) and not our own data, we 
found no evidence that a different number of components would have been better.  
An important consideration is that although the AAM approach was no better than 
eigenface when measured with AFC accuracy, this approach to assessing performance is only 
expressed in relation to the predictability of the components themselves. That is, although both 
methods performed about as well at reconstructing within that space, the AAM renders more 
realistic reconstructions, which could potentially be judged as more similar to the true face 
based on subjective judgments. A behavioral study would need to asses that possibility.  
124 
  
Future directions 
Although the present results establish the ability to reconstruct face images at above 
chance levels, the magnitude of the effect was modest compared to recent attempts that took a 
similar methodological approach (Cowen et al., 2014; Lee & Kuhl, 2016) or to other recent 
approaches with strong reconstruction performances (Dado et al., 2022; Güçlütürk et al., 2017; 
VanRullen & Reddy, 2019). Moving forward, this is a rich dataset that needs to be explored 
further in order to increase the power and effectiveness of this approach. 
Our ultimate goal here, was not only to maximize face reconstruction accuracy, but to do 
so in a way that was practical to implement as part of a larger experiment where internal face 
representations could be a decodable dependent variable. For this reason, we focused on 
regions that would be most likely to have decodable internal representations. One recent study, 
for example, found that a remembered face was best decoded from temporal voxels (VanRullen 
& Reddy, 2019). For the same reason, we also focused on an experimental setup with a training 
set of faces from only one fMRI session. This led to a training set substantially smaller than 
other recent studies (Dado et al., 2022; Güçlütürk et al., 2017; VanRullen & Reddy, 2019). We 
designed a procedure that attempted to balance maximizing the total number of unique faces 
viewed with not overtaxing the attention of participants. There is also a tradeoff in the design 
between including more fast trials or slower trials that individually have better estimates. Our 
balancing of these tradeoffs led to a total of 396 unique training stimuli and 36 test images. 
Ultimately, the most potentially applicable approach would involve not only limiting the 
training session to one scan session, but actually limiting it to only a portion of the session. In 
order to pursue this possibility, one important aspect of the present work is an investigation into 
the utility of across-participant reconstructions as opposed to the more typical within-participant 
approach. This approach has recently shown to improve the reconstruction of natural images 
(Akamatsu et al., 2021). Here, one important aspect of the design was the inclusion of two 
125 
  
“fixed” runs where all participants saw the same images in the same order. These fixed runs 
make the data more amenable to transforming the functional data into a shared space across all 
participants (Chen et al., 2015; Chen et al., 2017). With this approach, the training set for each 
participant can be multiplied by the number of participants included. When applied to the 
present data, it has the potential to greatly improve the reconstruction accuracy. If that approach 
proves viable, it opens up the possibility of participants only needing to be shown those two 
fixed runs (or possibly one) in order to be transformed into this shared space. This would allow 
the rest of a scan session to be devoted to a specific experimental design that decodes a 
participant’s internal representation of face images over the course of the experiment. 
Going forward, there are a number of additional analyses that are being actively 
explored that could potentially increase the power of this approach: (1) Use a leave one out 
analysis that ignores the test faces as distinct. This approach would make the measurement 
less noisy and less influenced by any particular test image. (2) Include a voxel selection 
technique, such as the inclusion of only face-selective voxels. This could help reduce the 
likelihood of overfitting the model. (3) Adapt this data to more recent approaches to face 
parameterization. This has greatly improved reconstructions in other contexts. (4) Leverage the 
relationship between particular regions and components to create reconstructions pieced 
together from the most reliably decoded region/component pairings. Although this would be less 
informative of any particular region, this could lead to better overall predictions. With these 
possibilities (and others) there is reason to believe the power of this approach can be improved.  
126 
  
Chapter V 
GENERAL DISCUSSION 
The goal of this dissertation was to establish an approach to studying interference 
resolution in episodic memory. Long-term memories are too often studied without full 
appreciation of their multi-dimensional and reconstructive nature. Further, cognitive and neural 
perspectives on this phenomenon often proceed on separate tracks. Thus, I sought to develop 
an approach that could bridge findings from behavioral and neural paradigms, and 
computational models of interference resolution. This integrated approach will be highly 
valuable in understanding how the human memory system is able to efficiently store so many 
potentially confusable memories. In establishing this approach, I have already gained some key 
insights into the mechanisms of interference resolution. 
Integrated summary of results 
 I began (Chapter 2) by collecting and establishing the validity and reliability of several 
metrics describing a large face stimulus corpus. Specifically, I first landmarked a large sample of 
faces by hand. The 62 positions were landmarked with high reliability across raters and allowed 
me to implement an active appearance model (AAM). The AAM components represent a face 
space that can be used to generate and manipulate synthetic face stimuli. In Chapter 4, I 
established the utility of these components in reconstructing faces from evoked patterns of fMRI 
activity.  
I next collected data on the sorting of faces based on similarity. I found that sorters were 
consistent in their grouping of faces. In Chapter 3 I describe an application that combines the 
sorting data with the AAM components. This approach involved generating eight distinct face 
127 
  
“families” based on which faces tended to be grouped together. From this starting point, I 
manipulated the similarity of faces within an orthogonal face space where faces generated from 
the same family caused high interference, but faces from different families did not.  
 I also collected subjective ratings on all face images. I found that the ratings were 
reliable both within and across participants. In Chapter 3 I describe an application combining the 
AAM components with the ratings. I focused on mapping the AAM components to the two most 
reliably rated dimensions (affect, gender). However, other subjective dimensions could be 
utilized moving forward (attractiveness, trustworthiness, dominance). Critically, I found that 
participants were highly accurate in retrieving both features (affect, gender) from memory. 
 Beyond the validity of this approach to research, this technology helped to unlock key 
theoretical insights. In Chapter 3 I found compelling evidence that resolving memory 
interference is associated with systematic, subtle changes in how memory features are recalled. 
The key to establishing this finding was the ability to independently manipulate two perceptually-
important dimensions. I was able to manipulate the faces to be similar enough to cause 
interference, but distinct enough (on one diagnostic dimension) to learn. By probing memory on 
the same dimensions the faces were manipulated on, I was able to measure feature memory in 
a continuous, perceptually-important space. I found that competition induced repulsion, where 
memories shifted away from their competitor on the diagnostic compared to the non-diagnostic 
dimension. I also found an increase in precision, again on the diagnostic compared to the non-
diagnostic dimension. Both of these changes in recalled feature information were associated 
with better associative memory—suggesting that they play an adaptive role in interference 
resolution. 
 In Chapter 4 I began the process of mapping the face dimensions I developed to 
patterns of neural activity. The goal was to establish a method that could be used to measure 
potential changes in the neural representation of faces during competitive learning. First, I 
128 
  
established the ability to reliably reconstruct face images by decoding AAM components from 
patterns of fMRI activity. I proceeded to explore which dimensions were best reconstructed and 
from which brain regions. I found evidence that face images were, overall, best reconstructed 
from occipital cortex. I also found a pattern where temporal and posterior parietal regions 
reconstructed shape components better than appearance components. I further identified 
specific AAM components and subjective dimensions that were the most strongly predicted. In 
particular, affect, gender, and trustworthiness were well predicted from the occipital region. 
Although the ability to generate image reconstructions is one important way to visualize the 
power of the model, ultimately the predictions made at the specific component or dimension 
level are the key to utilizing this approach experimentally (see Future directions, below). 
 Together, these results establish a role for both repulsion and increased precision in 
interference resolution. They also establish a set of methods that enable a path forward to 
linking these findings to adaptive neural changes also associated with interference resolution. 
Below I discuss the full theoretical implications and potential next steps. 
The role of selective attention in interference resolution 
 The dominant theoretical focus of this dissertation is on a mechanistic account of 
interference resolution where learning through interleaved practice leads to changes in how 
feature-level information is remembered (repulsion, precision). It is important to consider how or 
whether these changes may be explained by behavioral strategies adopted by participants. 
Namely, given the task demands, participants were likely to actively develop strategies to learn 
the cue-face associations. One advantageous strategy to learning would be to attempt to 
identify the differences between competitive faces and then to selectively attend to those 
differences. In fact, a strategy of attending to differences has previously been shown to 
eliminate retrieval-induced forgetting (Smith & Hunt, 2000). Moreover, recent work has 
129 
  
demonstrated that diagnostic features of competing memories are more strongly represented in 
in neural activity patterns during memory retrieval (Zhao et al., 2021).  
While selective attention to diagnostic features potentially played a role in learning the 
competing associations, a selective attention account does not readily explain the key results of 
repulsion and precision found in Chapter 3. That is, an account based on increased attention to 
diagnostic features does not predict that memory for these features will fundamentally shift with 
learning. In contrast, our results indicate a systematic distortion in feature memory that occurs, 
with learning, for specific features of specific items. While attention can create perceptual 
distortions in certain instances, these distortions operate on the dimensions themselves. For 
example, during perceptual learning, repeatedly attending to a particular dimension may 
“stretch” the perception of that dimension (Goldstone, 1998; Nosofsky, 1986). Critically, this type 
of stretching would lead to increased precision for any item that could be perceived along that 
dimension—in other words, the stretching should generalize across items. However, an 
important design feature of my experiments is that I counterbalanced, across items, which 
feature (affect, gender) corresponded to the diagnostic versus non-diagnostic dimensions. 
Therefore, the increased precision I observed for competitive items cannot be explained by a 
global change (stretching) in the perception of affect or gender. Instead, the increases in 
precision were specific to individual items. Because these changes at the individual item level 
were predictive of a reduction in interference, I believe that these changes were a driving force 
in resolving interference. 
While the role of selective attention to diagnostic dimensions is likely to play some role in 
interference resolution and is worth further investigation, the key point is that it fails to account 
for the findings of repulsion and precision that are the focus of this dissertation. Below I discuss 
an alternative theoretical framework that more readily accounts for these feature-level changes.  
130 
  
Relationship between behavioral findings and neural accounts of interference resolution 
 The hippocampus has important properties that reduce the likelihood of interference. 
Pattern separation helps create a unique neural code in CA3 when a new event is being 
encoded (Yassa & Stark, 2011). A key facet of pattern separation is that regardless of the 
constituent features of the event, the associated neural representation is orthogonalized. This 
reduces the likelihood that any two events will interfere with one another during memory 
retrieval. However, when two events are highly similar (e.g. two egg recipe videos), this 
automatic mechanism may not be able to prevent interference. In those instances, recent fMRI 
evidence indicates that there is an experience-dependent process in the hippocampus, 
repulsion, that resolves interference over the course of learning. This process shifts the neural 
representations of items with high similarity (Chanales et al., 2017) or that have caused more 
interference (Wanjia et al., 2021) to become more distinct. Hippocampal repulsion predicts 
interference reduction, suggesting that these neural changes may be playing a mechanistic role 
in resolving interference (Favila et al., 2016; Wanjia et al., 2021). 
 This pattern of neural repulsion mirrors our findings of behavioral repulsion and 
precision. Namely, our findings show that highly similar items shift to be recalled as more 
distinct, just as neural representations shift to become more distinct in similar experimental 
contexts. Both the neural and behaviorally-measured changes have been linked to interference 
resolution. However, neural repulsion has not yet been shown to be associated with changes in 
the recollection of memory features. One theory that could explain that connection comes from 
Hullbert and Norman (2015). They model neural repulsion as a process where the diagnostic 
features of competing items become strengthened and the non-diagnostic features become 
weakened (see Interference resolution, Chapter 1). From this perspective, these changes to 
memory feature representations are what underlie neural repulsion. If the neural representation 
131 
  
of specific memory features become strengthened or weakened, it follows that those same 
features would be recalled differently.  
A challenge I addressed is translating the Hullbert and Norman (2015) model predictions 
into specific memory measures; I found evidence for two potential accounts. The most 
straightforward is developing a more precise memory for the strengthened compared to the 
weakened feature (i.e. diagnostic compared to non-diagnostic dimension). Regardless of the 
accuracy, if there is a strong representation for a particular feature, we would expect repeated 
retrieval attempts to have lower variance. Alternatively (or additionally), when a diagnostic 
feature becomes a larger part of the overall memory representation, that “over-representation” 
may create an exaggerated difference compared to the competitive item (specifically on that 
feature). Both of these explanations represent adaptive distortions compatible with this 
computational model and more broadly with how the reconstructive nature of memory can 
facilitate our interactions with the world (Schacter et al., 2011).  
Future directions 
 Importantly, although our behavioral findings fit with the Hullbert and Norman (2015) 
model of neural repulsion, the present data is insufficient to make a full causal connection. An 
important future goal is to investigate the extent to which behavioral and neural repulsion are 
related. Future experiments could track changes in memory features simultaneously through 
behavioral probes and trial-level neural decoding, both mapped to the same underlying face 
features. If feature memory shifts can be linked to shifts in decoding predictions, in a time-
locked way, this would be strong evidence for a relationship between the two. 
 Although our approach in Chapter 4 framed the results in terms of reconstructing face 
images, the most important objective for future applicability is the ability to predict specific face 
dimensions. The results in Chapter 4 suggest that affect and gender are the best predicted 
132 
  
subjective dimensions. These same dimensions were the focus of Chapter 3, making them 
strong candidates to focus on moving forward.  
Although those dimensions are the most promising, there are still challenges that need 
to be overcome prior to implementing this approach. One of the most pressing challenges is the 
development of a brief fMRI training session that can reliably predict those dimensions. 
Optimally, the training session would be short enough to still allow time for additional scan runs 
focused on interference manipulations. The fMRI study in Chapter 4 included over an hour of 
functional scanning and 396 unique training images. Given the number of features used 
(thousands of voxels), the model is already data-starved, so reducing the number of trials 
without any other changes is not a viable approach. The most straightforward approach, would 
be the transformation of each participant’s neural data into a shared feature space (Chen et al., 
2015; Chen et al., 2017; Haxby et al., 2011) so that model training could be performed across 
participants. This would vastly expand the size of training set by allowing all trials across 
participants to be used for model training, with one participant iteratively held-out for model 
testing. This type of training set would continue to expand as more participants are added, 
making this approach potentially more powerful over time, especially if used over multiple, 
successive experiments. The key to utilizing this type of approach is the inclusion of a fixed 
scan session where every participant is presented with the same stimuli in the same sequence. 
This fixed scan session can then be used to “functionally align” data across participants. If this 
approach proves viable, future experiments would only need to include the fixed scan run in 
order to map a new participant’s data into the shared space. This would allow the rest of the 
scan session to be devoted to an experiment that the model makes predictions on, rather than 
devoting more scan time to adding to the training set. This type of approach has proved 
successful in decoding natural images (Akamatsu et al., 2021), however it has not yet been 
133 
  
applied to decoding face dimensions. It is unknown whether the success will translate and how 
long of a fixed scan run would be needed for this approach to work for faces. 
An alternative approach could focus on developing a training session specifically 
designed to decode the target dimensions. By focusing on as few as two dimensions, rather 
than the full multi-dimensional face space, there are potentially more powerful design options. 
Chapter 4 utilized an event-related design, where each face stimulus was treated as an 
independent event. I viewed this as necessary to target all unique properties of each face. 
However, under a more targeted approach, one could utilize a blocked design where faces 
matched on the target dimensions are displayed in succession. An approach like that would 
leave out a lot of variability in faces that the current design detected, but it may be able to 
achieve a stronger detected signal for the dimensions of interest. This type of targeted approach 
could lead to high decodability in a more limited amount of time. 
 Neither of these potential approaches directly address another outstanding challenge: 
whether the ability to reconstruct from perception will translate to memory retrieval. Previous 
face reconstruction attempts have found the ability to reconstruct remembered faces at above 
chance levels in temporal cortex (VanRullen & Reddy, 2019) and posterior parietal cortex (Lee 
& Kuhl, 2016). Thus, one potential path to applying the decoding of face features to memory is 
to simply focus on specific regions where there is an established ability to reconstruct. 
Reconstruction accuracies, however, were lower compared to perception-based 
reconstructions, so this approach may not lead to decoding accuracies sufficiently high to track 
changes in the representation between trials.  
One alternative approach would be to train models on memory. Training on long-term 
memory would likely lead to too small of a training set (due to practical limits in how many faces 
a participant would be able to memorize in a single experimental session), however, training on 
working memory could be viable (see Lee & Kuhl, 2016). Another possibility would involve 
134 
  
learning a more complex mapping to translate activation patterns in perception to memory. 
Although we know that during retrieval, patterns elicited during encoding are reinstated, there is 
a lot of unaccounted for variance in reinstatement patterns (Xue, 2018). There are a number of 
potential explanations (e.g. a similar representation that is merely depressed, an alteration in 
how features are represented, a bias towards higher-level features, or a shift in where items are 
primarily processed), all of which may play a role to different extents depending on the brain 
region (Favila et al., 2022). Accounting for these differences could be key to unlocking the full 
power of this method. 
Broader implications 
 I established evidence that memories are systematically altered at the specific feature 
level by the presence of other memories. This has important implications for the measure of 
memory generally, particularly in the context of the growing use of continuous memory 
measures. Given the present findings, there are several important considerations for memory 
researchers: (1) Changes in one memory feature do not necessarily mean changes on other 
memory features. I found changes on the diagnostic relative to the non-diagnostic dimension. 
This is taken into account when it is the target of experimental manipulation, but often isn’t when 
one probed feature is intended to measure overall memory accuracy, bias, or precision. (2) 
Different dimensions may demonstrate unique memory properties. I found initial evidence for 
differences in memory between affect and gender (see Fig. S3.1). Although these dimensions 
were counter-balanced and the differences did not impact our results of interests, these factors 
can lead to overall distortions in memory (Bülthoff & Zhao, 2019; Won et al., 2020). Again, the 
impact of this can be missed if only one feature is probed. (3) Adaptive (or maladaptive) feature 
memory changes must be taken into account. I found a systematic pattern of repulsion, a study 
that views memory only in reference to the veridical value would view that type of adaptive 
distortion as inaccurate. (4) Precision should not be conflated with accuracy. I found a pattern 
135 
  
where repulsed memories were also highly precise. (5) Unique, item-level memory properties 
should be taken into account. The pattern of repulsion I found varied across items. Examining 
our results this way helped find evidence that greater repulsion was associated with reduced 
interference. Failure to take into account differences in feature memory bias between items 
could also lead to deflated estimates of precision, if each item is highly precise, but has a 
distinct degree of bias. 
Outside of memory, these tools and approaches offer a path to innovation in a variety of 
domains. This dissertation focused on face stimuli because of how adept humans are at 
processing and remembering them, and due to their ability to be parameterized along 
perceptually-important, continuous dimensions. These properties make face stimuli useful in a 
wide variety of research applications. Although this dissertation focused on separate behavioral 
and neural designs, there is a clear path forward for the pursuit of convergent designs. When 
there are changes in a neural representation, there are likely behavioral consequences. It is the 
job of cognitive neuroscientists to identify those changes, even in more abstract domains like 
episodic memory (Krakauer et al., 2017). Inspired by cognitive models and a variety of neural 
findings, this dissertation identified and translated models of neural computations into specific, 
behavioral consequences. The tools validated here offer further opportunities for the translation 
of experimental manipulations along comparable, but behavioral- or neural-derived measures. 
Conclusion: how do we remember highly similar information? 
 Let’s return to the question I began this dissertation with: how do humans store so many 
memories? First, not every memory needs to be remembered as distinct. Some memories are 
best forgotten; others are best integrated into semantic knowledge or schemas without a full 
episodic memory specific to the event. When it is advantageous to form a distinct memory, there 
is a dedicated path of the memory system that helps avoid interference. This system, however, 
is not always effective and memories can be (and are quite often) forgotten due to other 
136 
  
memories. However, in cases where there is additional pressure on the system to store a once 
forgotten or interference-prone memory, interference can be overcome. The exact neural 
computation involved in this process are still being studied. Here, I established two adaptive 
memory changes that help reduce interference within the time course of the experiment. These 
feature memory changes highlight or even exaggerate differences between competing 
memories in a way that could prevent forgetting over the long-term. I propose ways this could 
relate to underlying neural processes that repulse the representation of competing memories 
and chart a path forward to establishing the full relationship between the two. Such an 
interference-resolution mechanism would explain the vast human memory capacity.  
137 
  
APPENDIX A 
CHAPTER II SUPPLEMENTARY MATERIAL 
 
 
 
Figure S2.1. Density plot of masculinity/femininity ratings across all stimuli, divided by hand labels of female 
(orange) and male (blue) perceived gender.   
138 
  
 
 
Figure S2.2. Example of 8 stimulus pairs (left to right) from Chapter 3 (experiment 1) matched on affect, 
but differing on gender (top to bottom). 
 
 
Figure S2.3. Example of 8 stimulus pairs (left to right) from Chapter 3 (experiment 1) matched on gender, 
but differing on affect (top to bottom).  
139 
  
APPENDIX B 
CHAPTER III SUPPLEMENTARY MATERIAL  
140 
  
 
Figure S3.1. Differential memory effects for affect vs. gender. A. Accuracy (percent correct) on the 
associative memory test during each round of the learning phase (competitive condition only) separated by 
the whether the diagnostic dimension was affect (green) or gender (purple) and by experiment number. For 
exp. 1, there was no difference in accuracy for affect vs. gender (F(1,35) = 1.84, p = 0.18, 𝜂#" = 0.004). For 
exp. 2, the similarity between competitive pairmates was increased on both affect and gender (see 
Methods). Although not intended, this resulted in accuracy being significantly higher when affect was the 
diagnostic dimension compared to gender (F(1,40) = 13.22, p < 0.001, 𝜂#" = 0.026). We addressed this 
difference in exp. 3 by further (and selectively) increasing the similarity between competitive pairmates 
along the affect dimension. This change was successful, as there was no longer a significant difference in 
accuracy when the diagnostic dimension was affect vs. gender (F(1,56) = 2.22, p = 0.14, 𝜂#" = 0.003). B. 
Reconstruction bias as a function of dimension (diagnostic, non-diagnostic), whether the diagnostic 
dimension was affect or gender, and experiment number Here, bias was measured as the (un-modeled) 
mean response because there were too few trials to perform the modelling approach used in the main text. 
A repeated measures ANOVA revealed a robust main effect of dimension, reflecting significantly greater 
bias towards repulsion (higher mean reconstruction bias) on the diagnostic vs. non-diagnostic dimension 
(F(1,118) = 85.05, p < 0.001, 𝜂#" = 0.089). However, there was also a significant interaction between 
diagnostic/non-diagnostic dimension and gender/affect (F(1,118) = 34.41, p < 0.001, 𝜂#" = 0.047) reflecting 
a greater difference between diagnostic vs. non-diagnostic dimensions when the diagnostic dimension was 
affect (this effect did not further interact with experiment number, three-way interaction: F(2,118) = 0.39, p 
= 0.68, 𝜂#" = 0.001). Nonetheless, the effect of diagnostic vs. non-diagnostic dimension was significant when 
the diagnostic dimension was affect (F(1,118) = 88.44, p < 0.001, 𝜂#" = 0.234) or gender (F(1,118) = 4.27, 
p = 0.041, 𝜂#" = 0.008). C. Reconstruction precision (SD of responses across the 4 reconstruction trials for 
each face) as a function of dimension (diagnostic, non-diagnostic), whether the diagnostic dimension was 
affect or gender, and experiment number. A repeated measures ANOVA revealed a robust main effect of 
dimension (F(1,118) = 45.30, p < 0.001, 𝜂#" = 0.054), reflecting greater precision (lower SD) for the 
diagnostic than the non-diagnostic dimension. However, there was also a significant interaction between 
diagnostic/non-diagnostic dimension and gender/affect (F(1,118) = 36.60, p < 0.001, 𝜂#" = 0.034), reflecting 
a greater difference between the diagnostic vs. non-diagnostic dimensions when the diagnostic dimension 
was affect (this effect did not further interact with experiment number, three-way interaction: F(2,118) = 
0.086, p = 0.92, 𝜂#" < 0.001). When affect was the diagnostic dimension, the effect of diagnostic vs. non-
diagnostic dimension was significant (F(1,118) = 76.34, p < 0.001, 𝜂#" = 0.147). In contrast, when gender 
was the diagnostic dimension, the effect of diagnostic vs. non-diagnostic dimension was not significant 
(F(1,118) = 1.22, p = 0.27, 𝜂#" = 0.003). Note: error bars represent SEM.   
141 
  
 
Figure S3.2. Relationship between reconstruction bias and precision. A. A mixed-effects model was run 
with precision (mean SD) as the dependent variable and with experiment number (1, 2, 3), dimension 
(diagnostic, non-diagnostic) and bias included as predictors. Measures of precision and bias were 
computed at the item level, with each measure based on the mean value across the 4 reconstruction trials 
for each face. The relationship between bias and precision was modeled with random intercepts and slopes 
for each participant. Compared against a model without bias, adding bias significantly improved model fit 
(𝜒#(1) = 140.8, p < 0.001, 𝛽'()*	= -0.25, SE = 0.015) reflecting the fact that stronger bias (repulsion) was 
associated with greater precision (lower SD). In the plot, each dot represents a specific face image, each 
experiment is a unique color (e1: blue; e2: orange; e3: pink), and each line represents the modelled, 
participant-specific relationship between reconstruction bias and precision. Notably, the effect of bias on 
precision did not interact with dimension type (diagnostic vs. non-diagnostic: 𝜒#(1) = 1.61, p = 0.20, 
𝛽'()*.1(9	= 0.027, SE = 0.022, or experiment: 𝜒#(2) = 2.32, p = 0.31). Thus, although the diagnostic 
dimension was associated with greater bias and precision compared to the non-diagnostic dimension (see 
main text), the relationship between bias and precision was not specific to the diagnostic dimension. B. 
One potential account of why precision was greater on the diagnostic dimension is that greater repulsion 
(towards the boundary) reduced the response space for reconstruction (compressing variance). To address 
this, we computed mean precision for each dimension as a function of the level of bias on the diagnostic 
and non-diagnostic dimensions. Three equal-width bias bins were created within the half of the response 
range that was closer to the target than the competitor (0-2): ‘Attraction’ represents bias in the direction of 
the competitor face (range of bias values: 0-0.67); ‘Target’ represents responses centered around the true 
value (range: 0.67-1.33); ‘Repulsion’ represents bias away from the competitor face (range: 1.33-2). 
Because not all participants contributed to each bin, the mean and SEM were calculated across items, 
ignoring participant. Qualitatively, while precision was markedly higher, overall, for the Repulsion bin (lower 
SD), the tendency for greater precision on the diagnostic vs. non-diagnostic dimension was not selective 
to the Repulsion bin—in fact, the effect was least consistent in the Repulsion bin. In order to statistically 
confirm that the difference in precision for diagnostic vs. non-diagnostic dimensions was not an artifact of 
high bias values, we calculated the mean precision for the diagnostic and non-diagnostic dimensions only 
including faces within the Attraction and Target bins (0–1.33). Data were included from all experiments, but 
one participant (from e3) was excluded for having no items within the specified range. The remaining 
participants each had at least 2 items in the specified range for both the diagnostic (M = 5.70) and non-
diagnostic (M = 5.16) dimensions (out of 8 possible items). Even with this restricted range (that excluded 
high bias items), there was significantly greater precision on the diagnostic compared to the non-diagnostic 
dimension (F(1,117) = 25.16, p < 0.001, 𝜂#" = 0.051). This effect did not interact with experiment (F(2,117) 
= 2.20, p = 0.12, 𝜂#" = 0.009). Note: error bars represent SEM.  
142 
  
 
Figure S3.3. Reconstruction bias and precision for faces in the competitive vs. non-competitive conditions. 
For the non-competitive condition there was no distinction between diagnostic vs. non-diagnostic 
dimensions. Thus, for each face in the non-competitive condition, data from both dimensions were included. 
With 4 items in the non-competitive condition for each participant, and 4 reconstruction trials per face, this 
yielded 32 total values per participant for the non-competitive condition (2 dimensions x 4 items x 4 trials). 
Bias was modeled using the same method as for the diagnostic and non-diagnostic dimensions (see 
Methods). A. Bias was significantly greater (higher modeled mean) for the diagnostic dimension (of faces 
in the competitive condition) than for the non-competitive condition (F(1,118) = 22.11, p < 0.001, 𝜂#" = 
0.043). This difference did not interact with experiment (F(2,118) = 0.13, p = 0.88, 𝜂#" < 0.001). There was 
also a significant difference between the non-diagnostic dimension and the non-competitive condition 
(F(1,118) = 6.73, p = 0.011, 𝜂#" = 0.015; not shown in the figure), with no interaction by experiment (F(2,118) 
= 1.90, p = 0.15, 𝜂#" = 0.008). Specifically, for the non-diagnostic dimension there was a relative bias toward 
the center of face space (modeled mean tending to be lower than 1; see Figure 3.4) whereas for the non-
competitive condition the modeled mean was higher (almost exactly at the true value of 1). B. Precision 
was significantly greater (lower mean SD) for the diagnostic dimension (of faces in the competitive 
condition) than for the non-competitive condition (F(1,118) = 39.44, p < 0.001, 𝜂#" = 0.073). This difference 
did not interact with experiment (F(2,118) = 0.12, p = 0.89, 𝜂#" < 0.001). Notably, there was no significant 
difference in precision between the non-competitive condition and the non-diagnostic dimension (F(1,118) 
= 0.006, p = 0.94, 𝜂#" < 0.001; not shown in the figure), nor was there an interaction by experiment (F(2,118) 
= 1.75, p = 0.18, 𝜂#" = 0.006). Notes: Each dot represents a participant, with color indicating the experiment 
(e1: blue; e2: orange; e3: pink); error bars represent SEM. 
  
143 
  
 
Figure S3.4. Histogram of reconstruction responses across all experiments, participants, and items in the 
competitive condition. Responses were separated by whether the dimension was diagnostic (orange) or 
non-diagnostic (blue). As in all other analyses, responses were rescaled such that the location of the target 
was at 1 (black dotted line), the center of the face space was at 0, and in the case of the diagnostic 
dimension, the location of the competitor was at -1 (red dotted line). To better characterize the distributions, 
separate mixture models were generated for the diagnostic and non-diagnostic dimensions. Each model 
included three distributions: a target distribution (the correct face), a competitor distribution (the competitor 
face), and a uniform distribution (random guessing). For the target distribution, we used a truncated normal 
distribution where we set the mean to the mean of our estimate from the main bias analysis across all 
participants from all experiments (diagnostic: 1.17; non-diagnostic: .9) and allowed the standard deviation 
to vary within a ‘generous’ range that was wide beyond what would plausibly explain the data (0.3–2). We 
used the same approach for the competitor distribution but changed the mean.  For the diagnostic 
dimension, we mirrored the target bias value by setting the competitor value to -1.17. For the non-diagnostic 
dimension, since there was no competitor, we set the competitor value at the value where a competitor 
would be (-1). Although there was no competitor in the case of the non-diagnostic dimension, we included 
it here to allow a fairer comparison across the diagnostic and non-diagnostic dimensions. In particular, the 
non-diagnostic dimension allows for a baseline estimate of the percentage of swap errors (recalling the 
competitor) in a situation where there should not be any. For the diagnostic dimension, the best fitting model 
estimated that the target distribution explained 91.9% of responses (SD = .8), as reflected by the orange 
line. The model estimated that 6.1% of responses were random guesses and 2.0% of responses were swap 
errors (SD = .5). For the non-diagnostic dimension, the best fitting model estimated that the target 
distribution explained 84.1% of responses (SD = 1), as reflected by the blue line. The model estimated that 
15.9% of responses were random guesses and 0% were swap errors. Taken together, these mixture model 
results suggest that the target distributions largely explained responses, with relatively little influence from 
random guesses and swap errors. That said, because the mixture models require a relatively high number 
of data points, these models were not well-suited to characterizing distributions for individual items (faces) 
and participants. 
 
144 
  
APPENDIX C 
CHAPTER IV SUPPLEMENTARY MATERIAL 
 
 
Figure S4.1. Alternative forced choice accuracy (AFC) for correct trials only, modeled separately for 1st 
and 2nd appearance (x-axis), and for occipital (OCC), posterior parietal (PPC), and temporal (TEMP) 
cortical ROIs. For ease of interpretation, the current plot and results focus on participants (N=23) who only 
saw the test items twice. A repetition (1, 2) x ROI (OCC, PPC, TEMP) repeated measures ANOVA found 
no significant difference in repetition (F(1, 22) = 2.45, p = 0.13, 𝜂#"	= 0.03) and no interaction (F(2, 44) = 
1.4, p = 0.26, 𝜂#" = 0.01). There was a significant effect of ROI (F(2, 44) = 11.54, p < 0.001, 𝜂#" = 0.07). 
Follow up t-tests revealed a significant drop in performance between repetitions 1 and 2 in OCC (t(22) = 
2.76, p = 0.01, d = 0.56), but not in PPC (t(22) = 1.13, p = 0.27, d = 0.32) or TEMP (t(22) = 0.66, p = 
0.52, d = 0.19). Error bars represent SEM 
145 
  
 
Figure S4.2. Alternative forced choice accuracy (AFC) for correct trials only, modeled separately for 1st 
and 2nd appearance (x-axis), and for ROIS within the overall PPC region: angular gyrus (ANG), intraparietal 
sulcus (IPS), supramarginal gyrus (SMG), and superior parietal cortex (SPC). For ease of interpretation, 
the current plot and results focus on participants (N=23) who only saw the test items twice. A repetition (1, 
2) x ROI (ANG, IPS, SMG, SPC) repeated measures ANOVA found no significant difference in repetition 
(F(1, 22) = 1.95, p = 0.18, 𝜂#" = 0.01) and no interaction (F(3, 66) = 1.35, p = 0.27, 𝜂#" = 0.02). There was 
also no significant effect of ROI (F(3, 66) = 0.03, p = 0.99, 𝜂#" < 0.01). Follow up t-tests revealed no 
significant differences in any ROI between repetition 1 and 2. Error bars represent SEM 
146 
  
 
Figure S4.3. Alternative forced choice accuracy (AFC) for correct trials only, modeled separately for 1st 
and 2nd appearance (x-axis), and for ROIS within the overall temporal ROI: inferior temporal (Inf), superior 
temporal (Sup), middle temporal (Mid), temporal pole (Pole), and transverse temporal (Trans). We also 
included the fusiform gyrus (FUS), which was not included in the overall temporal ROI. For ease of 
interpretation, the current plot and results focus on participants (N=23) who only saw the test items twice. 
A repetition (1, 2) x ROI (Inf, Sup, Mid, Pole, Trans, FUS) repeated measures ANOVA found no significant 
difference in repetition (F(1, 22) = 2.93, p = 0.10, 𝜂#"	= 0.03) and no interaction (F(5, 110) = 0.57, p= 0.72, 
𝜂#" = 0.01). There was also no significant effect of ROI (F(5, 110) = 1.21, p = 0.31, 𝜂#" = 0.01). Follow up t-
tests revealed that only FUS significantly differed between repetition 1 and 2, with performance falling on 
the repeated trial (t(22) = 2.52, p = 0.02, d = 0.59). Error bars represent SEM 
147 
  
 
Figure S4.4. AAM appearance components significantly predicted by only OCC (black) and only PPC (red). 
The mean face (center) is depicted, shifted uniform amounts for each component (rows). 
148 
  
 
Figure S4.5. AAM shape components significantly predicted by only OCC (black) and by both OCC and 
TEMP (blue). The mean face (center) is depicted, shifted uniform amounts for each component (rows). 
149 
  
 
Figure S4.6. AAM components with significant negative correlations, for all ROIs (black), TEMP only (blue), 
and PPC only (red). The mean face (center) is depicted, shifted uniform amounts for each component 
(rows).  
150 
  
REFERENCES CITED 
 
 
Akamatsu, Y., Harakawa, R., Ogawa, T., & Haseyama, M. (2021). Perceived image decoding from brain 
activity using shared information of multi-subject fMRI data. IEEE access, 9, 26593-26606. 
Anderson, M. C. (2003). Rethinking interference theory: Executive control and the mechanisms of 
forgetting. Journal of memory and language, 49(4), 415-445. 
Anderson, M. C., Bjork, E. L., & Bjork, R. A. (2000). Retrieval-induced forgetting: Evidence for a recall-
specific mechanism. Psychonomic bulletin & review, 7(3), 522-530. 
Anderson, M. C., Bjork, R. A., & Bjork, E. L. (1994). Remembering can cause forgetting: retrieval 
dynamics in long-term memory. Journal of Experimental Psychology: Learning, Memory, and 
Cognition, 20(5), 1063. 
Anderson, M. C., & Neely, J. H. (1996). Interference and inhibition in memory retrieval. In Memory (pp. 
237-313). Academic Press. 
Anderson, M. C., & Spellman, B. A. (1995). On the status of inhibitory mechanisms in cognition: memory 
retrieval as a model case. Psychological review, 102(1), 68. 
Archibald, F. (2009). Warping Using Thin Plate Splines. MATLAB Central File Exchange. 
Arsalidou, M., Morris, D., & Taylor, M. J. (2011). Converging evidence for the advantage of dynamic facial 
expressions. Brain topography, 24(2), 149-163. 
Ashby, S. R., Bowman, C. R., & Zeithamova, D. (2020). Perceived similarity ratings predict generalization 
success after traditional category learning and a new paired-associate learning task. Psychonomic 
bulletin & review, 27(4), 791-800. 
Avants, B. B., Epstein, C. L., Grossman, M., & Gee, J. C. (2008). Symmetric diffeomorphic image 
registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative 
brain. Medical image analysis, 12(1), 26-41. 
Baddeley, A. D. (1964). Semantic and acoustic similarity in short-term memory. Nature, 204(4963), 1116-
1117. 
151 
  
Baddeley, A. D., & Dale, H. C. (1966). The effect of semantic similarity on retroactive interference in long-
and short-term memory. Journal of Verbal Learning and Verbal Behavior, 5(5), 417-420. 
Bae, G. Y., & Luck, S. J. (2017). Interactions between visual working memory representations. Attention, 
Perception, & Psychophysics, 79(8), 2376-2395. 
Bainbridge, W.A., Isola, P., & Oliva, A. (2013). The intrinsic memorability of face images. Journal of 
Experimental Psychology: General, 142(4), 1323-1334 
Bakker, A., Kirwan, C. B., Miller, M., & Stark, C. E. (2008). Pattern separation in the human hippocampal 
CA3 and dentate gyrus. Science, 319(5870), 1640-1642. 
Balas, B., & Pacella, J. (2015). Artificial faces are harder to remember. Computers in human behavior, 52, 
331-337. 
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory 
hypothesis testing: Keep it maximal. Journal of memory and language, 68(3), 255-278. 
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effects models using 
lme4. arXiv. https://doi.org/10.48550/arXiv.1406.5823 
Battaglia, F. P., Benchenane, K., Sirota, A., Pennartz, C. M., & Wiener, S. I. (2011). The hippocampus: 
hub of brain network communication for memory. Trends in cognitive sciences, 15(7), 310-318. 
Bäuml, K. H., & Hartinger, A. (2002). On the role of item similarity in retrieval-induced forgetting. Memory, 
10(3), 215-224. 
Bays, P. M., Catalao, R. F., & Husain, M. (2009). The precision of visual working memory is set by 
allocation of a shared resource. Journal of vision, 9(10), 7-7. 
Beliy, R., Gaziv, G., Hoogi, A., Strappini, F., Golan, T., & Irani, M. (2019). From voxels to pixels and back: 
Self-supervision in natural-image reconstruction from fMRI. Advances in Neural Information 
Processing Systems, 32. 
Benda, M. S., & Scherf, K. S. (2020). The Complex Emotion Expression Database: A validated stimulus 
set of trained actors. PloS one, 15(2), e0228248. 
Benoit, R. G., & Schacter, D. L. (2015). Specifying the core network supporting episodic simulation and 
episodic memory by activation likelihood estimation. Neuropsychologia, 75, 450-457. 
152 
  
Berens, S. C., Richards, B. A., & Horner, A. J. (2020). Dissociating memory accessibility and precision in 
forgetting. Nature Human Behaviour, 4(8), 866-877. 
Bjork, R. A. (1989). Retrieval inhibition as an adaptive mechanism in human memory. Varieties of 
memory and consciousness: Essays in honour of Endel Tulving, 309-330. 
Bonnici, H. M., Richter, F. R., Yazar, Y., & Simons, J. S. (2016). Multimodal feature integration in the 
angular gyrus during episodic and semantic retrieval. Journal of Neuroscience, 36(20), 5462-5471. 
Bookstein, F. L. (1989). Principal warps: Thin-plate splines and the decomposition of deformations. IEEE 
Transactions on pattern analysis and machine intelligence, 11(6), 567-585. 
Bostock, E., Muller, R. U., & Kubie, J. L. (1991). Experience‐dependent modifications of hippocampal 
place cell firing. Hippocampus, 1(2), 193-205. 
Brady, T. F., Konkle, T., Alvarez, G. A., & Oliva, A. (2008). Visual long-term memory has a massive 
storage capacity for object details. Proceedings of the National Academy of Sciences, 105(38), 
14325-14329. 
Brady, T. F., Konkle, T., Gill, J., Oliva, A., & Alvarez, G. A. (2013). Visual long-term memory has the same 
limit on fidelity as visual working memory. Psychological science, 24(6), 981-990. 
Brainard, D. H. (1997). The psychophysics toolbox. Spatial vision, 10(4), 433-436. 
Brouwer, G. J., & Heeger, D. J. (2009). Decoding and reconstructing color from responses in human 
visual cortex. Journal of Neuroscience, 29(44), 13992-14003. 
Brunec, I. K., Moscovitch, M., & Barense, M. D. (2018). Boundaries shape cognitive representations of 
spaces and events. Trends in Cognitive Sciences, 22(7), 637-650. 
Bülthoff, I., & Zhao, M. (2020). Personally familiar faces: Higher precision of memory for idiosyncratic than 
for categorical information. Journal of Experimental Psychology: Learning, Memory, and 
Cognition, 46(7), 1309. 
Cao, R., Li, X., Todorov, A., & Wang, S. (2020). A flexible neural representation of faces in the human 
brain. Cerebral Cortex Communications, 1(1), tgaa055. 
Carlson, T. A., Schrater, P., & He, S. (2003). Patterns of activity in the categorical representations of 
objects. Journal of cognitive neuroscience, 15(5), 704-717. 
153 
  
Chadwick, M. J., Hassabis, D., & Maguire, E. A. (2011). Decoding overlapping memories in the medial 
temporal lobes using high-resolution fMRI. Learning & Memory, 18(12), 742-746. 
Chanales, A. J., Oza, A., Favila, S. E., & Kuhl, B. A. (2017). Overlap among spatial memories triggers 
repulsion of hippocampal representations. Current Biology, 27(15), 2307-2317. 
Chanales, A. J., Tremblay-McGaw, A. G., Drascher, M. L., & Kuhl, B. A. (2021). Adaptive repulsion of 
long-term memory representations is triggered by event similarity. Psychological science, 32(5), 705-
720. 
Chang, L., & Tsao, D. Y. (2017). The code for facial identity in the primate brain. Cell, 169(6), 1013–1028. 
Chen, J., Leber, A. B., & Golomb, J. D. (2019). Attentional capture alters feature perception. Journal of 
Experimental Psychology: Human Perception and Performance, 45(11), 1443. 
Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories 
reveal shared structure in neural activity across individuals. Nature neuroscience, 20(1), 115. 
Chen, J. M., Norman, J. B., & Nam, Y. (2021). Broadening the stimulus set: introducing the American 
multiracial faces database. Behavior Research Methods, 53(1), 371-389. 
Chen, P. H. C., Chen, J., Yeshurun, Y., Hasson, U., Haxby, J., & Ramadge, P. J. (2015). A reduced-
dimension fMRI shared response model. In Advances in Neural Information Processing Systems (pp. 
460-468). 
Chung, K. M., Kim, S., Jung, W. H., & Kim, Y. (2019). Development and validation of the Yonsei face 
database (YFace DB). Frontiers in psychology, 10, 2626. 
Chunharas, C., Brady, T., & Ramachandran, V. S. (2018). Selective amplification of salient features of 
visual memories during early memory consolidation. PsyArXiv. https://doi.org/10.31234/osf.io/5dcxa 
Chunharas, C., Rademaker, R. L., Brady, T., & Serences, J. (2019). Adaptive memory distortion in visual 
working memory. PsyArXiv. 
Cichy, R. M., Pantazis, D., & Oliva, A. (2014). Resolving human object recognition in space and time. 
Nature neuroscience, 17(3), 455-462. 
154 
  
Cohen, J. D., Daw, N., Engelhardt, B., Hasson, U., Li, K., Niv, Y., Norman, K.A., Pillow, J., Ramadge, 
P.J., Turk-Browne, N.B., & Willke, T. L. (2017). Computational approaches to fMRI analysis. Nature 
neuroscience, 20(3), 304-313. 
Colgin, L. L., Moser, E. I., & Moser, M. B. (2008). Understanding memory through hippocampal 
remapping. Trends in neurosciences, 31(9), 469-477. 
Conley, M. I., Dellarco, D. V., Rubien-Thomas, E., Cohen, A. O., Cervera, A., Tottenham, N., & Casey, B. 
J. (2018). The racially diverse affective expression (RADIATE) face stimulus set. Psychiatry 
research, 270, 1059-1067. 
Cooper, R. A., Kensinger, E. A., & Ritchey, M. (2019). Memories fade: The relationship between memory 
vividness and remembered visual salience. Psychological science, 30(5), 657-668. 
Cooper, R. A., & Ritchey, M. (2019). Cortico-hippocampal network connections support the 
multidimensional quality of episodic memory. Elife, 8, e45591. 
Cootes, T. F., Edwards, G. J., & Taylor, C. J. (2001). Active appearance models. IEEE Transactions on 
Pattern Analysis & Machine Intelligence, 23(6), 681–685. 
Cowen, A. S., Chun, M. M., & Kuhl, B. A. (2014). Neural portraits of perception: reconstructing face 
images from evoked brain activity. Neuroimage, 94, 12-22. 
Cox, R. W., & Hyde, J. S. (1997). Software tools for analysis and visualization of fMRI data. NMR in 
Biomedicine: An International Journal Devoted to the Development and Application of Magnetic 
Resonance In Vivo, 10(4‐5), 171-178. 
Cox, D. D., & Savoy, R. L. (2003). Functional magnetic resonance imaging (fMRI)“brain reading”: 
detecting and classifying distributed patterns of fMRI activity in human visual cortex. Neuroimage, 
19(2), 261-270. 
Crowder, R. G. (2014). The interference theory of forgetting in long-term memory. In Principles of 
Learning and Memory (pp. 234-279). Psychology Press. 
Dado, T., Güçlütürk, Y., Ambrogioni, L., Ras, G., Bosch, S., van Gerven, M., & Güçlü, U. (2022). 
Hyperrealistic neural decoding for reconstructing faces from fMRI activations via the GAN latent 
space. Scientific reports, 12(1), 1-9. 
155 
  
Dale, A. M., Fischl, B., & Sereno, M. I. (1999). Cortical surface-based analysis: I. Segmentation and 
surface reconstruction. Neuroimage, 9(2), 179-194. 
Davis, T., & Poldrack, R. A. (2013). Measuring neural representations with fMRI: practices and pitfalls. 
Annals of the New York Academy of Sciences, 1296(1), 108-134. 
DeBruine, Lisa; Jones, Benedict (2017): Face Research Lab London Set. figshare. Dataset. 
https://doi.org/10.6084/m9.figshare.5047666.v5   
Deese, J. (1959). On the prediction of occurrence of particular verbal intrusions in immediate recall. 
Journal of experimental psychology, 58(1), 17. 
Destrieux, C., Fischl, B., Dale, A., & Halgren, E. (2010). Automatic parcellation of human cortical gyri and 
sulci using standard anatomical nomenclature. Neuroimage, 53(1), 1-15. 
Diana, R. A., Peterson, M. J., & Reder, L. M. (2004). The role of spurious feature familiarity in recognition 
memory. Psychonomic bulletin & review, 11(1), 150-156. 
Drascher, M. L., & Kuhl, B. A. (2022). Long-term memory interference is resolved via repulsion and 
precision along diagnostic memory dimensions. Psychonomic Bulletin & Review, 1-15. 
Drucker, D. M., & Aguirre, G. K. (2009). Different spatial scales of shape similarity representation in lateral 
and ventral LOC. Cerebral Cortex, 19(10), 2269-2280. 
Ebner, N. C., Riediger, M., and Lindenberger, U. (2010). FACES—A database of facial expressions in 
young, middle-aged, and older women and men: Development and validation. Behavior Research 
Methods, 42(1):351–362. 
Edwards, G. J., Cootes, T. F., & Taylor, C. J. (1998). Face recognition using active appearance models. 
In European conference on computer vision. Springer, Berlin, Heidelberg, pp. 581–595. 
Engell, A. D., & Haxby, J. V. (2007). Facial expression and gaze-direction in human superior temporal 
sulcus. Neuropsychologia, 45(14), 3234-3241. 
Esteban, O., Markiewicz, C. J., Blair, R. W., Moodie, C. A., Isik, A. I., Erramuzpe, A., Kent, J.D., 
Goncalves, M., DuPre, E., Snyder, M., & Gorgolewski, K. J. (2019). fMRIPrep: a robust 
preprocessing pipeline for functional MRI. Nature methods, 16(1), 111-116. 
156 
  
Ester, E. F., Sprague, T. C., & Serences, J. T. (2015). Parietal and frontal cortex encode stimulus-specific 
mnemonic representations during visual working memory. Neuron, 87(4), 893-905. 
Favila, S. E., Chanales, A. J. H., & Kuhl, B. A. (2016). Experience-dependent hippocampal pattern 
differentiation prevents interference during subsequent learning. Nature Communications, 7(1), 
11066. 
Favila, S. E., Kuhl, B. A., & Winawer, J. (2022). Perception and memory have distinct spatial tuning 
properties in human visual cortex. Nature communications, 13(1), 1-21. 
Favila, S. E., Samide, R., Sweigart, S. C., & Kuhl, B. A. (2018). Parietal representations of stimulus 
features are amplified during memory retrieval and flexibly aligned with top-down goals. Journal of 
Neuroscience, 38(36), 7809-7821. 
Fawcett, J. M., & Hulbert, J. C. (2020). The many faces of forgetting: Toward a constructive view of 
forgetting in everyday life. Journal of Applied Research in Memory and Cognition, 9(1), 1-18. 
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via 
coordinate descent. Journal of statistical software, 33(1), 1. 
Fonov, V. S., Evans, A. C., McKinstry, R. C., Almli, C. R., & Collins, D. L. (2009). Unbiased nonlinear 
average age-appropriate brain templates from birth to adulthood. NeuroImage, (47), S102. 
Ford, J. H., & Kensinger, E. A. (2016). Effects of internal and external vividness on hippocampal 
connectivity during memory retrieval. Neurobiology of learning and memory, 134, 78-90. 
Furl, N., Henson, R. N., Friston, K. J., & Calder, A. J. (2013). Top-down control of visual responses to fear 
by the amygdala. Journal of Neuroscience, 33(44), 17435-17443. 
Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychological 
review, 91(1), 1. 
Goldstone, R. L. (1998). Perceptual learning. Annual review of psychology, 49(1), 585-612. 
Goldstone, R. L., Lippa, Y., & Shiffrin, R. M. (2001). Altering object representations through category 
learning. Cognition, 78(1), 27-43. 
Goldstone, R. L., & Steyvers, M. (2001). The sensitization and differentiation of dimensions during 
category learning. Journal of experimental psychology: General, 130(1), 116. 
157 
  
Golomb, J. D. (2015). Divided spatial attention and feature-mixing errors. Attention, Perception, & 
Psychophysics, 77(8), 2562-2569. 
Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., & Ghosh, S. S. 
(2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in 
python. Frontiers in neuroinformatics, 13. 
Greve, D. N., & Fischl, B. (2009). Accurate and robust brain image alignment using boundary-based 
registration. Neuroimage, 48(1), 63-72. 
Güçlütürk, Y., Güçlü, U., Seeliger, K., Bosch, S., van Lier, R., & van Gerven, M. A. (2017). Reconstructing 
perceived faces from brain activations with deep adversarial neural decoding. Advances in neural 
information processing systems, 30. 
Harlow, I. M., & Donaldson, D. I. (2013). Source accuracy data reveal the thresholded nature of human 
episodic memory. Psychonomic Bulletin & Review, 20(2), 318-325. 
Harlow, I. M., & Yonelinas, A. P. (2016). Distinguishing between the success and precision of 
recollection. Memory, 24(1), 114-127. 
Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001). Distributed and 
overlapping representations of faces and objects in ventral temporal cortex. Science, 293(5539), 
2425-2430. 
Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., Hanke, M., 
& Ramadge, P. J. (2011). A common, high-dimensional model of the representational space in 
human ventral temporal cortex. Neuron, 72(2), 404-416. 
Hays, J., Wong, C., & Soto, F. A. (2020). FaReT: A free and open-source toolkit of three-dimensional 
models and software to study face perception. Behavior research methods, 52(6), 2604-2622. 
Haynes, J. D., & Rees, G. (2005). Predicting the orientation of invisible stimuli from activity in human 
primary visual cortex. Nature neuroscience, 8(5), 686-691. 
Horner, A. J., & Burgess, N. (2013). The associative structure of memory for multi-element events. 
Journal of Experimental Psychology: General, 142(4), 1370. 
158 
  
Horner, A. J., & Burgess, N. (2014). Pattern completion in multielement event engrams. Current Biology, 
24(9), 988-992. 
Hulbert, J. C., & Norman, K. A. (2015). Neural differentiation tracks improved recall of competing 
memories following interleaved study and retrieval practice. Cerebral Cortex, 25(10), 3994-4008. 
Hutchinson, J. B., Pak, S. S., & Turk-Browne, N. B. (2016). Biased competition during long-term memory 
formation. Journal of cognitive neuroscience, 28(1), 187-197. 
Huth, A. G., Lee, T., Nishimoto, S., Bilenko, N. Y., Vu, A. T., & Gallant, J. L. (2016). Decoding the 
semantic content of natural movies from human brain activity. Frontiers in systems neuroscience, 10, 
81. 
Jenkinson, M., Bannister, P., Brady, M., & Smith, S. (2002). Improved optimization for the robust and 
accurate linear registration and motion correction of brain images. Neuroimage, 17(2), 825-841. 
Jiang, Z., Sanders, D. M. W., & Cowell, R. A. (2022). Visual and semantic similarity norms for a 
photographic image stimulus set containing recognizable objects, animals and scenes. Behavior 
Research Methods, 1-17. 
Kahana, M. J., Zhou, F., Geller, A. S., & Sekuler, R. (2007). Lure similarity affects visual episodic 
recognition: Detailed tests of a noisy exemplar model. Memory & cognition, 35(6), 1222-1232. 
Kamitani, Y., & Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature 
neuroscience, 8(5), 679. 
Kamitani, Y., & Tong, F. (2006). Decoding seen and attended motion directions from activity in the human 
visual cortex. Current biology, 16(11), 1096-1102. 
Kanwisher, N. (2000). Domain specificity in face perception. Nature neuroscience, 3(8), 759-763. 
Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). Identifying natural images from human 
brain activity. Nature, 452(7185), 352-355. 
Klein, A., Ghosh, S. S., Bao, F. S., Giard, J., Häme, Y., Stavsky, E., Lee, N., Rossa, B., Reuter, M., 
Chaibub Neto, E., & Keshavan, A. (2017). Mindboggling morphometry of human brains. PLoS 
computational biology, 13(2), e1005350. 
159 
  
Kleiner, M. Brainard, D., & Pelli, D. (2007) What’s new in Psychtoolbox-3? Perception, 36 (ECVP Abstract 
Supplement), 14. 
Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011, November). Annotated facial landmarks in 
the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE 
international conference on computer vision workshops (ICCV workshops) (pp. 2144-2151). IEEE. 
Korkki, S. M., Richter, F. R., Jeyarathnarajah, P., & Simons, J. S. (2020). Healthy ageing reduces the 
precision of episodic memory retrieval. Psychology and Aging, 35(1), 124. 
Krakauer, J. W., Ghazanfar, A. A., Gomez-Marin, A., MacIver, M. A., & Poeppel, D. (2017). Neuroscience 
needs behavior: correcting a reductionist bias. Neuron, 93(3), 480-490. 
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis-connecting the 
branches of systems neuroscience. Frontiers in systems neuroscience, 4. 
Kroon, D.J. (2012). Active Shape Model (ASM) and Active Appearance Model (AAM). MATLAB Central 
File Exchange.  
Kruschke, J. K. (1996). Dimensional relevance shifts in category learning. Connection Science, 8(2), 225-
248. 
Kuhl, B. A., & Chun, M. M. (2014). Successful remembering elicits event-specific activity patterns in 
lateral parietal cortex. Journal of Neuroscience, 34(23), 8051-8060. 
Kuhl, B. A., Rissman, J., Chun, M. M., & Wagner, A. D. (2011). Fidelity of neural reactivation reveals 
competition between memories. Proceedings of the National Academy of Sciences, 108(14), 5903–
5908. 
LaBar, K. S., Crupain, M. J., Voyvodic, J. T., & McCarthy, G. (2003). Dynamic perception of facial affect 
and identity in the human brain. Cerebral Cortex, 13(10), 1023-1033. 
Lakshmi, A., Wittenbrink, B., Correll, J., & Ma, D. S. (2021). The India face set: International and cultural 
boundaries impact face impressions and perceptions of category membership. Frontiers in 
psychology, 12, 627678. 
Lee, H., & Kuhl, B. A. (2016). Reconstructing perceived and retrieved faces from activity patterns in 
lateral parietal cortex. Journal of Neuroscience, 36(22), 6069-6082. 
160 
  
Leopold, D. A., O'Toole, A. J., Vetter, T., & Blanz, V. (2001). Prototype-referenced shape encoding 
revealed by high-level aftereffects. Nature neuroscience, 4(1), 89-94. 
Li, A. Y., Fukuda, K., Lee, A. C., & Barense, M. D. (2020). Visual interference can help and hinder 
memory: Capturing representational detail using the Validated Circular Shape Space. bioRxiv, 
535922. 
Lin, P. H., & Luck, S. J. (2009). The influence of similarity on visual working memory representations. 
Visual Cognition, 17(3), 356-372. 
Long, N. M., & Kuhl, B. A. (2018). Bottom-up and top-down factors differentially influence stimulus 
representations across large-scale attentional networks. Journal of Neuroscience, 38(10), 2495-
2504. 
Ma, D. S., Correll, J., and Wittenbrink, B. (2015). The Chicago face database: A free stimulus set of faces 
and norming data. Behavior Research Methods, 47(4):1122–1135. 
Ma, D. S., Kantner, J., & Wittenbrink, B. (2021). Chicago face database: Multiracial expansion. Behavior 
Research Methods, 53(3), 1289-1300. 
Mack, M. L., Love, B. C., & Preston, A. R. (2016). Dynamic updating of hippocampal object 
representations reflects new conceptual knowledge. Proceedings of the National Academy of 
Sciences, 113(46), 13203-13208. 
Mate, J., & Baqués, J. (2009). Short article: Visual similarity at encoding and retrieval in an item 
recognition task. Quarterly Journal of Experimental Psychology, 62(7), 1277-1284. 
Martin, C. B., Douglas, D., Newsome, R. N., Man, L. L., & Barense, M. D. (2018). Integrative and 
distinctive coding of visual and conceptual object features in the ventral visual stream. elife, 7, 
e31873. 
McClelland, J. L., McNaughton, B. L., & O'reilly, R. C. (1995). Why there are complementary learning 
systems in the hippocampus and neocortex: insights from the successes and failures of 
connectionist models of learning and memory. Psychological review, 102(3), 419. 
161 
  
Melton, A. W., & Irwin, J. M. (1940). The influence of degree of interpolated learning on retroactive 
inhibition and the overt transfer of specific responses. The American Journal of Psychology, 53(2), 
173-203. 
Milborrow, S., Morkel, J., & Nicolls, F. (2010). The MUCT landmarked face database. Pattern recognition 
association of South Africa, 201(0). 
Minear, M., & Park, D. C. (2004). A lifespan database of adult facial stimuli. Behavior research methods, 
instruments, & computers, 36(4), 630-633. 
Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M. A., Morito, Y., Tanabe, H. C., Sadato, N., & Kamitani, 
Y. (2008). Visual image reconstruction from human brain activity using a combination of multiscale 
local image decoders. Neuron, 60(5), 915-929. 
Mozafari, M., Reddy, L., & VanRullen, R. (2020, July). Reconstructing natural scenes from fMRI patterns 
using BigBiGAN. In 2020 international joint conference on neural networks (IJCNN) (pp. 1-8). IEEE. 
Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial firing of 
hippocampal complex-spike cells. Journal of Neuroscience, 7(7), 1951-1968. 
Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which 
algorithms implement Ward’s criterion? Journal of classification, 31(3), 274-295. 
Naselaris, T., Prenger, R. J., Kay, K. N., Oliver, M., & Gallant, J. L. (2009). Bayesian reconstruction of 
natural images from human brain activity. Neuron, 63(6), 902-915. 
Nemrodov, D., Behrmann, M., Niemeier, M., Drobotenko, N., & Nestor, A. (2019). Multimodal evidence on 
shape and surface information in individual face processing. Neuroimage, 184, 813-825. 
Nestor, A., Lee, A. C., Plaut, D. C., & Behrmann, M. (2020). The face of image reconstruction: progress, 
pitfalls, prospects. Trends in cognitive sciences, 24(9), 747-759. 
Nestor, A., Plaut, D. C., & Behrmann, M. (2016). Feature-based face representations and image 
reconstruction from behavioral and neural data. Proceedings of the National Academy of Sciences, 
113(2), 416-421. 
162 
  
Nilakantan, A. S., Bridge, D. J., Gagnon, E. P., VanHaerents, S. A., & Voss, J. L. (2017). Stimulation of 
the posterior cortical-hippocampal network enhances precision of memory recollection. Current 
Biology, 27(3), 465-470. 
Nilakantan, A. S., Bridge, D. J., VanHaerents, S., & Voss, J. L. (2018). Distinguishing the precision of 
spatial recollection from its success: Evidence from healthy aging and unilateral mesial temporal lobe 
resection. Neuropsychologia, 119, 101-106. 
Nishimoto, S., Vu, A. T., Naselaris, T., Benjamini, Y., Yu, B., & Gallant, J. L. (2011). Reconstructing visual 
experiences from brain activity evoked by natural movies. Current biology, 21(19), 1641-1646. 
Norman, K. A. (2010). How hippocampus and cortex contribute to recognition memory: revisiting the 
complementary learning systems model. Hippocampus, 20(11), 1217-1227. 
Norman, K. A., Newman, E. L., & Detre, G. (2007). A neural network model of retrieval-induced 
forgetting. Psychological review, 114(4), 887. 
Norman, K. A., Newman, E., Detre, G., & Polyn, S. (2006). How inhibitory oscillations can train neural 
networks and punish competitors. Neural computation, 18(7), 1577-1610. 
Norman, K. A., Polyn, S. M., Detre, G. J., & Haxby, J. V. (2006). Beyond mind-reading: multi-voxel pattern 
analysis of fMRI data. Trends in cognitive sciences, 10(9), 424-430. 
Norman, K. A., & O'Reilly, R. C. (2003). Modeling hippocampal and neocortical contributions to 
recognition memory: a complementary-learning-systems approach. Psychological review, 110(4), 
611. 
Nosofsky, R. M. (1986). Attention, similarity, and the identification–categorization relationship. Journal of 
experimental psychology: General, 115(1), 39. 
Oosterhof, N. N., & Todorov, A. (2008). The functional basis of face evaluation. Proceedings of the 
National Academy of Sciences, 105(32), 11087-11092. 
O'Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: 
Avoiding a trade‐off. Hippocampus, 4(6), 661-682. 
O'Reilly, R. C., & Norman, K. A. (2002). Hippocampal and neocortical contributions to memory: Advances 
in the complementary learning systems framework. Trends in cognitive sciences, 6(12), 505-510. 
163 
  
O'Reilly, R. C., & Rudy, J. W. (2001). Conjunctive representations in learning and memory: principles of 
cortical and hippocampal function. Psychological review, 108(2), 311. 
Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into 
movies. Spatial vision, 10, 437-442. 
Pertzov, Y., Manohar, S., & Husain, M. (2017). Rapid forgetting results from competition over time 
between items in visual working memory. Journal of Experimental Psychology: Learning, Memory, 
and Cognition, 43(4), 528. 
Peterson, J. C., Uddenberg, S., Griffiths, T. L., Todorov, A., & Suchow, J. W. (2022). Deep models of 
superficial face judgments. Proceedings of the National Academy of Sciences, 119(17), 
e2115228119. 
Power, J. D., Mitra, A., Laumann, T. O., Snyder, A. Z., Schlaggar, B. L., & Petersen, S. E. (2014). 
Methods to detect, characterize, and remove motion artifact in resting state fMRI. Neuroimage, 84, 
320-341. 
Rajsic, J., Swan, G., Wilson, D. E., & Pratt, J. (2017). Accessibility limits recall from visual working 
memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 43(9), 1415. 
Reuter, M., Rosas, H. D., & Fischl, B. (2010). Highly accurate inverse consistent registration: a robust 
approach. Neuroimage, 53(4), 1181-1196. 
Rhodes, S., Abbene, E. E., Meierhofer, A. M., & Naveh-Benjamin, M. (2020). Age differences in the 
precision of memory at short and long delays. Psychology and Aging, 35(8), 1073. 
Richter, F. R., Cooper, R. A., Bays, P. M., & Simons, J. S. (2016). Distinct neural mechanisms underlie 
the success, precision, and vividness of episodic memory. Elife, 5, e18260. 
Roediger, H. L., & McDermott, K. B. (1995). Creating false memories: Remembering words not presented 
in lists. Journal of experimental psychology: Learning, Memory, and Cognition, 21(4), 803. 
Roesch, E. B., Tamarit, L., Reveret, L., Grandjean, D., Sander, D., & Scherer, K. R. (2011). FACSGen: A 
tool to synthesize emotional facial expressions through systematic manipulation of facial action units. 
Journal of Nonverbal Behavior, 35(1), 1-16. 
164 
  
Rugg, M. D., Otten, L. J., & Henson, R. N. (2002). The neural basis of episodic memory: evidence from 
functional neuroimaging. Philosophical Transactions of the Royal Society of London. Series B: 
Biological Sciences, 357(1424), 1097-1110. 
Rundus, D. (1973). Negative effects of using list items as recall cues. Journal of Verbal Learning and 
Verbal Behavior, 12(1), 43-50. 
Said, C. P., Moore, C. D., Engell, A. D., Todorov, A., & Haxby, J. V. (2010). Distributed representations of 
dynamic facial expressions in the superior temporal sulcus. Journal of vision, 10(5), 11-11. 
Sanocki, T., & Sulman, N. (2011). Color relations increase the capacity of visual short-term memory. 
Perception, 40(6), 635-648. 
Sato, W., Yoshikawa, S., Kochiyama, T., & Matsumura, M. (2004). The amygdala processes the 
emotional significance of facial expressions: an fMRI investigation using the interaction between 
expression and face direction. Neuroimage, 22(2), 1006-1013. 
Schacter, D. L., Guerin, S. A., & Jacques, P. L. S. (2011). Memory distortion: an adaptive perspective. 
Trends in cognitive sciences, 15(10), 467-474. 
Schacter, D. L., & Madore, K. P. (2016). Remembering the past and imagining the future: Identifying and 
enhancing the contribution of episodic memory. Memory Studies, 9(3), 245-255. 
Schapiro, A. C., Turk-Browne, N. B., Botvinick, M. M., & Norman, K. A. (2017). Complementary learning 
systems within the hippocampus: a neural network modelling approach to reconciling episodic 
memory with statistical learning. Philosophical Transactions of the Royal Society B: Biological 
Sciences, 372(1711), 20160049. 
Schindler, S., Zell, E., Botsch, M., & Kissler, J. (2017). Differential effects of face-realism and emotion on 
event-related brain potentials and their implications for the uncanny valley theory. Scientific reports, 
7(1), 1-13. 
Schlichting, M. L., Mumford, J. A., & Preston, A. R. (2015). Learning-related representational changes 
reveal dissociable integration and separation signatures in the hippocampus and prefrontal cortex. 
Nature communications, 6, 8151. 
165 
  
Schurgin, M. W., Wixted, J. T., & Brady, T. F. (2020). Psychophysical scaling reveals a unified theory of 
visual memory strength. Nature human behaviour, 4(11), 1156-1172. 
Scotti, P. S., Hong, Y., Golomb, J. D., & Leber, A. B. (2021). Statistical learning as a reference point for 
memory distortions: Swap and shift errors. Attention, Perception, & Psychophysics, 1-21. 
Seeliger, K., Güçlü, U., Ambrogioni, L., Güçlütürk, Y., & van Gerven, M. A. (2018). Generative adversarial 
networks for reconstructing natural images from brain activity. NeuroImage, 181, 775-785. 
Serences, J. T., Ester, E. F., Vogel, E. K., & Awh, E. (2009). Stimulus-specific delay activity in human 
primary visual cortex. Psychological science, 20(2), 207-214. 
Shen, B., RichardWebster, B., O'Toole, A., Bowyer, K., & Scheirer, W. J. (2021, December). A study of 
the human perception of synthetic faces. In 2021 16th IEEE International Conference on Automatic 
Face and Gesture Recognition (FG 2021) (pp. 1-8). IEEE. 
Smith, R. E., & Hunt, R. R. (2000). The influence of distinctive processing on retrieval-induced 
forgetting. Memory & Cognition, 28(4), 503-508. 
Steyvers, M. (1999). Morphing techniques for manipulating face images. Behavior Research Methods, 
Instruments, & Computers, 31(2), 359-369. 
St-Laurent, M., Abdi, H., & Buchsbaum, B. R. (2015). Distributed patterns of reactivation predict vividness 
of recollection. Journal of Cognitive Neuroscience, 27(10), 2000-2018. 
Storm, B. C., Bjork, E. L., & Bjork, R. A. (2008). Accelerated relearning after retrieval-induced forgetting: 
the benefit of being forgotten. Journal of Experimental Psychology: Learning, Memory, and 
Cognition, 34(1), 230. 
Sun, S. Z., Fidalgo, C., Barense, M. D., Lee, A. C., Cant, J. S., & Ferber, S. (2017). Erasing and blurring 
memories: The differential impact of interference on separate aspects of forgetting. Journal of 
Experimental Psychology: General, 146(11), 1606. 
Swan, G., Collins, J., & Wyble, B. (2016). Memory for a single object has differently variable precisions for 
relevant and irrelevant features. Journal of vision, 16(3), 32-32. 
Theves, S., Fernández, G., & Doeller, C. F. (2020). The hippocampus maps concept space, not feature 
space. Journal of Neuroscience, 40(38), 7318-7325. 
166 
  
Thomas, K. M., Drevets, W. C., Whalen, P. J., Eccard, C. H., Dahl, R. E., Ryan, N. D., & Casey, B. J. 
(2001). Amygdala response to facial expressions in children and adults. Biological psychiatry, 49(4), 
309-316. 
Tompary, A., & Thompson-Schill, S. L. (2021). Semantic influences on episodic memory distortions. 
Journal of Experimental Psychology: General. 
Tottenham, N., Tanaka, J. W., Leon, A. C., McCarry, T., Nurse, M., Hare, T. A., Marcus, D.J., Westerlund, 
A., Casey, B.J., & Nelson, C. (2009). The NimStim set of facial expressions: judgments from 
untrained research participants. Psychiatry research, 168(3), 242-249. 
Tulving, E. (1974). Cue-dependent forgetting: When we forget something we once knew, it does not 
necessarily mean that the memory trace has been lost; it may only be inaccessible. American 
scientist, 62(1), 74-82. 
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1), 71-
86. 
Tustison, N. J., Avants, B. B., Cook, P. A., Zheng, Y., Egan, A., Yushkevich, P. A., & Gee, J. C. (2010). 
N4ITK: improved N3 bias correction. IEEE transactions on medical imaging, 29(6), 1310-1320. 
Van Ginneken, B., Frangi, A. F., Staal, J. J., ter Haar Romeny, B. M., & Viergever, M. A. (2002). Active 
shape model segmentation with optimal features. IEEE transactions on medical imaging, 21(8), 924-
933. 
VanRullen, R., & Reddy, L. (2019). Reconstructing faces from fMRI patterns using deep generative neural 
networks. Communications biology, 2(1), 1-10. 
Walker, M., Schönborn, S., Greifeneder, R., & Vetter, T. (2018). The Basel Face Database: A validated 
set of photographs reflecting systematic differences in Big Two and Big Five personality dimensions. 
PloS one, 13(3), e0193190. 
Wanjia, G., Favila, S. E., Kim, G., Molitor, R. J., & Kuhl, B. A. (2021). Abrupt hippocampal remapping 
signals resolution of memory interference. Nature communications, 12(1), 1-11. 
Watson, H. C., & Lee, A. C. (2013). The perirhinal cortex and recognition memory interference. Journal of 
Neuroscience, 33(9), 4192-4200. 
167 
  
Wen, H., Shi, J., Zhang, Y., Lu, K. H., Cao, J., & Liu, Z. (2018). Neural encoding and decoding with deep 
learning for dynamic natural vision. Cerebral cortex, 28(12), 4136-4160. 
Wheatley, T., Weinberg, A., Looser, C., Moran, T., & Hajcak, G. (2011). Mind perception: Real but not 
artificial faces sustain neural activity beyond the N170/VPP. PloS one, 6(3), e17960. 
Wills, T. J., Lever, C., Cacucci, F., Burgess, N., & O'keefe, J. (2005). Attractor dynamics in the 
hippocampal representation of the local environment. Science, 308(5723), 873-876. 
Won, B. Y., Haberman, J., Bliss-Moreau, E., & Geng, J. J. (2020). Flexible target templates improve 
visual search accuracy for faces depicting emotion. Attention, Perception, & Psychophysics, 1-15. 
Xue, G. (2018). The neural representations underlying human episodic memory. Trends in Cognitive 
Sciences, 22(6), 544-561. 
Xue, G., Dong, Q., Chen, C., Lu, Z., Mumford, J. A., & Poldrack, R. A. (2010). Greater neural pattern 
similarity across repetitions is associated with better memory. Science, 330(6000), 97-101. 
Yassa, M. A., & Stark, C. E. (2011). Pattern separation in the hippocampus. Trends in neurosciences, 
34(10), 515-525. 
Yeung, L. K., Ryan, J. D., Cowell, R. A., & Barense, M. D. (2013). Recognition memory impairments 
caused by false recognition of novel objects. Journal of Experimental Psychology: General, 142(4), 
1384. 
Yu, X., & Geng, J. J. (2019). The attentional template is shifted and asymmetrically sharpened by 
distractor context. Journal of experimental psychology: human perception and performance, 45(3), 
336. 
Zeithamova, D., Dominick, A. L., & Preston, A. R. (2012). Hippocampal and ventral medial prefrontal 
activation during retrieval-mediated learning supports novel inference. Neuron, 75(1), 168-179. 
Zhang, Y., Brady, M., & Smith, S. (2001). Segmentation of brain MR images through a hidden Markov 
random field model and the expectation-maximization algorithm. IEEE transactions on medical 
imaging, 20(1), 45-57. 
Zhang, W., & Luck, S. J. (2008). Discrete fixed-resolution representations in visual working memory. 
Nature, 453(7192), 233-235. 
168 
  
Zhao, Y., Chanales, A. J., & Kuhl, B. A. (2021). Adaptive memory distortions are predicted by feature 
representations in parietal cortex. Journal of Neuroscience, 41(13), 3014-3024. 
Zheng, J., Schjetnan, A. G., Yebra, M., Gomes, B. A., Mosher, C. P., Kalia, S. K., Valiante, T.A., 
Mamelak, A.N., Kreiman, G., & Rutishauser, U. (2022). Neurons detect cognitive boundaries to 
structure episodic memories in humans. Nature Neuroscience, 25(3), 358-368. 
169